Creating the reference file and index for Karma

Karma uses both a binary genome reference and a word index for performing read mapping.

To prepare a genome reference for use by karma, concatenate the FASTA format chromosome reference files into a single file. There may be one or more in the resulting file.

For our example here, we'll build a reference for the Human chromosome 21.

Creating the binary version of the reference and the word index can be done with a single command:

karma --createIndex --reference Homo_sapiens.NCBI36.49.dna.chromosome.21.fa.gz

At this point, five files will be created alongside the reference file:

  • Homo_sapiens.NCBI36.49.dna.chromosome.21.umfa
  • Homo_sapiens.NCBI36.49.dna.chromosome.21.umwihi
  • Homo_sapiens.NCBI36.49.dna.chromosome.21.umwiwp
  • Homo_sapiens.NCBI36.49.dna.chromosome.21.umwhl
  • Homo_sapiens.NCBI36.49.dna.chromosome.21.umwhr

All five are binary files and may be large. For the combined NCBI build 36 reference, using the default arguments, the five files are about 1.5GB, 4.0GB, 11GB, 1.2GB, and 1.2GB respectively.

For a larger reference, you will need large amounts of RAM. At a minimum, even with a truncated reference, you will need at least 8GB of RAM to avoid excess paging. The actual amount of RAM used is largely determined by the relationship between the chosen index work length (--wordSize defaults to 15) and the number of times each of the permutations of that length occur in the reference genome.

To support shorter read lengths, a smaller word index size should be chosen. You must allow for at least two index words to be found in each read, so for example, if you wish to map 25 base reads, you must use an index created using --wordSize 12.

Karma uses adaptive index word choices, so any whole word in the read may be a reasonable candidate. This means that if you know you will be mapping reads that are only 70 bases in length, for example, that a 16-mer index wordsize might make sense (this consumes more RAM, however).

For the human genome, word lengths from 12 to 15 make the most sense for todays computers. On 64GB RAM machines, it might make sense to try additional bases.