|
How Karma WorksKarma is an heuristic k-tuple word method for aligning reads to a reference genome. It trades high memory use to obtain superior speed. See wikipedia for a discussion of various mapping approaches.Karma File FormatsKarma uses two different input file formats, one for the reference, and a related one for the reads that it is to map.The output format is SAM, which is a new format developed in part by the 1000 Genome Project.
Karma data structuresTo work, Karma needs two things, 1) the reference genome provided by the user, and 2) a Karma created pair of data structures that represents all unique locates of all possible 15 base character 'words'.Binary Genome ReferenceTo speed future loading of the reference genome, karma first converts the reference to a packed binary array with two bases per eight bit byte. The file is then mapped into memory when it is used. This allows nearly instantaneous loading of previously RAM cached data.Word IndexAfter creating the binary reference, karma then creates two additional arrays. The second array is a packed set of word locations for all possible words. The repeated locations of a particular word are all stored together, in genome increasing location.The first array is an index of all possible combinations of word values, so for 15 base pair words, it has 4**15 = 1073741824 entries in it. This array takes about 4GB. Each entry points to a location in the word locations table where the repeated occurrences for this word index are listed. The size of this table varies depending on the number of bases in the genome and how many of the word sequences in the genome are unique. For the human genome, with a word repeat cutoff of 5000, the size is around 11GB. Given these data structures, karma is then able to rapidly locate all occurrences of any given 15 base pair word. Karma Matching AlgorithmWe'll examine two cases - single end reads, then paired end reads. While there is some overlap in the cases, they are distinctly different. First, we will describe some computations that are common to both.Common operationsFor both single and paired end reads, there are several basic operations that need to happen.
Single End Matching
foreach read:
initialize bestMatch
foreach direction of the read:
foreach index word of the read:
calculate the number of mismatches, and the sum of the
quality scores at those mismatch quality positions.
check for various search termination conditions.
keep cumulative stats, including summation of posterior probabilities.
if this match is better than the current bestMatch
update bestMatch.
check for and randomly break ties.
done
done
bestMatch now contains a genome match position, a direction, and
a cumulative posterior probability, which gives us the map score.
update overall match statistics
print the bestMatch
done
print summary stat information
Paired End MatchingPaired end match processing has several other termination cases and conditions, and a few modifications as well:
foreach read pair:
iterate over all zero and one mutations on readA:
record all possible match candidate genome match positions
merge sort all match candidate positions for readA
initialize bestMatchA
initialize bestMatchB
iterate over all zero and one mutations on readB:
for each candidate match position in repeats for readB positionB:
binary search with a filter on positionB
if found:
update stats
if combined score is better than existing best score:
update bestMatchA
update bestMatchB
(current no tie breaking)
done
done
bestMatchA and bestMatchB now contain genome match position, direction, etc for each
update overall match statistics
print the pair of bestMatches
done
print summary stat information
|