How Karma Works

Karma is an heuristic k-tuple word method for aligning reads to a reference genome. It trades high memory use to obtain superior speed. See wikipedia for a discussion of various mapping approaches.

Karma File Formats

Karma uses two different input file formats, one for the reference, and a related one for the reads that it is to map.

The output format is SAM, which is a new format developed in part by the 1000 Genome Project.

Karma data structures

To work, Karma needs two things, 1) the reference genome provided by the user, and 2) a Karma created pair of data structures that represents all unique locates of all possible 15 base character 'words'.

Binary Genome Reference

To speed future loading of the reference genome, karma first converts the reference to a packed binary array with two bases per eight bit byte. The file is then mapped into memory when it is used. This allows nearly instantaneous loading of previously RAM cached data.

Word Index

After creating the binary reference, karma then creates two additional arrays. The second array is a packed set of word locations for all possible words. The repeated locations of a particular word are all stored together, in genome increasing location.

The first array is an index of all possible combinations of word values, so for 15 base pair words, it has 4**15 = 1073741824 entries in it. This array takes about 4GB. Each entry points to a location in the word locations table where the repeated occurrences for this word index are listed. The size of this table varies depending on the number of bases in the genome and how many of the word sequences in the genome are unique. For the human genome, with a word repeat cutoff of 5000, the size is around 11GB.

Given these data structures, karma is then able to rapidly locate all occurrences of any given 15 base pair word.

Karma Matching Algorithm

We'll examine two cases - single end reads, then paired end reads. While there is some overlap in the cases, they are distinctly different. First, we will describe some computations that are common to both.

Common operations

For both single and paired end reads, there are several basic operations that need to happen.
  • Iterate over all mutations of the index words in a single read.
  • Calculate the map quality of the read based on the sum of the posterior probabilities.
  • Identify common conditions for terminating a search.
    • an index word contains unreadable bases
    • an index word hits a 'high repeat' area (hit a word that exceeds --occurrenceCutoff)
    • a set of evaluated read hits had a combined posterior quality that exceeds a reasonable threshhold

Single End Matching

foreach read:
    initialize bestMatch

    foreach direction of the read:

	    foreach index word of the read:

		calculate the number of mismatches, and the sum of the
                    quality scores at those mismatch quality positions.

                check for various search termination conditions.

                keep cumulative stats, including summation of posterior probabilities.

                if this match is better than the current bestMatch

                    update bestMatch.

                    check for and randomly break ties.
            done
    done

    bestMatch now contains a genome match position, a direction, and
    a cumulative posterior probability, which gives us the map score.

    update overall match statistics

    print the bestMatch

done

print summary stat information

Paired End Matching

Paired end match processing has several other termination cases and conditions, and a few modifications as well:
  • If one read has bad bases in an index word, both reads are dropped.
  • Map quality includes the sum of mismatches for both reads, not just one.
  • others?
foreach read pair:
    iterate over all zero and one mutations on readA:
        record all possible match candidate genome match positions

    merge sort all match candidate positions for readA


    initialize bestMatchA
    initialize bestMatchB

    iterate over all zero and one mutations on readB:
        for each candidate match position in repeats for readB positionB:
	    binary search with a filter on positionB
            if found:
                update stats
                if combined score is better than existing best score:
                    update bestMatchA
                    update bestMatchB
                (current no tie breaking)
        done
    done

    bestMatchA and bestMatchB now contain genome match position, direction, etc for each

    update overall match statistics

    print the pair of bestMatches

done

print summary stat information