MACH Tutorial - Input Files

Main
CSG Home
-----------------------------------------------------------------
Abecasis Lab
Li Lab
MaCH 1.0
Home
-----------------------------------------------------------------
Tutorial
T: Input Files
T: Imputation
-----------------------------------------------------------------
Download
wiki
MACH 1.0 Tutorial

A full tutorial is not yet available, the README file (pasted below), should give you a flavor of how MACH works...
README File

INPUT FILES
===========

Mach 1.0 needs a Merlin format data and pedigree files as input.

The data file should look like this:

 M marker1
 M marker2
 ...

The pedigree file should list one individual per row. Each row 
should start with an family id and individual id, followed by a
father and mother id (which should both be 0, 'zero', since
mach1 assumes individuals are unrelated), and sex. These initial
columns are followed by a series of marker genotypes, each with 
two alleles. Alleles can be coded as 1, 2, 3, 4 or A, C, G, T.

For example:

 FAM1001   ID1234  0   0   M   1 1   1 2   2 2
 FAM1002   ID1234  0   0   F   1 2   2 2   3 3

Or:

  FAM1001   ID1234  0   0   M  A A   A C   C C
  FAM1002   ID1234  0   0   F  A C   C C   G G
 


USING MACH 1.0 for HAPLOTYPING
==============================

To use Mach 1.0 to haplotype a sample of unrelated individuals, 
you'll need a MERLIN format pedigree and data file. You should
make sure that markers are ordered according to their physical
position and use the --phase command line option to request the
output of phased chromosomes.

The key parameters for managing the quality of inferred haplotypes
and the amount of computational effort expended in generating them
are the --rounds and --states parameters. If missing data is not 
distributed evenly among the available individuals, you should 
also consider the --weighted parameter (which favors using individuals 
with more genotype data as templates for haplotyping other individuals).

The parameter --rounds K specifies how many iterations of the Markov 
sampler should be run. Larger numbers will result in better 
solutions. If there isn't much missing data, a value of 50 should 
give a reasonable solution. Larger values will provide even better
solutions.

The parameter --states K specifies how many haplotypes should be
considered when updating each individual. Larger values will generate
more accurate solutions, but may slow things down a bit (as well as
requiring more memory). A value of 200 or larger typically provides
quite good solutions. The default is to use all available haplotypes
for each update (but this can require a lot of memory and time!).

Other important parameters are --compact (reduces memory use) and
--poll K (to request intermediate solutions after N iterations).

Example Usage:

   mach -d sample.dat -p sample.ped --rounds 50 --states 200 --phase


USING MACH 1.0 to INFER UNTYPED MARKERS
=======================================

To use Mach 1.0 to infer genotypes at untyped markers, you 
should use the --geno command line option. There are two main
strategies for imputation:

INCLUDE REFERENCE (e.g. HAPMAP) GENOTYPES TO YOUR DATASET: 

If you select this option, you should simply create one large
pooled dataset. Some individuals will have missing data and 
others will have much more complete genotyping information. 

In addition to estimating the most likely genotype for 
each individual, you can use the command line options --dosage and
--quality options to request additional information about each 
inferred genotype.

USE REFERENCE (e.g. HAPMAP) HAPLOTYPES AS INPUT:

If you select this option, you should generate a file that 
includes a set of reference haplotypes. These can be typed 
at more markers than are available in your sample. You will
also need a small file that lists all the markers that appear
in the phased haplotypes.

Then, to estimate missing genotypes, you'll need to provide
the Merlin format data and pedigree files, the reference 
haplotypes and the list of SNPs in the reference haplotypes.
All markers in the pedigree should also appear in the
reference haplotype set. 

Most of the time, you'll get good estimates of genotypes at untyped 
markers using the --rounds N and --greedy option.

If you don't use the --greedy option, you can control computational
effort with the --weighted and --states options. However, this
alternative strategy generally requires quite a few more iterations
before converging to a good solution.

Examples:

   mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --geno

   mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --geno

   mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --weighted --geno

SPEEDING UP IMPUTATION

The standard genotype imputation approach, described in the 
preceding section works best when you execute a large
number of iterations of the Markov Chain (50-100). These iterations 
are used to simultaneously update the crossover map (which determines
the likely locations for haplotype transitions), to update the error
rate map (which flags unusual markers), and to estimate the 
missing genotypes. 

An alternative approach is to use a single set of estimates for
the crossover and error rate maps and, conditional on these, to 
find the most likely genotypes. This approach seems to work quite
well. To use it, use the --crossovermap and --errormap options to
specify estimates of error and crossover rates from a previous
mach run, and request the --mle option instead of --genos. 

If you don't have an available set of map estimates, you can 
request that Mach estimate them using a small number of iterations
of the Markov Chain with the rounds option.

Examples:

   mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --crossovermap mach.rec --errormap mach.erate --greedy --mle

   mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --greedy --mle --rounds 5


MACH1 OUTPUT KEY
================

Mach 1.0 generates a table that provides useful information
about each marker. The filename for the table has the extension
.info or .mlinfo, depending on whether the --mle option is used.
 
This table includes the marker name, allele labels, minor allele 
frequency for each marker. In addition, the estimated probability 
that an average imputed genotype will match an experimental 
genotype is output (this should be 1.0 for genotyped markers, and
will often be less for untyped markers). You will also get an
estimate of the r-squared correlation between an estimated
genotype scores and true genotypes.

ASSESSING QUALITY OF SOLUTIONS
==============================

One simple way to empirically assess quality of the solutions 
generated by Mach 1.0 is to use the mask option. This option 
hides a small proportion of genotypes from the haplotyper and 
then compares the imputed genotypes at these locations with 
the actual genotypes.

Example:

  mach -d sample.dat -p sample.ped --rounds 50 --states 200 --mask 0.02

  mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --mask 0.02
University of Michigan | School of Public Health | Abecasis Lab