HPeak: An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data

Readme | Download | Home | FAQ | Update

Introduction


This program is for the purpose of defining genome-wide ChIP-enriched peaks in the human genome using short sequence reads. The underlying algorithm is based on a two-state HMM. 

 

Setup


Please visit Install page for details

 

Usage


perl HPeak.pl < -format FORMAT -t TFILE -n NAME > [Options] 

Options:

  -h                               Show this help message

  -format                       Format of tag file, "ELAND", "BED" or "Custom". REQUIRED

                                   When choosing "Custom", the following information needs to be specified in the order of:

                                   1. The column to store genome file where match was found

                                   2. The column to store base position of match

                                   3. The column to store direction of match

                                   The column numbers start at 1 and separate with comma.

                                   Example: the ELAND format can be presented as -format custom[7,8,9]

  -t TFILE                   Treatment file name. REQUIRED

  -n, -name                   Experiment name used to generate output file. REQUIRED

  -c CFILE                   Control file name.

  -fmin                         Minimal DNA fragment size. DEFAULT: 100

  -fmax                        Maximal DNA fragment size. DEFAULT: 300

  -w, -window             Window size (bp). DEFAULT: 25

  -s, -sig                      P-value threshold for peak detection. DEFAULT: 1e-3

  -wig                         Whether to generate WIG file that can be visualized in UCSC genome broswer

  -seq                         Whether to extract peak sequences

  -ann                        Whether to extract nearest gene information for peaks

Example:

perl HPeak.pl -format ELAND -t stimu.inp -n stimu

perl HPeak.pl -format CUSTOM[6,7,8] -t chip.inp -c mock.inp -n chip-mock -fmin 100 -fmax 300 -w 25 -s 1e-3 -wig -seq -ann

Parameters:


–format: input file format:

HPeak currently allows three types of input formats: ELAND, a modified BED or CUSTOM. ELAND is for files of s_N_eland_result.txt format produced by Illumina Genome Analyzer Pipeline 0.2 software suite (see the figure below for illustration).

>PATHBIO-SOLEXA_20F0CAAXX:1:39:231:694  TTTTTTGAGGAATATATGTATATATNTNTGTTGGGT    U0      1       1       0       hs_ref_chr5.fa  68396743        F       DD

>PATHBIO-SOLEXA_20F0CAAXX:1:39:856:492  TCTCCCAATTGACCTTTGGGATATGNGNATAAAATT    U0      1       0       0       hs_ref_chr7.fa  128899421       F       DD

Columns 7-9 contain mapping information we used are: 

7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).

The detailed information can be obtained from Illumina.

BED format is different from the one used in UCSC genome browser (detailed information about the original BED format can be found at http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED). In addition to the three required columns of chromosome, start, end positions, HPeak requires a fourth column which contains the strand information (+: positive strand, -: negative strand). They have to occupy the first four columns and need to be in the right order. The sequences /rows do not need to be sorted. More columns are allowed, but will be ignored by HPeak. An example of the modified BED format input file is shown below.

chr1     51131       51155       +
chr1     52543       52567       -
chr1     60254       60278       -

HPeak allows great flexibility in terms of the input format. Most non-standard formats can be used with the "CUSTOM" format option. The square bracket right after "CUSTOM" contains the column numbers in which chromosome, start and strand information are stored. For example, ELAND format is equivalent to "CUSTOM[7,8,9]". Note that no space is allowed in the phrase. It is required that the string “chr” and “.fa” (part of genome sequence files in which a match is achieved for the read) be present in the chromosome column. Other rows will be ignored. The strand can be denoted as either “F/R” or “+/-“.

HPeak support the new Illumina Genome Analyzer Pipeline 0.3.0. In this version, the final eland result for each lane is summarized in files called s_N_export.txt. One can use “CUSTOM[9,10,11](Note that no space is allowed in the phrase) to read this type of files. 

Note that the current version of HPeak only works for Human Genome.

–t, –c: input filenames

In addition to data files, one needs “input files” to run HPeak. These files contain location and names of sequencing data files. Each line in these files represents one data file. If there are multiple data files specified in the input file. they will be merged so sequences from multiple sources can be combined. A sample file is shown below.

Example.inp:

/data/GERALD/s_1_eland_result.txt
/data/GERALD/s_2_eland_result.txt
/data/GERALD/s_3_eland_result.txt
 

–fmin: Minimum fragment width

The lower bound of the length of size-selected DNA fragments. The default value is 100 bp. This number will be rounded to a multiple of the window size described below.

–fmax: Maximum fragment width

The upper bound of the length of size-selected DNA fragments. The default value is 300 bp. This number will be rounded to a multiple of the window size described below.

 –w, –window: window size

This program partitions the genome into small segment so counting of read coverage is conducted in each segment. This strategy allows comparison across samples. Larger window size reduce computation time and file size but lower resolution in defining enriched regions. The default value is 25bp.

 –s, –sig: significance level

This is the p-value threshold to determine whether a peak is significantly ChIP-enriched. Multiple comparisons are adjusted using the Bonferroni method, i.e., p/N, N is total number of regions. The default significance level is 0.001 (same as in Robertson et al. Nature Methods 2007).

–wig: whether to generate WIG format coverage profile file for the enriched regions.

–seq: whether to generate FASTA format sequence files for the enriched regions.

–ann: whether to generate detailed annotation information for the enriched regions.

Detailed annotation including GC%, conservation rate, genomic features (exon, intron, intergenic,…) and up- and down- stream gene name. Details about this file can be found in the Optional output files section.

Output:


.allregions.txt:

This is the main output file. It is in BED format indicating chromosome, start and end location, and the length (in bp) of all enriched regions. The last column contains the maximum coverage among all bins in this region. Note that the first column only contains the chromosome number, and X and Y are replaced by 23 and 24 for easier numerical manipulations.

.hpeak.out:

This is the new main output file. Compared to .allregions.txt, he only difference is an inserted extra column (column 5) which indicates the location within the peak that has the highest hypothetical DNA fragment (HDF) coverage). Note that the first column only contains the chromosome number, and X and Y are replaced by 23 and 24 for easier numerical manipulations.

.sum

This file contains three main parts. The first part lists all the parameter values entered. The second part contains total number of reads and uniquely mapped reads from treated and control samples. The third part summarizes the number of enriched regions, total length of DNA covered by these regions, and range of read coverage in these regions.

.log

This file contains all the raw output information. It is for debugging purpose.

Optional output files:

.seq

This is a FASTA format file containing the sequence of all enriched regions. Such a file is useful for motif scan which is often a follow up study. Note that we used the unmasked Build 36.1 finished human genome assembly (hg18, Mar. 2006). 
 
.annotation.txt

This file contains detailed annotation of the enriched regions. A typical annotation file is showm below:

chr1:1273426-1273850    425     13      GC%: 64.9       Masked%: 0.0    0+/-0.01        Intron  DVL1    NM_182779       -
chr1:1332076-1332350    275     8       GC%: 64.4       Masked%: 8.4    0.31+/-0.44     Exon    MRPL20  NM_017971       -

chr1:1499426-1499725    300     10      GC%: 76.0       Masked%: 0.0    0.03+/-0.16     Exon    SSU72   NM_014188       -

The columns are: peak genomic poisition, peak length, peak max height, GC content, repeated sequence percentage, mean and standard deviation of conservative scores for the enriched region, relationship with nearest genes including whether the peak is located within the gene or between genes, gene name, GB accession number, strand, distance to gene transcription start site. If the enriched region is located between genes, both upstream and downstream genes will be provided.

 .wig

This file contains WIG format coverage profile file for the enriched regions. One can directly upload the generated file onto UCSC genome browser for visualization.

Sample Data


There are three sample datasets provided in this package, sample.chip.txt, sample.mock.txt and sample.stimu_eland_result.txt. The first two files are subsets of the corresponding datasets generated from the Johnson et al. NRSF ChIP-Seq study (Johnson et al. Science 2007). The data are downloaded from Illumina website: http://www.illumina.com/downloads/Illumina_ChIP-Seq_Demo_Data_Johnson_Science_2007.zip. The format of these data is slightly different from the standard ELAND format. So one needs to use the “CUSTOM[6,7,8]” format option. Two input files: chip.inp and mock.inp are provided. A sample command can be found in the file sampleCommand. The third file is a subset of the corresponding dataset generated from the Robertson et al. STAT1 ChIP-Seq study (Robertson et al. Nature Methods 2007). The data are downloaded from BCGSC website: http://www.bcgsc.ca/data/chipseq.

 


[ HPeak ] | Qin Lab | Chinnaiyan Lab ]