GenomeComb
Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)
This text describes how to analyze, combine, and annotate whole genome, exome or targetted sequencing raw data using genomecomb. Starting from raw data, a fully annotated projectdir will be created, that can be further queried and analyzed using cg select and cg viz
In the howto, the smal example and test data set ori_mixed_yri_mx2 downloadable from the genomecomb website will be used. This data set was derived from publically available exome and genome sequencing data by extracting only raw data covering the region of the MX2 gene (on chr21) and a part of the ACO2 gene (on chr22).
We start by creating and populating a projectdir (tmp/mixed_yri_mx2) with samples using the cg project_addsample command. (The projectdir will be made with the first addsample call if it does not yet exists)
# Create the projectdir # This is not strictly needed: # It would be made with the first addsample call if it did not yet exists mkdir -p tmp/mixed_yri_mx2 # Add a Complete genomics sample named cgNA19240mx2. # ori/mixed_yri_mx2/cgNA19240mx2 contains an ASM directory as obtained from Complete Genomics cg project_addsample tmp/mixed_yri_mx2 cgNA19240mx2 ori/mixed_yri_mx2/cgNA19240mx2 # Add illumina fastq files from a genome sequencing using wildcards to find all fastq files cg project_addsample tmp/mixed_yri_mx2 gilNA19240mx2 ori/mixed_yri_mx2/gilNA19240mx2/*.fq.gz # Add illumina fastq files from exome sequencing # The option -targetfile gives a regionfile with the target regions for this sample # (The \ at the end of the first line means the the command is continued on the next line) cg project_addsample -targetfile ori/mixed_yri_mx2/reg_hg19_exome_SeqCap_EZ_v3.tsv.lz4 \ tmp/mixed_yri_mx2 exNA19239mx2 ori/mixed_yri_mx2/exNA19239mx2/*.fq.gz # Add another exome sequencing sample # Here we do not give the individual fastq files as argument, but the directory containing them. # you can only do this if the directory does not contain fastq files from other samples cg project_addsample -targetfile ori/mixed_yri_mx2/reg_hg19_exome_SeqCap_EZ_v3.tsv.lz4 \ tmp/mixed_yri_mx2 exNA19240mx2 ori/mixed_yri_mx2/exNA19240mx2
The cg process_project command will run the entire secondary analysis (clipping, alignment, variant calling, reports, ...) and part of the tertiary analysis (combining samples, annotation, ...) on the samples in the projectdir Reference information (genome sequence, annotation files, etc.) are in /data/refseq/hg19
cg process_project -verbose 2 -d 2 -split 1 -dbdir /data/refseq/hg19 tmp/mixed_yri_mx2
If the command is interupted, The processing can continue from where it was by repeating the same command. Processing can also be distributed over more than one processing unit: The -d 2 parameter in the example distibutes the processing over 2 cores on the local machine. If you have a Sun Grid Engine cluster, you can distribute on the cluster using -d sge
All analysis results are added in the projectdir, described in projectdir. cg viz allows an easy (graphical interface) view of the results.
cg viz tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4
opens the annotated combined variant file (format described in format_tsv) using cg viz, allowing you to browse through the table (even if it is millions of lines long). Look in howto_view for some examples of what to do in cg viz.
cg select can be used to query the result files: selecting or adding fields, selecting variants fullfilling given criteria or making summaries using a command line tool (functionality is similar to cg viz, with a few extras). As such it can be more easily integrated in other workflows and exact logging of what was done is easier dan using cg viz.
Some examples are shown here, you can find more examples in howto_query and an extensive description of the possibilities in the cg select help.
# Which fields are in the file, results are shown on the terminal (stdout) cg select -h tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 # Which "samples" are in the file, results are shown on the terminal (stdout) cg select -n tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 # Write a file tmp/short.tsv containing only the basic variant fields cg select -f 'chromosome begin end type ref alt' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 tmp/short.tsv # Add a field count that calculates in how many of the "NA19240 samples" zyg is either t or m cg select \ -f 'chromosome begin end type ref alt {count=scount($zyg in "t m" and $sample regexp "NA19240")}' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 tmp/short.tsv # Select only variants where gatk has made a homozygous call in sample gilNA19240mx2 # The results are compressed using "piping": # The results of cg select are transferred (indicated by the | or pipe character) to "cg lz4" # cg lz4 compresses the data. # the > character redirects the results of the compression to the file tmp/homozygous.tsv.lz4 cg select -q '$zyg-gatk-rdsbwa-gilNA19240mx2 == "m"' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 | cg lz4 > tmp/homozygous.tsv.lz4 # intGene_impact contains the impact of the variant on all alternative splice variants in the intGene # gene dataset. intGene_impact contains a list of impacts (so we cannot use ==). # You can also query using regular expression. The following selects only variants with # impact matching the regexp pattern "UTR|RNA": containing UTR or RNA cg select -q '$intGene_impact regexp "UTR|RNA"' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 tmp/utr.tsv # You can also properly check the list vs a given list of impacts # (getting the correct list may be easier in cg viz using Easyquery) # shares is true if the list in intGene_impact shares elements with the given list cg select \ -q '$intGene_impact shares {RNASTART UTR5KOZAK UTR5SPLICE UTR5 RNAEND UTR3SPLICE UTR3 RNASPLICE RNA}' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.lz4 tmp/utr2.tsv
The following is an example text describing the default process_project workflow for Illumina sequencing with the proper references (The <version> actually used for the various tools can be found in the analysisinfo files):
Analysis was performed using the pipeline integrated in genomecomb (1) The pipeline used fastq-mcf (2) for adapter clipping. Reads were then aligned using bwa mem (3) and the resulting sam file converted to bam using samtools (4). Bam files were sorted and duplicates were removed using biobambam bammarkduplicates2 (5). Realignment in the neighborhood of indels was performed with GATK (6). Variants were called at all positions with a totalcoverage >= 5 using both GATK (6) and samtools (4). At this initial stage positions with a coverage < 5 or a quality < 30 were considered unsequenced. The resulting variant sets of different individuals were combined and annotated using genomecomb (1). (1) genomecomb version <version>, Reumers, J*, De Rijk, P*, Zhao, H, Liekens, A, Smeets, D, Cleary, J, Van Loo, P, Van Den Bossche, M, Catthoor, K, Sabbe, B, Despierre, E, Vergote, I, Hilbush, B, Lambrechts, D and Del-Favero, J (2011) Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nature biotechnology, 30, 61-88 [PMID: 22178994] (2) fastq-mcf version <version>, Erik Aronesty (2011). ea-utils : "Command-line tools for processing biological sequencing data"; Expression Analysis, Durham, NC http://code.google.com/p/ea-utils (3) bwa version <version>, Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168] (4) samtools version <version>, Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] (5) biobambam <version>, Tischler G., Leonard S. (2013) Biobambam: Tools for read pair collation based algorithms on BAM files. Source Code for Biology and Medicine 9(1) (6) GATK version <version>, DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. NATURE GENETICS 43:491-498 [PMID: 21478889]
For amplicon sequencing, remove the markduplicates part, and add Amplicon primers were clipped using genomecomb (1) after making the realignment.
If freebayes was used to call variants, add it to the list of variant callers, also for freebayes, genoqual is used for determining unsequenced (freebayes does not give the correct value in quality for doing this), e.g.:
... and freebayes (x). For GATK, positions with a coverage < 5 or a quality score < 30 were considered unsequenced. For freebayes coverage < 5 or a genotype quality < 30 was used as a cut-off. (x) freebayes version <version>, Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012