Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

Process_project

Format

cg process_project ?options? projectdir ?dbdir?

Summary

process a sequencing project directory (projectdir), generating full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data from various sources.

Description

The cg process_project command performs the entire secondary analysis (clipping, alignment, variant calling, reports, ...) and part of the tertiary analysis (combining samples, annotation, ...) on a number of samples that may come from various sources. A practical example of the workflow can be found in howto_process_project.

The command expects a basic genomecomb project directory (as described extensively in projectdir) containing a number of samples with raw data (fastq, Complete genomics results, ...). Each sample is in a separate subdirectory of a directory named samples in the projectdir. You can add samples manually or using the cg project_addsample command as described in howto_process_project.

Per sample analysis

In the first step, each sampledir is processed using cg process_sample; Samples in one project can come from different sources (Complete genomics, illumina sequencing) and be of different types (shotgun, amplicon). Some options are applied to all samples, e.g. the -amplicons option (for amplicon sequencing analysis) will place (a link to) the given amplicons file in each sampledir. These options should only be used in projects with uniform samples. For mixed samples, these options can be applied specifically by placing files, e.g. an amplicon file (named reg_*_amplicons.tsv) in the appropriate sample directories. More information on specific sample types and options can be found in the description of cg process_sample.

Combined analysis

In the final step process_project will call cg process_multicompar to combine sample results in the subdirectory compar. Different result files may be present depending on the type of analysis:

annot_compar-projectname.tsv: multicompar file containing information for all variants in all samples (and all methods). If a variant is not present in one of the samples, the information at the position of the variant will be completed (is the position sequenced or not, coverage, ...) The file is also annotated with all databases in dbdir (impact on genes, regions of interest, known variant data)
sreg-projectname.tsv: sequenced region multicompar file containing for all regions whether they are sequenced (1) or nor (0) for each sample.
annot_cgsv-projectname.tsv: combined results of Complete Genomics structural variant calling
annot_cgcnv-projectname.tsv: combined results of Complete Genomics CNV calling

Arguments

projectdir: project directory with illumina data for different samples, each sample in a sub directory. The proc will search for fastq files in dir/samplename/fastq/
dbdir: directory containing reference data (genome sequence, annotation, ...). dbdir can also be given in a projectinfo.tsv file in the project directory. process_illumina called with the dbdir parameter will create the projectinfo.tsv file.

Options

This command can be distributed on a cluster or using multiple with job options (more info with cg help joboptions)

As different types of original data are processed differently, not all options are applicable. Options that are not applicable to the given type of data are ignored.

-dbdir dbdir: dbdir can also be given as an option (instead of second parameter)
-minfastqreads num: fastq based samples with less than num reads in the fastq files are not processed and not added to the final compar.
-paired 1/0 (-p): sequenced are paired/unpaired
-adapterfile file: Use file for possible adapter sequences
-removeskew num: -k parameter for sequence clipping using fastq-mcf: sKew percentage-less-than causing cycle removal
-aligner aligner (-a): use the given aligner for mapping to the reference genome (default bwa)
-realign value: If value is 0, realignment will not be performed, use 1 for (default) realignment with gatk, or srma for alignment with srma if 1, bam files are realigned using gatk, use value srma to align using srma.
-removeduplicates 0/1/picard: By default duplicates will be removed (1) using bammarkduplicates2 (from biobambam2) except for amplicon sequencing. With this option you can specifically request or turn of duplicate removal (overruling the default). If you want to use large amounts of memory, you can still use picard for removing duplicates (third option)
-amplicons ampliconfile: This option turns on amplicon sequencing analysis (as described in cg process_sample) using the amplicons defained in ampliconfile for all samples that do not have a sample specific amplicon file yet.
-varcallers varcallers: (space separated) list of variant callers to be used (default "gatk sam"). Currently supported are" gatk, sam and freebayes
-split 1/0: split multiple alternative genotypes over different line
-downsampling_type NONE/ALL_READS/BY_SAMPLE/: sets the downsampling type used by GATK (empty for default).
-reports list: use basic (default) for creating most reports, or all for all reports. If you only want some made, give these as a space separated list. Possible reports are (further explained in cg process_reports): fastqstats fastqc flagstat_reads flagstat_alignments histodepth vars hsmetrics covered histo predictgender
-dbfile file: Use the given file for extra (files in dbdir are already used) annotation. This option can be given more than once; all given files will be added
-dbfiles files: Use files for extra (files in dbdir are already used) annotation. files should be a space separated list of files.
-conv_nextseq 1/0: generate fastqs for nextseq run & create sample folders - rundir should be placed in projectdir of resulting variants. This option can be added multiple times (with different files)
-targetfile targetfile: if targetfile is provided, coverage statistics will be calculated for this region
-targetvarsfile file: Use this option to easily check certain target positions/variants in the multicompar. The variants in file will allways be added in the final multicompar file, even if none of the samples is variant (or even sequenced) in it.
-m maxopenfiles (-maxopenfiles): The number of files that a program can keep open at the same time is limited. pmulticompar will distribute the subtasks thus, that the number of files open at the same time stays below this number. With this option, the maximum number of open files can be set manually (if the program e.g. does not deduce the proper limit, or you want to affect the distribution).
-samBQ number: only for samtools; minimum base quality for a base to be considered (samtools --min-BQ option)
-jobsample 0/1: By default (0) the processing of each sample is split in many separate jobs. If you have to process many samples with relatively short indivual runtimes, you can set this to 1 to run each sample in one job, thus reducing the job managment overhead.

This command can be distributed on a cluster or using multiple with job options (more info with cg help joboptions)

Dependencies

Some of the programs needed in this workflow are not distributed with genomecomb. gatk and picard should be installed separately. Their installation location can be given using the environment variables GATK and PICARD. These should point to the installation directory that contains the jar files. If these environment variables are not set, a directory named gatk and picard will be searched in the PATH. If used, freebayes must also be installed separately, and should be runnable from the path.

Example

export GATK=/opt/bio/GenomeAnalysisTK-2.4-9-g532efad/
export PICARD=/opt/bio/picard-tools-1.87
cg process_project -d sge testproject /complgen/refseq/hg19

Home

Contact

Installation

Documentation

Scientific applications