Genomecomb moved to github on https://github.com/derijkp/genomecomb
with documentation on https://derijkp.github.io/genomecomb.
For up to date versions, go there. These pages only remain here for the data on the older scientific
application (or if someone really needs a long obsolete version of the software)
Process_project
Format
cg process_project ?options? projectdir ?dbdir?
Summary
process a sequencing project directory (projectdir), generating full analysis
information (variant calls, multicompar, reports, ...) starting from
raw sample data from various sources.
Description
The cg process_project
command performs the entire secondary analysis (clipping, alignment,
variant calling, reports, ...) and part of the tertiary analysis
(combining samples, annotation, ...) on a number of samples that may
come from various sources. A practical example of the workflow can be
found in howto_process_project.
The command expects a basic genomecomb project directory (as
described extensively in projectdir)
containing a number of samples with raw data (fastq, Complete
genomics results, ...). Each sample is in a separate subdirectory of
a directory named samples in the projectdir. You can add
samples manually or using the cg
project_addsample command as described in howto_process_project.
Per sample analysis
In the first step, each sampledir is processed using cg process_sample; Samples in one
project can come from different sources (Complete genomics, illumina
sequencing) and be of different types (shotgun, amplicon). Some
options are applied to all samples, e.g. the -amplicons option (for
amplicon sequencing analysis) will place (a link to) the given
amplicons file in each sampledir. These options should only be used
in projects with uniform samples. For mixed samples, these options
can be applied specifically by placing files, e.g. an amplicon file
(named reg_*_amplicons.tsv) in the appropriate sample directories.
More information on specific sample types and options can be found in
the description of cg
process_sample.
Combined analysis
In the final step process_project will call cg process_multicompar to
combine sample results in the subdirectory compar. Different result
files may be present depending on the type of analysis:
- annot_compar-projectname.tsv
- multicompar file containing information for all variants in
all samples (and all methods). If a variant is not present in one
of the samples, the information at the position of the variant will
be completed (is the position sequenced or not, coverage, ...) The
file is also annotated with all databases in dbdir (impact on
genes, regions of interest, known variant data)
- sreg-projectname.tsv
- sequenced region multicompar file containing for all regions
whether they are sequenced (1) or nor (0) for each sample.
- annot_cgsv-projectname.tsv
- combined results of Complete Genomics structural variant
calling
- annot_cgcnv-projectname.tsv
- combined results of Complete Genomics CNV calling
Arguments
- projectdir
- project directory with illumina data for different samples,
each sample in a sub directory. The proc will search for fastq
files in dir/samplename/fastq/
- dbdir
- directory containing reference data (genome sequence,
annotation, ...). dbdir can also be given in a projectinfo.tsv file
in the project directory. process_illumina called with the dbdir
parameter will create the projectinfo.tsv file.
Options
This command can be distributed on a cluster or using multiple
with job options (more info with cg
help joboptions)
As different types of original data are processed differently, not
all options are applicable. Options that are not applicable to the
given type of data are ignored.
- -dbdir dbdir
- dbdir can also be given as an option (instead of
second parameter)
- -minfastqreads num
- fastq based samples with less than num reads in the
fastq files are not processed and not added to the final compar.
- -paired 1/0 (-p)
- sequenced are paired/unpaired
- -adapterfile file
- Use file for possible adapter sequences
- -removeskew num
- -k parameter for sequence clipping using fastq-mcf: sKew
percentage-less-than causing cycle removal
- -aligner aligner (-a)
- use the given aligner for mapping to the reference genome
(default bwa)
- -realign value
- If value is 0, realignment will not be performed, use
1 for (default) realignment with gatk, or srma for alignment with
srma if 1, bam files are realigned using gatk, use value
srma to align using srma.
- -removeduplicates 0/1/picard
- By default duplicates will be removed (1) using
bammarkduplicates2 (from biobambam2) except for amplicon
sequencing. With this option you can specifically request or turn
of duplicate removal (overruling the default). If you want to use
large amounts of memory, you can still use picard for removing
duplicates (third option)
- -amplicons ampliconfile
- This option turns on amplicon sequencing analysis (as
described in cg
process_sample) using the amplicons defained in
ampliconfile for all samples that do not have a sample
specific amplicon file yet.
- -varcallers varcallers
- (space separated) list of variant callers to be used (default
"gatk sam"). Currently supported are" gatk, sam and
freebayes
- -split 1/0
- split multiple alternative genotypes over different line
- -downsampling_type NONE/ALL_READS/BY_SAMPLE/
- sets the downsampling type used by GATK (empty for default).
- -reports list
- use basic (default) for creating most reports, or all for all
reports. If you only want some made, give these as a space
separated list. Possible reports are (further explained in cg process_reports): fastqstats
fastqc flagstat_reads flagstat_alignments histodepth vars hsmetrics
covered histo predictgender
- -dbfile file
- Use the given file for extra (files in dbdir
are already used) annotation. This option can be given more than
once; all given files will be added
- -dbfiles files
- Use files for extra (files in dbdir are already
used) annotation. files should be a space separated list of
files.
- -conv_nextseq 1/0
- generate fastqs for nextseq run & create sample folders -
rundir should be placed in projectdir of resulting variants. This
option can be added multiple times (with different files)
- -targetfile targetfile
- if targetfile is provided, coverage statistics will be
calculated for this region
- -targetvarsfile file
- Use this option to easily check certain target
positions/variants in the multicompar. The variants in file
will allways be added in the final multicompar file, even if none
of the samples is variant (or even sequenced) in it.
- -m maxopenfiles (-maxopenfiles)
- The number of files that a program can keep open at the same
time is limited. pmulticompar will distribute the subtasks thus,
that the number of files open at the same time stays below this
number. With this option, the maximum number of open files can be
set manually (if the program e.g. does not deduce the proper limit,
or you want to affect the distribution).
- -samBQ number
- only for samtools; minimum base quality for a base to be
considered (samtools --min-BQ option)
- -jobsample 0/1
- By default (0) the processing of each sample is split in many
separate jobs. If you have to process many samples with relatively
short indivual runtimes, you can set this to 1 to run each sample
in one job, thus reducing the job managment overhead.
This command can be distributed on a cluster or using multiple
with job options (more info with cg
help joboptions)
Dependencies
Some of the programs needed in this workflow are not distributed
with genomecomb. gatk and picard should be installed separately.
Their installation location can be given using the environment
variables GATK and PICARD. These should point to the installation
directory that contains the jar files. If these environment variables
are not set, a directory named gatk and picard will be searched in
the PATH. If used, freebayes must also be installed separately, and
should be runnable from the path.
Example
export GATK=/opt/bio/GenomeAnalysisTK-2.4-9-g532efad/
export PICARD=/opt/bio/picard-tools-1.87
cg process_project -d sge testproject /complgen/refseq/hg19
Category
Process