Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

Process

Format

cg process_sample ?options? ?oridir? sampledir

Summary

Processes one sample directory (projectdir), generating full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data that can come from various sources.

Description

The command expects a basic genomecomb sample directory (as described extensively in projectdir) It generates all kinds of result data (variant calls, sequenced regions, ...) analyses and reports.

Several types of source data (fastq files, Complete Genomics analysis dir, ...) are supported. The directory containing the original starting data can be given as an option, an argument, or it can be present in a dir named ori in the sample dir. If given as an argument, a link named ori to the original data will be made. The results (which analyses, etc.) differ according to the type of original data, the parameters given (e.g. use -amplicons for amplicon sequencing) and files in the sampledir.

By default, process_sample will only create files that do no exist yet, or update ones that are older than files they depend on. This way an analysis that was interupted, can be simply restarted (giving the same command), and it will proceed from where it was.

Arguments

oridir: directory containing original data, this can be a data directory as it comes from Complete Genomics, or simply a directory containing fastq files or a bam file. This argument is optional; if not given, a directory named ori containing the source data is expected in the sampledir (This can be a softlink), or a fastq dir with fastq files.
sampledir: name of the sample directory to be created (or completed if it already exists)

Options

This command can be distributed on a cluster or using multiple cores with job options (more info with cg help joboptions)

As different types of original data are processed differently, not all options are allways applicable. Options that are not applicable to the given type of data are ignored.

-dbdir dbdir: Some of the analysis require a reference genome and databases; dbdir gives the directory where to find these
-oridir oridir: directory containing original data, this can be a data directory as it comes from Complete Genomics, or simply a directory containing fastq files or a bam file. A softlink to oridir named ori will be made in the sample directory.
-minfastqreads number: if less then number reads are found in the fastq files of the sample, the sample is not processed.
-paired 1/0 (-p): sequenced are paired/unpaired
-adapterfile file: Use file for possible adapter sequences
-removeskew num: -k parameter for sequence clipping using fastq-mcf: sKew percentage-less-than causing cycle removal
-aligner aligner: use the given aligner for mapping to the reference genome (default bwa)
-realign value: If value is 0, realignment will not be performed, use 1 for (default) realignment with gatk, or srma for alignment with srma
-removeduplicates 0/1/picard: By default duplicates will be removed (1) using bammarkduplicates2 (from biobambam2) except for amplicon sequencing. With this option you can specifically request or turn of duplicate removal (overruling the default). If you want to use large amounts of memory, you can still use picard for removing duplicates (third option)
-amplicons ampliconfile: This option turns on amplicon sequencing analysis (see further) using the amplicons defained in ampliconfile
-varcallers varcallers (-v): (space separated) list of variant callers to be used (default "gatk sam")
-split 1/0 (-s): split multiple alternative genotypes over different line
-downsampling_type NONE/ALL_READS/BY_SAMPLE/: sets the downsampling type used by GATK (empty for default).
-targetfile targetfile: if targetfile is provided, coverage statistics will be calculated for this region
-reports list: use basic (default) for creating most reports (or all for all reports). If you only want some made, give these as a space separated list. Possible reports are: flagstat_reads flagstat_alignments fastqc histodepth vars hsmetrics covered
-samBQ number: only for samtools; minimum base quality for a base to be considered (samtools --min-BQ option)

Sample types

Several types of sample data are supported:

(Illumina) (targeted) shotgun sequencing

In this case the starting raw data for the sample is fastq files. These should be in a subdirectory of the sampledir named fastq. They can also be in a directory ori in the sampledir, in which case the fastq dir will be made and links to the fastq files made in it.

The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok. The name of each sample is taken from the sampledir name. The sample name should not contain hyphens (-)

By default reads are clipped using fastq-mcf, aligned to the reference genome in dbdir using bwa mem, duplicates removed (using picard) and realigned (using gatk). Variants are called using gatk and samtools. All files generated have names following the convention of using hyphens to separate different elements about the file. The first element is the type of file. The last element (before the extension) is the sample name. There can be several steps in between. Each sampledir will contain results for this individual sample of the following type:

map-rdsbwa-sample1.bam: bam file created by aligning the reads of sample1 to the reference genome in dbdir using bwa. The bam file has been sorted (s), duplicate marked (d), and realigned (r).

var-gatk-rdsbwa-sample1.tsv: a variant file that contains variants called by gatk based on map-rdsbwa-sample1.bam. Positions with a quality < 30 or coverage < 5 are considered unsequenced. Lower quality variants (but with quality >= 10) are still included in the variant list, but have the a "u" in the sequenced and zyg columns to indicate that they are considered unsequenced
sreg-gatk-rdsbwa-sample1.tsv: A region file with all regions that can be considered sequenced (quality >= 30 and coverage >= 5) using the same methods and quality measures as var-gatk-rdsbwa-sample1.tsv. Any position in those regions that is not in the variant file can be called reference with the same reliability as the variant calls.
varall-gatk-rdsbwa-sample1.tsv: variant calling data by gatk for all positions with >= 5 coverage (also reference called positions). This file is used to create the sreg files, and to update data in making multicompar files later.
reg_cluster-gatk-rdsbwa-S0489.tsv: regions with many clustered variants (which are less reliable)

For samtools variant calling on the same bamfile (map-rdsbwa-sample1.bam), these result files are named var-sam-rdsbwa-sample1.tsv, sreg-sam-rdsbwa-sample1.tsv, varall-sam-rdsbwa-sample1.tsv, reg_cluster-sam-rdsbwa-S0489.tsv

If the experiment used e.g. exome capture, this can be indicated by the presence of a file named reg_targets.tsv (or matching reg_*_targets*.tsv) in the sampledir (or the option -targetfile). If present, coverage statistics will be calculated for this region.

Amplicon sequencing

Amplicon sequencing samples are indicated by the presence of a file named reg_amplicons.tsv (or matching reg_*_amplicons*.tsv) in the sampledir. If the option -amplicons is given to the command, a link to the given ampliconfile will be created in the sampledir and used. If an ampliconfile (or link) already exists in the sampledir, it will NOT be overwritten! (only a warning given).

An amplicon file is a tsv file indicating the genomic location of the amplicons It must have the following fields:

chromosome: chromosome of amplicon
begin: start of sequenceable part of amplicon: i.e. at the end of the forward primer
end: end of sequenceable part of amplicon: i.e. before the reverse primer sequence in the genome refernce
outer_begin: start of amplicon including primers, i.e. start of forward primer in the genome
outer_end: end of amplicon including primers, i.e. end of reverse primer sequence in the genome

Amplicon sequencing samples can also start from the fastq files and are processed similarly to shotgun sequencing, but analysis is different in a few ways:

Variants will only be called in the sequenceneable part of the amplicons (i.e. between begin and end). (off-target mappings are not called) To avoid wrong results by "sequencing" primers, the primer parts of amplicons will be clipped based on their mapping on the expected amplicons in the bam file. (This is done by replacing the sequence by Ns and reducing quality to 0 for these positions)

Several options use different defaults when amplicon sequencing is specified (-removeduplicates 0 -removeskew 0 -dt NONE).

Complete Genomics sequencing

Complete Genomics source data is already aligned and variant called. The region and variant information is converted to a similar format as use for (Illumina) shotgun sequencing, with some differences:

Naming uses a cg-cg- prefix: var-cg-cg-sample1.tsv, sreg-cg-cg-sample1.tsv, reg_cluster-cg-cg-S0489.tsv, ...
Some files are not present (e.g. no varall)
Extra files, e.g. the directory coverage-cg-sample contains whole genome coverage, refscore, ... data in the form of bcol files.
CG data can include structural variant (cgsv-sample.tsv) and CNV (cgcnv-sample.tsv) calls

Precalculated data

The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) will be converted to tsv files, and their variants included in the multicompar.