cg process_illumina ?options? projectdir ?dbdir?


process an illumina sequencing project directory. This starts from fastq read data and results in a genomecomb directory with annotated variant comparisons.


As input, the command expects a basic genomecomb project directory with illumina sequencing data (projectdir).

The projectdir will contain a subdirectory for each sample (sampledir) in the project. The sample directories may also be in a subdirectory called samples (which is more organized). Each sampledir contains a subdirectory named fastq that contains the fastq files for that sample. The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok. Only subdirectories containing these fastq dirs are considered sampledirs. The name of each sample is taken from the sampledir name. The sample name should not contain hyphens (-)

By default reads are clipped using fastq-mcf, aligned to the reference genome in dbdir using bwa mem, duplicates removed (using picard) and realigned (using gatk). Variants are called using gatk and samtools. All files generated have names following the convention of using hyphens to separate different elements about the file. The first element is the type of file. The last element (before the extension) is the sample name. There can be several steps in between. Each sampledir will contain results for this individual sample of the following type:

bam file created by aligning the reads of sample1 to the reference genome in dbdir using bwa. The bam file has been sorted (s), duplicate marked (d), and realigned (r).
a variant file that contains variants called by gatk based on map-rdsbwa-sample1.bam. Positions with a quality < 30 or coverage < 5 are considered unsequenced. Lower quality variants (but with quality >= 10) are still included in the variant list, but have the a "u" in the sequenced and zyg columns to indicate that they are considered unsequenced
A region file with all regions that can be considered sequenced (quality >= 30 and coverage >= 5) using the same methods and quality measures as var-gatk-rdsbwa-sample1.tsv. Any position in those regions that is not in the variant file can be called reference with the same reliability as the variant calls.
variant calling data by gatk for all positions with >= 5 coverage (also reference called positions). This file is used to create the sreg files, and to update data in making multicompar files later.
regions with many clustered variants (which are less reliable)

For samtools variant calling on the same bamfile (map-rdsbwa-sample1.bam), these result files are named var-sam-rdsbwa-sample1.tsv, sreg-sam-rdsbwa-sample1.tsv, varall-sam-rdsbwa-sample1.tsv, reg_cluster-sam-rdsbwa-S0489.tsv

The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) will be converted to tsv files, and their variants included in the multicompar.

In projectdir a subdirectory compar will be made. This will contain comparisons of all samples:

multicompar file containing information for all variants in all samples (and all methods). If a variant is not present in one of the samples, the information at the position of the variant will be completed (is the position sequenced or not, coverage, ...) The file is also annotated with all databases in dbdir (impact on genes, regions of interest, known variant data)
sequenced region multicompar file containing for all regions whether they are sequenced (1) or nor (0) for each sample.


project directory with illumina data for different samples, each sample in a sub directory. The proc will search for fastq files in dir/samplename/fastq/
directory containing reference data (genome sequence, annotation, ...). dbdir can also be given in a projectinfo.tsv file in the project directory. process_illumina called with the dbdir parameter will create the projectinfo.tsv file.


-realign value
If value is 0, realignment will not be performed, use 1 for (default) realignment with gatk, or srma for alignment with srma if 1, bam files are realigned using gatk, use value srma to align using srma.
-split 1/0
split multiple alternative genotypes over different line
-dbdir dbdir
dbdir can also be given as an option (instead of second parameter)
-paired 1/0
sequenced are paired/unpaired
-adapterfile file
Use file for possible adapter sequences
-dbfile file
Use file for extra (files in dbdir are already used) annotation
-conv_nextseq 1/0
generate fastqs for nextseq run & create sample folders - rundir should be placed in projectdir of resulting variants. This option can be added multiple times (with different files)
-bedfile bedfile
if bedfile is provided, coverage statistics will be calculated for this region using picard CalculateHsMetrics

supports job options (more info with cg help joboptions):

-d number
distribute subjobs of command over number processes
-d sge
use grid engine to distribute subjobs


Some of the programs needed in this workflow are not distributed with genomecomb. gatk and picard should be installed separately. Their installation location can be given using the environment variables GATK and PICARD. These should point to the installation directory that contains the jar files. If these environment variables are not set, a directory named gatk and picard will be searched in the PATH.


export GATK=/opt/bio/GenomeAnalysisTK-2.4-9-g532efad/
export PICARD=/opt/bio/picard-tools-1.87
cg process_illumina -d sge testproject /complgen/refseq/hg19