GenomeComb
Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)
cg process_sample ?options? ?oridir? sampledir
Processes one sample directory (projectdir), generating full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data that can come from various sources.
The command expects a basic genomecomb sample directory (as described extensively in projectdir) It generates all kinds of result data (variant calls, sequenced regions, ...) analyses and reports.
Several types of source data (fastq files, Complete Genomics analysis dir, ...) are supported. The directory containing the original starting data can be given as an option, an argument, or it can be present in a dir named ori in the sample dir. If given as an argument, a link named ori to the original data will be made. The results (which analyses, etc.) differ according to the type of original data, the parameters given (e.g. use -amplicons for amplicon sequencing) and files in the sampledir.
By default, process_sample will only create files that do no exist yet, or update ones that are older than files they depend on. This way an analysis that was interupted, can be simply restarted (giving the same command), and it will proceed from where it was.
This command can be distributed on a cluster or using multiple cores with job options (more info with cg help joboptions)
As different types of original data are processed differently, not all options are allways applicable. Options that are not applicable to the given type of data are ignored.
Several types of sample data are supported:
In this case the starting raw data for the sample is fastq files. These should be in a subdirectory of the sampledir named fastq. They can also be in a directory ori in the sampledir, in which case the fastq dir will be made and links to the fastq files made in it.
The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok. The name of each sample is taken from the sampledir name. The sample name should not contain hyphens (-)
By default reads are clipped using fastq-mcf, aligned to the reference genome in dbdir using bwa mem, duplicates removed (using picard) and realigned (using gatk). Variants are called using gatk and samtools. All files generated have names following the convention of using hyphens to separate different elements about the file. The first element is the type of file. The last element (before the extension) is the sample name. There can be several steps in between. Each sampledir will contain results for this individual sample of the following type:
For samtools variant calling on the same bamfile (map-rdsbwa-sample1.bam), these result files are named var-sam-rdsbwa-sample1.tsv, sreg-sam-rdsbwa-sample1.tsv, varall-sam-rdsbwa-sample1.tsv, reg_cluster-sam-rdsbwa-S0489.tsv
If the experiment used e.g. exome capture, this can be indicated by the presence of a file named reg_targets.tsv (or matching reg_*_targets*.tsv) in the sampledir (or the option -targetfile). If present, coverage statistics will be calculated for this region.
Amplicon sequencing samples are indicated by the presence of a file named reg_amplicons.tsv (or matching reg_*_amplicons*.tsv) in the sampledir. If the option -amplicons is given to the command, a link to the given ampliconfile will be created in the sampledir and used. If an ampliconfile (or link) already exists in the sampledir, it will NOT be overwritten! (only a warning given).
An amplicon file is a tsv file indicating the genomic location of the amplicons It must have the following fields:
Amplicon sequencing samples can also start from the fastq files and are processed similarly to shotgun sequencing, but analysis is different in a few ways:
Variants will only be called in the sequenceneable part of the amplicons (i.e. between begin and end). (off-target mappings are not called) To avoid wrong results by "sequencing" primers, the primer parts of amplicons will be clipped based on their mapping on the expected amplicons in the bam file. (This is done by replacing the sequence by Ns and reducing quality to 0 for these positions)
Several options use different defaults when amplicon sequencing is specified (-removeduplicates 0 -removeskew 0 -dt NONE).
Complete Genomics source data is already aligned and variant called. The region and variant information is converted to a similar format as use for (Illumina) shotgun sequencing, with some differences:
The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) will be converted to tsv files, and their variants included in the multicompar.
Process