GenomeComb



Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

genome project directory

Although most genomecomb commands can be individually run on files located anywhere, some of the commands expect or generate data organised in a specific project structure: A genomecomb project directory or projectdir for short is a directory containing (links to) raw data and analysis in this particular structure, which is described in this help.

The cg proccess_project command can e.g. be used to generate a projectdir with full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data from various sources.

overview

A projectdir basically contains individual sample directories (in the subdirectory samples) and overview data (samples comparisons) in the following structure:

samples
directory containing a separate sample directory for each sample
compar
directory containing files that combine and compare data from all samples (multicompar files)
projectinfo.tsv
a file with some meta data about the analysis/data in the projectdir As long as compatible analysis (same reference genome, split or unsplit variants) were used, the same sampledir can be used in multiple projects (e.g. using a soft link).

The project name is taken from the filename of the projectdir. The project name will be used (a.o.) in naming most of the overview files. These filenames will end with a hyphen-minus character followed by the projectsname. For this reason, the hyphen-minus character may not be used in the projectname.

Most of the result files in a projectdir/sampledir are tab-separated value files (file extension tsv) of various types (described in format_tsv). For space reasons, files are often compressed. genomecomb tools can generally handle compressed files transparently.

sample directory

Indivual sample data is in subdirectories of the samples directory in the projectdir. Each of these sampledirs contains the raw data and analysed data from one sample. The sample name is taken from the filename of the sampledir. As hyphen-minus characters are used in naming the analysis results files ending with the sample name, this character (-) should not be present in the name.

sample source data

ori
A sampledir can contain a (link to a) directory containing the original sequnencing data, named ori. The commands cg process_sample or cg process_project can be used to analyse the data and produce a fully filled sampledir/projectdir
fastq
If the original data is in the form of fastq files, the fastq files for that sample are present in a subdirectory named fastq. (If fastq files are found in the ori directory, a fastq dir is made, and the files linked.) The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok.

sample results

All files generated have names following the convention of using hyphen-minus to separate different elements of the file. The first element indicates what is in the file. The last element (before the extension) is the sample name. There can be several steps in between.

Each sampledir can contain results for this individual sample of the following type (depending on source data):

map-rdsbwa-sample1.bam
bam file created by aligning the reads of sample1 to the reference genome in dbdir using bwa. The bam file has been sorted (s), duplicate marked (d), and realigned (r).
var-gatk-rdsbwa-sample1.tsv
a tsv variant file that contains variants called by gatk based on map-rdsbwa-sample1.bam.
sreg-gatk-rdsbwa-sample1.tsv
A region file with all regions that can be considered sequenced using the same methods and quality measures as var-gatk-rdsbwa-sample1.tsv. Any position in those regions that is not in the variant file can be called reference with the same reliability as the variant calls.
varall-gatk-rdsbwa-sample1.tsv
variant file containing variant calls by gatk for all positions with >= 5 coverage (also reference called positions). This file is used to create the sreg files, and to update data in making multicompar files later.
reg_cluster-gatk-rdsbwa-S0489.tsv
regions with many clustered variants (which are less reliable)
bcolall
directory containing whole genome coverage, refscore, ... data in the form of bcol files. These files can be used to create the sreg files, and to update data in making multicompar files later. (In older project dirs, this directory may be called coverage-cg-* and contain old style formatted bcol files)
cgsv-sample.tsv
structural variants
cgcnv-sample.tsv
CNV data

The result files from samtools variant calling on the same bamfile (map-rdsbwa-sample1.bam), are named var-sam-rdsbwa-sample1.tsv, sreg-sam-rdsbwa-sample1.tsv, varall-sam-rdsbwa-sample1.tsv, reg_cluster-sam-rdsbwa-S0489.tsv

For Complete Genomics alignment and variant calling the files are named var-cg-cg-sample1.tsv, sreg-cg-cg-sample1.tsv, reg_cluster-cg-cg-S0489.tsv

The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) will be converted to tsv files, and their variants included in the multicompar.

compar dir

The subdirectory compar contains comparisons of all samples, e.g.:

annot_compar-projectname.tsv
multicompar file containing information for all variants in all samples (and all methods). If a variant is not present in one of the samples, the information at the position of the variant will be completed (is the position sequenced or not, coverage, ...) The file is also annotated with all databases in dbdir (impact on genes, regions of interest, known variant data)
sreg-projectname.tsv
sequenced region multicompar file containing for all regions whether they are sequenced (1) or nor (0) for each sample.

projectinfo.tsv

projectinfo.tsv is a tsv file containing data about the project. It must have 2 columns: key and value. The following keys can be found:

dbdir
directory containing reference data (genome sequence, annotation, ...). projectinfo.tsv file.
split
if 1, each alternative allele is on a separate line. If 0, multiple alternative alleles in the sample location and allele specific data are on one line, the relevant fields containing (comma separated) lists.