Genomecomb moved to github on https://github.com/derijkp/genomecomb
with documentation on https://derijkp.github.io/genomecomb.
For up to date versions, go there. These pages only remain here for the data on the older scientific
application (or if someone really needs a long obsolete version of the software)
genome project directory
Although most genomecomb commands can be individually run on files
located anywhere, some of the commands expect or generate data
organised in a specific project structure: A genomecomb project
directory or projectdir for short is a directory containing
(links to) raw data and analysis in this particular structure, which
is described in this help.
The cg proccess_project
command can e.g. be used to generate a projectdir with full analysis
information (variant calls, multicompar, reports, ...) starting from
raw sample data from various sources.
overview
A projectdir basically contains individual sample directories (in
the subdirectory samples) and overview data (samples comparisons) in
the following structure:
- samples
- directory containing a separate sample directory for each
sample
- compar
- directory containing files that combine and compare data from
all samples (multicompar files)
- projectinfo.tsv
- a file with some meta data about the analysis/data in the
projectdir As long as compatible analysis (same reference genome,
split or unsplit variants) were used, the same sampledir can be
used in multiple projects (e.g. using a soft link).
The project name is taken from the filename of the
projectdir. The project name will be used (a.o.) in naming most of
the overview files. These filenames will end with a hyphen-minus
character followed by the projectsname. For this reason, the
hyphen-minus character may not be used in the projectname.
Most of the result files in a projectdir/sampledir are
tab-separated value files (file extension tsv) of various types
(described in format_tsv). For space
reasons, files are often compressed. genomecomb tools can generally
handle compressed files transparently.
sample directory
Indivual sample data is in subdirectories of the samples directory
in the projectdir. Each of these sampledirs contains the raw data and
analysed data from one sample. The sample name is taken from
the filename of the sampledir. As hyphen-minus characters are used in
naming the analysis results files ending with the sample name, this
character (-) should not be present in the name.
sample source data
- ori
- A sampledir can contain a (link to a) directory containing
the original sequnencing data, named ori. The commands cg process_sample or cg process_project can be used
to analyse the data and produce a fully filled sampledir/projectdir
- fastq
- If the original data is in the form of fastq files, the fastq
files for that sample are present in a subdirectory named
fastq. (If fastq files are found in the ori directory, a
fastq dir is made, and the files linked.) The names of matching
fastq files of paired reads should be consecutive when sorted
naturaly,the forward reads first. The usual naming of these files
(same name, except for a 1 and 2) is ok.
sample results
All files generated have names following the convention of using
hyphen-minus to separate different elements of the file. The first
element indicates what is in the file. The last element (before the
extension) is the sample name. There can be several steps in between.
Each sampledir can contain results for this individual sample of
the following type (depending on source data):
- map-rdsbwa-sample1.bam
- bam file created by aligning the reads of sample1 to the
reference genome in dbdir using bwa. The bam file has been sorted
(s), duplicate marked (d), and realigned (r).
- var-gatk-rdsbwa-sample1.tsv
- a tsv variant file that
contains variants called by gatk based on map-rdsbwa-sample1.bam.
- sreg-gatk-rdsbwa-sample1.tsv
- A region file with all regions that can be considered
sequenced using the same methods and quality measures as
var-gatk-rdsbwa-sample1.tsv. Any position in those regions that is
not in the variant file can be called reference with the same
reliability as the variant calls.
- varall-gatk-rdsbwa-sample1.tsv
- variant file containing variant calls by gatk for all
positions with >= 5 coverage (also reference called positions).
This file is used to create the sreg files, and to update data in
making multicompar files later.
- reg_cluster-gatk-rdsbwa-S0489.tsv
- regions with many clustered variants (which are less
reliable)
- bcolall
- directory containing whole genome coverage, refscore, ...
data in the form of bcol files. These files
can be used to create the sreg files, and to update data in making
multicompar files later. (In older project dirs, this directory may
be called coverage-cg-* and contain old style formatted bcol files)
- cgsv-sample.tsv
- structural variants
- cgcnv-sample.tsv
- CNV data
The result files from samtools variant calling on the same
bamfile (map-rdsbwa-sample1.bam), are named
var-sam-rdsbwa-sample1.tsv, sreg-sam-rdsbwa-sample1.tsv,
varall-sam-rdsbwa-sample1.tsv, reg_cluster-sam-rdsbwa-S0489.tsv
For Complete Genomics alignment and variant calling the files are
named var-cg-cg-sample1.tsv, sreg-cg-cg-sample1.tsv,
reg_cluster-cg-cg-S0489.tsv
The sampledir may contain precalculated data data from other
pipelines. If these are in the correct format, they will be
integrated in the project. vcf files (var-*.vcf) will be converted to
tsv files, and their variants included in the multicompar.
compar dir
The subdirectory compar contains comparisons of all samples, e.g.:
- annot_compar-projectname.tsv
- multicompar file containing information for all variants in
all samples (and all methods). If a variant is not present in one
of the samples, the information at the position of the variant will
be completed (is the position sequenced or not, coverage, ...) The
file is also annotated with all databases in dbdir (impact on
genes, regions of interest, known variant data)
- sreg-projectname.tsv
- sequenced region multicompar file containing for all regions
whether they are sequenced (1) or nor (0) for each sample.
projectinfo.tsv
projectinfo.tsv is a tsv file containing
data about the project. It must have 2 columns: key and value. The
following keys can be found:
- dbdir
- directory containing reference data (genome sequence,
annotation, ...). projectinfo.tsv file.
- split
- if 1, each alternative allele is on a separate line. If 0,
multiple alternative alleles in the sample location and allele
specific data are on one line, the relevant fields containing
(comma separated) lists.