Genomecomb moved to github on https://github.com/derijkp/genomecomb
with documentation on https://derijkp.github.io/genomecomb.
For up to date versions, go there. These pages only remain here for the data on the older scientific
application (or if someone really needs a long obsolete version of the software)
tsv format
The standard file format used in GenomeComb is the widely
supported, simple, yet flexible tab-separated
values file (tsv). A tsv file is a simple text file containing
tabular data, where each line represents a record or row in the
table. Each field value of a record is separated from the next by a
tab character. As values cannot be quoted (as in csv), they cannot
contain tabs or newlines, unless by coding them e.g. by using escape
characters (\t,\n)
The first line of the tab file (not starting with a #) is a
header that contains column names (or fields). Genomecomb
allows comment lines (indicated by starting the line with a #
character) containing metadata to precede the header.
Depending on the columns present, tsv files can be used for
various purposes. Usually the files are used to describe features on
a reference genome sequence. In this context, Genomecomb recognizes
columns with specific field names to have a special meaning. The
order/position of the columns does not matter, although genomecomb
will usually write tsv files with columns in a specific order. All
tsv files can be easily queried using the cg select functionality or loaded into a
local database.
region file format
Region files are used to indicate regions on the genome, and
potentially associate some name, score, annotation, ... to it. Region
files contain at least the following fields:
- chromosome
- a string indicating the chromosome name.
- begin
- a number indicating the start of the region (half-open
coordinates).
- end
- a number indicating the end of the region (half-open
coordinates) Any extra columns can be added to provide information
on the region.
Coordinates are in zero-based half-open format as used by UCSC
bed files and Complete Genomics files. This means for instance that
the first base of a sequence will be indicated by begin=0 and end=1.
It is possible to indicate regions not containing a base, e.g. the
position/region before the first base would be indicated by begin=0,
end=0. chromosome is a just a string. Many genomecomb tools
allow mixing "chr1" and "1" type of notation for
chromosome, meaning that "chr1 1 2" is considered the same
region as "1 1 2".
Most genomecomb tools expect region files to be sorted on
chromosome, begin, end (and files created by genomecomb are usually
sorted). You can sort files using the -s option of cg select. Use -s - to sort using the
default expected sort order: cg select -s - file sortedfile For
chromosome names, a natural sort is (to be) used. This sorts strings
alphabetically except that multi-digit numbers are ordered as a
single character, meaning that chr11 will come (as expected by most
humans) after chr2 instead of before (as the typically used lexical
sort would do)
If the normal chromosome, begin, end fields are not present, the
following alternative fieldnames are also recognized:
- chromosome
- "chrom", "chr", "chr1",
"genoName", "tName" and "contig"
- begin
- "start", "end1", "chromStart",
"genoStart", "tStart", txStart",
"pos" and "offset" (end1 is recognised as begin
because of the structural variant code in genomecomb, where
start1,end1 and start2,end2 regions surround a SV).
- end
- "start2", "chromEnd",
"genoEnd", "tEnd" or "txEnd"
variant file format
A variant file is a tsv file containing a list of variants.
chromosome, begin and end fields indicate the location of the variant
in the same way as in region files, while other fields define the
variant at the location: The following basic fields are present in a
variant file:
- chromosome
- chromosome name.
- begin
- start of the region (half-open coordinates).
- end
- end of the region (half-open coordinates)
- type
- type of variation: snp, ins, del, sub are recognised
- ref
- genotype of the reference sequence at the feature. For large
deletions, the size of the deletion can be used. Insertions will
have an empty string as ref. (also reference)
- alt
- alternative allele(s). If there are more than one
alternatives, they are separated by commas. Deletions have an empty
string as alt allele. (also alternative)
Variant files should be sorted on the fields chromosome,
begin, end, type, alt (in that order).
Further fields can be present to describe the variant in the
sample or annotation information, many depending on the variant
caller used (e.g. gatk_vars and sam_vars). The following fields have
specific meanings in genomecomb:
- sequenced
- single letter code describing sequencing status (described
below)
- zyg
- zygosity, a single letter code indicating the zygosity of
sample for the variant. (described below)
- alleleSeq1
- genotype of variant at one allele
- alleleSeq2
- genotype of variant at other allele
- phased
- order of genotypes in alleleSeq1 and alleleSeq2 is
significant (phase is known)
- quality
- quality of the variant (or reference) call, assigned by the
variant caller. Normally phred scaled: -10log_10 prob(call in ALT
is wrong))
- coverage
- number of reads used to call the variant (covering the
variant). This is as reported by the variant caller; Interpretation
of coverage can differ between different callers
genomecomb has 2 options to deal with the presence of multiple
alternative alleles on the same position:
- multiallelic
- one line per position and type. All alternative alleles are
in one line in the variant file. The alt field contains a list
(separated by commas) of alternative alleles. If any of the other
fields contains values specific to an allele (e.g. frequency of the
allele in a population), this field will contain a comma separated
list with values in the order as the alt alleles list
- split
- Each alternative allele is on a seperate line. e.g. A to G,C
variant (multialleic notation) is split into an A to G and an A to
C variant. While cg select can select
based on lists in fields (as in multiallelic mode), split mode
makes querying and selection of variants much easier.
sequenced field
The sequenced field indicates sequenig status of the variant in
the sample. The following codes can be found:
- u
- the position is considered unsequenced in the sample (e.g.
because coverage or quality was too low).
- v
- the variant was found in the sample.
- r
- the position was sequenced, but the given variant is not
present In multiallelic mode, r allways means that the genotype is
reference. In split mode however, "v" will only be
assigned if the specific alternative is present in the genotype. So
"r" will be used even if there are non-reference alleles,
as long as they are not the given alternative!
When calling variants using GATK or samtools, genomecomb picks a
relatively low quality treshhold (coverage < 5 or quality < 30)
for considering variants unsequenced (sensitiviy over specificity).
You can allways apply more stringent quality filtering on the result
using cg select.
zyg
Possible zyg codes are:
- m
- homozygous; the sample has two times the given alternative
allele
- t
- heterozygous; the sample has the given alternative allele and
one reference allele
- c
- compound; the sample has two different non-reference alleles.
In split mode, c is only used if one the those is the given variant
alt allele.
- o
- other; This is only used in split mode when the sample
contains non-reference alleles other than the variant alt allele.
- r
- reference
- u
- unspecified/unsequenced It is possible to have an assigned
zyg other than u (e.g. t) even when the sequenced field is u,
meaning that the variant caller could make a zygosity
estimate/prediction, but the variant call is not of enough quality
to consider it sequenced.
multicompar file
In a multicompar file, data for different samples is present in
one file, so they can be compared. Fields that are specific to a
sample have the samplename added to the fieldname separated by a
dash, e.g. the zygosity of a variant in sample1 can be found in the
column named zyg-sample1. A small example multicompar variant file
with two samples would contain the following fields
- chromosome
- chromosome name.
- begin
- start of the region (half-open coordinates).
- end
- end of the region (half-open coordinates)
- type
- type of variation: snp, ins, del, sub are recognised
- ref
- reference sequence
- alt
- alternative allele(s).
- sequenced-sample1
- sequencing status of sample1
- zyg-sample1
- zygosity of sample1
- quality-sample1
- variant quality in sample1
- alleleSeq1-sample1
- genotype of variant at one allele in sample1
- alleleSeq2-sample1
- genotype of variant at other allele in sample1
- sequenced-sample2
- sequencing status of sample2
- zyg-sample2
- zygosity of sample2
- quality-sample2
- variant quality in sample2
- alleleSeq1-sample2
- genotype of variant at one allele in sample2
- alleleSeq2-sample2
- genotype of variant at other allele in sample2
tab based bioinformatics formats
Some formats used in bioinformatics contain data in a tab
separated format where the header does not conform to the tsv specs.
Most Genomecomb commands will detect and support some of these
alternative comments/header styles:
- sam
- starts with "@HD VN", header lines start with @,
uses fixed columns
- vcf
- starts with "fileformat=VCF", the last
"comment" line contains the header. In order to extract
the data merged in some of the vcf fields into a genomecomb
supported tsv, use cg vcf2tsv
- Complete genomics
- header line is preceeded by an empty line and starts with a
> character