GenomeComb



Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

tsv format

The standard file format used in GenomeComb is the widely supported, simple, yet flexible tab-separated values file (tsv). A tsv file is a simple text file containing tabular data, where each line represents a record or row in the table. Each field value of a record is separated from the next by a tab character. As values cannot be quoted (as in csv), they cannot contain tabs or newlines, unless by coding them e.g. by using escape characters (\t,\n)

The first line of the tab file (not starting with a #) is a header that contains column names (or fields). Genomecomb allows comment lines (indicated by starting the line with a # character) containing metadata to precede the header.

Depending on the columns present, tsv files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. In this context, Genomecomb recognizes columns with specific field names to have a special meaning. The order/position of the columns does not matter, although genomecomb will usually write tsv files with columns in a specific order. All tsv files can be easily queried using the cg select functionality or loaded into a local database.

region file format

Region files are used to indicate regions on the genome, and potentially associate some name, score, annotation, ... to it. Region files contain at least the following fields:

chromosome
a string indicating the chromosome name.
begin
a number indicating the start of the region (half-open coordinates).
end
a number indicating the end of the region (half-open coordinates) Any extra columns can be added to provide information on the region.

Coordinates are in zero-based half-open format as used by UCSC bed files and Complete Genomics files. This means for instance that the first base of a sequence will be indicated by begin=0 and end=1. It is possible to indicate regions not containing a base, e.g. the position/region before the first base would be indicated by begin=0, end=0. chromosome is a just a string. Many genomecomb tools allow mixing "chr1" and "1" type of notation for chromosome, meaning that "chr1 1 2" is considered the same region as "1 1 2".

Most genomecomb tools expect region files to be sorted on chromosome, begin, end (and files created by genomecomb are usually sorted). You can sort files using the -s option of cg select. Use -s - to sort using the default expected sort order: cg select -s - file sortedfile For chromosome names, a natural sort is (to be) used. This sorts strings alphabetically except that multi-digit numbers are ordered as a single character, meaning that chr11 will come (as expected by most humans) after chr2 instead of before (as the typically used lexical sort would do)

If the normal chromosome, begin, end fields are not present, the following alternative fieldnames are also recognized:

chromosome
"chrom", "chr", "chr1", "genoName", "tName" and "contig"
begin
"start", "end1", "chromStart", "genoStart", "tStart", txStart", "pos" and "offset" (end1 is recognised as begin because of the structural variant code in genomecomb, where start1,end1 and start2,end2 regions surround a SV).
end
"start2", "chromEnd", "genoEnd", "tEnd" or "txEnd"

variant file format

A variant file is a tsv file containing a list of variants. chromosome, begin and end fields indicate the location of the variant in the same way as in region files, while other fields define the variant at the location: The following basic fields are present in a variant file:

chromosome
chromosome name.
begin
start of the region (half-open coordinates).
end
end of the region (half-open coordinates)
type
type of variation: snp, ins, del, sub are recognised
ref
genotype of the reference sequence at the feature. For large deletions, the size of the deletion can be used. Insertions will have an empty string as ref. (also reference)
alt
alternative allele(s). If there are more than one alternatives, they are separated by commas. Deletions have an empty string as alt allele. (also alternative)

Variant files should be sorted on the fields chromosome, begin, end, type, alt (in that order).

Further fields can be present to describe the variant in the sample or annotation information, many depending on the variant caller used (e.g. gatk_vars and sam_vars). The following fields have specific meanings in genomecomb:

sequenced
single letter code describing sequencing status (described below)
zyg
zygosity, a single letter code indicating the zygosity of sample for the variant. (described below)
alleleSeq1
genotype of variant at one allele
alleleSeq2
genotype of variant at other allele
phased
order of genotypes in alleleSeq1 and alleleSeq2 is significant (phase is known)
quality
quality of the variant (or reference) call, assigned by the variant caller. Normally phred scaled: -10log_10 prob(call in ALT is wrong))
coverage
number of reads used to call the variant (covering the variant). This is as reported by the variant caller; Interpretation of coverage can differ between different callers

genomecomb has 2 options to deal with the presence of multiple alternative alleles on the same position:

multiallelic
one line per position and type. All alternative alleles are in one line in the variant file. The alt field contains a list (separated by commas) of alternative alleles. If any of the other fields contains values specific to an allele (e.g. frequency of the allele in a population), this field will contain a comma separated list with values in the order as the alt alleles list
split
Each alternative allele is on a seperate line. e.g. A to G,C variant (multialleic notation) is split into an A to G and an A to C variant. While cg select can select based on lists in fields (as in multiallelic mode), split mode makes querying and selection of variants much easier.

sequenced field

The sequenced field indicates sequenig status of the variant in the sample. The following codes can be found:

u
the position is considered unsequenced in the sample (e.g. because coverage or quality was too low).
v
the variant was found in the sample.
r
the position was sequenced, but the given variant is not present In multiallelic mode, r allways means that the genotype is reference. In split mode however, "v" will only be assigned if the specific alternative is present in the genotype. So "r" will be used even if there are non-reference alleles, as long as they are not the given alternative!

When calling variants using GATK or samtools, genomecomb picks a relatively low quality treshhold (coverage < 5 or quality < 30) for considering variants unsequenced (sensitiviy over specificity). You can allways apply more stringent quality filtering on the result using cg select.

zyg

Possible zyg codes are:

m
homozygous; the sample has two times the given alternative allele
t
heterozygous; the sample has the given alternative allele and one reference allele
c
compound; the sample has two different non-reference alleles. In split mode, c is only used if one the those is the given variant alt allele.
o
other; This is only used in split mode when the sample contains non-reference alleles other than the variant alt allele.
r
reference
u
unspecified/unsequenced It is possible to have an assigned zyg other than u (e.g. t) even when the sequenced field is u, meaning that the variant caller could make a zygosity estimate/prediction, but the variant call is not of enough quality to consider it sequenced.

multicompar file

In a multicompar file, data for different samples is present in one file, so they can be compared. Fields that are specific to a sample have the samplename added to the fieldname separated by a dash, e.g. the zygosity of a variant in sample1 can be found in the column named zyg-sample1. A small example multicompar variant file with two samples would contain the following fields

chromosome
chromosome name.
begin
start of the region (half-open coordinates).
end
end of the region (half-open coordinates)
type
type of variation: snp, ins, del, sub are recognised
ref
reference sequence
alt
alternative allele(s).
sequenced-sample1
sequencing status of sample1
zyg-sample1
zygosity of sample1
quality-sample1
variant quality in sample1
alleleSeq1-sample1
genotype of variant at one allele in sample1
alleleSeq2-sample1
genotype of variant at other allele in sample1
sequenced-sample2
sequencing status of sample2
zyg-sample2
zygosity of sample2
quality-sample2
variant quality in sample2
alleleSeq1-sample2
genotype of variant at one allele in sample2
alleleSeq2-sample2
genotype of variant at other allele in sample2

tab based bioinformatics formats

Some formats used in bioinformatics contain data in a tab separated format where the header does not conform to the tsv specs. Most Genomecomb commands will detect and support some of these alternative comments/header styles:

sam
starts with "@HD VN", header lines start with @, uses fixed columns
vcf
starts with "fileformat=VCF", the last "comment" line contains the header. In order to extract the data merged in some of the vcf fields into a genomecomb supported tsv, use cg vcf2tsv
Complete genomics
header line is preceeded by an empty line and starts with a > character