GenomeComb
Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)
Most important changes in 0.9.0 are
Support has been added for the analysis of exome and mastr sequencing projects (both Illumina and Solid).
cg process_illumina, cg process_mastr,
Multiple analysis methods are supported within one sample.
These are reflected in the file names using the pattern type-variantcaller-mapper-sample.tsv to indicate different mapping and variant calling algorithms,
e.g. the variant file for sample x1 where variants were called using gatk on an alignment done using bwa after duplicate removal and realignment (rdsbwa) would be in a folder x1 and called var-gatk-rdsbwa-x1.tsv
Column names specific to samples/methods in the multicompar file use the same convention, e.g. sequenced-varcaller-mapper-sample
varall files were introduced to be able to get information on positions not in the variant lists.
They are created by genotype variant calls for all (sufficiently covered) positions (including reference), and are used to create sequenced region files (reference calls should also be of sufficient quality), and to complete quality, coverage, etc. data for non-variant positions when making the multicompar.
The option (split) has been added to several commands to work with a slightly different format where multiple alternative alleles are stored on separate lines. This is now the default for exome/mastr sequencing.
A zyg (zygosity) colomn has been added. Following codes are used:
m hoMozygous
t heTerozygous
c compound: the variant is present in one allele, but the other allele is not reference)
o other: the variant is not present, but it also not reference (contains another alternative allele)
r reference
There are also a lot of new fields depending on the methods used.
The new cg select options -g (group) and -gc (groupcols) allow you to extract summary data from multicompar, e.g.
cg select -g 'refGene_gene {}' -gc 'sample {} zyg m count' compar/annot_compar-mastr130215nbd.tsv
will give you a table with for each gene (rows) the number of homozygous variants for each sample (columns)
sample aggregates are a special class of functions to get summary info over the samples for a variant line.
Within the parameters of these functions, you can use the column names without the samplename part. (The field sample will also be available with the sample name) e.g.
scount($zyg == "m" and $sample matches "gatk-*") will give the number of gatk analysed samples that are homozygous for a variant
slist($zyg == "m" and $sample matches "gatk-*",$sampl) will return a list of these samples
sample aggregates can be used to define calculated fields as well as in queries
many other new functions were added: compare, if, sum, distinct, ...
The (autogenerated) ROW field with give you a lines number
supports all new querying methods in cg select, but has some extra additions as well:calculated columns can now also be added in the viewer
A query builder has been added to aid in making queries. queries are still plain text (can still copy/past,etc.), but the querybuilder will help creating the text
summaries (as the -g and -gc options) are supported as a different table view. You can also show these summaries immediately as a chart.
cg regselect: select all regions or variants in region_file1 that overlap with regions in region_file2
cg bam histo: make histogram from bam file
cg splitalleles: convert a multiallelic variant file to the split allelic format. (You get better information if you use split format from the start, as this cannot know for all cols how to split them)
and many fixes and optimizations