Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

releasenotes 0.10.0

Release 0.10.0

Many improvements have been made, some highlights:

Annotation

Added specific annotation of miRNA genes

Analysis

Multiple analysis methods are supported within one sample. These are reflected in the file names using the pattern type-variantcaller-mapper-sample.tsv to indicate different mapping and variant calling algorithms, e.g. the variant file for sample x1 where variants were called using gatk on an alignment done using bwa after duplicate removal and realignment (rdsbwa) would be in a folder x1 and called var-gatk-rdsbwa-x1.tsv Column names specific to samples/methods in the multicompar file use the same convention, e.g. sequenced-varcaller-mapper-sample
varall files were introduced to be able to get information on positions not in the variant lists. They are created by genotype variant calls for all (sufficiently covered) positions (including reference), and are used to create sequenced region files (reference calls should also be of sufficient quality), and to complete quality, coverage, etc. data for non-variant positions when making the multicompar.
The option (split) has been added to several commands to work with a slightly different format where multiple alternative alleles are stored on separate lines. This is now the default for exome/mastr sequencing.
A zyg (zygosity) colomn has been added. Following codes are used: m hoMozygous t heTerozygous c compound: the variant is present in one allele, but the other allele is not reference) o other: the variant is not present, but it also not reference (contains another alternative allele) r reference
There are also a lot of new fields depending on the methods used.
In mastrs amplicons can be clipped correctly (overlapping amplicons no longer pose potential problems)

Querying

cg select
- The new cg select options -g (group) and -gc (groupcols) allow you to extract summary data from multicompar, e.g. cg select -g 'refGene_gene {}' -gc 'sample {} zyg m count' compar/annot_compar-mastr130215nbd.tsv will give you a table with for each gene (rows) the number of homozygous variants for each sample (columns)
- sample aggregates are a special class of functions to get summary info over the samples for a variant line. Within the parameters of these functions, you can use the column names without the samplename part. (The field sample will also be available with the sample name.) Sample aggregates can be used to define calculated fields as well as in queries e.g.
  - scount($zyg == "m" and $sample matches "gatk-*") will give the number of gatk analysed samples that are homozygous for a variant
  - slist($zyg == "m" and $sample matches "gatk-*",$sampl) will return a list of these samples
- many other new functions were added: compare, if, sum, distinct, length, split, catch, chr_clip, zyg, regextract, ucount, sucount, length, ...
- The (autogenerated) ROW field with give you a lines number
- You can use a sampleinfo file in cg select (default datafilename.sampleinfo.tsv): This file can contain extra information about the samples in the datafile (e.g. gender,disease,...) that can be used in a query.
- You can define calculated fields using wildcards, e.g. {geno-*="alleleSeq1-*/alleleSeq1-*"} to create a geno-... column for each matching alleleSeq1-... and alleleSeq2-.... Multiple wildcards can be used by incorporation of multiple successive asterisks (*, , *, ...)
- You can use calculated columns given in the -f parameter in the query (prepend a - to not show it in the result)
- add the -rc (removecomments) option
- -nh option nnow allows replacement of a wrong header (only gives warning if new header length different from old one)

cg viz
- added roc curves and precision-recal graphs
- added showcmdline: you can see (and copy) the commandline for the current query and fields.
- calculated columns can now also be added in the viewer
- A query builder has been added to aid in making queries. queries are still plain text (can still copy/past,etc.), but the querybuilder will help creating the text
- easyquery adds some commonly used queries in an easier format
- summaries (as the -g and -gc options) are supported as a different table view. You can also show these summaries immediately as a chart.

New commands

cg liftregion and cg liftsample: for lifting regions to another genome build
cg regselect: select all regions or variants in region_file1 that overlap with regions in region_file2
cg splitalleles: convert a multiallelic variant file to the split allelic format. (You get better information if you use split format from the start, as this cannot know for all cols how to split them)
cg collapsealleles: go from split variant file to multiallelic variant file
cg split: split a tab separated file in multiple tab separated files based on the content of a (usually chromosome) field.
cg long: change tsv file from wide to long format (each sample in separate line)
cg wide: change tsv file from long to wide format (sample data in separate columns with sample name as suffix in fieldname)
cg fixtsv: make sure that all lines of the tsv file have the correct number of columns (same as header)
cg gene2reg: extract regions from a gene file
cg bam histo: make histogram from bam file

Various new options and optimizations

Several new options were added to various commands, e.g.

cg genome_seq
- added -p snpdbpattern option to allow selection of snp database
- The -gcsplit option allows results to be split based on gc content
- Using the -split option, all sequences are saved in separate files
- With -l (limitchars), some characters (that cause havoc with mpcr) can be excluded from the names
cg exportplink
- added codegeno option (-c)
- added -samples (-s) option, proper error message on error in query
- added nulllines (-n) option, default on
cg cat: added options for concatenating chromosome split files: -s option, extra -c values

The update further also includes many optimizations (e.g. speedup multicompar, makeprimers using only local data, support for lz4 compressed files) and fixes.

Home

Contact

Installation

Documentation