Genomecomb moved to github on https://github.com/derijkp/genomecomb
with documentation on https://derijkp.github.io/genomecomb.
For up to date versions, go there. These pages only remain here for the data on the older scientific
application (or if someone really needs a long obsolete version of the software)
releasenotes 0.10.0
Release 0.10.0
Many improvements have been made, some highlights:
Annotation
- Added specific annotation of miRNA genes
Analysis
- Multiple analysis methods are supported within one sample.
These are reflected in the file names using the pattern
type-variantcaller-mapper-sample.tsv to indicate different mapping
and variant calling algorithms, e.g. the variant file for sample x1
where variants were called using gatk on an alignment done using
bwa after duplicate removal and realignment (rdsbwa) would be in a
folder x1 and called var-gatk-rdsbwa-x1.tsv Column names specific
to samples/methods in the multicompar file use the same convention,
e.g. sequenced-varcaller-mapper-sample
- varall files were introduced to be able to get information on
positions not in the variant lists. They are created by genotype
variant calls for all (sufficiently covered) positions (including
reference), and are used to create sequenced region files
(reference calls should also be of sufficient quality), and to
complete quality, coverage, etc. data for non-variant positions
when making the multicompar.
- The option (split) has been added to several commands to work
with a slightly different format where multiple alternative alleles
are stored on separate lines. This is now the default for
exome/mastr sequencing.
- A zyg (zygosity) colomn has been added. Following codes are
used: m hoMozygous t heTerozygous c compound: the variant is
present in one allele, but the other allele is not reference) o
other: the variant is not present, but it also not reference
(contains another alternative allele) r reference
- There are also a lot of new fields depending on the methods
used.
- In mastrs amplicons can be clipped correctly (overlapping
amplicons no longer pose potential problems)
Querying
- cg select
- The new cg select options -g (group) and -gc (groupcols)
allow you to extract summary data from multicompar, e.g. cg
select -g 'refGene_gene {}' -gc 'sample {} zyg m count'
compar/annot_compar-mastr130215nbd.tsv will give you a table with
for each gene (rows) the number of homozygous variants for each
sample (columns)
- sample aggregates are a special class of functions to get
summary info over the samples for a variant line. Within the
parameters of these functions, you can use the column names
without the samplename part. (The field sample will also be
available with the sample name.) Sample aggregates can be used to
define calculated fields as well as in queries e.g.
- scount($zyg == "m" and $sample matches
"gatk-*") will give the number of gatk analysed
samples that are homozygous for a variant
- slist($zyg == "m" and $sample matches
"gatk-*",$sampl) will return a list of these samples
- many other new functions were added: compare, if, sum,
distinct, length, split, catch, chr_clip, zyg, regextract,
ucount, sucount, length, ...
- The (autogenerated) ROW field with give you a lines
number
- You can use a sampleinfo file in cg select (default
datafilename.sampleinfo.tsv): This file can contain extra
information about the samples in the datafile (e.g.
gender,disease,...) that can be used in a query.
- You can define calculated fields using wildcards, e.g.
{geno-*="alleleSeq1-*/alleleSeq1-*"} to create a
geno-... column for each matching alleleSeq1-... and
alleleSeq2-.... Multiple wildcards can be used by incorporation
of multiple successive asterisks (*, , *, ...)
- You can use calculated columns given in the -f parameter
in the query (prepend a - to not show it in the result)
- add the -rc (removecomments) option
- -nh option nnow allows replacement of a wrong header
(only gives warning if new header length different from old one)
- cg viz
- added roc curves and precision-recal graphs
- added showcmdline: you can see (and copy) the commandline
for the current query and fields.
- calculated columns can now also be added in the viewer
- A query builder has been added to aid in making queries.
queries are still plain text (can still copy/past,etc.), but the
querybuilder will help creating the text
- easyquery adds some commonly used queries in an easier
format
- summaries (as the -g and -gc options) are supported as a
different table view. You can also show these summaries
immediately as a chart.
New commands
- cg liftregion and cg liftsample: for lifting regions to another
genome build
- cg regselect: select all regions or variants in region_file1
that overlap with regions in region_file2
- cg splitalleles: convert a multiallelic variant file to the
split allelic format. (You get better information if you use split
format from the start, as this cannot know for all cols how to
split them)
- cg collapsealleles: go from split variant file to
multiallelic variant file
- cg split: split a tab separated file in multiple tab
separated files based on the content of a (usually chromosome)
field.
- cg long: change tsv file from wide to long format (each
sample in separate line)
- cg wide: change tsv file from long to wide format (sample
data in separate columns with sample name as suffix in fieldname)
- cg fixtsv: make sure that all lines of the tsv file have the
correct number of columns (same as header)
- cg gene2reg: extract regions from a gene file
- cg bam histo: make histogram from bam file
Various new options and optimizations
Several new options were added to various commands, e.g.
- cg genome_seq
- added -p snpdbpattern option to allow selection of snp
database
- The -gcsplit option allows results to be split based on
gc content
- Using the -split option, all sequences are saved in
separate files
- With -l (limitchars), some characters (that cause havoc
with mpcr) can be excluded from the names
- cg exportplink
- added codegeno option (-c)
- added -samples (-s) option, proper error message on error
in query
- added nulllines (-n) option, default on
- cg cat: added options for concatenating chromosome split
files: -s option, extra -c values
The update further also includes many optimizations (e.g.
speedup multicompar, makeprimers using only local data, support for
lz4 compressed files) and fixes.