GenomeComb

Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

releasenotes 0.98.7

Release 0.98.7

Major updates to genomecomb and the annotation databases have been made.

Workflow improvements

A new generic cg process_project command was added to process the entire workflow from original data to variants and reporting for multiple samples. It can handle samples of mixed type (CGI genomes, Illumina) and has options to select aligner, variant callers, etc. on the command line. New options were added for e.g. aligner (minimap2) and variantcaller (freebayes). A workflow can either be run directly (single core), locally distributed over multiple cores or on a grid engine based cluster using one simple option. An interupted workflow can easily be continued or results updated when a source file is changed. cg process_sample (which is used by cg process_project) can be used seperately for processing one sample, and cg process_multicompar to combine multiple independentaly processed samples.

A major focus was optimization of the analysis pipeline, converting all tools to 64bits, streamlining performance on the cluster and replacing or updating several of the components: e.g. duplicate removal is now (by default) done by tools from the biobambam suite instead of the (much more resource hungry) picard markduplicates and samtools has been upgraded to the latest htslib based version. Several genomecomb tools (e.g. multicompar) have also been optimized and parallelized to be able to handle multicompars with 10 thousands of samples (much) faster and using less resources.

Results can differ according to the tools and versions used, so when reporting, it is important to know which tools, versions and options were actually used. genomecomb keeps this provenance data in analysisinfo files; Their name is based on the name of the resultfile with the extra extension .analysisinfo. A process workflow run also creates a log file with info on all subtasks/jobs run.

Also extended in the new workflow is reporting and qc. It uses cg process_reports (which can also be used separately) to calculate a number of statistics that will be stored in the sample directory (fastqstats, fastqc, flagstats, vars, hsmetrics, covered, predictgender). Most of these reports (except fastqc) are functional rather than fancy: a tsv file with sample, source (i.e. program used to make them), parameter and value

Compression

In order to optimize storage space, most tsv databases and results will now be compressed using lz4. (lz4 was chosen over the more familiar gzip because decompression is an order of magnitude faster and it allows random access) cg commands will transparantly accept compressed files: You can use cg viz, cg select, etc, on compressed files identically as on uncompressed files. For other tools or software you may have to decompress first. (commands that have been added for this are explained later)

Genomic Reference

The reference genome previously only contained the cannonical chromosome sequences. In this release the random (known chromosome but not where, *_random) and unplaced (not associated with a chromosome, chrUn_*) sequences are also included.

Annotation (cg annotate -h)

Gene annotation has been considerably improved. The major change in the output is in the *_descr field, which now follows the latest HGVS variant nomenclature (v 15.11) almost completely (some small deviations for brevity or usefulness are described in the annotate help: cg annotate -h). In the *_descr field pre was changed to up (for upstream) and post to down (for downstream)

The impact code UTR5KOZAK was added, and some codes changed name: CDSDELSPLICE -> CDSSPLICE CDSDELSTART -> CDSSTARTDEL The impact of complex changes (sub, inv) are now indicated using CDSCOMP CDSSTARTCOMP GENECOMP

A new annotation database format (multi-allelic bcol) has been introduced that is much more efficient for databases such as CADD, both in space (>4 times smaller) and time (orders of magnitude faster annotation).

The annotate command itself has also been optimized, allowing for parrallel execution on the cluster. The new -replace option allows you to choose what to do with annotations that are already in the variant file (replace, skip, or give error) and -u allows changing the "upstream" size (default 2000).

Annotation databases

All annotation databases have been updated. Beside simple updates, other important changes have been made:

For easier querying of genes, an integrated geneset annotation file has been made (intGene) that incorporates refGene, gencode, ensGene and knownGene. Only this geneset and the refGene set (for when minimal ref info is needed) is included in the default annotation. Gene annotation itself has also considerably improved (see below). The individual geneset files and prediction genesets are available in the extra directory.

A new annotation database format (multi-allelic bcol) has been introduced that is much more efficient for databases such as CADD, both in space (>4 times smaller) and time (orders of magnitude faster annotation). Several databases (e.g. CADD) use the new more efficient multi-allelic bcol format. Because of this CADD could now be included in the default analysis (moved from the hg19/extra folder to the hg19 folder)

The accompanying info files now contain at the start of the file basic information about version, source, citation, ... that you may need to properly cite its use.

Frequencies for most variant databases are now given in percentage, as indicated by using "freqp" in the name. This saves space (less characters needed) and is generally easier to interpret. Adapt your queries for new data sets accordingly.

The dbsnp version number (now at 150) has been removed from the name. As dbsnp frequencies are not reliable anyway (e.g. there are frequencies based on only one sample) they are no longer included in the annotation automatically. (They are still in the annotation file)

Several new annotation files/databases were added

gnomad
variants with freqs from Broad (based on 15,496 whole-genome and 123,136 exomes) by default the maximum frequency in any of the populations and the (non-Finish) european population (nfe) are added. You can add the other populations by annotating with the version in the extra directory
kaviar
compilation of SNVs, indels, and complex variants observed in humans (based on 13.2K whole genome, 64.6K exome)
lincRNA
long non-coding RNAs
refcoding
simple region file (in extra) with coding regions (from refGene)
intcoding
simple region file (in extra) with coding regions (from intGene)

Querying

Queries using cg select and cg viz have also seen several improvements. Most changes are in the summary functions:

The possibility to loop over lists of values (e.g. genes, impacts) has been added to the summarizing options (-g,-gc), by prepending the fieldname with - (unique elements in the list) or + (all elements). The howto_query gives an example how this could be used (together with the new transcripts function) to find genes with multiple variants.

The names of columns made (by the -gc option) now start with the summary function instead of ending with it. This makes it possible to have the sample as last element, creating multi-sample summary tsv files properly formatted for further analysis using cg select.

The functions median, q1 and q3 were added (as function and as -gc aggregate function)

New options include

-samples samplelist
returns tsv with only given samples
-ssamples samplelist
returns tsv with only given samples in the given order
-hp list
allows you to specify a (alternative) header in the command line
-sr sortfields
sort on given fields in reverse order
-rc
remove comments from tsv file
-samplingskip num
take only a sample of the tsv file, skipping the num lines between sampling

While "cg select-h" gives you help on the possibilities of cg select, "cg help howto_query" can now be used to get a help/howto using examples (although not touching all possiilites). You can find the example files used in the howto in /complgen/examples

vcf conversion

cg vcf2tsv was substantially changed. In the vcf format, snps and indels at the same position are clumped together into one multiallelic complex variant, which makes further analysis and annotation harder. This was especially apparant when creating the gnomad annotation file.

vcf2tsv now tries a lot harder to separate these complex variants into simpler separate alleles/types. The -split ori option has been added for when you want to transfer the original multiallelic format to tsv as a subst. The -removefields option was added to remove some (superfluous) fields while converting vcf files

Various other new options and commands

Various new commands were added

Workflow

cg project_addsample
convenience function to add a sample directory to a project directory.
cg renamesamples
Converts the sample names in a file or entire directory to other names

several workflow components supporting parallel processing are now available as separate commands or newly added

cg map_bwa
bwa alignment
cg map_bowtie2
bowtie2 alignmment
cg map_minimap2
minimap2 alignment
cg realign_abra
realignement around indels using abra
cg realign_gatk
realignement around indels using gatk
cg var_freebayes
call variants using freebayes
cg basecaller_albacore
basecall nanopore reads using albacore

Analysis

cg predictgender
predicts gender for a sample (from a bamfile/sampledir)
cg sam_ampliconscount
count reads mapping to each amplicon in ampliconsfile. file can be a bam or sam file.
cg depth_histo
makes a histogram of the sequencing depth in the given bamfile, optionally subdivided in on- and oftarget regions
cg histo
makes a fast histogram of values in the given field
cg hsmetrics
Creates a hsmetrics file (target only) using picard CalculateHsMetrics

Tsv Analysis

cg multiselect
use the same cg select query on multiple tsv files.
cg paste
merge lines of tab separated files.
cg tsvjoin
join two tsv files based on common fields (must be sorted)
cg tsvdiff
compare tsv files
cg keyvalue
Converts data in tsv format from wide format (data for each sample in separate columns) to keyvalue format
cg colvalue
Converts data in tsv format from a long key-value format to a wide column-value format

Bamfiles:

cg bam2fastq
can properly extract fastq files from a bam file (using biobambam)
cg fastq2tsv
convert fastq file to tsv format, e.g. to analyse using cg select
cg bamreorder
Changes the order of the contigs/chromosomes in a bam file
cg bam2reg
extract regions with a given minimum coverage in a bam file

Conversion

cg sortfastq
sort a fastq file (based on name)
cg tsv2bed
Converts data in tab-separated format (tsv) format to bed format
cg sam2tsv
converts a sam (or bam) file to tsv format (for e.g. querying using cg select)
cg tsv2sam
converts a tsv file (previously converted from sam) back to sam format

(De)Compression commands For transfer to other tools not supporting lz4 transparently, some extra commands were added:

cg zcat
output (uncompressed) contents of (multiple) files to stdout (compression type is detected from file name, can be mixed)
cg lz4cat
uncompress an lz4 compressed stream (from stdin to stdout)
cg lz4ra
Using lz4ra, you can access a random part of a lz4 compressed file, without needing to compress the entire file
cg less
view (page through) a compressed file
cg lz4less
view (paged) the contents of a stream that is lz4 compressed (pipe from stdin, decompress and send to less)
cg lz4
compress files using lz4
cg lz4index
create an index (.lz4i) to get faster random access to an lz4 compressed file

Cluster

cg qjobs
returns running and waiting jobs on a grid engine cluster in tsv format (so they can by analysed using cg select)
cg qsub
submit a command to the cluster (grid engine)

dev/filing

cg hardsync
creates a hardlinked copy of a directory in another location
cg cplinked
create a copy of a directory where each file in it is a softlink to the src

Other commands got new options and capabilities

cg regextract
added -filtered, -min and -max options (instead of the confusing -above and cutoff)
cg gene2reg
added -upstream option to include upstream and downstream regions in output
cg genome_seq
regions can be entered as a parameter as well as in a regionfile
cg exportplink
-samples may use wildcard
cg annotate
added -distrchr option, only for annotategene
cg select
error on overwrite, added -overwrite option, better error messages
cg liftover
can work in pipes
cg regjoin
added option -fields to join regions that have the same values in the given fields, can work in pipes
cg cat
added -n option to add column showing which file a line comes from

Help

The help system was also improved: You can still get an overview using "cg help" and detailed info on specific commands using "cg help command" or "cg command -h". The help will now by default be shown using a pager (so you can scroll through it instead of going off the screen imediately, q to quit) and using colors. Overview help topics (about formats, generic options, process commands, ...) have been extended.