GenomeComb
Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)
Major updates to genomecomb and the annotation databases have been made.
A new generic cg process_project command was added to process the entire workflow from original data to variants and reporting for multiple samples. It can handle samples of mixed type (CGI genomes, Illumina) and has options to select aligner, variant callers, etc. on the command line. New options were added for e.g. aligner (minimap2) and variantcaller (freebayes). A workflow can either be run directly (single core), locally distributed over multiple cores or on a grid engine based cluster using one simple option. An interupted workflow can easily be continued or results updated when a source file is changed. cg process_sample (which is used by cg process_project) can be used seperately for processing one sample, and cg process_multicompar to combine multiple independentaly processed samples.
A major focus was optimization of the analysis pipeline, converting all tools to 64bits, streamlining performance on the cluster and replacing or updating several of the components: e.g. duplicate removal is now (by default) done by tools from the biobambam suite instead of the (much more resource hungry) picard markduplicates and samtools has been upgraded to the latest htslib based version. Several genomecomb tools (e.g. multicompar) have also been optimized and parallelized to be able to handle multicompars with 10 thousands of samples (much) faster and using less resources.
Results can differ according to the tools and versions used, so when reporting, it is important to know which tools, versions and options were actually used. genomecomb keeps this provenance data in analysisinfo files; Their name is based on the name of the resultfile with the extra extension .analysisinfo. A process workflow run also creates a log file with info on all subtasks/jobs run.
Also extended in the new workflow is reporting and qc. It uses cg process_reports (which can also be used separately) to calculate a number of statistics that will be stored in the sample directory (fastqstats, fastqc, flagstats, vars, hsmetrics, covered, predictgender). Most of these reports (except fastqc) are functional rather than fancy: a tsv file with sample, source (i.e. program used to make them), parameter and value
In order to optimize storage space, most tsv databases and results will now be compressed using lz4. (lz4 was chosen over the more familiar gzip because decompression is an order of magnitude faster and it allows random access) cg commands will transparantly accept compressed files: You can use cg viz, cg select, etc, on compressed files identically as on uncompressed files. For other tools or software you may have to decompress first. (commands that have been added for this are explained later)
The reference genome previously only contained the cannonical chromosome sequences. In this release the random (known chromosome but not where, *_random) and unplaced (not associated with a chromosome, chrUn_*) sequences are also included.
Gene annotation has been considerably improved. The major change in the output is in the *_descr field, which now follows the latest HGVS variant nomenclature (v 15.11) almost completely (some small deviations for brevity or usefulness are described in the annotate help: cg annotate -h). In the *_descr field pre was changed to up (for upstream) and post to down (for downstream)
The impact code UTR5KOZAK was added, and some codes changed name: CDSDELSPLICE -> CDSSPLICE CDSDELSTART -> CDSSTARTDEL The impact of complex changes (sub, inv) are now indicated using CDSCOMP CDSSTARTCOMP GENECOMP
A new annotation database format (multi-allelic bcol) has been introduced that is much more efficient for databases such as CADD, both in space (>4 times smaller) and time (orders of magnitude faster annotation).
The annotate command itself has also been optimized, allowing for parrallel execution on the cluster. The new -replace option allows you to choose what to do with annotations that are already in the variant file (replace, skip, or give error) and -u allows changing the "upstream" size (default 2000).
All annotation databases have been updated. Beside simple updates, other important changes have been made:
For easier querying of genes, an integrated geneset annotation file has been made (intGene) that incorporates refGene, gencode, ensGene and knownGene. Only this geneset and the refGene set (for when minimal ref info is needed) is included in the default annotation. Gene annotation itself has also considerably improved (see below). The individual geneset files and prediction genesets are available in the extra directory.
A new annotation database format (multi-allelic bcol) has been introduced that is much more efficient for databases such as CADD, both in space (>4 times smaller) and time (orders of magnitude faster annotation). Several databases (e.g. CADD) use the new more efficient multi-allelic bcol format. Because of this CADD could now be included in the default analysis (moved from the hg19/extra folder to the hg19 folder)
The accompanying info files now contain at the start of the file basic information about version, source, citation, ... that you may need to properly cite its use.
Frequencies for most variant databases are now given in percentage, as indicated by using "freqp" in the name. This saves space (less characters needed) and is generally easier to interpret. Adapt your queries for new data sets accordingly.
The dbsnp version number (now at 150) has been removed from the name. As dbsnp frequencies are not reliable anyway (e.g. there are frequencies based on only one sample) they are no longer included in the annotation automatically. (They are still in the annotation file)
Several new annotation files/databases were added
Queries using cg select and cg viz have also seen several improvements. Most changes are in the summary functions:
The possibility to loop over lists of values (e.g. genes, impacts) has been added to the summarizing options (-g,-gc), by prepending the fieldname with - (unique elements in the list) or + (all elements). The howto_query gives an example how this could be used (together with the new transcripts function) to find genes with multiple variants.
The names of columns made (by the -gc option) now start with the summary function instead of ending with it. This makes it possible to have the sample as last element, creating multi-sample summary tsv files properly formatted for further analysis using cg select.
The functions median, q1 and q3 were added (as function and as -gc aggregate function)
New options include
While "cg select-h" gives you help on the possibilities of cg select, "cg help howto_query" can now be used to get a help/howto using examples (although not touching all possiilites). You can find the example files used in the howto in /complgen/examples
cg vcf2tsv was substantially changed. In the vcf format, snps and indels at the same position are clumped together into one multiallelic complex variant, which makes further analysis and annotation harder. This was especially apparant when creating the gnomad annotation file.
vcf2tsv now tries a lot harder to separate these complex variants into simpler separate alleles/types. The -split ori option has been added for when you want to transfer the original multiallelic format to tsv as a subst. The -removefields option was added to remove some (superfluous) fields while converting vcf files
Various new commands were added
Workflow
several workflow components supporting parallel processing are now available as separate commands or newly added
Analysis
Tsv Analysis
Bamfiles:
Conversion
(De)Compression commands For transfer to other tools not supporting lz4 transparently, some extra commands were added:
Cluster
dev/filing
Other commands got new options and capabilities
The help system was also improved: You can still get an overview using "cg help" and detailed info on specific commands using "cg help command" or "cg command -h". The help will now by default be shown using a pager (so you can scroll through it instead of going off the screen imediately, q to quit) and using colors. Overview help topics (about formats, generic options, process commands, ...) have been extended.