GenomeComb

Introduction

GenomeComb

GenomeComb provides tools to generate, combine and analyze whole genome, exome or targetted sequencing data. Variant files in tab-separated format from different sequencing datasets can be combined taking into account which regions are actually sequenced (given as region files in tab-separated format), annotated and queried (several examples can be seen in the Howto::Query section). A graphical user interface able to browse and query multi-million line tab-separated files is also included.

The cg process_illumina command can be used to generate annotated multisample data starting from fastq files, using tools such as bwa for alignment and GATK and samtools for variant calling. Sequencing data can also be imported from Complete Genomics (cg_process_sample command), Real Time Genomics (cg_process_rtgsample command) and VariantCallFormat (VCF) variant files (vcf2sft command). Multiple genomes can then be compared within one single, annotated file using the multicompar protocol. You can find info on each separate command in the reference section.

File formats

The standard file format used in GenomeComb is a simple tab delimited file (without quoting), where the first (non comment) line contains column names. This is very flexible and allows for fast parsing. Depending on the columns present, files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. The file extension .sft (simple feature table) or plain .tsv can be used to refer to this format. In this context, genomecomb gives following columns specific meanings:

chromosome or chrom
chromosome name. Many genomecomb tools allow mixing chr1 and 1 notations
begin or start
start of feature. half-open coordinates as used by UCSC bed files and Complete Genomics files are expected. This means for instance that the first base of a sequence will be indicated by start=0 and end=1. An insertion before the first base will have start-0, end=0.
end
end of feature in half-open coordinates
type
type of variation: snp, ins, del, sub are recognised
ref or reference
genotype of the reference sequence at the feature. For large deletions, the size of the deletion can be used.
alt
alternative genotype(s). If there are more than one alternatives, they are separated by commas.
alleleSeq1
gentype of features at one allele
alleleSeq2
gentype of features at other allele

Most tools expect the files to be sorted on chromosme,begin,end,type and will create sorted files. You can sort files using the -s option of cg_select. Not all the columns must be present, and any other columns can be added and searched. In files containing data for multiple samples, columns that are specific to a sample have -samplename appended to the column name. Some examples of (minimal) columns present for various genomecomb files:

region file
chromosome begin end.
variant file
chromosome begin end type ref alt ?alleleSeq1? ?alleleSeq2?
multicompar file
chromosome begin end type ref alt alleleSeq1-sample1 alleleSeq2-sample1 alleleSeq1-sample2 alleleSeq2-sample2 ...

These files can easily be queried using the cg_select functionality or can be loaded into a local database.

The format does not use quoting, so values in the table cannot contain tabs or newlines, unless by coding them using escape characters (\t,\n)

How to start

In the Howto section we give some extended examples on how to process genome files and query the results.