GenomeComb provides tools to generate, combine and analyze whole genome, exome or targetted sequencing data. Variant files in tab-separated format from different sequencing datasets can be combined taking into account which regions are actually sequenced (given as region files in tab-separated format), annotated and queried (several examples can be seen in the Howto::Query section). A graphical user interface able to browse and query multi-million line tab-separated files is also included.
The cg process_illumina command can be used to generate annotated multisample data starting from fastq files, using tools such as bwa for alignment and GATK and samtools for variant calling. Sequencing data can also be imported from Complete Genomics (cg_process_sample command), Real Time Genomics (cg_process_rtgsample command) and VariantCallFormat (VCF) variant files (vcf2sft command). Multiple genomes can then be compared within one single, annotated file using the multicompar protocol. You can find info on each separate command in the reference section.
The standard file format used in GenomeComb is a simple tab delimited file (without quoting), where the first (non comment) line contains column names. This is very flexible and allows for fast parsing. Depending on the columns present, files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. The file extension .sft (simple feature table) or plain .tsv can be used to refer to this format. In this context, genomecomb gives following columns specific meanings:
Most tools expect the files to be sorted on chromosme,begin,end,type and will create sorted files. You can sort files using the -s option of cg_select. Not all the columns must be present, and any other columns can be added and searched. In files containing data for multiple samples, columns that are specific to a sample have -samplename appended to the column name. Some examples of (minimal) columns present for various genomecomb files:
These files can easily be queried using the cg_select functionality or can be loaded into a local database.
The format does not use quoting, so values in the table cannot contain tabs or newlines, unless by coding them using escape characters (\t,\n)
In the Howto section we give some extended examples on how to process genome files and query the results.