Genomecomb moved to github on https://github.com/derijkp/genomecomb
with documentation on https://derijkp.github.io/genomecomb.
For up to date versions, go there. These pages only remain here for the data on the older scientific
application (or if someone really needs a long obsolete version of the software)
Headers
This text describes the fields in the example file (and most files
coming out of genomecomb analysis) shortly. More detail on the format
of the example files can be found in format_tsv.
Some of the fields apply to variant itself: the basic variant
fields describe the variant itself, the annotation fields give info
on the variant (location etc.).
Sample specific fields are given separately for each sample by
appending a dash and the sample name to the generic fieldname. They
specify data about the variant that can be different for each sample,
e.g. the genotype (a sample can have a reference, heterozygous,...
genotype for the variant), or sequencing quality score.
Basic variant fields
- chromosome: chromosome number
- begin: start position of variant (zero based)
- end: end position of variant (zero based)
- type: single nucleotide variants (snp), short
insertions (ins), deletions (del) or substitutions (sub), or the
combinations of two different variant types at heterozygous
positions (e.g. del_snp).
- ref: The reference allele at this position
- alt: The alternative allele at this position
Sample specific fields for cg samples
- sequenced-cg-cg-testNA19238chr2122cg: Position sequenced
by CG/CG? ("u" = unsequenced, "v"=variant,
"r"=reference)
- zyg-cg-cg-testNA19238chr2122cg: zygosity according
according to CG
- alleleSeq1-cg-cg-testNA19238chr2122cg: First allele
called in CG genome.
- alleleSeq2-cg-cg-testNA19238chr2122cg: Second allele
called in CG genome.
- totalScore1-cg-cg-testNA19238chr2122cg: Variant score
at first allele.
- totalScore2-cg-cg-testNA19238chr2122cg: Variant score
at second allele.
- coverage-cg-cg-testNA19238chr2122cg: Coverage depth at
this position by CG.
- refscore-cg-cg-testNA19238chr2122cg: Reference score
at this position.
- refcons-cg-cg-testNA19238chr2122cg: in a poorly called
region (accoring to CG)
- cluster-cg-cg-testNA19238chr2122cg: Presence within a
cluster of SNVs by CG
The same fields are present for NA19239 and NA19240
Sample specific fields for gatk analysis
- sequenced-gatk-rdsbwa-testNA19240chr21il: sequenced by
Illumina/GATK? ("u" = unsequenced, "v"=variant,
"r"=reference)
- zyg-gatk-rdsbwa-testNA19240chr21il: zygosity according
to GATK (m=homozygous, t=heterozygous, c=compound, o=other,
r=reference, u=unsequenced/unspecified)
- alleleSeq1-gatk-rdsbwa-testNA19240chr21il: First
allele called in Illumina genome by GATK.
- alleleSeq2-gatk-rdsbwa-testNA19240chr21il: Second
allele called in Illumina genome by GATK.
- quality-gatk-rdsbwa-testNA19240chr21il: Quality score
for this position as called by GATK on Illumina genome.
- phased-gatk-rdsbwa-testNA19240chr21il: order of
genotypes in alleleSeq1 and alleleSeq2 is significant (phase is
known)
- genotypes-gatk-rdsbwa-testNA19240chr21il: list of
genotypes, can contain more than 2
- alleledepth_ref-gatk-rdsbwa-testNA19240chr21il: depth
of reference allele
- alleledepth-gatk-rdsbwa-testNA19240chr21il: depth of
alternative alleles
- coverage-gatk-rdsbwa-testNA19240chr21il: Coverage
depth at this position by by Illumina/GATK.
- genoqual-gatk-rdsbwa-testNA19240chr21il: quality of
the genotypes
- PL-gatk-rdsbwa-testNA19240chr21il: Normalized,
Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and
B=alt; not applicable if site is not biallelic
- BaseQRankSum-gatk-rdsbwa-testNA19240chr21il:
Phred-scaled p-value From Wilcoxon Rank Sum Test of Alt Vs. Ref
base qualities
- totalcoverage-gatk-rdsbwa-testNA19240chr21il: Total
Depth, counting all reads (DP in vcf INFO)
- DS-gatk-rdsbwa-testNA19240chr21il: Were any of the
samples downsampled?
- Dels-gatk-rdsbwa-testNA19240chr21il: Fraction of Reads
Containing Spanning Deletions
- FS-gatk-rdsbwa-testNA19240chr21il: Phred-scaled
p-value using Fisher's exact test to detect strand bias
- HaplotypeScore-gatk-rdsbwa-testNA19240chr21il:
Consistency of the site with at most two segregating haplotypes
- MQ-gatk-rdsbwa-testNA19240chr21il: RMS Mapping Quality
- MQ0-gatk-rdsbwa-testNA19240chr21il: Total Mapping
Quality Zero Reads
- MQRankSum-gatk-rdsbwa-testNA19240chr21il: Z-score From
Wilcoxon rank sum test of Alt vs. Ref read mapping qualities
- QD-gatk-rdsbwa-testNA19240chr21il: Variant
Confidence/Quality by Depth
- ReadPosRankSum-gatk-rdsbwa-testNA19240chr21il: Z-score
from Wilcoxon rank sum test of Alt vs. Ref read position bias
- SOR-gatk-rdsbwa-testNA19240chr21il: Symmetric Odds
Ratio of 2x2 contingency table to detect strand bias
- cluster-gatk-rdsbwa-testNA19240chr21il: Presence
within a cluster of SNVs by Illumina/GATK (if yes "cl",
if no "")
Sample specific fields for samtools analysis
- sequenced-sam-rdsbwa-testNA19240chr21il: sequenced by
Illumina/samtools? ("u" = unsequenced,
"v"=variant, "r"=reference)
- zyg-sam-rdsbwa-testNA19240chr21il: zygosity according
to samtools (m=homozygous, t=heterozygous, c=compound, o=other,
r=reference, u=unsequenced/unspecified)
- alleleSeq1-sam-rdsbwa-testNA19240chr21il: First allele
called in Illumina genome by samtools.
- alleleSeq2-sam-rdsbwa-testNA19240chr21il: Second
allele called in Illumina genome by GATK.
- quality-sam-rdsbwa-testNA19240chr21il: Quality score
for this position as called by GATK on Illumina genome.
- phased-sam-rdsbwa-testNA19240chr21il: order of
genotypes in alleleSeq1 and alleleSeq2 is significant (phase is
known)
- genotypes-sam-rdsbwa-testNA19240chr21il: list of
genotypes, can contain more than 2
- genoqual-sam-rdsbwa-testNA19240chr21il: genotype
quality
- loglikelihood-sam-rdsbwa-testNA19240chr21il:
- coverage-sam-rdsbwa-testNA19240chr21il: Coverage depth
at this position by by Illumina/GATK.
- DV-sam-rdsbwa-testNA19240chr21il: number of
high-quality non-reference bases
- SP-sam-rdsbwa-testNA19240chr21il: Phred-scaled strand
bias P-value
- PL-sam-rdsbwa-testNA19240chr21il: List of Phred-scaled
genotype likelihoods
- totalcoverage-sam-rdsbwa-testNA19240chr21il: Raw read
depth (DP in vcf INFO); samtools does not count all mapping reads
for totalcoverage (bad reads are filtered out)
- DP4-sam-rdsbwa-testNA19240chr21il: number of
high-quality ref-forward bases, ref-reverse, alt-forward and
alt-reverse bases
- MQ-sam-rdsbwa-testNA19240chr21il: Root-mean-square
mapping quality of covering reads
- FQ-sam-rdsbwa-testNA19240chr21il: Phred probability of
all samples being the same
- AF1-sam-rdsbwa-testNA19240chr21il: Max-likelihood
estimate of the first ALT allele frequency (assuming HWE)
- AC1-sam-rdsbwa-testNA19240chr21il: Max-likelihood
estimate of the first ALT allele count (no HWE assumption)
- IS-sam-rdsbwa-testNA19240chr21il: Maximum number of
reads supporting an indel and fraction of indel reads
- PV4-sam-rdsbwa-testNA19240chr21il: P-values for strand
bias, baseQ bias, mapQ bias and tail distance bias
- PC2-sam-rdsbwa-testNA19240chr21il: Phred probability
of the nonRef allele frequency in group1 samples being larger
(,smaller) than in group2.
- QBD-sam-rdsbwa-testNA19240chr21il: Quality by Depth:
QUAL/#reads
- RPB-sam-rdsbwa-testNA19240chr21il: Read Position Bias
- MDV-sam-rdsbwa-testNA19240chr21il: Maximum number of
high-quality nonRef reads in samples
- VDB-sam-rdsbwa-testNA19240chr21il: Variant Distance
Bias (v2) for filtering splice-site artefacts in RNA-seq data.
Note: this version may be broken.
- cluster-sam-rdsbwa-testNA19240chr21il: Presence within
a cluster of SNVs by Illumina/GATK (if yes "cl", if no
"")
Annotation fields
- intGene_impact: impact on gene transcripts (can be list)
of intGene gene set (integration of refGene, knownGene, genecode,
ensGene)
- intGene_gene: gene name
- intGene_descr: description of transcript and location
of variant in transcript in several ways (including hgvs)
- lincRNA_impact: impact on long non-coding genes
- lincRNA_gene: long non-coding gene name
- lincRNA_descr: description of transcript and location
of variant in long non-coding transcript
- refGene_impact: impact on refGene genes
- refGene_gene: name of refGene genes
- refGene_descr: list of refGene transcripts and
description of location and effect of variant on the transcript
- mirbase20_impact: impact on mirbase20 miRNA
- mirbase20_mir: mirbase20 miRNA variant is located in
- chainSelf: Presence in self-chained region (yes =
"label from UCSC", no = "")
- cytoBand: approximate location of bands seen on
Giemsa-stained chromosomes.
- dgvMerged: Database of Genomic Variants (Structural
Var Regions)
- evofold: conserved functional RNA structures based on
predictions made with the EvoFold program
- gad: Genetic Association Database
- genomicSuperDups: Presence in segmental duplication
(yes = "label from UCSC", no = "")
- gwasCatalog_name: name single nucleotide polymorphisms
(SNPs) identified by published GWAS
- gwasCatalog_score: score for gwasCatalog SNPs
- homopolymer_base: if part of a homopolymer, the
homopolymer base is given
- homopolymer_size: if part of a homopolymer, the
homopolymer size is given
- microsat: Presence in a microsatelite
- oreganno: literature-curated regulatory regions,
transcription factor binding sites, and regulatory polymorphisms
from oreganno
- phastConsElements46way_name: evolutionary conservation
region name (phastcons raw log odds scores)
- phastConsElements46way_score: evolutionary
conservation score (phastcons)
- phastConsElements46wayPlacental_name: evolutionary
conservation name in placental mammals
- phastConsElements46wayPlacental_score: evolutionary
conservation score in placental mammals
- phastConsElements46wayPrimates_name: evolutionary
conservation name in primates
- phastConsElements46wayPrimates_score: evolutionary
conservation score in primates
- rmsk: Presence in a RepeatMasker region (yes =
"label from UCSC", no = "")
- simpleRepeat: Presence in simple tandem repeat (yes =
"label from UCSC", no = "")
- targetScanS_name: Putative miRNA binding site
- targetScanS_score: Putative miRNA binding site score
- tfbsConsSites_name: transcription factor binding site
- tfbsConsSites_score: transcription factor binding site
score
- tRNAs: transfer RNA
- vistaEnhancers_name: Vista distant-acting
transcriptional enhancer
- vistaEnhancers_score: Vista distant-acting
transcriptional enhancer score
- wgEncodeCaltechRnaSeq: Encode RnaSeq score
- wgEncodeH3k4me1: Encode H3k4me1 score
- wgEncodeH3k4me3: Encode H3k4me3 score
- wgEncodeH3k27ac: Encode H3k27ac score
- wgEncodeRegDnaseClusteredV3_name: Encode Dnase region
name
- wgEncodeRegDnaseClusteredV3_score: Encode Dnase region
score
- wgEncodeRegTfbsClusteredV3_name: Encode transcription
factor binding site name
- wgEncodeRegTfbsClusteredV3_score: Encode transcription
factor binding site score
- wgRna_name: RNA gene (pre-miRNA, snoRNA or scaRNA)
name
- wgRna_score: RNA gene score
- 1000g3: frequency (in percent) of the variant in the
1000 genomes data set
- cadd: deleteriousness of single nucleotide variants
according to CADD
- clinvar_acc: clinvar accession number
- clinvar_disease: clinvar disease
- gnomad_max_freqp: maximum population frequency (in
percent) in the the gnmomad (genomic) database
- gnomad_nfe_freqp: maximum frequency in the NFE
population in the the gnmomad (genomic) database
- kaviar: frequency (in percent) in the klaviar database
- snp147: variant name (in the dbSNP 147 database)
- snp147Common: frequency in dbSNP 147 database (only
for variants > 1% in a sufficient population)
- evs_ea_freqp: frequency (percent) in Exome Variant
Server (european american population)
- evs_aa_freqp: frequency (percent) in Exome Variant
Server (african american population)
- evs_ea_mfreqp: homozygote frequency (percent) in Exome
Variant Server (european american population)
- evs_aa_mfreqp: homozygote frequency (percent) in Exome
Variant Server (african american population)