Genomecomb moved to github on https://github.com/derijkp/genomecomb with documentation on https://derijkp.github.io/genomecomb. For up to date versions, go there. These pages only remain here for the data on the older scientific application (or if someone really needs a long obsolete version of the software)

genome_seq

Format

cg genome_seq ?options? regionfile/regions dbdir ?outfile?

Summary

Returns sequences of regions in the genome (fasta file), optionally masked for snps/repeats

Description

This command returns the sequences of the genomic regions given in the file regionfile in fasta format (to stdout or to a file outfile). Regionfile is a tab delimited file with at least following columns: chromosome begin end. Repeatmasker repeats are softmasked (lower case) in the output sequences. Optionally you can hardmask repeats, and soft or hardmask known (dbsnp) variants based on frequency.

Arguments

regionfile: tab delimited file containing targets with at least following columns: chromosome begin end.
regions: If the string given for regionfile/regions does not exist as a file, it is parsed as a list of regions, given by chromosome,begin,end that can be separated in a variety of ways (colon, dash, comma, space or newlines), e.g., all of the following formats are accepted: 'chr1:100-200,chr2:100-200' 'chr1 100 200 chr2 100 200' 'chr1-100-200 chr2-100-200'
dbdir: directory containing reference genomes and variation data

Options

-f freq (--freq): only softmask (lowercase) dbsnp variants if they have a frequency > freq (given as a fraction, default is 0, use -1 to include all)
-fp freqp (--freqp): only softmask (lowercase) dbsnp variants if they have a frequency > freqp (given as a percentage, default is 0, use -1 to include all)
-n freqn (--freqn): only mask (using N) dbsnp variants if they have a frequency > freqn (given as a fraction, default is 0.2, use -1 to include all)
-np freqnp (--freqnp): only mask (using N) dbsnp variants if they have a frequency > freqnp (given as a percentage, default is 20, use -1 to include all)
-p snpdbpattern (--snpdbpattern): determines which variant databases are used (dbdir/var_*snpdbpattern*.tsv.gz). default is "snp" for dbsnp. you can e.g. use "Common" for the common variants in dbsnp
-d delsize (--delsize): only mask (using N) dbsnp variants if they are smaller than delsize (default is 5, use -1 to include all)
-r repeatmasker (--repeatmasker): how to mask repeatmasker repeats: "s" means softmask (lowercase), use "N" to mask using Ns, and 0 for no repeatmasking (default is "s")
-i idcolumn (--id): The ids for the fasta file will be taken from the given column (location will be added after a space)
-c concatseq (--concat): using this option, all regions will be concatenated into one sequence with concatseq between them. To just concatenate the sequences, use -c ''
-m mapfile (--mapfile): Create a map file that describes which regions in the newly created fasta file map to which regions in the genome
--namefield namefield: entries in the map file will have a name obtained from the namefield column in the region file
-cn concatname (--concatname): The concatname wil be the name of sequence in the fasta file generated (if not given, the name will be based on the file)
-e concatend (--concatend): The sequence given by concatend will be added to start and end of the final sequence (only if -c option was used)
-ca concatadj (--concatadj): The concatseq (-c option) will only be added if regions are separated by at least one base. concatadj will be used to concat adjoining regions (and is '' by default)
-g windowsize (--gc): add gc content on id line. if windowsize 0 only total gc content will be added. For windowsize > 0, the max gc content for the given windowsize will also be added (default = -1 for no gc content)
-gs gccontent (--gcsplit): Split the result in low and high gc (high has gc >= gccontent). The gc used depends on the -gc option. If -gc is not given, the maxgc at a windowsize of 100 is used. This option cannot be combined with concatenating sequences, and outfile has to be specified. 2 files will be generated with lowgc and highgc added in the given outfile name.
-gd 0/1 (--gcdisplay): determines if the gc content is actually displayed on the name line. By setting this to 0, you can set a windowsize (using -g) to split the files on, without the gc content being displayed on the name line If you set -gd to 1 without setting -g, the total gc content will be shown
-s 0/1 (--split): If this option is 1, each region will be saved as a separate fasta file. The
-l char (--limitchars): Replace all but alphanumeric characters, _, . and - in the sequence names by char

Home

Contact

Installation

Documentation