Genomic Data Standards
Genomic data have reached a high level of standardization in the scientific community. Today, all high-impact journals typically ask the author to deposit their genomic data in either or both of these databases before publication.
Below are the most widely accepted formats that are relevant to the data and analyses generated in TERRA-REF.
Raw reads + quality scores
Reference genome assembly
Reference genome assembly (for alignment of reads or BLAST) is in FASTA format. FASTA files generally need indexing and formatting that can be done by aligners, BLAST, or other applications that provide built-in commands for this purpose.
Sequence alignments are in BAM format – in addition to the nucleotide sequence, the BAM format contains fields to describe mapping and read quality. BAM files are binary files but can be visualized with IGV. If needed, BAM can be converted in SAM (text file) with SAMtools
BAM is the preferred format for sra database (sequence read archive).
SNP and genotype variants
VCF format is also the format required by dbSNP, the largest public repository all SNPs.