Welcome to MSI’s bioref database webpage!  The Minnesota Supercomputing Institute houses several public biological reference (or “bioref”, for short) databases that are of broad use to researchers, such as NCBI BLAST and ENSEMBL. Our goal in hosting these databases is save our users and groups valuable storage space and time. 

These databases are maintained and regularly updated by the MSI Research Informatics (RI) Core Bioinformatics group. BLAST is updated quarterly and ENSEMBL is updated annually starting at the beginning of each the calendar year. These databases can be found at this location on our system:

/common/bioref/ 

Databases and Reference Files Provided in bioref

BLAST databases

MSI maintains the entire NCBI BLAST sequence database, which are currently 34 separate databases. Table 1 below shows the names of these BLAST databases, a brief description of their contents, their database format version, the full path to their location on our system, and the versions of NCBI BLAST software that they are compatible with. 


Latest BLAST databases are located within:

/common/bioref/blast/latest/

If you are loading BLAST software 2.12.0 or greater using “module load” on our system, the latest bioref BLAST databases are automatically loaded in your environment, so there is no need to specify the full path to the database - providing only the database name (in column 1 of Table 1) for the database argument will suffice. Please note that the BLAST databases in bioref are in dbV5 format, so they will not work with any versions of BLAST software older than 2.12.0+. 


BLAST databases and their locations in bioref

BLAST database namedb formatBLAST version compatibilityFull path to databaseContents*
16S_ribosomal_RNAdbV52.12.0+/common/bioref/blast/latest/16S_ribosomal_RNAMicrobial 16S RNA sequences from the RefSeq Targeted Loci project
18S_fungal_sequencesdbV52.12.0+/common/bioref/blast/latest/18S_fungal_sequences 
28S_fungal_sequencesdbV52.12.0+/common/bioref/blast/latest/28S_fungal_sequences 
BetacoronavirusdbV52.12.0+/common/bioref/blast/latest/Betacoronavirus 
cdd_deltadbV52.12.0+/common/bioref/blast/latest/cdd_deltaCondensed conserved domain database for use with deltablast protein searches.
env_nrdbV52.12.0+/common/bioref/blast/latest/env_nrProtein sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nr database.
env_ntdbV52.12.0+/common/bioref/blast/latest/env_ntNucleotide sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nt database.
human_genomedbV52.12.0+/common/bioref/blast/latest/human_genomeCurrent refseq human genome assembly (GRCh) with various database masking
ITS_eukaryote_sequencesdbV52.12.0+/common/bioref/blast/latest/ITS_eukaryote_sequencesDatabases with collection of eukaryotic Internal Transcribed Spacer sequences.
ITS_RefSeq_FungidbV52.12.0+/common/bioref/blast/latest/ITS_RefSeq_FungiDatabases with collection fungal Internal Transcribed Spacer sequences.
landmarkdbV52.12.0+/common/bioref/blast/latest/landmarkThe landmark database includes complete proteomes from a few selected representative genomes spanning a wide taxonomic range, the main database used by the SmartBLAST services.
LSU_eukaryote_rRNAdbV52.12.0+/common/bioref/blast/latest/LSU_eukaryote_rRNADatabase with large submit rRNA sequences for prokaryotes.
LSU_prokaryote_rRNAdbV52.12.0+/common/bioref/blast/latest/LSU_prokaryote_rRNADatabase with large submit rRNA sequences for eukaryotes.
mitodbV52.12.0+/common/bioref/blast/latest/mitoprotein from the annotated mitochondrial genomes
mouse_genomedbV52.12.0+/common/bioref/blast/latest/mouse_genomeCurrent refseq mouse genome assembly (GRCm) with various database masking
nrdbV52.12.0+/common/bioref/blast/latest/nrA collection of protein sequences with entries from GenPept, Swissprot, PDB, PRF, PIR and NCBI Reference Sequence (RefSeq) project.
ntdbV52.12.0+/common/bioref/blast/latest/ntThe nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded.
pataadbV52.12.0+/common/bioref/blast/latest/pataaProtein sequences from patents as supplied by USPTO. These entries are EXCLUDED from the nr database.
patntdbV52.12.0+/common/bioref/blast/latest/patntNucleotide sequences from patents as supplied by USPTO to GenBank, or from EU/Japan Patent Agencies through EMBL/DDBJ. Entries are EXCLUDED from the nt database.
pdbaadbV52.12.0+/common/bioref/blast/latest/pdbaaProtein sequences from PDB structure records’ protein components.
pdbntdbV52.12.0+/common/bioref/blast/latest/pdbntSequences for the nucleotide components of PDB structure records.
ref_euk_rep_genomesdbV52.12.0+/common/bioref/blast/latest/ref_euk_rep_genomesEukaryotic representative genomes from NCBI RefSeq project
ref_prok_rep_genomesdbV52.12.0+/common/bioref/blast/latest/ref_prok_rep_genomesProkaryotic representative genomes from NCBI RefSeq project
refseq_proteindbV52.12.0+/common/bioref/blast/latest/refseq_proteinProtein sequences from NCBI RefSeq project.
refseq_rnadbV52.12.0+/common/bioref/blast/latest/refseq_rnaRNA sequences from NCBI RefSeq project, also included in the nt database.
refseq_select_protdbV52.12.0+/common/bioref/blast/latest/refseq_select_prot 
refseq_select_rnadbV52.12.0+/common/bioref/blast/latest/refseq_select_rna 
ref_viroids_rep_genomesdbV52.12.0+/common/bioref/blast/latest/ref_viroids_rep_genomesViroids representative genomes from NCBI RefSeq project
ref_viruses_rep_genomesdbV52.12.0+/common/bioref/blast/latest/ref_viruses_rep_genomesViruses representative genomes from NCBI RefSeq project
SSU_eukaryote_rRNAdbV52.12.0+/common/bioref/blast/latest/SSU_eukaryote_rRNAA database with sequences small from fungi and eukaryotes
swissprotdbV52.12.0+/common/bioref/blast/latest/swissprotProtein sequences from the swiss-prot sequence database (last major update).
taxdbdbV52.12.0+/common/bioref/blast/latest/taxdbA non-sequence database file containing taxonomic information for sequences in the preformatted databases providing common and scientific names for each entry.
tsa_nrdbV52.12.0+/common/bioref/blast/latest/tsa_nrProtein sequences from the Trascriptome Shotgun Assembly. Its entries are EXCLUDED from the nr database.
tsa_ntdbV52.12.0+/common/bioref/blast/latest/tsa_ntA database with earlier non-project based Transcriptome Shotgun Assembly (TSA) entries. Project-based TSA entries are NOT included. Entries are EXCLUDED from the nt database.
* from: BLAST Help Manual

BLAST Update Schedule

January

April

July

October

The bioref BLAST databases will be updated quarterly on the months listed in the update schedule and kept for 2 years. If needed, you can find older versions of the BLAST databases within the /common/bioref/blast/ folder, named by their month and year they were downloaded (e.g. “blast_update_04_2023” refers to the April 2023 update). 
 

ENSEMBL databases

The genomes and gene annotations for all organisms from ENSEMBL, except bacteria, are stored in /common/bioref/ensembl/. In addition to storing the genomes (FASTA format) and gene annotations (GFF and GTF formats), we also provide pre-built genome indices for BWA, HISAT2, Bowtie, Bowtie2, Samtools, NCBI_toolkit, and Picard. Note: Protein sequences and transcript sequences (FASTA format) corresponding to the genes in the ENSEMBL GTF/GFFs are not provided.


Latest ENSEMBL databases are located within:

/common/bioref/ensembl/

Organisms are organized into directories based on their kingdom (plants, metazoan, fungi, etc), except “main” which contains model organisms and vertebrates. 

ENSEMBL Directory Structure

Directory NameContents
mainModel organisms and vertebrates
grch37GRCh37 build of the human genome
fungiFungi
metazoaInvertebrates and other animals
plantsGreen plants and algae
protistsOrganisms colloquially known as “protists”


Within each “genus_species” directory, you will find the following subdirectories:

Directory NameContents
annotationGene model annotations in GFF and GTF formats
blastNCBI BLAST+ index files for the genome (not peptide, not transcripts)
bowtieIndex files for read mapping with bowtie version 1
bowtie2Index files for read mapping with bowtie version 2
bwaIndex files for read mapping with BWA
hisat2Index files for splice-aware read mapping with HISAT2
seqFASTA sequence files for the genome

Software Versions/Commands used for genomic sequence indexing

SoftwareVersionCommand
BWA0.7.17bwa index
HISAT22.1.0hisat2-build
bowtie22.3.4.1bowtie2 build
bowtie1.1.2bowtie-build
picard2.3.0picard CreateSequenceDictionary
ncbi_toolkit25.2.0makeblastdb
samtools1.9samtools faidx


ENSEMBL Update Schedule

ENSEMBL databases will be updated annually in January and will be kept for 4 years. If needed, you can find older versions of the ENSEMBL databases within the ENSEMBL folder labeled by the month and year (e.g. “ensembl_update_01_2023”).

Genome Indexes for Single-Cell Software

10X Genomics Cellranger References

/common/bioref/cellranger_10x
Contains pre-built cellranger genome index files downloaded from 10X (https://www.10xgenomics.com/support/software/cell-ranger/downloads) for the gene expression (gex), VDJ (vdj), and ATAC-seq/multiome (arc) 10X genomics library preps for a few different species as well as a combined human and mouse genome index. It also contains the probe set CSV files that are required for human and mouse 10X Flex library demultiplexing.

Parse Biosciences split-pipe References

/common/bioref/splitpipe_parse
Contains pre-built split-pipe indices for human (hg38) and mouse (mm10) genome for demultiplexing single-cell libraries from Parse Biosciences. Indices were built with parse-pipeline v1.5.0.

Kraken2/Bracken Database

/common/bioref/kraken2/
Contains two pre-made metagenomics indices downloaded from the following link: https://benlangmead.github.io/aws-indexes/k2. The suffix indicates the date of release (standard) or release number (silva). 

Microbiome Databases

/common/bioref/microbiome/
Contains dada2_taxonomy_references (see https://benjjneb.github.io/dada2/training.html for more information)
and reference databases from the various biobakery_workflows (see https://huttenhower.sph.harvard.edu/biobakery_workflows/ for more information).

 

Was this page helpful?
If you have a question about MSI services, please submit a ticket through our Help Desk