Downloading the MIDASDB¶

MIDAS Reference Database (MIDASDB) is comprised of three components: representative genomes, species pangenomes, and marker genes. For each MIDASDB, six-digit numeric species ids are randomly assigned and stored in the corresponding metadata file (metadata.tsv).

For MIDAS2, we have already built two MIDASDBs from large, public, microbial genome databases:

midas2 database --list

uhgg 286997 genomes from 4644 species version 1.0
gtdb 258405 genomes from 47893 species version r202

For the purposes of this documentation we’ll generally assume that we’re working with the prebuilt uhgg MIDASDB and that the local mirror is in a subdirectory my_midasdb_uhgg.

Automatic database downloading is built into MIDAS2 analysis commands (e.g., run_snps and run_genes). Specifically, MIDAS2 will download a fraction of the full database; this subset is determined by which species are identified to be at high coverage.

However, when parallelizing computation across samples multiple commands might try to download the same database components simultaneously, a race condition. This may be problematic.

We therefore suggest that, for large-scale analyses, users pre-download the MIDASDB.

Users should start by downloading the taxonomic marker genes.

midas2 database \
    --init \
    --midasdb_name uhgg \
    --midasdb_dir my_midasdb_uhgg

This is everything needed to run abundant species detection.

It is possible to download an entire MIDASDB using the following command:

midas2 database \
  --download \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --species all

This requires a large amount of data transfer and storage: 93 GB for MIDASDB-uhgg and 539 GB for MIDASDB-gtdb.

Note

The database would be much larger except that files are compressed with LZ4 to minimize storage requirements.

Alternatively, we strongly recommend that users take a more customized approach to database loading, taking advantage of species-level database sharding to download and decompress only the necessary portions of a MIDASDB.

Afterwards, we can collect a list of species present in a list of samples. Parsing the MIDAS2 output files (midas2_output/merge/species/species_prevalence.tsv) presents a convenient way to do this.

awk '$6 > 1 {print $1}' midas2_output/merge/species/species_prevalence.tsv > all_species_list.tsv

Finally, we can download database components (both reference genomes and pangenome collections) based on these species.

midas2 database \
    --download \
    --midasdb_name uhgg \
    --midasdb_dir my_midasdb_uhgg \
    --species_list all_species_list.tsv

Afterwards, the single-sample parts of the SNV and CNV modules can be run in parallel and without a potential race condition.

Note

It is also possible for advance users to contruct their own MIDASDB from a custom genome collection (e.g. for metagenome assembled genomes).