Module: Species Selection

Reference-based metagenotyping depends crucially on the choice of reference sequences. Microbiome data may contain hundreds of species in one sample, and an ideal reference database is both representative and comprehensive in terms of the abundant species in the sample. A badly chosen reference may suffer both from ambiguous mapping of reads to two or more sequences (which may lead to reads being filtered out post-alignment due to low uniqueness) or spurious cross-mapping to incorrect sequences (which leads to metagenotype errors). Therefore, a typical MIDAS2 workflow starts with a species selection step, which customizes the MIDASDB to only include the genomes of sufficiently abundant species in each sample.

Single-Sample Analysis

MIDAS2 estimates species coverage by profiling the coverage of 15 universal, single copy, marker genes (SCGs, 15 per species), to quickly determine which species are abundant in the sample.

Warning

This is designed only to select species with sufficient coverage in each sample. It is not intended to quantify species abundance.

(In this document, we continue to use the data from the Quickstart as an example.)

midas2 run_species \
    --sample_name sample1 \
    -1 reads/sample1_R1.fastq.gz \
    --midasdb_name uhgg \
    --midasdb_dir my_midasdb_uhgg \
    --num_cores 4 \
    midas2_output

Tip

Single-sample analysis step can be parallelized over samples (e.g. xargs)

Note

The first time run_species is used, MIDAS2 will automatically download the marker gene database.

Warning

(Race condition) If starting multiple calls to run_species simultaneously, be sure that the marker gene database has already been downloaded. Otherwise multiple, redundant downloads may be started.

Cross-Sample Merging

We now run the merge_species command to merge the single-sample species profiling results for the samples listed our samples_list.

midas2 merge_species \
  --samples_list list_of_samples.tsv \
  --min_cov 2 \
  midas2_output/merge

The --min_cov flag defines the minimum median_marker_coverage for estimating species prevalence, which is output as the sample_counts statistic. See below.

Key Outputs

Single-Sample

For each sample, the primary output of the run_species command is a file containing information about species detected in the reads (e.g.) midas2_output/sample1/species/species_profile.tsv This file describes the coverage of each species’ marker genes in the sample. Species are sorted in decreasing order of median_marker_coverage. Only species with more than two marker genes covered with more than two reads (a very low bar) are reported in this file.

species_id

marker_read_counts

median_marker_coverage

marker_coverage

marker_relative_abundance

unique_fraction_covered

102337

4110

28.48

28.91

0.30

1.00

102506

734

4.98

4.98

0.05

0.93

Where the columns have the following meaning:

species_id:                 six-digit species id
marker_read_counts:         total mapped read counts
median_marker_coverage:     median coverage of the 15 SCGs
marker_coverage:            mean coverage of the 15 SCGs
marker_relative_abundance:  computed based on ``marker_coverage``
unique_fraction_covered:    the fraction of uniquely mapped SCGs genes

Downstream commands (run_snps and run_genes) use the median_marker_coverage and/or unique_fraction_covered to select sufficiently abundant species. See below.

Across-Samples

The primary output of the merging step is the file midas2_output/merge/species/species_prevalence.tsv.

species_id

median_abundance

mean_abundance

median_coverage

mean_coverage

sample_counts

102337

0.186

0.186

16.205

16.205

2

102506

0.035

0.035

2.967

2.967

2

Where the columns have the following meaning:

species_id:       six-digit species id
median_abundance: median marker_relative_abundance across samples
mean_abundance:   mean marker_relative_abundance across samples
median_coverage:  median median_marker_coverge across samples
mean_coverage:    mean median_marker_coverge across samples
sample_counts:    number of samples with median_marker_coverge >= min_cov

MIDAS2 also writes two species-by-sample matrices in the output directory: midas2_output/merge/species. Median marker coverage, and unique fraction covered are written to midas2_output/merge/species/species_marker_median_coverage.tsv and midas2_output/merge/species/species_unique_fraction_covered.tsv, respectively

Species Selection in Downstream Modules

In a standard SNV/CNV workflow, only sufficiently abundant species in the restricted species profile will be included to build representative genome (rep-genome) or pan-genome index and further to be genotyped. By default, both the run_snps and run_genes commands perform a species selection step. Both commands therefore assume that run_species has already been carried out for each sample.

Two flags, --select_by and --select_threshold, determine which species are selected:

  • --select_by followed by a comma separated list of column names in midas2_output/species/species_profile.tsv

  • --select_threshold followed by a comma-separated list of threshold values for selection.

For most analyses we recommend using the combination of median_marker_coverage > 2X and unique_fraction_covered > 0.5:

--select_by median_marker_coverage,unique_fraction_covered --select_threshold=2,0.5

Some users may wish to genotype low abundance species and should adjust the parameters accordingly:

--select_by median_marker_coverage,unique_fraction_covered --select_threshold=0,0.5

Alternatively, users can directly pick a list of species using the --species_list option. It is worth noting that the species in the provided species list are still subject to the --select_threshold restriction. Users can set --select_threshold=-1 to escape species selection filters based on the species profiling:

--species_list 102337,102506 --select_threshold=-1

All the species passing the species selection filters will be genotyped.

Having finished the species selection step, we can now go to the SNV or CNV modules, depending on the scientific aims.