Advanced Topics

Population SNV Calling

In this section, we will the compute of population SNV, the chunkified implementation of pileup, as well as filters of species, samples and genomic sites in the SNV module.

Important Concepts

  1. <species, relevant samples> selection

    Population SNV analysis restricts attention to “sufficiently well” covered species in “sufficiently many” samples.

    To be specific, a given <species, sample> pair will only be kept (by default) if it has more than 40% horizontal genome coverage (genome_coverage) and 5X vertical genome coverage (genome_depth). Furthermore, only “sufficiently prevalent” species with “sufficiently many” samples (sample_counts) would be included for the population SNV analysis. Therefore, different species may have different lists of relevant samples.

  2. <site, relevant samples> selection

    For each genomic site, a sample is considered to be “relevant” if the corresponding site depth falls between the range defined by the input arguments site_depth and `site_ratio * mean_genome_coverage; otherwise it is ignored for the across-samples SNV compute.

    Therefore, different genomic sites from the same species may have different panels of “relevant samples”. Genomic site prevalence can be computed as the ratio of the number of relevant samples for the given site over the total number of relevant samples for the given species.

  3. relevant site

    For each species, a site is considered to be “relevant” if the site prevalence meets the range defined by the input arguments snv_type and `site_prev. By default, common SNV with more than 90% prevalence are reported.

Population SNV Computation

There are three main steps to compute and report population SNVs in MIDAS2.

First, for each relevant genomic site, MIDAS2 determines the set of alleles present across all relevant samples. Specifically, for each allele (A, C, G, T), merge_snps command

  1. tallys the sample counts (sc_) of relevant samples containing corresponding allele (scA:scT)

  2. sums up the read counts (rc`_`) of the corresponding allele across all the relevant samples (``rc_G:rc_T).

site_id

rc_A

rc_C

rc_G

rc_T

sc_A

sc_C

sc_G

sc_T

gnl|Prokka|UHGG000587_14|34360|A

26

10

0

0

1

2

0

0

Second, population major and minor alleles for a single site can be computed based on the accumulated read counts or sample counts across all relevant samples. The population major allele refers to the most abundant (read count) or prevalent (sample count) allele, and the population minor allele refers to the second most abundant or prevalent allele.

For example, the population major allele of the site gnl|Prokka|UHGG000587_14|34360|A in the above example is A defined by accumulated read counts and C defined by accumulated sample counts.

Third, MIDAS2 collects and reports the sample-by-site matrix of the corresponding (1) site depth and (2) allele frequency of the above calculated population minor allele for all the relevant samples. In these two matrices, MIDAS2 encode site_depth = 0 and allele_frequency = -1 with the special meaning of missing <site, sample> pair.

Chunkified Pileup Implementation

Both single-sample and across-samples pileup are parallelized on the unit of chunk of sites, which is indexed by <species_id, chunk_id>. When all chunks from the same species finish processing, then chunk-level pileup results will merged into species-level pileup result.

This implementation makes population SNV analysis across thousands of samples possible. To compute the population SNV for one chunk, all the pileup results of corresponding sites across all the samples need to be read into memory. With the uses of multiple CPUs, multiple chunks can be processed at the same time. Therefore, for large collections of samples, we recommend higher CPU counts and smaller chunk size to optimally balance memory and I/O usage, especially for highly prevalent species. Users can adjust the number of sites per chunk via chunk_size (default value = 1000000). MIDAS2 also has a robust_chunk option, where assigning different chunk sizes to different species based on the species prevalence.

Build Your Own MIDASDB

MIDAS2 users can locally build a new MIDASDB for a custom collection of genomes. The target layout of MIDASDB can be found at MIDASDB Layout. This section is focused specifically on the database construction commands.

Table of Content (TOC)

To start with, users need to organize the genomes in a specific format and produce the TOC genomes.tsv as described in the MIDASDB Layout.

We have prepared a toy collections of genomes at the tests/genomes, and we will build the new MIDASDB under the directory of tests/genomes. There are two command-line parameters that users need to pass:

  • --debug: keep the local file after successfully build the database

  • --force: re-build the database even if already locally exists

Six-digit numeric species ids are randomly assigned and can be stored in the corresponding metadata file (metadata.tsv)

MIDAS2 reserves --midasdb_name newdb for building a custom MIDASDB, and the new MIDASDB will be built at --midasdb_dir.

Rep-genome

First, annotate all the genomes:

midas2 annotate_genome --species all
  --midasdb_name newdb --midasdb_dir my_new_midasdb \
  --debug --force

midas2 build_midasdb --generate_gene_feature \
  --genomes all \
  --midasdb_name newdb --midasdb_dir my_new_midasdb
  --debug --force

SCG Markers

Second, infer SCGs for all the genomes and build marker database:

midas2 infer_markers --genomes all
  --midasdb_name newdb --midasdb_dir my_new_midasdb \
  --debug --force

midas2 build_midasdb --build_markerdb \
  --midasdb_name newdb --midasdb_dir my_new_midasdb \
  --debug --force

Pan-genome

Third, build species pangenomes:

midas2 build_pangenome --species all \
  --midasdb_name newdb --midasdb_dir my_new_midasdb \
  --debug --force

midas2 build_midasdb --generate_cluster_info \
  --species all \
  --midasdb_name newdb --midasdb_dir my_new_midasdb \
  --debug --force

Build Your Own Genome Index

MIDAS2 builds sample-specific rep-genome or pan-genome index for species in the restricted species profile. However, we recognize the need of using one comprehensive list of species across samples in the same study. In this section, we will go over the steps of building one genome index containing a customized list of species across a set of samples.

We presuppose users have already completed the across-samples species profiling and have midas2_output/merge/species/species_prevalence.tsv ready for the set of samples.

Species Selection

Users can select species based on the prevalence from the species_prevalence.tsv file, e.g. the list of speices that are present in at least one sample, by customizing the --select_by and --selectd_threshold to the build_bowtie2db command.

Build Genome Index

In this section, we will keep using the example data from Quickstart.

midas2 build_bowtie2db \
  --midasdb_name uhgg --midasdb_dir my_midasdb_uhgg \
  --select_by sample_counts \
  --select_threshold 2 \
  --bt2_indexes_name repgenomes \
  --bt2_indexes_dir one_bt2_indexes \
  --num_cores 8

And users can locate the generated rep-genome database at one_bt2_indexes/repgenomes, and the list of species in the rep-genome is at one_bt2_indexes/repgenomes.species.

Use Prebuilt Genome Index

If taking this approach, for the single-sample SNV or CNV analysis, users can pass the pre-built rep-genome to run_snps analysis (pan-genome for run_genes), as follows:

midas2 run_snps
  --sample_name sample1 \
  -1 reads/sample1_R1.fastq.gz \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --prebuilt_bowtie2_indexes one_bt2_indexes/repgenomes \
  --prebuilt_bowtie2_species one_bt2_indexes/repgenomes.species \
  --select_threshold=-1 \
  --num_cores 8 \
  ${midas_output}

Developer Notes

Export Your Conda Environment

conda update --all
conda clean –all
conda env export --no-builds | grep -v "^prefix:" > midas2.updated.yml