Output Files and Directory Layout¶

In this section we refer to the output directory for single-sample as midas_output/sample_name, and across-samples as midas_output. Input command-line options can be found at this page.

Single-Sample Results Layout¶

MIDAS2 analysis usually starts with species selection which selects sufficiently abundant species in each sample (command run_species). After completing this step, users can run the SNV module with run_snps or the CNV module with run_genes.

Here is an example of output from a single-sample analysis in whch the species, SNV, and CNV modules were run, as it would appear in the local filesystem.

Output                                 Producer        Meaning
-------------------------------------------------------------------------------------------
{midas_output}/{sample_name}
  |- species
     |- species_profile.tsv            run_species     Summary of species coverage
     |- markers_profile.tsv            run_species     Per species marker coverage
  |- snps
     |- snps_summary.tsv               run_snps        Summary of reads mapping to rep-genome
     |- {species}.snps.tsv.lz4         run_snps        Per species pileups
  |- genes
     |- genes_summary.tsv              run_genes       Summary of reads mapping to pan-genome
     |- {species}.genes.tsv.lz4        run_genes       Per species pan-gene coverage

 |- temp
     |- snps
        |- repgenomes.bam              run_snps        Rep-genome alignment file
        |- {species}/snps_XX.tsv.lz4
     |- genes
        |- pangenome.bam               run_genes       Pan-genome alignment file
        |- {species}/genes_XX.tsv.lz4
  |- bt2_indexes
     |- snps/repgenomes.*              run_snps        Sample-specific rep-genome database
     |- genes/pangenomes.*             run_genes       Sample-specific pan-genome database

Across-Samples Results Layout¶

For a collection of samples, population SNVs and CNVs can be estimated using the subcommands merge_snps and merge_genes. This is what the output looks like in the local filesystem.

Output                                             Producer        Meaning
---------------------------------------------------------------------------------------------------------------
{midas_output}
  |- species
    |- species_prevalence.tsv                      merge_species   Per species summary statistics across samples
    |- species/species_read_counts.tsv             merge_species   Species-by-sample reads counts matrix
    |- species/species_coverage.tsv                merge_species   Species-by-sample marker coverage matrix
    |- species/species_rel_abundance.tsv           merge_species   Species-by-sample relative abundance matrix
  |- snps
    |- snps_summary.tsv                            merge_snps      Alignment summary statistics per sample
    |- {species}/{species}.snps_info.tsv.lz4       merge_snps      Per species SNV metadata
    |- {species}/{species}.snps_freqs.tsv.lz4      merge_snps      Per species site-by-sample MAF matrix
    |- {species}/{species}.snps_depth.tsv.lz4      merge_snps      Per species site-by-sample reads depth matrix
  |-genes
    |- genes_summary.tsv                           merge_genes     Alignment summary statistics per sample
    |- {species}/{species}.genes_presabs.tsv.lz4   merge_genes     Per species gene-by-sample pre-abs matrix
    |- {species}/{species}.genes_copynum.tsv.lz4   merge_genes     Per species gene-by-sample copy number matrix
    |- {species}/{species}.genes_depth.tsv.lz4     merge_genes     Per species gene-by-sample reads depth matrix

MIDAS Reference Database Layout¶

To meet the challenge of increasing numbers of available genome sequences, MIDAS2 implemented a new database infrastructure, geared to run on AWS Batch and S3, to achieve elastic scaling for building MIDAS2 reference databases.

To be specific, the MIDAS2 reference database construction step can be executed in AWS using hundreds of r5d.24xlarge instances over a period of a couple of days, depositing built products in S3. For example, it took ~$80,000 and a week to build the species pan-genome for all 47,894 species of GTDB r202.

Table of Content¶

The new database infrastructure reads in a table of contents (TOC) file, containing genome-to-species assignment and a choice of representative genome for each species cluster. One TOC file (genomes.tsv) per MIDAS2 reference database. The TOC file has four columns, among which genome_is_representative specify whether the genome is the representative genome for the corresponding species. Only one representative per species.

genome	species	representative	genome_is_representative
GUT_GENOME138501	104351	GUT_GENOME269084	0
GUT_GENOME269084	104351	GUT_GENOME269084	1

By default, MIDAS2 inherits the representative genome assignments from published prokaryotic genome databases. Inspired by the importance of selecting proper reference genome for accurate SNV calling, this new infrastructure empowers users to dynamically re-assign the representative genomes, simply by modifying the genomes.tsv file accordingly.

Microbial Genome Collections¶

Unified Human Gastrointestinal Genome (UHGG)¶

A collection of 286,997 genomes assembled from metagenomes, isolates and single cells from human stool samples has been clustered into 4,644 gut-only species in UHGG 1.0 catalogues. The collection of all the UHGG genomes were mirrored in a S3 bucket, which serves as the input to the database construction. Six-digit numeric species ids were arbitrarily assigned. Instead of species name, these species_id are used as species identifier in all the reports generated by MIDAS2.

Genome Taxonomy Database (GTDB)¶

GTDB R06-RS202 contains 45,555 bacterial and 2,339 archaeal species clusters spanning 258,406 genomes, released on April 27th, 2021. The genome members for each species cluster are specified in the sp_clusters_r202.tsv, upon which order six-digit numeric species ids are assigned. GTDB only provided the sequences of the representative genomes, and we downloaded all the genomes from NCBI genomes repository using genome_updater.

Target Layout and Construction¶

A MIDAS2 reference database (MIDASDB) consists of three primary parts: rep-genome database, pan-genome database, and universal single copy genes (SGC) marker database. The target layout of any MIDASDB follows the same relative structure, based on the root directory of the database. The following toy example demonstrates the major steps to construct the MIDASDB and the target layout using a collection of two genomes (genome1 and genome2) from one species cluster species1.

TODO: insert image

Inputs¶

The input collection of genomes needs to be organized in the format cleaned_genomes/<species>/<genome>/<genome>.fna. And the table of contents file genomes.tsv needs to be generated accordingly, with randomly assigned six-digit species_id, to replace the species name. The genome name can be kept as it is.

genome	species	representative	genome_is_representative
genome1	100001	genome2	0
genome2	100001	genome2	1

Rep-Genome Database¶

Genome annotation is performed using Prokka, and the annotated genes are kept under the directory of genes_annotations/<species>/<genome>. The rep-genome database (for the SNV module) only includes the gene annotations and sequences for the representative genomes, as specified in the table of contents file.

gene_annotations/100001/genome2/genome2.fna.lz4
gene_annotations/100001/genome2/genome2.ffn.lz4
gene_annotations/100001/genome2/genome2.genes.lz4

SCG Marker Database¶

Marker genes are defined as universal, single-copy gene families. MIDAS2 uses a subset (15) of the PhyEco gene families. The pre-computed HMM model of this set of 15 single copy genes (SCGs) are available at:

s3://microbiome-pollardlab/uhgg_v1/marker_gene_models/phyeco/marker_genes.hmm.lz4
s3://microbiome-pollardlab/uhgg_v1/marker_gene_models/phyeco/marker_genes.mapping_cutoffs.lz4

For each annotated genome, the homologs of 15 SCGs are identified with hmmsearch, as well as the mapping of gene id to corresponding marker gene id, under the directory of marker_genes/phyeco/temp/<species>/<genome>.

marker_genes/phyeco/temp/100001/genome2/genome2.markers.fa
marker_genes/phyeco/temp/100001/genome2/genome2.markers.map

For all the representative genomes, the identified marker genes are concatenated into monolithic marker_genes.fa, from which a hs-blastn index is constructed. The indexed marker_genes.fa serves as the SCG marker databases.

marker_genes/phyeco/marker_genes.fa
marker_genes/phyeco/marker_genes.fa.sa
marker_genes/phyeco/marker_genes.fa.bwt
marker_genes/phyeco/marker_genes.fa.sequence

Pan-Genome Database¶

Species-level pan-genome refers to the set of non-redundant genes that represent the genetic diversity within one species cluster.

In order to construct the pan-genome database for each species, the first step is to concatenate the annotated genes from all genomes within the species cluster into pangenomes/100001/genes.ffn. The second step, which is also the most time-consuming step, is to cluster the concatenated genes based on 99% percent identity (PID) using vsearch. Each cluster is represented by the gene at its center - centroid gene (centroids.99.ffn). The centroid.99 genes are further on clustered to 95, 90, …, PID, respectively, and the mapping relationships are listed in centroid_info.txt. The top level centroids.ffn file represents the 99 percent identity clusters, and it serves as the species pan-genome databases.

Reads are aligned to the pan-genome database to determine the gene content and average copy number of strains in a sample (run_genes command), and reads can optionally aggregated into gene clusters at any of the lower clustering thresholds across samples (merge_genes command).

pangenomes/100001/centroids.ffn
pangenomes/100001/centroid_info.txt