We are releasing a 7B parameter metagenomic foundation model designed for pandemic monitoring, trained on over 1.5T base pairs of DNA and RNA sequenced from wastewater.
A collaboration between researchers at USC, Prime Intellect, and the Nucleic Acid Observatory.
We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer language model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs.
This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present across the human microbiome. This model is designed to aid in public health applications relating to pandemic monitoring, pathogen detection, and early detection of emerging health threats.
We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. We detail the pretraining data, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data, in our technical report.
METAGENE-1 is trained on a newly collected metagenomic dataset comprising material from a very broad range (e.g., tens of thousands) of organisms, which was collected via metagenomic sequencing of human wastewater (i.e., municipal influent). This approach contrasts with prior genomic sequence models, which often focus on curated collections of specific species or genomic types.
We show the metagenomic data collection and sequencing pipeline in the figure above. The process begins with the collection of wastewater (left), which contains genomic fragments from a diverse collection of species (center). This is processed via high-throughput metagenomic sequencing to produce millions of paired-end reads (right), each consisting of hundreds of base pairs. The complete dataset comprises over 1.5 trillion base pairs of metagenomic sequences used for model pretraining.
METAGENE-1 achieves state-of-the-art performance on pathogen detection and metagenomic embedding benchmarks, as well as other standard genomic evaluation tasks—designed to evaluate models on human, animal, and other genomes—highlighting its generalization capabilities.
As an initial demonstrations of the downstream application potential, we construct pathogen and anomaly detection scenarios, and show METAGENE-1's capabilities on these detection tasks. See our technical report for details.
METAGENE-1 offers capabilities that are valuable for biosurveillance and metagenomic anomaly detection. While the current version of the model poses minimal risks due to data and architectural choices, we carefully weighed its benefits against the potential for misuse, particularly in synthetic biology, and we emphasize the need for strict safety considerations for larger, more capable models. In our technical report, we discuss safety considerations in releasing a metagenomic model.
Purpose and Capabilities: METAGENE-1 is specifically optimized to detect anomalies in short metagenomic reads (100-300 base pairs), making it well-suited for tasks like pathogen detection and biosurveillance. The model’s architectural choices, such as its 512-token context length, limit its applicability to complex sequence design tasks, reducing misuse risks.
Open Source Impact: We believe the open release of METAGENE-1 will foster research in pathogen detection and biosurveillance by providing a valuable tool for scientists, and it will also facilitate research in scientific foundation models. However, we emphasize the need for rigorous safety evaluations before open-sourcing larger or more capable genomic models in the future, and are developing a suite of safety benchmarks for this purpose.
METAGENE-1 is brought to you by a team of researchers from USC, Prime Intellect, and the Nucleic Acid Observatory: