METAGENE-1: Metagenomic Foundation Model

H2 Essay Title

Authors

Johannes

Sami

Announcements

January 6, 2025

METAGENE-1: Metagenomic Foundation Model

We are releasing a 7B parameter metagenomic foundation model designed for pandemic monitoring, trained on over 1.5T base pairs of DNA and RNA sequenced from wastewater.

Paper: metagene-1-paper
Github: github.com/metagene-ai
Hugging Face: huggingface.co/metagene-ai

A collaboration between researchers at USC, Prime Intellect, and the Nucleic Acid Observatory.

‍

Overview

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer language model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs.

This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present across the human microbiome. This model is designed to aid in public health applications relating to pandemic monitoring, pathogen detection, and early detection of emerging health threats.

We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. We detail the pretraining data, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data, in our technical report.

‍

Data

METAGENE-1 is trained on a newly collected metagenomic dataset comprising material from a very broad range (e.g., tens of thousands) of organisms, which was collected via metagenomic sequencing of human wastewater (i.e., municipal influent). This approach contrasts with prior genomic sequence models, which often focus on curated collections of specific species or genomic types.

We show the metagenomic data collection and sequencing pipeline in the figure above. The process begins with the collection of wastewater (left), which contains genomic fragments from a diverse collection of species (center). This is processed via high-throughput metagenomic sequencing to produce millions of paired-end reads (right), each consisting of hundreds of base pairs. The complete dataset comprises over 1.5 trillion base pairs of metagenomic sequences used for model pretraining.

Performance

METAGENE-1 achieves state-of-the-art performance on pathogen detection and metagenomic embedding benchmarks, as well as other standard genomic evaluation tasks—designed to evaluate models on human, animal, and other genomes—highlighting its generalization capabilities.

As an initial demonstrations of the downstream application potential, we construct pathogen and anomaly detection scenarios, and show METAGENE-1's capabilities on these detection tasks. See our technical report for details.

Safety

METAGENE-1 offers capabilities that are valuable for biosurveillance and metagenomic anomaly detection. While the current version of the model poses minimal risks due to data and architectural choices, we carefully weighed its benefits against the potential for misuse, particularly in synthetic biology, and we emphasize the need for strict safety considerations for larger, more capable models. In our technical report, we discuss safety considerations in releasing a metagenomic model.

Purpose and Capabilities: METAGENE-1 is specifically optimized to detect anomalies in short metagenomic reads (100-300 base pairs), making it well-suited for tasks like pathogen detection and biosurveillance. The model’s architectural choices, such as its 512-token context length, limit its applicability to complex sequence design tasks, reducing misuse risks.

Open Source Impact: We believe the open release of METAGENE-1 will foster research in pathogen detection and biosurveillance by providing a valuable tool for scientists, and it will also facilitate research in scientific foundation models. However, we emphasize the need for rigorous safety evaluations before open-sourcing larger or more capable genomic models in the future, and are developing a suite of safety benchmarks for this purpose.

Team

METAGENE-1 is brought to you by a team of researchers from USC, Prime Intellect, and the Nucleic Acid Observatory:

Oliver Liu, USC
Sami Jaghouar, Prime Intellect
Johannes Hagemann, Prime Intellect
Shangshang Wang, USC
Jason Wiemels, USC
Jeff Kaufman, Nucleic Acid Observatory
Willie Neiswanger, USC

‍

Eurus-2-PRIME

System Prompt

When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

Eurus-2-PRIME

[ASSESS]

We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.

[ADVANCE]

Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $

Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.

For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $

$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $

For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $

$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $

[ADVANCE]

Now, we can solve the system of equations for $P(x)$:
Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $

Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $

So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $

Next, we can solve the system of equations for $Q(x)$:
Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $

Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $

So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $

[ADVANCE]

Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $

So, the value of $P(0) + Q(0)$ is $\boxed{116}$.

INTELLECT-MATH-7B-SFT (ours)

System Prompt

Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B-SFT

So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:

$P(x) = 2x^2 + bx + c$

$Q(x) = -2x^2 + dx + e$

Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.

Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.

So, I can set up equations using these points.

First, for $P(x)$:

$P(16) = 2(16)^2 + b16 + c = 54$

Similarly, $P(20) = 2(20)^2 + b20 + c = 53$

Let me calculate $16^2$ and $20^2$:

$16^2 = 256$

$20^2 = 400$

So,

$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$

$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$

Now, I have two equations:

1) $16b + c = -458$

2) $20b + c = -747$

I can subtract equation 1 from equation 2 to eliminate $c$:

$(20b + c) - (16b + c) = -747 - (-458)$

$20b + c - 16b - c = -747 + 458$

$4b = -289$

So, $b = -289 / 4 = -72.25$

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:

$16(-72.25) + c = -458$

$-1156 + c = -458$

$c = -458 + 1156 = 698$

So, $P(x) = 2x^2 - 72.25x + 698$

Hmm, $P(0)$ would be $c$, which is $698$.

Now, let's do the same for $Q(x)$.

$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$

$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$

Now, equations for $Q(x)$:

1) $16d + e = 566$

2) $20d + e = 853$

Subtract equation 1 from equation 2:

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287$

$d = 287 / 4 = 71.75$

Now, plug $d$ back into equation 1:

$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$

So, $Q(x) = -2x^2 + 71.75x - 582$

Therefore, $Q(0) = e = -582$

Now, $P(0) + Q(0) = 698 + (-582) = 116$

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write $\boxed{116}$

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for $P(x)$:

$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$

$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$

Subtracting: $4b = -289 \Rightarrow b = -72.25$

Then, $c = -458 - 16*(-72.25) = -458 + 1156 = 698$

Similarly for $Q(x)$:

$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$

$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$

Subtracting: $4d = 287 \Rightarrow d = 71.75$

Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$

So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.

Seems correct.

Final Answer

$\boxed{116}$

INTELLECT-MATH-7B (ours)

System Prompt

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B

So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both are quadratic polynomials, I can express them in the general form:

$P(x) = 2x² + bx + c$

$Q(x) = -2x² + dx + e$

But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.

For $P(x)$:

$P(16) = 2*(16)² + b \cdot 16 + c = 54$

$P(20) = 2(20)² + b \cdot 20 + c = 53$

Similarly for $Q(x)$:

$Q(16) = -2(16)² + d \cdot 16 + e = 54$

$Q(20) = -2(20)² + d \cdot 20 + e = 53$

Let's compute the squares:

$16² = 256$

$20² = 400$

So for $P(x)$:

$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$

$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$

For $Q(x)$:

$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$

$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$

Now, I have two systems of linear equations:

For $P(x)$:

$16b + c = -458 \quad ...(1)$

$20b + c = -747 \quad ...(2)$

Subtract equation (1) from equation (2):

$(20b + c) - (16b + c) = -747 - (-458)$

$4b = -289 \Rightarrow b = -\frac{289}{4}$

Now plug $b$ back into equation (1):

$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$

So $P(x) = 2x² - \frac{289}{4}x + 698$

For $Q(x)$:

$16d + e = 566 \quad ...(3)$

$20d + e = 853 \quad ...(4)$

Subtract equation (3) from equation (4):

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287 \Rightarrow d = \frac{287}{4}$

Now plug $d$ back into equation (3):

$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$

So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.

Now, $P(0)=c=698$.

$Q(0)=e=-582$.

Therefore, $P(0)+Q(0)=698+(-582)=116$.

Final Answer

$\boxed{116}$.

Authors

Johannes

Sami

Acknowledgements

METAGENE-1 was developed by Oliver Liu (USC), Sami Jaghouar and Johannes Hagemann (Prime Intellect), Shangshang Wang and Jason Wiemels (USC), Jeff Kaufman (Nucleic Acid Observatory), and Willie Neiswanger (USC).

15M to Build a Peer-to-Peer AI Protocol

Prime Intellect is building a peer-to-peer protocol for compute and intelligence, enabling collective creation, ownership, and access to sovereign open-source AI. We’re moving beyond centralized AI to empower anyone—from solo GPU operators to global datacenters—to contribute compute, code, or capital and shape the open and decentralized AI ecosystem.

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

We are releasing SYNTHETIC-1, the largest open reasoning dataset generated from Deepseek-R1, collaboratively generated by compute contributors across the globe.

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network

We’re excited to introduce our Protocol, and the first iteration of it on our Testnet. Prime Intellect’s Protocol is a peer-to-peer network that commoditizes compute and intelligence and coordinates the collective creation, ownership and governance of open-source AI.

METAGENE-1: Metagenomic Foundation Model

Overview

Data

Performance

Safety

Team

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network

Related Posts

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network