So I've been working on something that I think is pretty cool, and I wanted to share where it's at. The question I'm trying to answer: can we build a system that reliably identifies new mitochondrial haplogroups as data comes in?
The mitochondrial phylogenetic tree keeps growing. mitoLEAF—the community successor to PhyloTree—now has 6,409 haplogroups, up from PhyloTree's 5,435 when it was frozen in 2016. And ancient DNA is dropping thousands of new samples every year that might represent branches we haven't discovered yet.
So I built a system to find them.
The Problem: Orphan Samples
When you classify a mitogenome against a phylogenetic tree, you're looking for the best match—the node where the sample's variants most closely align. Most samples fit somewhere. But some don't fit well. They have too many "private variants"—mutations that don't match any known branch.
I categorize these samples by how much they deviate:
| Deviation Level | Private Variants | What It Means |
|---|---|---|
| Minimal | ~3 | Good fit, normal variation |
| Moderate | ~11 | Worth a second look |
| Substantial | ~22 | Might be a missing branch |
| Extensive | ~32+ | Almost certainly missing from the tree |
Samples with "substantial" or "extensive" deviation are interesting. If a bunch of them share the same set of mutations that aren't in the tree, that's probably a new haplogroup waiting to be named.
What I Built
The system does a few things:
- Fetches sample data. I started with the AADR (Allen Ancient DNA Resource) because ancient samples are more likely to carry haplogroups that aren't well-represented in the modern-biased tree.
- Classifies against mitoLEAF. For each sample, find the best-matching node and calculate which variants are "private" (not explained by the assigned haplogroup).
- Clusters orphans. Group samples that share private variants. If multiple samples independently show the same mutations, that's evidence for a real branch.
- Enriches with metadata. Pull in publication info, geographic origin, and dates from GenBank to provide context.
- Generates proposals. Output in a format ready for submission to mitoLEAF.
The Dataset
For this first pass, I analyzed:
| Source | Samples | Notes |
|---|---|---|
| AADR v62.0 (Ancient) | 8,200 | ≥90% MT coverage, spanning 45,000 years |
| 1000 Genomes + HGDP (Modern) | 3,326 | Global diversity panels as baseline |
| Total | 11,526 |
The ancient samples span from Paleolithic (45,000 BP) through Early Modern periods, with the bulk coming from Bronze Age and Iron Age Europe. (That's where most of the archaeological work with good DNA preservation has happened.)
What I Found
After running the full pipeline:
2,975 haplogroups with assigned samples | 480 haplogroups present only in ancient samples | 479 high-confidence proposals for new branches
479 high-confidence proposals. These are groups of samples that:
- Don't fit cleanly on the current tree
- Share variants with each other that aren't in mitoLEAF
- Have enough samples (≥2) to suggest it's not just sequencing error
- Score above 0.8 on a confidence metric based on sample count and variant consistency
The largest clusters are interesting: P1h with 45 samples, T2f1a1 with 26, and Q1 with 24. But there's a catch—P1h appears to be entirely from a single Papuan population study. All 45 samples have isolate names like "papuan6278xxx", which suggests they might be from related individuals in the same community. Archaeological and population genetics contexts often include relatives. The algorithm tries to filter these out, but it's imperfect.
I also found something unexpected: samples clustering at mt-MRCA—the very root of the tree. About 119 proposals showed up there, which seemed weird until I looked at them more closely.
They were all Neanderthals.
24 unique Neanderthal mitogenomes from 9 archaeological sites, correctly identified as Homo sapiens neanderthalensis in GenBank. They cluster at the root because Neanderthal mtDNA diverged from ours 400,000-500,000 years ago. Of course they don't fit on a tree built for anatomically modern humans.
What's interesting is that those 119 proposals aren't errors—they're the internal structure of Neanderthal mtDNA diversity. The system found a core signature (variants 709A, 827G, 1709A) shared by all 24 samples, plus three main sub-branches:
| Branch | Defining Variant | Samples | Sub-branches |
|---|---|---|---|
| N-1406C | 1406C | 21 | 73 |
| N-2831C | 2831C | 22 | 35 |
| N-7650T | 7650T | 18 | 11 |
The samples come from sites across Europe and Siberia: Goyet (Belgium, 8 samples), Vindija (Croatia, 3), Denisova Cave (Russia, 3), plus Spy, Scladina, Mezmaiskaya, Chagyrskaya, and Les Cottes. It's a small dataset, but enough to see phylogenetic structure within Neanderthal maternal lineages.
Explore Neanderthal mtDNA Diversity
Interactive visualization of 24 Neanderthal mitogenomes showing maternal lineage structure across 9 archaeological sites
(This is actually a good sign—the system correctly identified samples that are phylogenetically distant from modern humans, and built reasonable internal structure for them.)
The "Ancient-Only" Question
One thing that came out of the analysis: 480 haplogroups appear only in ancient samples, with zero representation in modern reference panels.
Some examples:
| Haplogroup | Ancient Samples | Modern | Time Range | Regions |
|---|---|---|---|---|
| N1a1a1 | 97 | 0 | Mesolithic–Medieval | Germany, Turkey, Hungary |
| K1a3 | 65 | 0 | Mesolithic–Medieval | UK, Turkey, Germany |
| J1c1b | 41 | 0 | Mesolithic–Early Modern | France, Spain, UK |
N1a1a1 is actually well-known in the literature as the "Neolithic farmer signature"—it peaked in Early Neolithic Central European populations. The fact that it shows up at zero in 1KG and HGDP is interesting, though it could mean either (a) the lineage is genuinely rare/extinct today, or (b) it persists in populations that aren't well-represented in those panels.
The system doesn't answer whether a haplogroup is extinct or just undersampled. That's a harder question that requires more targeted modern sampling. But it does flag which haplogroups are worth investigating.
The Visualization
I built an interactive visualization to explore all of this. It includes:
- A sunburst chart showing the mitoLEAF tree with orphan candidates highlighted
- A searchable table of all the haplogroups and their ancient/modern distributions
- Geographic and temporal context from the enrichment pipeline
Explore the Visualization
Interactive sunburst chart and searchable table of mtDNA orphan candidates and temporal distributions
Click around in the sunburst to zoom into different branches. The table is searchable by haplogroup name, region, or time period.
Can We Reliably Identify New Haplogroups?
So back to the key question: can we reliably identify new haplogroups with constant data input?
Based on this first pass, I think the answer is yes, with caveats:
- Clustering works. Samples that share private variants do cluster together, and many of those clusters correspond to branches that probably should exist in the tree.
- Ancient DNA is noisy. C→T and G→A mutations from post-mortem damage look like real variants. The system needs to flag these (it does) and weight them differently.
- Sample size matters. A single orphan sample could be sequencing error. Two or more samples with the same private variants are much more convincing.
- Human review is still needed. The pipeline generates candidates, but someone with domain expertise should vet them before proposing to mitoLEAF.
What's Next
This is a starting point. The system works, but there's more to do:
- Review individual studies. The high-confidence orphan clusters need manual review. It's possible (even likely) that the algorithm has misidentified familial relations as independent samples—that process is imperfect, and archaeological contexts often include related individuals.
- Continuous monitoring. Set up the pipeline to run periodically as new data is released, flagging new orphan clusters as they emerge.
- Contribute to mitoLEAF. Take the best orphan clusters and turn them into formal haplogroup proposals.
The mitochondrial tree has grown from ~5,400 to ~6,400 haplogroups in the last decade. With 8,200+ ancient samples and more coming every year, there are probably branches we haven't named yet. This system is designed to find them.
The Bigger Picture
Here's what I think is cool about this: the phylogenetic tree isn't a static thing. It's a model of human maternal lineage that keeps getting refined as we sequence more people—including people who lived thousands of years ago.
Every new branch we add is a tiny piece of that history getting filled in. Every orphan cluster represents someone's maternal ancestors whose lineage we haven't formally recognized yet.
Life is so fucking cool.
Happy to answer questions if you have them! And if you poke around in the visualization and find something interesting, let me know. :)