Some of the most important populations in human history left no direct descendants. They're ghosts. We know they existed only because they left genetic traces in other populations. Using a statistical framework called f-statistics, we can detect these ghosts and even estimate how much ancestry they contributed to populations alive today.

In this post, I'll walk through detecting four ghost populations using public ancient DNA data: Basal Eurasians, Denisovans, Neanderthals, and the African "Ghost Modern" population. All the code is reproducible using the AADR dataset and the admixtools2 R package.

Why I Wanted to Try This

I first encountered ghost populations through the Ancient North Eurasians in a popular genetics book (I can't remember which one, which probably says something about my organizational skills). The idea that you could statistically discover a population that left no remains, just by looking at patterns in the DNA of their descendants, fascinated me. It felt like detective work with math.

My general approach when something like this catches my interest is to go read the original papers, figure out what tools the researchers used, and then try to reproduce the analysis myself. Not because I expect to discover anything new, but because doing it yourself is the fastest way to actually understand what's happening.

To make this kind of exploration practical, I've built up a local collection of common reference datasets: HGDP, 1000 Genomes, SGDP, and AADR. Having these on hand means I can test ideas immediately without spending hours (or days) waiting for multi-gigabyte downloads. It's the difference between "I wonder if..." and actually finding out.

The Key Idea (No Math Required)

Here's the core insight that makes ghost detection work:

When populations split apart and stop interbreeding, they start accumulating different random genetic changes. It's like two branches of a family that move to different countries. Over generations, they'll develop their own traditions, recipes, and quirks that the other branch doesn't have.

The longer two populations have been separated, the more different they become. Closely related populations share more genetic variants. Distantly related populations share fewer.

So if we see a population that shares unexpectedly few genetic variants with another group, fewer than their family tree would predict, something's off. They must have ancestry from somewhere else. Somewhere we haven't sampled. A ghost.

The ghost detector question: "Does this population fit where we think it belongs in the family tree? Or does it have secret ancestry from somewhere else?"

Quick Glossary

Before we dive in, here are the key terms you'll encounter:

What Are Ghost Populations?

A ghost population is any ancestral group that contributed DNA to people alive today (or to ancient people we've sampled) but left no remains we can directly sequence. Sometimes we discover them later. Denisovans were a "ghost" until we found a finger bone in a Siberian cave and sequenced it. Other times, like the Basal Eurasians, we may never find direct remains.

Ghosts reveal themselves through patterns that don't fit the family tree. If population A should be closely related to population B based on geography and history, but their DNA says otherwise, there's a ghost in the picture.

The f4-Statistic: How It Works

The f4-statistic is our ghost detector. It asks: "Do these four populations fit a simple family tree?"

Imagine four populations arranged in a tree. If there's been no mixing between branches, certain patterns should hold. When they don't, the f4-statistic catches it.

Technically, for four populations A, B, C, D:

f4(A, B; C, D) = E[(a - b)(c - d)]

This measures whether the genetic differences between A and B are correlated with the differences between C and D. In a simple tree with no gene flow, this should be zero. When it's not zero, something interesting happened, usually gene flow from a population we haven't accounted for.

Reading f4 results: The Z-score tells you if the result is statistically significant. |Z| > 3 means "this pattern is real." The sign (positive or negative) tells you which population has the extra ancestry.

Diagram showing how f4-statistics detect gene flow in a population tree
The f4-statistic checks if four populations fit a simple family tree. If f4 ≠ 0, there's been mixing we didn't account for, often from a ghost population.

Setting Up the Analysis

We'll use the Allen Ancient DNA Resource (AADR), which contains genetic data from over 17,000 ancient and modern individuals. First, install the required packages:

# Install admixtools2 from GitHub
install.packages("remotes")
remotes::install_github("uqrmaie1/admixtools")

# Load libraries
library(admixtools)
library(dplyr)

# Set path to AADR data (eigenstrat format)
prefix <- "/path/to/v62.0_1240k_public"

The AADR data includes individuals labeled by population. For our analyses, we'll use populations like Mbuti.DG (African outgroup), Han.DG (East Asian), and various ancient samples.

Ghost 1: Basal Eurasians

The Basal Eurasians are perhaps the most elegant ghost story in human genetics. Here's the puzzle that revealed them:

When humans left Africa, we'd expect all their descendants to be roughly equally related to each other. Europeans and East Asians should share the same amount of genetic similarity, since they both descend from that same out-of-Africa group.

But the first farmers in Europe (from Anatolia, modern-day Turkey) share fewer genetic variants with East Asians than European hunter-gatherers do. That's weird. It's like finding out your cousin is less related to your uncle than you are. That shouldn't happen in a simple family tree.

The solution: those early farmers must have ancestry from a population that split off even earlier, before the ancestors of East Asians and European hunter-gatherers diverged. This "Basal Eurasian" lineage diluted their connection to East Asians. We've never found their bones, but their genetic signature is unmistakable.

Diagram showing Basal Eurasian ancestry in early farmers
The Basal Eurasian ghost: Early European Farmers carry ancestry from a lineage that split before WHG and East Asians diverged.

The Test

# Test for Basal Eurasian ancestry in Anatolian Neolithic
# If Anatolia_N has Basal ancestry, they should share FEWER
# alleles with East Asians than WHG does

f4_basal <- f4(
  prefix,
  pop1 = "Mbuti.DG",                    # African outgroup
  pop2 = "Han.DG",                      # East Asian
  pop3 = "Luxembourg_Mesolithic.DG",    # WHG (Loschbour)
  pop4 = "Turkey_Marmara_Barcin_N.DG"   # Anatolian Neolithic
)

print(f4_basal)

Results

Test f4 Z-score
f4(Mbuti, Han; WHG, Anatolia_N) -0.00261 -6.31
f4(Mbuti, Han; WHG, Iran_N) -0.00294 -7.90

What this means: The negative f4 confirms our suspicion: Anatolian farmers share fewer genetic variants with East Asians than hunter-gatherers do. The Z-score of -6.31 means we're extremely confident this is real (way above our threshold of 3). The early farmers definitely have ancestry from somewhere else.

Iran Neolithic shows an even stronger signal (Z = -7.90), suggesting they have even more Basal Eurasian ancestry than Anatolian farmers.

Confirming with qpAdm

We can go further and try to model Anatolian Neolithic as descending from known populations:

# Try to model Anatolia_N from WHG alone
model1 <- qpadm(
  prefix,
  target = "Turkey_Marmara_Barcin_N.DG",
  left = c("Luxembourg_Mesolithic.DG"),
  right = c("Mbuti.DG", "Han.DG", "Russia_MA1_UP.SG")
)
# p-value: 3.99e-15 - MODEL REJECTED

# Model with WHG + Iran_N (Basal proxy)
model2 <- qpadm(
  prefix,
  target = "Turkey_Marmara_Barcin_N.DG",
  left = c("Luxembourg_Mesolithic.DG", "Iran_GanjDareh_N.AG"),
  right = c("Mbuti.DG", "Han.DG", "Russia_MA1_UP.SG")
)
# p-value: 0.37 - MODEL FITS (11% WHG + 89% Iran_N)

Anatolian Neolithic cannot be modeled from WHG-like ancestry alone (p = 4×10-15). But when we add Iran Neolithic as a source, the model fits. This is because Iran Neolithic also carries Basal Eurasian ancestry.

Ghost 2: Denisovans in Oceanians

Denisovans are our success story: a ghost that became real. For years, geneticists could see that Papuans and Aboriginal Australians had DNA from something that wasn't quite human, but they didn't know what. Then in 2010, scientists sequenced a finger bone from a Siberian cave and finally met the Denisovans.

Today we know that people from Oceania (Papua New Guinea, Australia, and nearby islands) carry about 5% Denisovan DNA, the result of interbreeding tens of thousands of years ago. Most other populations have essentially none. This is archaic introgression: gene flow from a different human species into our own.

Diagram showing Denisovan gene flow into Papuans
Denisovan introgression: Papuans carry ~5% DNA from an archaic human species that interbred with modern humans in Asia.

The Test

# Test for Denisovan ancestry in Papuans vs Han
# Positive f4 = Papuans share MORE with Denisova than Han

f4_denisovan <- f4(
  prefix,
  pop1 = "Mbuti.DG",       # African outgroup
  pop2 = "Denisova.DG",    # Denisovan (we have their DNA!)
  pop3 = "Han.DG",         # East Asian
  pop4 = "Papuan.DG"       # Oceanian
)

Results

Test f4 Z-score
f4(Mbuti, Denisova; Han, Papuan) +0.00387 13.19
f4(Mbuti, Denisova; Han, Australian) +0.00397 12.48
f4(Mbuti, Denisova; Han, French) -0.00033 -1.95

What this means: The positive f4 with Z = 13.19 is enormous, one of the strongest signals in human population genetics. Papuans share dramatically more genetic variants with Denisovans than Han Chinese do. Aboriginal Australians show the same pattern (Z = 12.48). Meanwhile, Europeans show no excess Denisovan ancestry (Z = -1.95 is below our significance threshold).

This geographic pattern tells a story: the ancestors of Papuans and Australians interbred with Denisovans somewhere in Southeast Asia or Oceania, after they had already split from the ancestors of East Asians and Europeans.

Ghost 3: Neanderthals in Non-Africans

Neanderthals are the most famous "other humans." They lived in Europe and western Asia for hundreds of thousands of years before modern humans arrived. When the first Neanderthal genome was sequenced in 2010, the big surprise wasn't that we could read ancient DNA. It was that everyone outside of Africa carries 1-2% Neanderthal DNA.

When modern humans left Africa around 60,000 years ago, they met and interbred with Neanderthals. That mixing happened early, before the ancestors of Europeans and East Asians split apart, which is why both groups carry similar amounts of Neanderthal ancestry today.

Diagram showing Neanderthal gene flow into non-Africans
Neanderthal introgression: All non-Africans carry ~2% Neanderthal DNA from interbreeding ~60,000 years ago.

The Test

# Test for Neanderthal ancestry in non-Africans
# Using Chimp as outgroup instead of Mbuti

f4_neanderthal <- f4(
  prefix,
  pop1 = "Chimp.REF",                 # Outgroup
  pop2 = "Vindija_Neanderthal.DG",    # Neanderthal
  pop3 = "Yoruba.DG",                 # African
  pop4 = "Han.DG"                     # Non-African
)

Results

Test f4 Z-score
f4(Chimp, Neanderthal; Yoruba, Han) +0.00163 7.59
f4(Chimp, Neanderthal; Yoruba, French) +0.00149 8.33
f4(Chimp, Neanderthal; French, Han) +0.00014 0.82

What this means: Both Han Chinese and French share significantly more genetic variants with Neanderthals than Yoruba (West Africans) do. Z-scores above 7 are rock-solid evidence. But when we compare French to Han directly (Z = 0.82), there's no significant difference. They have the same amount of Neanderthal ancestry.

This tells us exactly when the interbreeding happened: after humans left Africa (which is why Africans don't have Neanderthal DNA) but before Europeans and East Asians split apart (which is why both groups have the same amount).

Ghost 4: The African "Ghost Modern" Population

The three ghosts above all involve ancestry outside Africa. But Africa (where our species originated and where humans have lived the longest) has its own ghosts.

When scientists sequenced ancient DNA from Shum Laka rock shelter in Cameroon (about 8,000 years old), they found something unexpected: these ancient Africans carried ancestry from a human lineage that split off extremely early, possibly around the same time the ancestors of the Khoe-San peoples (like the San and Ju|'hoansi) diverged from everyone else.

This isn't like Denisovans or Neanderthals; it's not a different species. It's a "ghost modern" human population: fully human, but from a branch of the family tree we've never directly sampled. Their DNA shows up in ancient Cameroonians but has been mostly diluted in present-day West Africans, probably through later population movements like the Bantu expansion.

Diagram showing Ghost Modern ancestry in ancient African populations
The African Ghost Modern: Ancient Cameroonians carry ancestry from a deeply diverged human lineage with no known remains.

The Test

# Test for Ghost Modern ancestry using Khoe-San as deep African reference
# If Shum Laka has Ghost Modern ancestry, they should share FEWER
# alleles with Khoe-San than present-day West Africans do

f4_ghost_modern <- f4(
  prefix,
  pop1 = "Chimp.REF",                    # Outgroup
  pop2 = "Ju_hoan_North.DG",             # Khoe-San (deep African reference)
  pop3 = "Cameroon_ShumLaka_SMA.AG",     # Ancient Cameroon (~8 kya)
  pop4 = "Yoruba.DG"                     # Present-day West African
)

Results

Test f4 Z-score
f4(Chimp, Ju_hoan; ShumLaka, Yoruba) +0.00045 3.06
f4(Chimp, Ju_hoan; ShumLaka, Mende) -0.00024 -1.27
f4(Chimp, Ju_hoan; ShumLaka, Mbuti) ~0 0.09

What this means: The positive f4 with Z = 3.06 (just above our threshold of 3) reveals something surprising: present-day Yoruba are more closely related to the Khoe-San than these 8,000-year-old Cameroonians were. That's backwards from what you'd expect geographically, since Cameroon is closer to southern Africa than Nigeria is.

The explanation: the ancient Shum Laka people had ancestry from an even older branch of the human family tree. Present-day West Africans have less of this deep ancestry because later migrations (especially the Bantu expansion, which spread farming across Africa) mixed in people from other lineages.

The Mende people from Sierra Leone (Z = -1.27) show a pattern similar to Shum Laka, hinting they may have retained more of this ancient deep ancestry than other West African groups.

Summary: Four Ghosts, One Method

Bar chart showing Z-scores for all ghost population tests
f4-statistic Z-scores for detecting four ghost populations. Dashed line shows |Z| = 3 significance threshold.
Ghost Population Key Test Z-score Interpretation
Basal Eurasian f4(Mbuti, Han; WHG, EEF) -6.31 EEF has ancestry from lineage that split before WHG/East Asian common ancestor
Denisovan f4(Mbuti, Denisova; Han, Papuan) +13.19 Papuans have ~5% archaic Denisovan ancestry
Neanderthal f4(Chimp, Neanderthal; Yoruba, non-African) +7.59 Non-Africans have ~2% archaic Neanderthal ancestry
Ghost Modern f4(Chimp, Ju_hoan; ShumLaka, Yoruba) +3.06 Ancient Cameroonians have ancestry from early-diverging modern human lineage

The same simple question, "Do these populations fit a simple family tree?", detected all four ghosts. When the answer is "no," we've found hidden ancestry. The direction of the f4-statistic tells us which population doesn't fit:

Tips for Ghost Hunting

The f4-statistic is powerful, but you need to set up your test carefully:

  1. Pick a good reference point. Your "outgroup" population should be distantly related and uninvolved in the mixing you're testing. For comparing non-African populations, we often use Mbuti (Central African) as a reference. For comparing humans to Neanderthals, we need chimps as a reference point.
  2. Understand what "zero" means. When f4 = 0, your four populations fit a simple family tree with no unexpected mixing. When f4 ≠ 0, something interesting happened. Your job is to figure out what.
  3. Run multiple tests. A single test could be explained in different ways. Running several tests from different angles helps you triangulate what actually happened.
  4. More samples = better results. Ancient DNA is messy and incomplete. Testing multiple individuals from each population, and using well-preserved samples when available, gives you cleaner, more reliable signals.

More Ghosts to Find

This analysis barely scratches the surface. Other ghost populations waiting to be explored include:

One important limitation: some ghost populations can't be detected with f4-statistics at all. The West African ghost archaic (a non-human hominin that may have interbred with our ancestors in Africa) requires completely different methods because we have no reference genome to compare against. The methods described here only work when you have at least some idea of what you're looking for.

The tools are freely available, the data is public, and the ghosts are waiting to be found. All you need is curiosity and some R code.

We can detect populations that left no bones, no artifacts, nothing but their DNA scattered across the genomes of their descendants. That's the power of population genetics.

Code and Data

All analyses were performed using:

The complete analysis code is available in my ghost-popgen repository.

References