Making Haplogroup Callers Work with Ancient DNA

Ancient DNA lies to you. If you don't account for that, your haplogroup calls will be wrong in predictable and frustrating ways.

I've been building two haplogroup callers over the past year: eveHap for mitochondrial DNA and yallHap for the Y chromosome. Both needed to work with ancient samples from the AADR and similar archaeogenetics datasets. This meant grappling with a problem that most modern-focused tools either ignore or handle poorly: post-mortem DNA damage.

I wanted to share what I learned, because the solutions are pretty elegant once you understand the underlying biology.

The Problem: Dead DNA Does Weird Things

When an organism dies, its DNA starts degrading. One of the most common forms of damage is cytosine deamination—a chemical process where cytosine (C) loses an amino group and becomes uracil, which sequencers read as thymine (T). This creates characteristic damage patterns:

C→T mutations accumulate at the 5' ends of DNA fragments
G→A mutations accumulate at the 3' ends (this is the same process on the complementary strand)

The older and more degraded the sample, the worse this gets. Some ancient samples show damage rates above 30% at fragment termini. That's a lot of false mutations.

Why this matters for haplogroups: Haplogroup classification works by matching observed variants to a phylogenetic tree. If your sample has a bunch of C→T mutations that aren't real, you'll match the wrong branches. The damage patterns create systematic bias toward haplogroups that happen to have C→T or G→A defining mutations.

Existing tools have different approaches to this problem. Some ignore it entirely (looking at you, yhaplo). Some offer basic filtering. But I wanted something more sophisticated—approaches that could extract maximum information from damaged samples without being fooled by the artifacts.

Solution 1: Know Your Enemy (Damage Detection)

The first step is figuring out whether your sample is actually ancient and how damaged it is. Both tools can estimate damage rates directly from the data.

eveHap's Approach (BAM-level)

eveHap works with BAM files, which means it has access to individual sequencing reads. This lets it track exactly where damage patterns appear:

# eveHap samples reads and counts damage at fragment ends
evehap damage ancient.bam

# Output:
# C→T rate at 5' end: 18.3%
# G→A rate at 3' end: 16.7%
# Classification: ANCIENT (rate > 3%)

The damage detection is position-aware. It knows that C→T damage is concentrated in the first few bases of reads (on the forward strand) and G→A in the last few bases. This position information becomes useful for filtering.

yallHap's Approach (VCF-level)

yallHap works with VCF files, not BAMs, so it doesn't have direct access to read positions. Instead, it uses variant density as a proxy for data quality:

# yallHap calculates variant density from the VCF
Variant Density = (Called Variants / Total Positions) × 100%

# Interpretation:
# <4%:  Unreliable classification regardless of method
# 4-10%: Bayesian mode recommended (+19pp improvement)
# ≥10%: Both modes achieve 97-99% accuracy

This turned out to be more reliable than relying on coverage annotations from metadata, which often don't match the actual data available for classification.

Solution 2: Filter the Noise (Selective Exclusion)

Once you know damage is present, the simplest approach is to filter out the suspicious variants. Both tools offer multiple filtering strategies:

Damage-Direction Filtering

The most targeted approach: only exclude variants that match the damage pattern.

Variant Type	Status	Rationale
C→T	Filtered	Matches deamination damage
G→A	Filtered	Matches deamination damage (complementary strand)
T→C	Kept	Reverse direction, unlikely to be damage
A→G	Kept	Reverse direction, unlikely to be damage
A→T, G→C, etc.	Kept	Transversions, not caused by deamination

# yallHap ancient mode
yallhap classify ancient.vcf --ancient

# eveHap damage filter
evehap classify ancient.bam --damage-filter

Transversions-Only Mode

The nuclear option: throw out all transitions (purine↔purine or pyrimidine↔pyrimidine changes) and only use transversions (purine↔pyrimidine changes). This is maximally conservative because deamination only causes transitions, never transversions.

# yallHap transversions-only
yallhap classify ancient.vcf --transversions-only

# Transitions (EXCLUDED):
# A↔G, C↔T

# Transversions (KEPT):
# A↔C, A↔T, G↔C, G↔T

The downside? You lose a lot of data. Transitions are more common than transversions in the human genome, so you're throwing away roughly two-thirds of your informative SNPs. For heavily damaged samples, this trade-off is worth it. For moderately damaged samples, there's a better way.

Solution 3: Trust, But Verify (Probabilistic Scoring)

Instead of binary filtering (keep/discard), you can use probability to weight confidence in each variant. This is where the two tools diverge, reflecting the different challenges of mtDNA vs. Y chromosome classification.

eveHap: Kulczynski Scoring with Coverage Adjustment

eveHap uses a weighted Kulczynski similarity measure (the same algorithm as Haplogrep3). The key insight is that not all mutations are equally informative:

Hotspot mutations that occur frequently: LOW weight (1-2)
Rare, phylogenetically informative mutations: HIGH weight (8-10)

This prevents basal haplogroups from dominating the scoring just because they're defined by common mutations. For ancient DNA, eveHap adds coverage-based confidence adjustment:

# Coverage adjustment for ancient samples
if coverage >= 50%:
    confidence = score  # No adjustment
elif coverage >= 25%:
    confidence = score × 0.7  # 30% reduction
elif coverage >= 10%:
    confidence = score × 0.5  # 50% reduction
else:
    confidence = score × 0.3  # 70% reduction

yallHap: Bayesian Posterior with Damage Modeling

yallHap takes a full Bayesian approach, computing posterior probabilities for all haplogroups.

For each branch in the Y-chromosome tree, yallHap computes a log-likelihood ratio:

log[P(data | descended through branch) / P(data | not descended)]

# For a derived allele call with error rate ε = 0.001:
# Log-ratio ≈ +7 (strong support for being on this branch)

# For an ancestral allele call:
# Log-ratio ≈ -7 (evidence against being on this branch)

The key innovation for ancient DNA is damage modeling. When processing ancient samples, yallHap:

Uses a higher effective error rate (5% vs 0.1%) to account for damage uncertainty
Applies damage probability penalties to C→T and G→A variants
Weights calls by allelic depth (how many reads support each allele)

# Damage probability calculation
if ref="C" and alt="T":  # or G→A
    damage_prob = damage_rate  # e.g., 10%
elif ref="T" and alt="C":  # or A→G (reverse)
    damage_prob = damage_rate × 0.3  # Less likely
else:  # Transversions
    damage_prob = 0.0  # Not caused by deamination

# Weight is reduced by damage probability
weight = base_weight × (1 - damage_prob)

This means a C→T variant isn't thrown away—it just contributes less evidence than a transversion. You keep more data while appropriately discounting the suspicious stuff.

Allelic Depth Weighting

Ancient DNA often has very low coverage—maybe 2-3 reads covering a position. If 2 reads show the derived allele and 1 shows ancestral, how confident should you be?

yallHap computes a support ratio from the allelic depth (AD) field in the VCF:

support_ratio = reads_supporting_call / total_reads

# Support ratio determines confidence:
# < 50%: Heavy penalty (squared)
# 50-70%: Gradual recovery
# > 70%: Near-linear contribution
# Missing AD: Assume good support (trust the genotype quality)

This is especially valuable for ancient DNA where many positions have just 1-3 reads. A call supported by 3/3 reads should count more than one supported by 2/3 reads.

Solution 4: The Soft Touch (Damage Rescaling)

Sometimes you don't want to filter variants entirely—you just want to be appropriately skeptical. yallHap's damage rescaling option applies quality penalties to potentially damaged variants:

# Damage rescaling modes
--damage-rescale moderate:
    C→T / G→A: 50% quality penalty
    Other transitions: 25% penalty
    Transversions: No penalty

--damage-rescale aggressive:
    C→T / G→A: 75% quality penalty
    Other transitions: 50% penalty
    Transversions: No penalty

# Example:
# Original quality 30, C→T variant, moderate mode:
# Rescaled quality = 30 × (1 - 0.5) = 15

If the rescaled quality drops below the minimum threshold, the variant is treated as missing. Otherwise it contributes with reduced weight. This lets real C→T variants still contribute (if they have very high quality) while appropriately discounting ambiguous ones.

Putting It All Together

What should you use? Based on validation across thousands of ancient samples, here are my recommendations:

Data Quality	eveHap (mtDNA)	yallHap (Y-chr)
Modern WGS (>30×)	Default mode	Default mode (99.8% accuracy)
Ancient, good quality (≥10% density)	`--damage-filter`	`--ancient` (97-99% accuracy)
Ancient, moderate (4-10% density)	`--damage-filter` + adaptive quality	`--ancient --bayesian` (+19pp improvement)
Ancient, poor (<4% density)	Manual review recommended	Consider `--transversions-only`, expect ~37% accuracy

The Bayesian mode in yallHap shows the biggest improvement at moderate variant density (4-10%), where there's enough data for probabilistic scoring to distinguish signal from noise, but not so much that simple counting works fine. At very low densities, even Bayesian inference can't rescue you from insufficient data.

What I Learned

Building these tools taught me a few things:

Biology matters. You can't just throw algorithms at DNA data without understanding the underlying chemistry. The damage patterns aren't random—they're predictable consequences of biochemistry, and that predictability is what makes filtering possible.
Probabilistic thinking beats binary filtering. The Bayesian approach in yallHap consistently outperforms simple transversions-only filtering in the moderate-quality regime. Keeping data and down-weighting it is usually better than throwing it away.
Confidence scores matter. A lot of tools just give you a haplogroup call without any indication of reliability. For ancient DNA especially, you need to know when to trust the result and when to be skeptical.
Variant density is underrated. Calculating data completeness directly from the VCF turned out to be more reliable than coverage annotations from metadata.

The Bigger Picture

Ancient DNA has transformed our understanding of human history. We can trace migrations, population replacements, and family relationships across thousands of years. But all of that depends on being able to accurately read the damaged genetic material that survived.

Haplogroup classification might seem like a small piece of the puzzle, but it's foundational. Mitochondrial and Y-chromosome haplogroups give us direct windows into maternal and paternal lineages—lines of descent that stretch back tens of thousands of years. Getting those calls right matters.

The biochemistry of DNA damage is just cool. The fact that we can predict and correct for the specific ways that cytosine decays over millennia, and use that knowledge to peer back into prehistory—that's the kind of thing that makes me love this field.

Life is so fucking cool.

If you want to dig into the code, both tools are open source:

eveHap (mtDNA haplogroup caller)
yallHap (Y-chromosome haplogroup caller)

Happy to answer questions if you have them! :)