Haplogroup Classification at Scale

The YFull Y-chromosome tree has 185,780 phylogenetically informative SNPs across 15,000+ subclades. That's an order of magnitude more than ISOGG's ~8,000 SNPs from 2016. When I built yallHap, I wanted to use the full YFull tree—but that meant solving some engineering problems.

This post is about those problems and how I approached them.

The Naive Approach (And Why It Doesn't Work)

The obvious way to classify a haplogroup: for each variant in your VCF, search through the SNP database to find a match, then figure out which tree node it belongs to.

With 185K SNPs, that's O(n×m) where n is your variants and m is your database. A typical Y-chromosome VCF might have 100-500 variants. At 185K comparisons per variant, you're looking at 18-90 million operations per sample. For a single sample, that's slow but survivable. For 7,000 ancient DNA samples? Not happening.

Solution: Index Everything

The fix is to build indices during initialization so that lookups are O(1) during classification. yallHap builds three hash tables:

_snp_name_to_pos: dict[str, int]
# "L21" → 15426875

_pos_to_snp: dict[int, SNP]
# 15426875 → SNP(name="L21", ancestral="G", derived="A", ...)

_node_to_snp_info: dict[str, list[tuple[int, str]]]
# "R-L21" → [(15426875, "L21"), (15426901, "S145"), ...]

The first two are straightforward—map SNP names to positions and vice versa. The third is the key optimization: for each tree node, pre-compute which SNP positions define it.

During classification, instead of searching through 185K SNPs for each haplogroup, yallHap just looks up the pre-computed list. Most haplogroups are defined by 1-10 SNPs, not 185K. This turns the scoring loop from "check every SNP in the database" to "check the handful of SNPs that actually matter for this node."

Multi-Reference Genome Support

Different datasets use different reference genomes. Ancient DNA from the Reich Lab is typically aligned to GRCh37. Modern samples from gnomAD use GRCh38. The T2T consortium's complete Y-chromosome assembly is gaining traction.

yallHap stores positions for all three references simultaneously:

class SNPDatabase:
    _by_position_grch37: dict[int, list[SNP]]
    _by_position_grch38: dict[int, list[SNP]]
    _by_position_t2t: dict[int, list[SNP]]

When you specify --reference grch38, lookups use the GRCh38 index. No conversion needed at query time.

T2T positions are handled lazily. The YBrowse SNP database doesn't include T2T coordinates natively, so yallHap computes them via liftover on first use. This adds ~30 seconds the first time you run with --reference t2t, but the lifted positions are cached for subsequent runs.

Parallel Processing Without the Pain

Python's multiprocessing has a gotcha: passing large objects between processes requires pickling them, which is slow. The tree and SNP database together are ~65MB. Pickling that for every sample would kill performance.

The solution is to load the data independently in each worker process:

from concurrent.futures import ProcessPoolExecutor

_worker_classifier = None  # Global in each worker process

def _init_worker(tree_path, snp_db_path, config):
    """Called once when worker starts."""
    global _worker_classifier
    tree = Tree.from_json(tree_path)
    snp_db = load_snp_database(snp_db_path)
    _worker_classifier = HaplogroupClassifier(tree, snp_db, **config)

def _classify_file(vcf_path):
    """Uses the pre-loaded classifier."""
    return _worker_classifier.classify(vcf_path)

# Main process
with ProcessPoolExecutor(
    max_workers=threads,
    initializer=_init_worker,
    initargs=(tree_path, snp_db_path, config),
) as executor:
    futures = {executor.submit(_classify_file, vcf): vcf for vcf in vcf_files}
    for future in as_completed(futures):
        result = future.result()
        # ... handle result

Each worker loads the tree and SNP database once on startup. After that, classification is just CPU work with no inter-process communication. The main process never pickles the large objects—it just passes file paths.

Memory Usage

With 12 worker processes, you might expect 12× memory usage. In practice it's not that bad:

Component	Size	Notes
YFull tree	~15 MB	JSON parsed to nested dicts
SNP database	~50 MB	185K SNPs × 3 reference indices
Per-worker overhead	~5 MB	Python interpreter, etc.
12 workers total	~850 MB

Most of that is the SNP database replicated across workers. If memory is tight, --threads 1 runs everything in the main process with a single copy of the data.

Benchmarks

On an M3 Max MacBook Pro (12 performance cores):

Dataset	Samples	Time	Per-sample
1000 Genomes Phase 3	1,233	314s	0.25s
gnomAD high-coverage	1,231	1,508s	1.22s
AADR ancient	7,333	1,367s	0.19s

The gnomAD samples are slower because they're 30× high-coverage WGS—much larger VCF files. The AADR ancient samples are sparse (many missing positions), so there's less data to process per sample.

For comparison, yhaplo (which uses the smaller ISOGG tree) is about 2.4× faster on the same samples. That's the trade-off: yallHap uses a tree with 23× more SNPs, which gives finer resolution but takes longer to process.

The Bayesian Mode Overhead

yallHap's --bayesian flag enables probabilistic scoring with posterior probabilities. This is more accurate for ancient DNA but adds computational cost.

In heuristic mode, scoring is simple counting: for each haplogroup, count derived vs. ancestral alleles. In Bayesian mode, it computes log-likelihood ratios for every branch in the path from root to each candidate haplogroup, then normalize to get posteriors.

The Bayesian classifier pre-computes which SNPs belong to each haplogroup during initialization:

_haplogroup_snps: dict[str, list[SNP]]
# "R-L21" → [SNP(...), SNP(...), ...]

This avoids repeated tree traversal during scoring. Even so, Bayesian mode is roughly 2× slower than heuristic mode. For modern high-coverage samples where heuristic mode achieves 99.8% accuracy, the extra computation isn't worth it. For ancient samples at 4-10% variant density, the +19 percentage point accuracy improvement justifies the cost.

What I'd Do Differently

A few things I'd reconsider if starting over:

Rust or C++ for the hot path. The scoring loop is pure Python, which is fine for thousands of samples but would bottleneck at millions. A compiled extension for the inner loop would help.
Memory-mapped SNP database. Right now each worker loads its own copy. A memory-mapped file would let workers share read-only data without the complexity of shared memory.
Streaming VCF processing. Currently yallHap loads all variants into memory before scoring. For very large VCFs, streaming would reduce peak memory.

That said, the current implementation handles the datasets I care about (population-scale ancient DNA studies, typically 1K-10K samples) without issues. Premature optimization, etc.

The Code

yallHap is open source: github.com/trianglegrrl/yallHap

The indexing logic is in classifier.py (_build_position_index). Parallel processing is in cli.py (_init_worker and the batch command). The Bayesian classifier lives in bayesian.py.

Happy to answer questions. :)