Building Your Own Bioinformatics Data Library

The fastest way I've found to understand a bioinformatics method is to run it myself. Not on toy data or tutorials, but on real samples where I can compare my results to published findings. The problem is that "real data" often means multi-gigabyte downloads that take hours or days. If you want to test an idea at 11pm on a Tuesday, waiting for a 50GB dataset isn't really an option.

So I've been building up a local library of commonly-used reference data. Having AADR, 1000 Genomes, HGDP, and SGDP already on disk means I can go from "I wonder if..." to actually finding out. I'm impatient, and "it only takes minutes" instead of "it takes days" is a big win in bioinformatics. This post walks through what I've collected, how it's organized, and how I use it for different kinds of analyses.

Why Bother?

For me, reading a paper about f-statistics is one thing. Running the actual analysis on real ancient DNA and seeing Z-scores that match the published results is different. I catch details I'd otherwise miss. I run into errors that teach me about edge cases. I build intuition for what "normal" looks like, so I can spot when something's weird.

The datasets I use most often are the standard population genetics references:

HGDP (Human Genome Diversity Project): 929 individuals from 54 populations worldwide
1000 Genomes: 2,504 individuals from 26 populations, with full genome sequences
SGDP (Simons Genome Diversity Project): 279 individuals at 40x coverage, with excellent Y chromosome data
AADR (Allen Ancient DNA Resource): 17,000+ ancient and modern individuals, the workhorse dataset for ancient DNA research

I also keep reference genomes, variant databases, haplogroup trees, and metagenomic databases. At last count, I'm sitting on about 30TB across five drives. (Yes, this is probably overkill for someone who isn't running a lab. I never claimed to be sensible about this.)

My Directory Structure

I organize things by type rather than by project. This makes it easier to share data across different analyses. Here's the rough layout:

references/
├── genomes/              # hg38, hg19, CHM13, primate genomes
├── aadr/                 # Allen Ancient DNA Resource (v62.0, v54.1)
├── 1000genomes/          # Phase 3 VCFs
├── human-y/              # Y-DNA SNP databases, trees
├── mito/                 # mtDNA references and panels
├── kraken2/              # Metagenomic databases
├── gnomad/               # Population allele frequencies
└── contamLD/             # Contamination estimation panels

datasets/
├── panels/               # Population reference panels
│   ├── 1kg/mtdna/        # 1000 Genomes mtDNA (FASTA, HSD)
│   ├── hgdp/mtdna/       # HGDP mtDNA (BAM, FASTA, haplogroups)
│   ├── sgdp/             # SGDP (mtDNA VCFs, Y VCFs)
│   └── aadr/mtdna/       # Ancient mtDNA (rCRS and RSRS aligned)
├── validation/           # Known-haplogroup samples for testing
│   ├── ydna/             # Ancient Y-DNA BAMs with confirmed calls
│   └── damaged_mtdna/    # Simulated damage for ML training
└── projects/             # Project-specific sample collections

The key thing is that any analysis can find the data it needs without me having to remember where I put something three months ago.

Reference Genomes

I keep multiple human genome builds because different tools and datasets expect different versions. The main ones:

Build	When to Use It
GRCh38 (hg38)	Current standard. Most new data aligns here.
GRCh37 (hg19)	Legacy data. A lot of ancient DNA and older studies use this.
hs37d5	1000 Genomes reference with decoy sequences. Good for ancient DNA.
CHM13 (T2T)	The telomere-to-telomere assembly. Complete, no gaps. Use for Y chromosome work.

Each genome has BWA and Bowtie2 indexes pre-built. Building indexes from scratch takes ages, so having them ready is a significant time-saver.

I also keep primate reference genomes (chimpanzee, bonobo) for comparative work. These are useful as outgroups when you're doing things like f-statistics, and for sanity-checking metagenomic classifications.

Mitochondrial and Y Chromosome References

The organellar genomes get their own special treatment because they're used differently than the autosomal chromosomes.

For mtDNA, I keep:

rCRS (revised Cambridge Reference Sequence): The standard mtDNA reference. Most haplogroup calling tools expect this.
RSRS (Reconstructed Sapiens Reference Sequence): The ancestral mtDNA sequence. Some ancient DNA workflows prefer this.
Reference panels: Known haplogroup sequences for training and validation.

For the Y chromosome:

ybrowse SNP databases: ~430MB of Y-DNA SNP positions in both hg19 and hg38 coordinates
YFull tree: The phylogenetic tree of Y haplogroups, in JSON format
Extracted Y references: Just chrY from each genome build, for faster alignment when you only care about the Y

Population Panels for Testing

This is probably the most useful part of my setup. I've curated subsets of the major population datasets in formats ready for different kinds of analysis.

mtDNA Testing (for tools like eveHap and HaploGrep)

Dataset	Samples	Formats
HGDP	829 FASTA, 1,656 BAM	BAM, FASTA, HSD, haplogroup calls
1000 Genomes	1,002	FASTA, HSD
AADR (ancient)	864	BAM, FASTA, VCF, HSD (rCRS and RSRS aligned)

Having BAMs available (not just FASTA consensus sequences) matters for tools like yhaplo, HaploGrep3, and eveHap that can work directly from alignment data. This lets you test the "call haplogroups from BAM" feature instead of just running on pre-called variants. The HGDP mtDNA BAMs (~100GB) are particularly nice because they're well-characterized modern samples with known haplogroups, so you can validate your pipeline against published results.

Y-DNA Testing (for tools like yallhap and yleaf)

Y chromosome haplogroup calling from ancient DNA is tricky because coverage is often low and damage patterns can confuse callers. To test pipelines properly, I've assembled a validation set of ancient Y-DNA BAMs with known haplogroups:

Sample	Period	Y-Haplogroup	Size
Kennewick Man	~9,000 BP	Q-M3	3.4GB
I0231, I0443 (Yamnaya)	Bronze Age	R1b1a	1.8GB
VK287, VK296	Viking Age	I1	3.3GB
VK292, VK582	Viking Age	R1b, R1a	740MB

These samples span different time periods, coverage levels, and haplogroups, which helps catch edge cases in haplogroup calling tools. When I'm testing a new version of pathPhynder or yallhap, I can run the whole validation set and see if the results match expectations.

For modern Y-DNA comparisons, I use the SGDP chrY VCFs (279 samples, 138GB compressed). These are high-coverage full-genome sequences, so the Y data is clean.

Population Genetics Datasets

For PCA, ADMIXTURE, and f-statistics work, the AADR is my primary dataset. I keep two versions:

v62.0 (1240K): The current release, ~5GB in eigenstrat format. This is what I use for most f4/qpAdm analyses.
v54.1.p1 (HO array): An older snapshot in PLINK format. Some published analyses used this version, so I keep it for reproducibility.

I also maintain merged datasets with modern samples projected onto ancient PCA space. I keep pre-built projection panels for common analyses.

Tip: AADR updates regularly as new ancient DNA is published. I download new versions but keep old ones around. When you're trying to reproduce a paper's analysis, using the same data version matters.

Metagenomics

When you sequence ancient DNA, most of what you get isn't human. It's soil bacteria, fungi, and whatever else was in the burial environment. Metagenomic classification tells you what's in your sample and helps estimate contamination.

I run Kraken2 for this, and it needs prebuilt databases. I keep several:

standard_16gb: The standard database, fits in 16GB RAM
PlusPFP-16: Extended with protozoa, fungi, plants
7primates: Just primate genomes, for quick human/non-human classification
NCBI taxonomy: The full taxonomy (~43GB), for downstream analysis

Building these databases from scratch is painful (downloading NCBI takes forever, building the hash tables takes hours). Having them pre-built lets me run metagenomic classification immediately when I get new sequence data.

I've used this setup to look at ancient samples that turned out to be mostly bacterial, which saved me from wasting time trying to call human variants on non-human data.

Ancient DNA Projects

I keep raw data from various ancient DNA projects I've worked on or wanted to explore. These include:

Nazca/Peru samples from public SRA repositories
nf-core/eager pipeline runs on various ancient samples
mapDamage output showing ancient DNA damage patterns
Chachapoya samples from Laguna de los Condores (~1000-1500 AD)

The archived EAGER runs are useful because they include all the intermediate files. When something goes wrong with a new analysis, I can compare against a known-good run to figure out what's different.

Use Cases: What I Actually Do With This

Testing haplogroup callers

When a new version of yhaplo or eveHap comes out, or when I want to compare different tools, I run them against my validation sets. The HGDP mtDNA BAMs give me modern samples with high coverage and known haplogroups. The ancient Y-DNA BAMs give me challenging low-coverage samples where tool performance can vary significantly.

This is how I learned that some haplogroup callers fall apart on ancient DNA with deamination damage, while others handle it gracefully.

Reproducing published analyses

When I read a paper that uses f-statistics or qpAdm, I often try to reproduce their main results. Having AADR locally means I can do this immediately. If my numbers match, great. If they don't, that's interesting too, and usually means I'm learning something about methodology or data processing.

Quick PCA/ADMIXTURE explorations

Sometimes I just want to see where a sample falls in PCA space or what ADMIXTURE thinks its ancestry composition is. With pre-built reference panels and projection loadings, I can do this in a few minutes instead of having to set up a full analysis from scratch each time.

Building intuition

Running the same analysis on many different samples has taught me what "normal" looks like. I've started to recognize when Z-scores are suspiciously high, when PCA projections look weird, when coverage patterns suggest something's off. I don't think I could have gotten that from reading papers alone.

Practical Tips

If you're thinking about building your own data library, here's what I've learned:

Start with what you actually need. Don't download everything at once. Get the datasets for your current project, then expand as needed. My collection grew over several years of different projects.
Keep track of versions. Reference datasets update. Tools change. When you can't reproduce something from six months ago, version mismatches are usually the culprit.
Pre-build indexes. BWA indexing a human genome takes an hour or more. Doing it once and keeping the index saves that time every future analysis.
Use symlinks liberally. When multiple projects need the same data, symlinks avoid duplication without copying files around.
Document what you have. I maintain a catalog markdown file that lists every dataset with paths, sizes, and intended uses. (I should probably keep it more up to date than I do.)
Clean up work directories. Nextflow and Snakemake pipelines generate massive work/ directories. I recently recovered 6TB by cleaning these out.

Storage Reality

Let me be honest about the storage situation. My current setup uses about 30TB across five drives:

Drive	Size	Contents
References	~2TB	Reference genomes, AADR, databases
Working	~2TB	Active analysis workspace
Data drives	~26TB	1000 Genomes, HGDP, SGDP, curated datasets, archives

You don't need this much to get started. The core datasets (AADR at 5GB, 1000 Genomes VCFs at 30GB, reference genomes at ~20GB) fit on a single 100GB partition. Expand from there based on what you're actually doing.

What This Enables

The whole point of this setup is to lower the barrier between "I wonder..." and "let me find out." When testing an idea means downloading 50GB and waiting overnight, you don't test as many ideas. When the data is already there, curiosity becomes action.

I've used this library to detect ghost populations, validate haplogroup calling tools, explore ancient DNA damage patterns, and sanity-check metagenomic classifications. None of that required anything special. Just having the data ready.

This isn't magic, it's just technology that most people haven't used before. 🤷‍♀️

If you're interested in population genetics or ancient DNA, I'd encourage you to start building your own collection. Download AADR. Grab some 1000 Genomes data. Run an analysis. See if you get the same numbers as a published paper. That's how I've been learning this stuff, anyway.