Getting Started
Input TSV format
genopack build and genopack add accept a tab-separated file:
| Column | Required | Description |
|---|---|---|
accession |
✓ | Unique genome identifier (e.g. GCA_000008085.1) |
file_path |
✓ | Path to .fa, .fna, .fa.gz, or .fna.gz |
completeness |
CheckM completeness % (0–100) | |
contamination |
CheckM contamination % | |
taxonomy |
Full lineage string (d__;p__;c__;o__;f__;g__;s__) |
|
| extra columns | Stored verbatim in meta.tsv sidecar |
Example:
accession file_path completeness contamination taxonomy
GCA_000008085.1 /data/GCA_000008085.1.fna.gz 98.7 0.3 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
GCA_000006945.2 /data/GCA_000006945.2.fna.gz 99.1 0.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
Build an archive
| Flag | Default | Description |
|---|---|---|
-i / --input |
required | Input TSV |
-o / --output |
required | Output .gpk path |
-t / --threads |
1 | Parallel decompression threads |
-z / --zstd-level |
6 | zstd compression level (1–22) |
--no-hnsw |
off | Skip HNSW index (recommended for > 1M genomes) |
--no-cidx |
off | Skip contig index (saves memory for large archives) |
--taxonomy-group |
off | Group genomes by genus before shard formation |
Archive statistics
Output includes: generation, shard count, genome count, total bp, compression ratio, section inventory.
Extract genomes
# Single accession
genopack extract mydb.gpk --accession GCA_000008085.1 -o genome.fasta
# From file (one accession per line)
genopack extract mydb.gpk --accessions-file list.txt -o genomes.fasta
# With quality filter
genopack extract mydb.gpk --min-completeness 95 --max-contamination 5 -o filtered.fasta
Taxonomy
# Lookup one genome
genopack taxonomy mydb.gpk --accession GCA_000008085.1
# Export as NCBI taxdump (Kraken/Kaiju compatible)
genopack taxdump mydb.gpk -f taxdump -o ./taxdump/
# Export columnar binary (fast offline lookup)
genopack taxdump mydb.gpk -f columnar -o ./taxonomy/
The columnar binary export produces four files:
| File | Description |
|---|---|
acc2taxid.bin |
Sorted (FNV-1a-64(acc), taxid) pairs - O(log n) binary search |
taxnodes.bin |
Sorted node records + name pool - O(log n) taxid lookup |
acc2taxid.tsv |
accession\ttaxid - pandas/polars/R compatible |
taxonomy.tsv |
taxid\tparent_taxid\trank\tname\tis_synthetic |
Find similar genomes
Uses HNSW approximate nearest-neighbour on KMRX k=4 tetranucleotide profiles. Results are sorted by cosine similarity descending.
Contig accession lookup
# Single contig → genome_id
genopack cidx mydb.gpk --accession NZ_JAVJIU010000001.1
# Batch (one contig per line)
genopack cidx mydb.gpk --accessions-file contigs.txt --threads 8
Repack by taxonomy
After building a large archive with randomly distributed shards, re-shard by genus for fast per-taxon NFS access. Geodesic-style workflows that process one taxon at a time read only the relevant shards instead of the entire archive.
| Flag | Default | Description |
|---|---|---|
-t / --threads |
1 | OMP decompression threads |
-z / --zstd-level |
6 | Output compression level |
-m / --max-memory |
32 | Max buffered FASTA GB before eviction |
--taxonomy-rank |
g |
g = genus, f = family |
The algorithm runs in three phases:
- Directory scan - reads only shard headers +
GenomeDirEntryarrays (~300 MB vs 3.1 TB for full blobs), builds the complete genome→taxonomy routing table. - Sort - sorts all genome records by
(taxonomy, oph_fingerprint)in memory. - Decompress + route + write - single sequential pass; flushes the largest active shard writer when memory cap is hit (minimises shard fragmentation vs "flush all").
Append genomes
Appends a new shard generation. Existing shards are untouched; the catalog is updated with a new CATL fragment. Use genopack repack afterwards if taxonomy grouping is important.
Remove genomes
Soft-deletes (tombstones) genomes. Physical space is not reclaimed; use genopack repack to compact.
Distributed build
For collections too large for a single node:
Each node builds its slice to local NVMe (/scratch), rsyncs the part back, then a parallel merge (pwrite one thread per part) assembles the final archive. NFS readahead efficiency is maintained by reading each part archive sequentially.
Deduplication
Removes near-identical sequences using KMRX cosine similarity as a pre-filter, then exact sequence comparison. Keeps the genome with the highest completeness.