CLI Reference
Synopsis
All commands accept --help / -h for option details. Wherever a flag below takes an archive, the path may be either a single archive directory (containing toc.bin) or a directory containing one or more part_*.gpk (multipart set).
build
Build a new archive from a TSV.
The output mydb.gpk is a directory containing toc.bin plus section files. Defaults: per-taxon shard grouping (--taxon-group, genus rank), kmer-NN sort within shards, OPH sketches (--sketch, k=16, sketch size 10 000), CIDX contig index, auto codec.
| Flag | Default | Description |
|---|---|---|
-i / --input |
required | Input TSV (accession\tfile_path\t[completeness\tcontamination\ttaxonomy\t…]) |
-o / --output |
required | Output archive directory (.gpk) |
-t / --threads |
16 | I/O threads (decompression + compression) |
-z / --zstd-level |
6 | zstd compression level (1–22) |
-p / --parallel |
1 | Parallel build workers (auto-merge) |
--no-dict |
off | Disable shared dictionary training |
--ref-dict |
off | Use first genome in each shard as reference content dictionary |
--delta |
off | Compress non-reference blobs against first genome via zstd prefix |
--mem-delta / --no-mem-delta |
on | k=31 k-mer seeded exact-match encoding for highly similar shard groups |
--2bit / --no-2bit |
on | Pack nucleotides to 2 bits/base before zstd (~1.5–2× extra compression) |
--no-cidx / --cidx |
on (skip) | Skip CIDX contig index; pass --cidx to build it |
--kmer-sort / --no-kmer-sort |
on | Sort genomes within each shard by kmer4 NN chain |
--taxon-group / --no-taxon-group |
on | Group genomes into per-taxon shards (requires taxonomy column) |
--taxon-rank |
g |
Rank for grouping (g = genus, f = family) |
--sketch / --no-sketch |
on | Compute OPH sketches |
--sketch-kmer |
16 | OPH sketch k-mer size |
--sketch-kmers |
unset | Comma list (e.g. 16,21,31) → multi-k SKCH in a single pass |
--sketch-size |
10000 | Number of OPH bins |
--sketch-syncmer |
0 | Open syncmer prefilter s (0 disables) |
--coordinator |
unset | NFS manifest coordinator: manifest_dir:/output.gpk |
-v / --verbose |
off | Verbose progress |
merge
Merge multiple .gpk archives into one. Uses parallel pwrite (one thread per part) for NFS efficiency.
genopack merge -l parts.txt -o merged.gpk
# or
genopack merge part1.gpk part2.gpk part3.gpk -o merged.gpk
| Flag | Default | Description |
|---|---|---|
-l / --list |
Text file with one .gpk path per line |
|
-o / --output |
required | Output path |
-t / --threads |
auto | Merge threads (one per input part) |
stat
Print archive statistics.
Output: generation, shard count, live/total genome count, total bp, compression ratio, per-section inventory. Accepts a single archive directory or a multipart set; multipart sets show an aggregated total plus per-part breakdown.
inspect
Report SKCH layout and preload memory cost.
For each archive (single or each part_*.gpk in a multipart directory) prints: live genome count, sketch_size (bins), mask_words (ceil(sketch_size/64)), the list of kmer_sizes stored, bytes per sketch per k, bytes per genome, and total preload size. Use this to decide whether to mmap-preload sketches or stream them frame-by-frame on memory-tight nodes. --json emits machine-readable output.
extract
Extract genomes as FASTA.
| Flag | Description |
|---|---|
--accession ACC |
Extract single genome |
--accessions-file FILE |
Extract list of accessions (one per line) |
--min-completeness FLOAT |
Completeness filter (0–100) |
--max-contamination FLOAT |
Contamination filter |
-o / --out |
Output FASTA (default: stdout) |
slice
Extract a subsequence by accession and coordinates.
Decompresses only the checkpoint blocks covering the requested region (sub-genome granularity).
add
Append genomes to an existing archive (new shard generation).
Existing shards are untouched. The catalog receives a new CATL fragment. Use repack afterwards if taxonomy grouping is required.
rm
Soft-delete (tombstone) genomes.
Marks genomes as deleted in a new catalog fragment. Physical space is not reclaimed; use repack to compact.
dedup
Tombstone duplicate genomes (same sequence, different accession) in place.
Walks every shard, hashes each genome's canonical FASTA content, groups duplicates and tombstones all but one representative per group. Modifies the archive in place by appending a new CATL fragment with the tombstones; physical bytes are reclaimed only by repack. With --dry-run, prints the duplicate groups without writing.
repack
Re-shard an archive by taxonomy for fast per-taxon NFS access.
| Flag | Default | Description |
|---|---|---|
-t / --threads |
1 | OMP decompression threads |
-z / --zstd-level |
6 | Output compression level |
-m / --max-memory |
32 | Max buffered FASTA data in GB before eviction |
--taxonomy-rank |
g |
g = genus, f = family |
-v / --verbose |
off | Log every source shard processed |
When to use: After building a large archive without --taxonomy-group, or after many add operations that scattered genomes across shards. A repacked archive allows geodesic-style tools to read only the ~1,900 shards belonging to Salmonella instead of all 24,000+ shards.
Algorithm: Three-phase two-pass design.
- Phase 1 (fast): Reads only
GenomeDirEntryarrays from each shard (~300 MB total for a 3.1 TB archive). Builds a full genome→taxonomy routing table in memory. - Phase 2: Sorts all genome records by
(taxonomy, oph_fingerprint). - Phase 3: Single sequential pass through the source archive; decompresses FASTAs with OMP parallelism; routes each genome to its pre-determined taxon writer; flushes the largest writer when memory cap is hit (minimises partial-shard fragmentation).
taxonomy
Group of taxonomy utilities. Each operation is its own subcommand.
taxonomy show
Print the lineage for one accession (or every accession when --accession is omitted). With --json, emits one JSON object per line.
taxonomy normalize
Take an accession\ttaxonomy\tfile_path TSV and produce a normalized 10-rank lineage TSV. With --ncbi-taxdump, fills missing ranks from NCBI nodes.dmp + names.dmp.
taxonomy partition
Partition a normalized TSV into N balanced parts at a given rank (g = genus, f = family) for parallel/distributed builds. Writes part_0.tsv … part_{N-1}.tsv under the output directory.
taxonomy assign-taxids
Assign canonical concept IDs to lineage paths and emit canonical_path\tconcept_id (sorted by path). Optionally writes a per-accession accession\tconcept_id\ttaxonomy map.
taxonomy diff
genopack taxonomy diff --current current.tsv --gtdb bac120_taxonomy.tsv --gtdb ar53_taxonomy.tsv -o out/
Diff a current taxonomy TSV against a new GTDB release and write per-category TSVs (added, removed, reassigned) plus a summary.txt to out/. With --write-unchanged, also writes the (often huge) unchanged.tsv.
taxonomy patch
Patch taxonomy assignments in place, either against a .gpk archive or a flat input TSV.
# Patch the archive directly
genopack taxonomy patch --archive mydb.gpk --patch reassignments.tsv
# Patch an input TSV before rebuilding
genopack taxonomy patch --tsv genomes.tsv --patch reassignments.tsv [--tsv-out patched.tsv]
# GTDB-Tk classify_summary input
genopack taxonomy patch --archive mydb.gpk --patch gtdbtk.summary.tsv --gtdbtk
Default patch format is accession\tnew_taxonomy. --gtdbtk accepts GTDB-Tk's classify_summary format directly. --no-normalize disables 7→10 rank normalization (default: on).
taxdump
Export taxonomy in NCBI or columnar binary format.
genopack taxdump mydb.gpk -f taxdump -o ./taxdump/
genopack taxdump mydb.gpk -f columnar -o ./taxonomy/
| Format | Output files | Description |
|---|---|---|
taxdump |
names.dmp, nodes.dmp, acc2taxid.dmp |
NCBI taxdump - Kraken/Kaiju compatible |
columnar |
acc2taxid.bin, taxnodes.bin, acc2taxid.tsv, taxonomy.tsv |
Fast offline lookup |
Similarity search and contig lookup (library-only)
There is no genopack similar or genopack cidx CLI. KMRX/HNSW similarity search and CIDX contig→genome lookup are exposed through the C++ API only:
ArchiveReader::find_similar(...)/find_similar_by_accession(...)— KMRX cosine similarity, HNSW-accelerated when an HNSW section is present, linear-scan fallback otherwise.ArchiveReader::find_contig_genome_id(accession)— CIDX binary search, ~150M queries/s/core.
See API → Similarity & contig lookup.
reindex
Build or rebuild auxiliary index sections in place.
| Flag | Default | Description |
|---|---|---|
--force |
off | Rebuild indexes even if already present |
--no-gidx |
off | Skip GIDX (useful when only --skch is needed and GIDX is absent/unwanted) |
--txdb |
off | Build the taxonomy tree (TXDB) from existing TAXN lineage strings |
--cidx FILE |
unset | Build the contig accession index (CIDX) from a build TSV (accession\ttaxonomy\tfile_path) |
--cidx-threads |
8 | Threads for parallel FASTA decompression while indexing contigs |
--skch |
off | Compute OPH sketches for genomes missing from existing SKCH sections |
--skch-threads |
8 | Threads for parallel sketch computation |
--sketch-kmer |
inherit / 16 | OPH k-mer size for a single-k SKCH section |
--sketch-kmers |
unset | Comma list (e.g. 16,21,31) → multi-k SKCH in one pass |
--sketch-size |
inherit / 10000 | OPH sketch size |
--sketch-syncmer |
inherit / 0 | Open-syncmer prefilter s (0 disables) |
Typical uses: an archive built with --no-cidx later wants contig lookup (--cidx genomes.tsv); a TAXN-only archive needs the tree (--txdb); SKCH layout needs to be upgraded to V4 seekable (--skch --force); or a multi-k variant is needed (--skch --sketch-kmers 16,21,31 --force).
verify
Checks archive integrity: TailLocator checksum, TocHeader checksum, and per-shard checksums. Exits 0 if all checks pass, non-zero on any failure.
| Flag | Default | Description |
|---|---|---|
archive |
required | Archive directory to verify |
--verbose |
off | Print OK lines in addition to failures |
coordinator
NFS-coordinated assembly mode for distributed builds. Workers run genopack build with --coordinator <manifest_dir>:<output.gpk>; the coordinator process waits for the expected number of worker manifests, then merges parts into a single archive.
genopack coordinator -o mydb.gpk --workers 64 --nfs-dir /shared/manifests/ \
[--ntdb /path/to/ncbi/taxdump/]
| Flag | Default | Description |
|---|---|---|
-o / --output |
required | Final merged archive path |
--workers |
required | Expected number of worker manifests |
--nfs-dir |
required | Shared directory where workers drop manifests |
--ntdb |
unset | NCBI nodes.dmp + names.dmp directory; embeds an NTDB section for offline taxid resolution |