CLI Reference
Synopsis
All commands accept --help / -h for option details.
build
Build a new archive from a TSV.
| Flag | Default | Description |
|---|---|---|
-i / --input |
required | Input TSV (accession file_path [completeness contamination taxonomy ...]) |
-o / --output |
required | Output .gpk path |
-t / --threads |
1 | Parallel FASTA decompression threads |
-z / --zstd-level |
6 | zstd compression level (1–22) |
--no-hnsw |
off | Skip HNSW index (recommended for > 1M genomes) |
--no-cidx |
off | Skip CIDX contig index |
--taxonomy-group |
off | Bucket genomes by genus before shard formation |
--taxonomy-rank |
g |
g = genus, f = family (requires --taxonomy-group) |
merge
Merge multiple .gpk archives into one. Uses parallel pwrite (one thread per part) for NFS efficiency.
genopack merge -l parts.txt -o merged.gpk
# or
genopack merge part1.gpk part2.gpk part3.gpk -o merged.gpk
| Flag | Default | Description |
|---|---|---|
-l / --list |
Text file with one .gpk path per line |
|
-o / --output |
required | Output path |
-t / --threads |
auto | Merge threads (one per input part) |
stat
Print archive statistics.
Output: generation, shard count, live/total genome count, total bp, compression ratio, per-section inventory.
extract
Extract genomes as FASTA.
| Flag | Description |
|---|---|
--accession ACC |
Extract single genome |
--accessions-file FILE |
Extract list of accessions (one per line) |
--min-completeness FLOAT |
Completeness filter (0–100) |
--max-contamination FLOAT |
Contamination filter |
--min-length INT |
Minimum assembly length in bp |
--max-contigs INT |
Maximum contig count |
-o / --output |
Output FASTA (default: stdout) |
slice
Extract a subsequence by accession and coordinates.
Decompresses only the checkpoint blocks covering the requested region (sub-genome granularity).
add
Append genomes to an existing archive (new shard generation).
Existing shards are untouched. The catalog receives a new CATL fragment. Use repack afterwards if taxonomy grouping is required.
rm
Soft-delete (tombstone) genomes.
Marks genomes as deleted in a new catalog fragment. Physical space is not reclaimed; use repack to compact.
dedup
Remove near-identical sequences.
Uses KMRX cosine similarity as pre-filter, then exact comparison. Keeps the genome with the highest completeness.
repack
Re-shard an archive by taxonomy for fast per-taxon NFS access.
| Flag | Default | Description |
|---|---|---|
-t / --threads |
1 | OMP decompression threads |
-z / --zstd-level |
6 | Output compression level |
-m / --max-memory |
32 | Max buffered FASTA data in GB before eviction |
--taxonomy-rank |
g |
g = genus, f = family |
-v / --verbose |
off | Log every source shard processed |
When to use: After building a large archive without --taxonomy-group, or after many add operations that scattered genomes across shards. A repacked archive allows geodesic-style tools to read only the ~1,900 shards belonging to Salmonella instead of all 24,000+ shards.
Algorithm: Three-phase two-pass design.
- Phase 1 (fast): Reads only
GenomeDirEntryarrays from each shard (~300 MB total for a 3.1 TB archive). Builds a full genome→taxonomy routing table in memory. - Phase 2: Sorts all genome records by
(taxonomy, oph_fingerprint). - Phase 3: Single sequential pass through the source archive; decompresses FASTAs with OMP parallelism; routes each genome to its pre-determined taxon writer; flushes the largest writer when memory cap is hit (minimises partial-shard fragmentation).
taxonomy
Query taxonomy for one genome.
taxdump
Export taxonomy in NCBI or columnar binary format.
genopack taxdump mydb.gpk -f taxdump -o ./taxdump/
genopack taxdump mydb.gpk -f columnar -o ./taxonomy/
| Format | Output files | Description |
|---|---|---|
taxdump |
names.dmp, nodes.dmp, acc2taxid.dmp |
NCBI taxdump - Kraken/Kaiju compatible |
columnar |
acc2taxid.bin, taxnodes.bin, acc2taxid.tsv, taxonomy.tsv |
Fast offline lookup |
similar
Find similar genomes by KMRX cosine similarity.
Uses the HNSW approximate nearest-neighbour index (if present) or falls back to linear scan.
| Flag | Default | Description |
|---|---|---|
-k / --top-k |
10 | Number of results |
--min-sim |
0.0 | Minimum cosine similarity threshold |
cidx
Look up genome_id from contig accession via the CIDX index.
# Single
genopack cidx mydb.gpk --accession NZ_JAVJIU010000001.1
# Batch
genopack cidx mydb.gpk --accessions-file contigs.txt --threads 8 -o results.tsv
Throughput: ~150M queries/s on a single core (binary search on sorted FNV-1a-64 hash array).
reindex
Append a missing GIDX or HNSW section to an existing archive.
Useful when an archive was built with --no-hnsw and the index is needed later.