Binary Format

A .gpk file is a seekable single-file container inspired by Parquet. Sections can be appended without rewriting existing data, and the TOC at the end is updated with each generation. A 64-byte TailLocator at EOF points to the current TOC, allowing the reader to open the archive with two seeks.

File layout

FileHeadermagic · version · UUID · created_ts · generation 128 B

SHRD × Ncompressed genome shards — each generation appended, old shards untouched variable

CATLcolumnar genome metadata (SoA, sorted by oph_fingerprint)

GIDXgenome_id → (section_id, dir_index, catl_row)

ACCXFNV-1a hash table: accession → genome_id

CIDXsorted (FNV-1a-64(contig), genome_id) array

TAXNFNV-1a hash table: accession → lineage string

TXDBfull taxonomy tree (taxid/parent/rank/name + acc→taxid)

KMRXfloat[n × 136] L2-normalised k=4 tetranucleotide profiles

HNSWhnswlib serialised blob + label map

TOMBtombstone records for soft-deleted genomes

TOCzstd-compressed SectionDesc[] — one record per section

TailLocatorfixed footer at EOF, points to TOC offset 64 B

Section types

Magic	Name	Description
`SHRD`	Shard	Compressed genome blobs and directory
`CATL`	Catalog	Columnar genome metadata (SoA, sorted by oph)
`GIDX`	Genome index	genome_id to (section_id, dir_index, catl_row)
`ACCX`	Accession index	FNV-1a hash table: accession string to genome_id
`CIDX`	Contig index	Sorted (FNV-1a-64(contig_acc), genome_id) array
`TAXN`	Taxonomy strings	FNV-1a hash table: accession to lineage string
`TXDB`	Taxonomy tree	Parsed taxid/parent/rank/name nodes + acc-to-taxid table
`KMRX`	K-mer profiles	float[n × 136] L2-normalised k=4 tetranucleotide frequencies
`HNSW`	HNSW index	hnswlib serialised blob + label map (genome_id per vector)
`TOMB`	Tombstone	Soft-deleted genome_id records

`FileHeader` — 128 bytes

Offset	Size	Field	Description
0	4 B	`magic`	`GPK\x01`
4	2 B	`version_major`	Breaking format change
6	2 B	`version_minor`	Backward-compatible extension
8	16 B	`uuid`	Archive UUID (stable across generations)
24	8 B	`created_ts`	Unix timestamp of initial build
32	8 B	`generation`	Monotonically incremented on each `add`/`rm`/`repack`
40	88 B	reserved	Zero-padded

Shard section (`SHRD`)

SHRD variable

ShardHeadermagic · shard_id · n_genomes · codec · dict_size
dir_offset · dict_offset · blob_area_offset · checkpoint_offset 128 B

GenomeDirEntry[n_genomes]genome_id · oph_fingerprint · blob_offset · blob_len_cmp · blob_len_raw
checkpoint_idx · n_checkpoints — sorted by oph_fingerprint 64 B × n

zstd dictionaryoptional — ZSTD_DICT and REF_DICT codecs only dict_size B

Blob areablob[0] · blob[1] · ... — each independently decompressible variable

CheckpointEntry[]optional — symbol_offset · block_offset · enables sub-genome slice 16 B × n

Genomes are sorted by oph_fingerprint within each shard. Nearby OPH values indicate similar k-mer content, maximising zstd LDM reuse and shared dictionary effectiveness.

`ShardHeader` — 128 bytes

Offset	Size	Field	Description
0	4 B	`magic`	`SHRD`
4	2 B	`version`
6	2 B	`codec`	Compression codec (see table below)
8	8 B	`shard_id`	Unique shard identifier
16	4 B	`n_genomes`	Number of genomes in this shard
20	4 B	`dict_size`	Dictionary size in bytes (0 if none)
24	8 B	`genome_dir_offset`	Byte offset of `GenomeDirEntry[]` from section start
32	8 B	`dict_offset`	Byte offset of zstd dictionary
40	8 B	`blob_area_offset`	Byte offset of blob area
48	8 B	`checkpoint_area_offset`	Byte offset of `CheckpointEntry[]` (0 if none)
56	72 B	reserved	Zero-padded

`GenomeDirEntry` — 64 bytes each

Offset	Size	Field	Description
0	8 B	`genome_id`
8	8 B	`oph_fingerprint`	Order-preserving hash; entries sorted by this value
16	8 B	`blob_offset`	Byte offset of compressed blob from blob area start
24	8 B	`blob_len_cmp`	Compressed size in bytes
32	8 B	`blob_len_raw`	Uncompressed size in bytes
40	4 B	`checkpoint_idx`	Index into `CheckpointEntry[]` for first checkpoint
44	4 B	`n_checkpoints`	Number of checkpoints (0 if slice not needed)
48	16 B	reserved

`CheckpointEntry` — 16 bytes each (optional)

Offset	Size	Field	Description
0	8 B	`symbol_offset`	Byte offset within the decompressed genome at this checkpoint
8	8 B	`block_offset`	Byte offset of the corresponding zstd block within the blob

Codec values

Value	Name	Description
0	`PLAIN`	Each blob is independent zstd
1	`ZSTD_DICT`	Shared dictionary trained on first N genomes
2	`REF_DICT`	First genome used as reference content dictionary
3	`DELTA`	Non-reference blobs zstd-compressed with `refPrefix` from genome 0
4	`MEM_DELTA`	Seed with k=31 k-mers; store MEM list + zstd verbatim residue

Catalog section (`CATL`)

CATL variable

CatlHeadermagic · n_rows · n_groups · stats_offset · rows_offset 32 B

RowGroupStatsV2[n_groups]min/max oph · min/max completeness · min/max genome_length
enables predicate pushdown — skip entire groups without row scan 72 B × n

GenomeMeta[n_rows]sorted by oph_fingerprint 72 B × n

Multiple CATL fragments (one per generation) are merged by MergedCatalogReader at read time; newer fragments take precedence on duplicate genome_id.

`CatlHeader` — 32 bytes

Offset	Size	Field	Description
0	4 B	`magic`	`CATL`
4	4 B	`n_rows`	Total number of `GenomeMeta` rows
8	4 B	`n_groups`	Number of row groups
12	4 B	reserved
16	8 B	`stats_offset`	Byte offset of `RowGroupStatsV2[]` from section start
24	8 B	`rows_offset`	Byte offset of `GenomeMeta[]` from section start

`GenomeMeta` — 72 bytes each

Offset	Size	Field	Description
0	8 B	`genome_id`
8	8 B	`oph_fingerprint`
16	4 B	`completeness`	CheckM completeness × 100 (fixed-point)
20	4 B	`contamination`	CheckM contamination × 100
24	8 B	`genome_length`	Total assembly length in bp
32	4 B	`n_contigs`
36	4 B	`shard_id`	Which shard holds this genome
40	32 B	reserved

Genome index (`GIDX`)

A flat array of fixed-size records sorted by genome_id, enabling O(log n) binary search. Each record maps:

genome_id  ->  (section_id, dir_index, catl_row_index)

section_id is resolved via the TOC to find the shard's file offset. dir_index is the entry's position in GenomeDirEntry[]. catl_row_index is the row in the merged catalog.

TOC and TailLocator

The TOC is a zstd-compressed array of SectionDesc records.

`SectionDesc`

struct SectionDesc {
    uint32_t type;              // section magic (e.g. SEC_SHRD)
    uint16_t version;
    uint16_t flags;
    uint64_t section_id;        // unique, monotonically increasing
    uint64_t file_offset;
    uint64_t compressed_size;
    uint64_t uncompressed_size;
    uint64_t item_count;        // genomes in a shard, rows in a catalog, etc.
    uint64_t aux0;              // type-specific (shard_id for SHRD)
    uint64_t aux1;
    uint8_t  checksum[16];
};

Open sequence

lseek(-64, SEEK_END) — read the 64-byte TailLocator
lseek(toc_offset, SEEK_SET) — read and decompress the TOC
Parse SectionDesc[] — mmap the entire file

Versioning

Field	Description
`FileHeader.version_major`	Breaking format change
`FileHeader.version_minor`	Backward-compatible extension
`FileHeader.generation`	Monotonically incremented on each `add`/`rm`/`repack`
`SectionDesc.version`	Per-section format version (e.g. shard v4 added checkpoints)

KMRX section

Stores L2-normalised k=4 canonical tetranucleotide frequency vectors (136 dimensions; reverse-complement collapsing reduces unique k-mers from 256 to 136).

Offset	Size	Field	Description
0	32 B	`KmrxHeader`	magic, n_genomes, flags
32	n × 8 B	`genome_ids[n]`	Sorted ascending; binary search for O(log n) lookup
32 + n×8	n × 544 B	`profiles[n][136]`	Parallel to `genome_ids`; stored uncompressed

HNSW section

Embeds a serialised hnswlib index blob. Default build parameters: M=16, efConstruction=200.

Offset	Size	Field	Description
0	64 B	`HnswSectionHeader`	magic, n_elements, M, efConstruction
64	variable	hnswlib blob	Serialised hnswlib index
64 + blob	n × 8 B	`label_map[n]`	Translates hnswlib internal label i to `genome_id`

Binary Format

File layout

Section types

FileHeader — 128 bytes

Shard section (SHRD)

ShardHeader — 128 bytes

GenomeDirEntry — 64 bytes each

CheckpointEntry — 16 bytes each (optional)