Skip to content

API Reference

All public symbols live in the taph namespace. Include <taph/frame_selector_decl.hpp>.


FrameSelector

The main entry point. All methods are static.

compute_sample_profile

static SampleDamageProfile compute_sample_profile(
    const std::vector<std::string>& sequences);

One-shot profile computation from a vector of sequences. Internally calls reset_sample_profile, iterates update_sample_profile, then finalize_sample_profile.

Parameters

Name Type Description
sequences const std::vector<std::string>& DNA sequences (ACGT). Reliable estimation requires enough read length and coverage to define both terminal (positions 0–14) and interior (positions 30 to L-30) regions.

Returns A fully populated SampleDamageProfile. Call profile.is_valid() to check whether enough reads were processed (≥ 1000).


update_sample_profile

static void update_sample_profile(
    SampleDamageProfile& profile,
    const std::string& seq);

Accumulate one read into profile.

Safe parallel pattern: update separate SampleDamageProfile objects in parallel threads, then merge them on the main thread with merge_sample_profiles. Do not call this on the same profile object from multiple threads without external synchronization.


update_sample_profile_pe

static void update_sample_profile_pe(
    SampleDamageProfile& profile,
    const std::string& r1,
    const std::string& r2);

Accumulate one read pair into profile. Maps R1[i] to top-strand 5'-end position i and the complement of R2[i] to top-strand 3'-end position i, so a single fragment contributes once to ct5 and once to ga3 (no double-count).

Pairs whose insert length is shorter than the read length read into the sequencing adapter and would imprint adapter composition onto the terminal channels. Such pairs are detected by R1/R2 overlap (15 bp window, ≤3 mismatches) and skipped; the count is reported in pe_short_insert_skipped.

Sets profile.input_mode = InputMode::PAIRED. Same threading contract as update_sample_profile — accumulate into per-thread profiles, then merge.


update_sample_profile_weighted

static void update_sample_profile_weighted(
    SampleDamageProfile& profile,
    const std::string& seq,
    float weight);

Weighted accumulation for alignability-weighted damage estimation.

Note: Only accumulates Channel A (T/(T+C), A/(A+G)), interior baseline, codon-position, and CpG counts. Channel B/C/D/E codon counts, hexamers, and GC-bin stratified data are not accumulated. Profiles built with this method will have channel_b_valid = false and GC-stratified fields unpopulated after finalization.


finalize_sample_profile

static void finalize_sample_profile(SampleDamageProfile& profile);

Compute all derived statistics from accumulated counts: fits the exponential decay, estimates D_max, runs the BIC library-type classifier. Must be called once after all update_* calls.


merge_sample_profiles

static void merge_sample_profiles(
    SampleDamageProfile& dst,
    const SampleDamageProfile& src);

Merge raw counts from src into dst. Useful for parallel accumulation. Call finalize_sample_profile(dst) after merging.


reset_sample_profile

static void reset_sample_profile(SampleDamageProfile& profile);

Zero the main accumulators used by standard damage estimation and library classification.

Important: reset_sample_profile does not currently clear every auxiliary diagnostic accumulator — including Channel B₃′/C/D/E, GC-stratified, and alignability-weighted fields. For guaranteed clean reuse, prefer constructing a fresh SampleDamageProfile{} (zero-initialized) rather than calling reset_sample_profile on a previously populated object.


SampleDamageProfile

All fields are public. Key results after finalize_sample_profile:

Primary damage estimates

Field Type Description
d_max_5prime float Calibrated D_max at 5' end: \(A/(1-b)\)
d_max_3prime float Calibrated D_max at 3' end
d_max_combined float Final asymmetry-aware D_max estimate
lambda_5prime float Fitted decay constant \(\lambda\) at 5'
lambda_3prime float Fitted decay constant \(\lambda\) at 3'
asymmetry float \|d5 - d3\| / mean(d5, d3), >0.5 flagged as suspicious
d_max_source_str() const char* Source used for d_max_combined: "average", "5prime_only", "3prime_only", "channel_b_structural", "channel_b3_structural", "min_asymmetry", "max_ss_asymmetry", "none"

Chemistry-aware background and area-excess

Field Type Description
bg_5prime_anchored / bg_3prime_anchored float Tail-anchored background: trimmed mean of C→T (G→A) rate over positions 20..49, restricted to positions with denom ≥ 100. Used as b in the Briggs fit.
bg_n_positions_5prime / bg_n_positions_3prime int Number of tail positions that contributed to the anchored bg.
bg_denominator_5prime / bg_denominator_3prime double Total T+C (A+G) coverage in the bg window.
briggs_pos0_masked_5prime / briggs_pos0_masked_3prime bool True when position 0 was excluded from the Briggs fit because the protocol-tag fingerprint dominated it (ligation footprint).
damage_5prime_area_excess / damage_3prime_area_excess float \(\sum_{i=k}^{14} \max(0, r_i - b)\) over the first 15 positions (k=1 if pos0 masked, else k=0). Robust headline statistic for cross-library comparison; preferred over d_max when chemistry-tag is set.
damage_5prime_lr / damage_3prime_lr float Log-likelihood ratio of the per-position binomial(rate, bg) model vs the bg-only null over positions k..14. Coverage-scaling companion to area-excess.

Input mode

Field Type Description
input_mode InputMode SINGLE (default) or PAIRED. Set to PAIRED by update_sample_profile_pe.
pe_short_insert_skipped uint64_t Number of pairs skipped because R1/R2 overlap revealed insert < read length (adapter read-through).

Library-type classification

Field Type Description
library_type LibraryType DOUBLE_STRANDED, SINGLE_STRANDED, or UNKNOWN
library_type_auto_detected bool true if set by classifier, false if user-forced
library_type_str() const char* "double-stranded", "single-stranded", "unknown"

LibraryType::UNKNOWN means no model beat the null (M_bias): insufficient damage signal for a confident call. Treat as DS for deduplication unless metadata is available.

BIC model scores (library-type classifier)

Field Type Description
library_bic_bias double BIC of M_bias (null model)
library_bic_ds double Best DS model BIC
library_bic_ss double Best SS model BIC
library_bic_mix double BIC of M_SS_full (4-channel unconstrained)

library_bic_ds - library_bic_ss > 0 means SS is favoured. Values reach ~10⁹ at high coverage; stored as double to preserve precision.

Per-channel classifier amplitudes and ΔBIC

Field Description
libtype_amp_ct5 / libtype_dbic_ct5 5' C→T fitted amplitude and ΔBIC
libtype_amp_ga3 / libtype_dbic_ga3 3' G→A smooth decay (pos 1-10)
libtype_amp_ga0 / libtype_dbic_ga0 3' G→A pos-0 spike
libtype_amp_ct3 / libtype_dbic_ct3 3' C→T (SS original-orientation signal)

ΔBIC > 0 means the alt (decay) model is preferred over the null for that channel.

Per-position arrays

All arrays are 15 elements, indexed 0–14 from the read terminus.

Field Description
damage_rate_5prime[p] C→T excess rate at 5' position p
damage_rate_3prime[p] G→A excess rate at 3' position p
t_freq_5prime[p] Before finalize_sample_profile: raw T count. After: T/(T+C) ratio (normalized in-place).
tc_total_5prime[p] T+C coverage at 5' position p (raw count; not normalized by finalization)
a_freq_3prime[p] Before finalize_sample_profile: raw A count. After: A/(A+G) ratio (normalized in-place).
ag_total_3prime[p] A+G coverage at 3' position p (raw count; not normalized by finalization)

GC-stratified mixture model

Available after finalize_sample_profile. Requires at least one GC bin with sufficient reads.

Field Type Description
mixture_pi_ancient float Fraction of C-sites in high-damage components
mixture_d_ancient float Expected damage rate among ancient reads (δ > 5%)
mixture_d_population float Population-average damage rate across all C-sites
mixture_d_reference float Damage rate in GC > 50% bins (metaDMG proxy)
mixture_K int Number of mixture components selected by BIC
mixture_converged bool Whether the EM algorithm converged
gc_stratified_valid bool At least one GC bin has a valid estimate

Context-aware 5' C→T (upstream base)

Per-context accumulators and shrinkage-fitted amplitudes, indexed by CTX_AC = 0, CTX_CC = 1, CTX_GC = 2, CTX_TC = 3 (constants under SampleDamageProfile::UpstreamContext, with N_UPSTREAM_CTX = 4).

Field Type Description
ct5_t_by_upstream[ctx][pos] / ct5_total_by_upstream[ctx][pos] double[4][N_POS] T and T+C counts at 5' position pos split by upstream base. Raw before finalize_sample_profile.
ct5_t_interior_by_upstream[ctx] / ct5_total_interior_by_upstream[ctx] double[4] Interior counts per upstream context, used as per-context baseline.
dmax_ct5_by_upstream[ctx] float[4] Fitted exponential amplitude per context. NaN if insufficient coverage.
baseline_ct5_by_upstream[ctx] float[4] Interior baseline rate per context.
cov_ct5_terminal_by_upstream[ctx] / cov_ct5_interior_by_upstream[ctx] float[4] Coverage used in the per-context fit.
dipyr_contrast float ½(d_CC + d_TC) − ½(d_AC + d_GC). Positive for dipyrimidine-biased (UV-like) damage.
cpg_contrast float d_GC − ⅓(d_AC + d_CC + d_TC). Positive for CpG-dominant (methylation-driven) damage.
context_heterogeneity_chi2 float Chi-squared statistic (df = 3) for uniformity across the four contexts.
context_heterogeneity_p float Ladder-quantized p-value (0.9, 0.5, 0.1, 0.05, 0.01, 0.001).
context_heterogeneity_detected bool chi2 > 7.81 (p < 0.05).

Validation and reliability flags

Field Type Description
damage_validated bool Joint model evidence supports genuine terminal deamination
damage_artifact bool Channel A-like enrichment is present, but joint evidence indicates composition/artifact rather than real damage
terminal_inversion bool Summary flag: terminal damage rate < interior at either end
inverted_pattern_5prime / inverted_pattern_3prime bool End-specific terminal depletion flags (full window below interior)
position_0_artifact_5prime / position_0_artifact_3prime bool Sharp position-0 artifact: C→T at p=0 < p=1 (5′) or isolated G→A spike at p=0 (3′). Indicates adapter/ligation artifact at the terminal nucleotide. Independent of hexamer_excess_tc, which tracks full-window chemistry bias.
composition_bias_5prime / composition_bias_3prime bool Control channel rises with the damage channel, suggesting compositional rather than damage-driven enrichment
is_valid() bool n_reads >= 1000
is_detection_unreliable() bool true when inversion or composition-bias flags indicate unreliable reference-free detection

Utility functions

// Validation state: VALIDATED, CONTRADICTED, or UNVALIDATED
taph::DamageValidationState state = taph::get_damage_validation_state(profile);

// Suppression factor for downstream damage-aware deduplication
float factor = taph::get_damage_suppression_factor(profile);
// 1.0 = validated, 0.5 = unvalidated, 0.0 = artifact

Complement-asymmetry oxidative channels (F, G, H)

These fields populate the complement_asymmetry JSON block. Each channel computes a binomial z-score comparing the trinucleotide-context stop rate in a terminal window (positions p0–4, adapter-adjusted) against a far-interior reference (pos ≥ 30, ≥ 50 counts; mid-read fallback pos 5–14). libtaph stores raw z-scores; threshold application is left to the consumer.

See Damage types — Channels F, G, H for the biochemical background, trinucleotide context lists, and the adapter-skip gate logic.

Channel F — C→A (bottom-strand 8-oxoG complement)

Pre-contexts: TCA/TCG/TAC/TGC. Stop contexts: TAA/TAG/TGA.

Field Type Description
channel_f_valid bool ≥ 200 total trinucleotide counts across pre + stop
channel_f_z float Binomial z-score, 5′ terminal vs interior; negative = depletion
ca_stop_rate_terminal float Stop rate in terminal window
ca_stop_rate_interior float Stop rate in mid-read interior (fallback baseline)
ca_stop_rate_baseline float Stop rate in far-interior reference (pos ≥ 30)
ca_uniformity_ratio float Terminal / baseline ratio; > 1 indicates enrichment
channel_f3_valid bool ≥ 200 counts for 3′ end
ca_stop_rate_terminal_3prime float 3′ terminal stop rate
ca_stop_rate_interior_3prime float 3′ interior stop rate
ca_stop_rate_baseline_3prime float 3′ far-interior baseline
ca_uniformity_ratio_3prime float 3′ terminal / baseline ratio
ca_pre_interior uint64_t Raw pre-context count, far-interior (pos ≥ 30)
ca_stop_interior uint64_t Raw stop-context count, far-interior

Channel G — C→G (further-oxidized guanine: Gh, Sp)

Pre-contexts: TCA/TAC. Stop contexts: TGA (from TCA), TAG (from TAC).

Field Type Description
channel_g_valid bool ≥ 200 total counts
channel_g_z float Binomial z-score, 5′ terminal vs interior
cg_stop_rate_terminal float Terminal stop rate
cg_stop_rate_interior float Interior stop rate
cg_stop_rate_baseline float Far-interior baseline
cg_uniformity_ratio float Terminal / baseline ratio
channel_g3_valid bool ≥ 200 counts for 3′ end
cg_stop_rate_terminal_3prime float 3′ terminal stop rate
cg_stop_rate_interior_3prime float 3′ interior stop rate
cg_stop_rate_baseline_3prime float 3′ far-interior baseline
cg_uniformity_ratio_3prime float 3′ terminal / baseline ratio
cg_pre_interior uint64_t Raw pre-context count, far-interior
cg_stop_interior uint64_t Raw stop-context count, far-interior

Channel H — A→T (empirical; mechanism uncertain)

Pre-contexts: AAA/AAG/AGA. Stop contexts: TAA/TAG/TGA.

Field Type Description
channel_h_valid bool ≥ 200 total counts
channel_h_z float Binomial z-score, 5′ terminal (positions p0_h5–4) vs interior
channel_h_z_p2plus float Same but terminal window restricted to positions 2–4; more robust because position 0 carries a lower background A→T rate
fgh_adapter_prefixes_excluded uint32_t Number of 5′ hexamer prefix bins excluded from F/G/H terminal counts; 0 = no adapter stubs detected
at_stop_rate_terminal float Terminal stop rate
at_stop_rate_interior float Interior stop rate
at_stop_rate_baseline float Far-interior baseline
at_uniformity_ratio float Terminal / baseline ratio
channel_h3_valid bool ≥ 200 counts for 3′ end
at_stop_rate_terminal_3prime float 3′ terminal stop rate
at_stop_rate_interior_3prime float 3′ interior stop rate
at_stop_rate_baseline_3prime float 3′ far-interior baseline
at_uniformity_ratio_3prime float 3′ terminal / baseline ratio
at_pre_interior uint64_t Raw pre-context count, far-interior
at_stop_interior uint64_t Raw stop-context count, far-interior

Typical usage patterns

Streaming with thread-parallel accumulation

#include <taph/frame_selector_decl.hpp>
#include <thread>

// Per-thread profiles
std::vector<taph::SampleDamageProfile> partial(n_threads);
for (auto& p : partial) taph::FrameSelector::reset_sample_profile(p);

// Fill partial[i] in parallel...

// Merge on main thread
taph::SampleDamageProfile final_profile;
taph::FrameSelector::reset_sample_profile(final_profile);
for (const auto& p : partial)
    taph::FrameSelector::merge_sample_profiles(final_profile, p);

taph::FrameSelector::finalize_sample_profile(final_profile);

Forcing library type

taph::SampleDamageProfile profile = /* ... */;
profile.forced_library_type = taph::SampleDamageProfile::LibraryType::DOUBLE_STRANDED;

When forced_library_type != UNKNOWN, downstream code should use the forced type; library_type_auto_detected will be false.


Damage Interpretation (taph/library_interpretation.hpp)

Higher-level functions that operate on a finalized SampleDamageProfile to produce scores, flags, and summaries suitable for reporting.

Hexamer utilities

std::array<char,7> taph::decode_hex(int code);
int taph::encode_hex_at(const std::string& s, int pos);

Encode/decode 12-bit 6-mer codes over the ACGT alphabet.

Adapter stub detection

taph::AdapterStubs taph::detect_adapter_stubs(
    const SampleDamageProfile& dp,
    const uint32_t hex3_terminal[4096],
    uint64_t n_hex3);

Detects 5' and 3' adapter remnants from hexamer enrichment. Pass the 4096-element terminal hexamer histogram (hex3_terminal) from a pre-scan pass alongside n_hex3 (total counts). Returns AdapterStubs with stubs5/stubs3 (up to 5 each), adapter_clipped, adapter3_clipped, and flag_hex_artifact.

3' detection is gated on adapter_clipped (5' stubs must be present first).

void taph::FrameSelector::recompute_fgh_excluding_adapter_prefixes(
    SampleDamageProfile& dp,
    const std::vector<uint32_t>& excl5,
    const std::vector<uint32_t>& excl3);

Re-sums all F/G/H terminal counts excluding reads whose first hexamer code is in excl5 (5′ end) or last hexamer code is in excl3 (3′ end), then recomputes channel_f_z, channel_g_z, channel_h_z, and channel_h_z_p2plus from the filtered counts. Must be called after finalize_sample_profile. Sets fgh_adapter_prefixes_excluded to excl5.size(). Hexamer codes come from taph::encode_hex_at or taph::encode_hexamer. Pass empty vectors to recompute with no exclusion (useful for testing the unfiltered baseline).

Hexamer statistics

taph::HexStats taph::compute_hex_stats(const SampleDamageProfile& dp);

Returns Shannon entropy of terminal/interior hexamer distributions, Jensen-Shannon divergence, and a multinomial G-test (statistic, z-score, p-value) comparing terminal vs interior composition.

Score functions

Function Output Meaning
compute_cpg_score(dp) CpgScore{z, p} log₂(CpG / non-CpG d_max) / SE
compute_oxog_interior_score(dp) OxogInteriorScore{z, p} Chargaff-symmetric G→T excess over 16 trinucleotide contexts
compute_oxog_trinuc(dp) OxogTrinucResult{cosine, n_ctx} Cosine similarity of per-context G→T residuals to empirical 8-oxoG reference
compute_depur_score(dp, is_ss) DepurScore{z, p, z5, z3, shift5, shift3} Conjunction test on terminal A/(A+G) [5'] and T/(T+C) [3'] enrichment

Damage mask

taph::DamageMask taph::compute_damage_mask(
    const SampleDamageProfile& dp, bool is_ss,
    double threshold = 0.05, int min_cov = 100);

Marks the first INTERP_N_POS (15) positions where terminal damage exceeds threshold above background. Returns DamageMask with pos[15], n_masked, and masked_str (comma-separated indices).

Library QC flags

taph::LibraryQcFlags taph::compute_library_qc_flags(
    const SampleDamageProfile& dp, bool is_ss,
    bool flag_hex_artifact, double jsd,
    double h_term, double short_read_frac);

Nine boolean flags covering adapter remnants, hexamer biases, short-read spikes, depurination, absent 3' signal (DS), and inward-displaced G→A.

Preservation summary

taph::PreservationSummary taph::compute_preservation_summary(
    const SampleDamageProfile& dp, bool is_ss,
    bool adapter_clipped, bool flag_hex_artifact,
    double cpg_score_z, double oxog_score_z,
    double oxog_context_cosine, double hex_shift_p);

Combines evidence from all channels into a single preservation assessment. Fields: authenticity_eff, authenticity_evidence, d5_raw, d5_hexamer_corrected, d5_was_corrected, oxidation_eff, oxidation_evidence, qc_risk_eff, qc_evidence, label (e.g. "ancient", "weak", "modern-like").

DamageContextProfile

struct taph::DamageContextProfile {
    enum class DominantProcess {
        None, LowDamage, CytosineDeamination, CpgEnrichedDeamination,
        DipyrimidineBiased, OxidativeLike,
        FragmentationBias, LibraryArtifactLikely
    };
    static constexpr const char* method = "training_free";
    bool reference_required = false;
    bool alignment_required = false;

    float terminal_deamination_score, cpg_context_score;
    float dipyrimidine_context_score, oxidative_context_score;
    float fragmentation_context_score, library_artifact_score;

    DominantProcess dominant_process;
    std::string     dominant_process_str;
    std::string     interpretation;

    struct Evidence { /* d_max_{5,3}, lambda_{5,3}, log2_cpg_ratio, cpg_z,
                         dipyr_contrast, ox_gt_asymmetry, s_oxog_{mean,max},
                         purine_enrichment_5prime, hex_shift_z,
                         adapter_clipped, adapter3_clipped, flag_hex_artifact,
                         position_0_artifact_{5,3}prime, n_reads */ } evidence;
};

const char* taph::to_string(DamageContextProfile::DominantProcess p);

taph::DamageContextProfile taph::compute_damage_context_profile(
    const SampleDamageProfile& dp,
    double cpg_z, double hex_shift_z,
    bool adapter_clipped, bool adapter3_clipped, bool flag_hex_artifact);

Training-free, reference-free summary aggregator. Six scores are in [0, 1] (or NaN when not evaluable); dominant_process is a deterministic rule over the scores. Underlying raw numbers are mirrored into evidence for auditing and re-normalisation by downstream tools. See methods.md for score formulas and the rule.


Length-stratified profile (taph/length_stratified_profile.hpp)

Up to four per-length SampleDamageProfile instances with shared accumulate / merge / finalize semantics.

struct taph::LengthBinStats {
    static constexpr std::size_t MAX_BINS = 4;
    std::array<SampleDamageProfile, MAX_BINS> profiles;
    std::vector<int> edges;
    std::size_t n_bins = 1;
    SampleDamageProfile::LibraryType forced_library_type;

    void configure(const std::vector<int>& new_edges);
    void update(std::string_view seq, int length);
    void merge(const LengthBinStats& other);
    void finalize_all();
    std::size_t bin_index(int length) const;
};

configure(edges) validates that edges are strictly increasing and at most three splits (four bins) and sets up the per-bin profiles. update routes a read by bin_index(length) and accumulates into that bin's profile. merge requires matching edges. finalize_all runs FrameSelector::finalize_sample_profile on every populated bin; forced_library_type propagates to each bin if set.

Automatic length-bin edges (taph/log_length_gmm.hpp)

taph::LogLengthGmmResult taph::detect_log_length_gmm_edges(
    const std::vector<uint64_t>& histogram,
    double log_min, double log_max,
    int min_length, int max_length,
    int max_components = 4);

std::vector<int> taph::detect_quantile_length_edges(
    const std::vector<uint64_t>& histogram,
    double log_min, double log_max,
    int min_length, int max_length,
    int n_bins);

detect_log_length_gmm_edges fits a 1D Gaussian mixture on a log-length histogram, selects n_components ∈ [1, max_components] by BIC, and returns split points between adjacent component means. LogLengthGmmResult holds edges, n_components, bic, and converged. detect_quantile_length_edges is the deterministic fallback when the GMM fails to separate modes; it splits the histogram into equal-count quantile bins.

Typical wiring:

auto hist = /* log-length histogram from a pre-scan */;
auto gmm = taph::detect_log_length_gmm_edges(hist, log_min, log_max, lmin, lmax);
std::vector<int> edges = gmm.converged && gmm.n_components > 1
    ? gmm.edges
    : taph::detect_quantile_length_edges(hist, log_min, log_max, lmin, lmax, 3);

taph::LengthBinStats stats;
stats.configure(edges);
for (auto& r : reads) stats.update(r.seq, r.length);
stats.finalize_all();

Length × GC joint mixture (taph/length_gc_joint_mixture.hpp)

taph::LengthGcJointMixtureResult
taph::fit_length_gc_joint_mixture(const LengthBinStats& stats);

Shared-component 2-Gaussian mixture over the flat length × GC cell grid of a finalized LengthBinStats. Each cell contributes its d_max and c_sites. Returns:

Field Type Description
d_ancient double Damaged-component mean, shared across all cells.
pi_ancient double Damaged-component mixing weight.
d_population double c_sites-weighted mean d_max across all cells.
converged bool EM convergence state.
separated bool true if the fit produced a distinct damaged component.
cell_w_ancient[b][g] vector<array<double, N_GC_BINS>> Posterior P(damaged | cell) per (length bin, GC bin). -1.0 for cells with insufficient C-sites. Empty when !separated.

Fixed priors match DamageMixtureModel: μ₀ = 0, τ₀ = 0.01, τ₁ = 0.10, per-cell sigma floor 0.02. Use d_ancient as a single shared amplitude and cell_w_ancient[b][g] to recover per-length-bin ancient fractions inside each GC column.


Hexamer tables and domain classifier (taph/hexamer_tables.hpp)

Eight pre-computed 4096-bin hexamer frequency tables plus a softmax domain classifier that scores a read against all of them.

enum class taph::Domain {
    META, GTDB, FUNGI, PROTOZOA, INVERTEBRATE,
    PLANT, VERTEBRATE_MAMMALIAN, VERTEBRATE_OTHER, VIRAL
};

Domain::GTDB covers bacteria + archaea. Domain::META is the default for ancient metagenomes and selects the ensemble-of-all-domains path. parse_domain(name) accepts "meta" | "metagenome" | "all" (all three map to Domain::META), "gtdb" | "bacteria" | "archaea", "fungi", "protozoa", "invertebrate", "plant", "mammal" | "vertebrate_mammalian", "vertebrate" | "vertebrate_other", "viral", and falls back to Domain::META on anything unrecognised. domain_name(d) gives the canonical string.

Encoding and table lookup

uint32_t taph::encode_hexamer(const char* seq);                         // 12-bit code, UINT32_MAX on N
float    taph::get_hexamer_freq(uint32_t code);                         // active domain
float    taph::get_hexamer_freq(uint32_t code, Domain d);
float    taph::get_hexamer_freq(const char* seq);                       // active domain
float    taph::get_hexamer_freq(const char* seq, Domain d);
float    taph::get_hexamer_score(const char* seq);                      // log2(freq / uniform)
float    taph::get_hexamer_score(const char* seq, Domain d);

get_hexamer_score takes a sequence pointer (there is no uint32_t overload), returns log2(freq / (1/4096)), and floors at -10.0 for unseen hexamers.

Active domain and ensemble mode

void          taph::set_active_domain(Domain d);
taph::Domain& taph::active_domain();             // reference to the global
taph::Domain  taph::get_active_domain();         // value accessor
void          taph::set_ensemble_mode(bool enabled);
bool          taph::get_ensemble_mode();
void          taph::set_domain_probs(const MultiDomainResult& r);
taph::MultiDomainResult& taph::get_domain_probs();
float         taph::get_ensemble_hexamer_freq(uint32_t code);

set_active_domain writes a global Domain (not thread-local) that gates the default-table overloads; read it via get_active_domain(). The ensemble flag (ensemble_mode_enabled()) and the posterior buffer (current_domain_probs()) are thread-local, so per-thread pipelines can carry their own blend. With ensemble mode off, get_ensemble_hexamer_freq falls through to the active-domain table. With ensemble mode on, it returns the posterior-weighted average of the eight tables using the probabilities written by set_domain_probs.

Domain classifier

struct taph::MultiDomainResult {
    float gtdb_prob, fungi_prob, protozoa_prob, invertebrate_prob;
    float plant_prob, vertebrate_mammalian_prob, vertebrate_other_prob, viral_prob;
    Domain best_domain;
    float  best_score;
};

taph::MultiDomainResult taph::score_all_domains(const std::string& seq, int frame);

Sums log2(freq[h] * 4096 + 1e-10) across every in-frame hexamer for each of the eight domains, then normalises with temperature-scaled softmax (exp(0.1 * (score - max))). best_domain is the argmax and best_score is the pre-softmax log-sum for that domain.

Typical ensemble flow for mixed-community samples:

auto probs = taph::score_all_domains(read, /*frame=*/0);
taph::set_domain_probs(probs);
taph::set_ensemble_mode(true);
// Subsequent get_ensemble_hexamer_freq(code) returns the posterior-weighted
// background rather than committing to one domain table.

Hexamer fields on SampleDamageProfile

Raw 4096-bin terminal and interior histograms populated during accumulation and consumed by the interpretation functions in taph/library_interpretation.hpp.

Field Type Description
hexamer_count_5prime[4096] double[4096] Hexamer counts at 5' positions 0-5.
hexamer_count_interior[4096] double[4096] Hexamer counts at the interior (middle third).
n_hexamers_5prime size_t Total counted terminal hexamers.
n_hexamers_interior size_t Total counted interior hexamers.
hexamer_terminal_tc float T/(T+C) ratio at 5′ terminal positions across all C/T hexamer pairs.
hexamer_interior_tc float T/(T+C) ratio at interior positions (same hexamer pairs).
hexamer_excess_tc float hexamer_terminal_tc − hexamer_interior_tc. Library chemistry / priming bias metric — not an adapter remnant signal. A negative value means the library prep preferentially produces C-starting read starts (e.g. SS ligation junction composition), which enriches C-context trinucleotides in the channels F and G pre-count denominator and suppresses their z-scores toward negative values. Adapter remnants are separately tracked by adapter_clipped, flag_hex_artifact, and position_0_artifact_*.
hexamer_damage_llr float Log-likelihood ratio of T-enrichment over C-enrichment at 5′ terminal hexamers vs interior. Positive = T-starting hexamers enriched (consistent with deamination). Used to detect inverted_pattern_5prime when < −0.02 and no p0 artifact.

Paired with the 3' terminal hexamer histogram produced by the pre-scan (passed to detect_adapter_stubs as hex3_terminal[4096] with n_hex3). The 3' histogram is kept out of SampleDamageProfile so it does not bloat per-thread state in accumulators that do not need it.