Skip to content

Interpreting libtaph output

This page is a practical guide for researchers reading a SampleDamageProfile or its JSON serialisation. It does not document every field — the API reference does that — but explains what the headline numbers mean, what counts as a strong or weak signal, and how to combine fields into a defensible call about a sample. For the underlying biology and statistics see Damage types and Methods.

A useful default rule of thumb: never look at one number in isolation. libtaph emits a dozen partly redundant statistics so that compositional artefacts and genuine damage can be distinguished. A high d_max_5prime on its own is suggestive; the same number with a positive Channel B validation, an elevated F/G/H trio, and a clean QC flag panel is a confident call.

1. Core deamination metrics

d_max_5prime and d_max_3prime

These are the calibrated terminal C→T excess at the 5′ end and the calibrated terminal G→A excess at the 3′ end (or terminal C→T at the 3′ end for single-stranded original-orientation libraries). They are computed as A / (1 − b) where A is the fitted exponential amplitude above the interior background and b is the anchored background rate. Physically they answer: "by how much does the cytosine-deamination–driven substitution rate at the very end of the read exceed what the interior of the same library shows?"

A working interpretation scale, for chemistry-untreated libraries:

d_max value Interpretation
≥ 0.20 Strong signal — well-preserved aDNA, often Pleistocene
0.08 – 0.20 Moderate — typical Holocene aDNA
0.02 – 0.08 Weak — borderline; needs cross-channel support
< 0.02 Effectively absent — modern, USER-treated, or contaminated

These thresholds are heuristics, not hard cutoffs. When the library has been chemically treated (e.g. USER, single-stranded), a moderate d_max may still be meaningful — see the source string returned by d_max_source_str().

lambda_5prime and lambda_3prime

lambda is the decay constant of the fitted exponential. Larger λ means the elevated rate falls back to background more steeply, i.e. damage is concentrated in the first one or two terminal positions. Lower λ means a slow tail of damage extending several bases inward, which is characteristic of older, more heavily degraded material. As a guideline, λ > 0.5 is sharp (fresh or chemically-driven), λ ≈ 0.2–0.4 is typical of well-preserved aDNA, and λ < 0.1 indicates long damage tails seen in highly degraded samples.

d_max_combined versus the per-end values

d_max_combined is the asymmetry-aware single-number summary; the string returned by d_max_source_str() records how it was derived. Use the combined value when reporting a single damage number per library. Use the individual d_max_5prime / d_max_3prime when comparing the two ends, when diagnosing library type, or when one end is known to be compromised by adapter remnants or end-repair artefacts.

If d_max_source_str() returns "5prime_only", "3prime_only", or "none", libtaph has decided that one or both ends are inverted (terminal rate below interior) and should not be averaged. Treat the combined value with care in those cases.

bg_5prime and bg_3prime (interior baseline)

These are the interior C→T (5′) and G→A (3′) rates used as the baseline b in the exponential fit (the field referred to as "fit baseline" in some downstream tools). They are not damage measurements — they reflect the ambient sequencing-error and sequence-composition rate. A baseline of 0.002–0.005 is typical for clean Illumina data; baselines above ~0.01 suggest elevated sequencing error, mapping artefacts, or — in the context of an aDNA workflow — a substantial fraction of modern contamination diluting the ancient signal. Use bg_5prime and bg_3prime as a modern-contamination proxy alongside the d_max values: high d_max with low baseline is a clean ancient call, while moderate d_max with elevated baseline is suspicious.

2. Library type classification (DS vs SS)

The library_type field takes values DOUBLE_STRANDED, SINGLE_STRANDED, or UNKNOWN. The classifier examines which terminal sub-channels (ct5, ga3, ct3, ga0) are active under the seven-model BIC selection described in Damage types.

In a double-stranded library (NEBNext, TruSeq), strand-overhang deamination produces C→T at the 5′ end and G→A at the 3′ end. The 3′ end shows G→A rather than C→T because the complementary strand of a damaged C-overhang carries a G, and that strand is what is sequenced from the 3′ direction.

In a single-stranded library (Gansauge & Meyer, Santa Cruz), both ends of the original strand survive intact through circularisation, so the underlying C→T deamination is visible at both the 5′ and 3′ ends. Consequently d_max_5prime and d_max_3prime should be of similar magnitude in an SS library, whereas a DS library only shows the 5′ C→T directly (the 3′ value reports G→A on the opposite strand).

UNKNOWN means no sub-channel beat the null model. This can happen legitimately in modern libraries with no damage, but in a sample that should be ancient it is a warning. Check is_valid() first: if n_reads < 1000 the classifier has not had enough evidence to commit. Below that threshold the call is unreliable and downstream tools should treat the type as missing rather than as UNKNOWN.

The flag library_type_auto_detected distinguishes a classifier call from a user-forced override. If you have set forced_library_type on the profile, honour it; otherwise inspect the auto-detected value.

3. Damage validation through cross-channel evidence

libtaph cross-validates Channel A (terminal C→T excess) against Channel B (stop-codon-context C→T) to separate genuine deamination from composition-driven enrichment.

  • damage_validated == true — Channel B independently confirms the Channel A signal. The terminal enrichment is consistent with cytosine deamination acting in stop-codon-bearing contexts.
  • damage_artifact == true — Channel A enrichment is present, but Channel B contradicts it. The terminal excess is driven by sequence composition (often GC-rich genomes or mapping artefacts), not deamination. A sample with damage_artifact == true should not be reported as ancient on the strength of d_max alone. Re-examine read lengths, mapping quality, and the F/G/H channels.
  • Neither flag set — there is too little Channel B coverage to corroborate or refute, or the Channel A signal is too weak to test. The result is inconclusive and downstream interpretation rests on Channel A alone.

composition_bias_5prime / composition_bias_3prime fire when the control channel rises in lockstep with the damage channel as the read terminus is approached. That co-elevation is the signature of a compositional artefact (GC-rich material driving both). When set, the corresponding terminal damage call is unreliable even if d_max looks reasonable.

position_0_artifact_5prime / position_0_artifact_3prime indicate a depleted position-0 with downstream enrichment, the typical fingerprint of adapter or ligation residue. Genuine deamination in a DS library produces a peak at position 0, not a depletion there; a depleted p0 with rising p1 through p4 is almost always a clipping or ligation artefact. When this flag is set, libtaph automatically shifts the fit window inward and (in DS libraries with flag_hex_artifact) reports a hexamer-corrected d5_hexamer_corrected extrapolated back to position 0.

4. Oxidative damage channels (F, G, H)

Channels F, G, and H detect oxidative damage through terminal trinucleotide context z-scores against the interior baseline. All three are reported as binomial z-scores (channel_f_z, channel_g_z, channel_h_z).

A practical scale for the 5′ z-scores:

channel_*_z Interpretation
~0 No detectable enrichment over interior
1–2 Borderline; coverage-limited noise
3–5 Detectable signal — the conventional z > 3 threshold
5–10 Strong signal — typical of oxidatively damaged ancient material
> 10 Very strong, but check for compositional drivers

A z-score of 5 for Channel F combined with Channel H near zero is consistent with GC-rich bacterial DNA driving the C-context channels through composition rather than oxidation. A z-score of 5 for F and H indicates the enrichment extends to AT-context channels too, which composition alone cannot explain.

The co-elevation pattern F+ G+ H+ is the diagnostic signature of authentic oxidative aDNA. All three channels rise together because 8-oxoG-mediated miscoding affects many trinucleotide contexts. Channel H (A→T) is particularly informative: its pre-contexts are AT-rich (AAA, AAG, AGA), so unlike F and G it cannot be inflated by GC-rich composition. A pattern of F+ G+ with H ≈ 0 is therefore diagnostic of compositional bias, most commonly GC-rich bacterial contamination (e.g. Pseudomonas) rather than genuine oxidative damage on the host material.

For Channel H specifically, prefer channel_h_z_p2plus over channel_h_z when the position-0 background A→T rate is depressed. The p2plus variant restricts the terminal window to positions 2–4 and avoids dilution by the low p0 rate; downstream consumers commonly use max(channel_h_z, channel_h_z_p2plus).

The per-channel *_uniformity_ratio (ca_uniformity_ratio, cg_uniformity_ratio, at_uniformity_ratio) is the terminal-to-baseline ratio. A ratio above 1 indicates terminal enrichment over interior; below 1 indicates depletion. Values near 1 mean no terminal effect and should be read together with the corresponding z-score before drawing a conclusion.

The 3′ counterparts (channel_f3_z, channel_g3_z, channel_h3_z, together with their *_3prime uniformity ratios) are typically weaker than their 5′ counterparts. This is partly because 3′ end-repair processes mask some terminal positions, and partly because adapter trimming on the 3′ side removes some of the terminal context. Treat 3′ z-scores below the 5′ values by a factor of 2–3 as normal; only worry when the 3′ signal is higher than the 5′ signal, which often signals an inversion or trimming artefact.

5. Channel C/D — 8-oxoG context diagnostics

Channels C and D summarise the strand-asymmetry of G→T versus C→A at interior positions, the canonical 8-oxoG signature.

ox_gt_uniformity is the terminal-to-interior ratio of the G→T rate. Values near 1.0 indicate that the G→T enrichment is interior-dominated: the signal is interpretable as 8-oxoG rather than as a terminal deamination artefact. Values substantially above 1.0 mean the G→T rate spikes at the terminus, which is more consistent with a terminal compositional effect than with bulk oxidative damage.

ox_gt_asymmetry is the strand contrast ox_gt_baseline − ox_ca_baseline. Positive values (G→T > C→A) are the expected sign for genuine 8-oxoG, because oxidation of guanine on one strand is read as G→T and has no corresponding C→A counterpart on the other strand of a pooled DS library. Negative values point the other way and usually indicate a sequencing or composition artefact rather than oxidation.

s_oxog_16ctx reports the per-context G→T asymmetry across the 16 trinucleotide contexts NGN. It is informative when coverage is high (reported as non-null when the per-context coverage exceeds 500). A broad enrichment across most contexts is consistent with genuine oxidative damage. A spike confined to one or two contexts (typically GG-flanked) is characteristic of modern oxidative chemistry rather than diffuse aDNA oxidation; it can also be noise when coverage is borderline. When s_oxog_16ctx is null or only a couple of contexts are populated, treat the panel as uninformative and rely on the F/G/H z-scores instead.

6. Channel E — depurination and AP-site fragmentation

Channel E captures the AP-site-mediated fragmentation pattern: ancient DNA preferentially breaks 3′ of purine residues (A and G), so reads from fragmented aDNA have purines enriched at the position immediately 5′ of the fragmentation site.

purine_enrichment_5prime is the headline statistic — the excess of A+G in the −1 / 5′-flanking position over what a uniform composition would predict.

  • Positive enrichment (e.g. > 0.05) is the expected aDNA signature; values above 0.15 are strong evidence of AP-site-mediated fragmentation.
  • Near-zero enrichment is consistent with random shearing, USER treatment, or modern DNA where AP-site chemistry has not had time to operate.
  • Negative enrichment is unusual and typically indicates a sonication-driven or otherwise non-biological fragmentation profile.

Channel E is independent of the deamination channels. A sample can show strong fragmentation enrichment with weak d_max (USER-treated aDNA, or material where deamination has saturated and lost its terminal asymmetry). In the dominant-process rule (see below), a strong Channel E with weak deamination is what produces the fragmentation_bias label.

7. Damage-context profile and dominant_process

The dominant-process classifier emits a single label summarising what process best describes the library, alongside six normalised scores in [0, 1] (or NaN when unevaluable). The label is assigned by a deterministic rule top-to-bottom; the first match wins. See Methods for the exact thresholds.

Label Implication
cytosine_deamination The default ancient-DNA call: terminal C→T deamination dominates with no contradictory evidence.
cpg_enriched_deamination Deamination is amplified at CpG sites; consistent with vertebrate methylation. Suggests host-mammalian aDNA.
dipyrimidine_biased Damage is enriched at CC/TC dipyrimidine contexts; a UV-photoproduct signature seen in surface-exposed material (skin, sun-exposed bone fragments).
oxidative_like Oxidative-channel score is high relative to terminal deamination. May reflect long-term storage oxidation, post-mortem oxygen exposure, or — if F+G+ but H≈0 — bacterial-composition contamination masquerading as oxidation.
fragmentation_bias Strong purine enrichment with weak terminal deamination. Often seen in chemically-treated libraries or very old material where deamination has saturated.
library_artifact_likely Adapter / hexamer / position-0 evidence is present and terminal deamination is below the threshold for a damage call. The library is unreliable; investigate trimming and re-prep.
low_damage Terminal deamination is essentially zero. Either modern DNA, a USER-treated library, or contamination so heavy that ancient signal is diluted.
none Insufficient reads (< 1000) or no finite d_max at either end. The classifier abstains.

The six scores let downstream tools re-normalise without rescanning. A library with terminal_deamination_score = 0.85 and library_artifact_score = 0.05 is a confident damage call; the reverse ratio is a confident artefact call.

8. GC-stratified mixture model

For samples that are mixtures of ancient and modern DNA, the per-read exponential summary in Channel A is a population average and can mislead. The two-component mixture over the GC × length cell grid disentangles them.

  • mixture_pi_ancient — the fraction of C-sites in the high-damage component. A value near 1.0 means almost all reads carry damage; near 0.0 means almost none do; intermediate values indicate a true mixture.
  • mixture_d_ancient — the expected damage rate among the reads assigned to the ancient component (i.e. the damage rate conditional on being ancient). This is the number to compare against d_max from clean libraries.

When the mixture model and d_max agree, the library is uniform (ancient or modern; no mixing). When mixture_pi_ancient is intermediate (say 0.2 – 0.6) and mixture_d_ancient is high while d_max_combined is moderate, the library is a mixture: there is a real ancient component, but it is diluted by modern reads. In that case mixture_d_ancient is the better number to report for the ancient fraction, and mixture_pi_ancient is the proxy for the contamination level.

9. QC flags and when to distrust results

Several flags should override an otherwise positive damage call.

  • terminal_inversion — terminal damage rate is below interior at one or both ends. The fit has been forced into a fallback configuration. Likely causes: read-orientation bug upstream, mis-trimmed adapters, or a reverse-complemented BAM. Investigate before reporting any d_max from this profile.
  • hexamer_excess_tc — a library chemistry / priming bias metric, not an adapter remnant signal (see flag_hex_artifact and adapter_clipped for those). It measures T/(T+C) at read starts minus T/(T+C) in the read interior across all matched C/T hexamer pairs. A negative value means the library prep preferentially produces C-starting read starts (e.g. ligation junction composition in SS protocols), enriching C-context trinucleotides in the channels F and G pre-count denominator across the full 5′ terminal window. The effect is false negatives in F/G (suppression toward large negative z-scores), not false positives in Channel A. Interpret F/G z-scores with caution when hexamer_excess_tc < −0.05; the raw value is always reported.
  • short_read_frac — the fraction of reads below the short-read threshold. High values indicate library-prep size selection failure or extreme fragmentation; both reduce the reliability of terminal statistics because end-effects dominate the per-read information budget.
  • flag_hex_artifact / adapter_clipped / adapter3_clipped — direct evidence of adapter remnants. The hexamer-corrected d5 is computed for DS libraries when these conditions co-occur with a position-0 artefact, but the safer call is to re-trim and re-run.
  • is_detection_unreliable() — true when terminal inversion or composition-bias flags fire. Treat every damage call from such a profile with caution; report values only with their flag context.

10. Common patterns in practice

Pattern Likely interpretation
High d_max, damage_validated, F+G+H+ all elevated, DS library Authentic ancient DNA with concurrent oxidative damage. The standard "good ancient sample" call.
High d_max, damage_validated, F+G+ with H ≈ 0, DS library Authentic ancient DNA with minimal oxidative damage; or, in unsupervised data, GC-rich-composition contamination — disambiguate with damage_validated and host coverage.
High d_max in a negative control with F+G+ H ≈ 0 GC-rich bacterial contamination driving Channel A through composition. Not ancient damage.
Moderate d_max with damage_artifact == true Composition artefact — Channel B contradicts Channel A. Report as unreliable.
d_max ≈ 0 and dominant_process == low_damage Modern DNA, USER-treated library, or contamination so heavy that ancient signal is diluted.
SS library with d_max_5prime ≈ d_max_3prime Single-stranded library showing the expected symmetric C→T at both ends.
dominant_process == dipyrimidine_biased UV-driven photoproduct signature. Common in skin, hair, or surface-exposed bone fragments.
dominant_process == fragmentation_bias AP-site-mediated fragmentation dominates over terminal deamination. Typical of USER-treated or heavily chemically processed libraries.
terminal_inversion == true with otherwise high d_max Likely orientation or trimming bug upstream — investigate before trusting the call.
Mixture model: mixture_pi_ancient = 0.3, mixture_d_ancient = 0.18, d_max_combined = 0.06 A genuine ancient signal diluted by ~70% modern contamination. Report mixture_d_ancient as the ancient damage rate.

When in doubt, look at all three layers — the d_max value, the cross-channel validation flags, and the dominant-process label — and require them to agree. libtaph is built so that no single number can be the whole story.