Skip to content

Damage types and channels

libdart-damage scans raw FASTQ reads for six biochemical damage channels and writes a structured JSON report. Damage values are fractions in [0, 1]; multiply by 100 for percentages.

Ancient DNA degradation falls into three broad categories: hydrolytic damage (cytosine deamination; depurination and AP-site fragmentation), oxidative damage (8-oxoguanine formation), and library-preparation artefacts that mimic or confound these signals. Each channel targets one of these processes. Together they cross-validate that the detected signal is genuine ancient damage rather than a composition or preparation artefact.


JSON output reference

Top-level fields

Field Type Description
input string Input FASTQ path
n_reads int Total reads scanned
library_type string "double-stranded", "single-stranded", or "unknown"
library_type_auto bool true if type was auto-detected (not forced by --library-type)
library_type_rescued bool true if a rescue rule overrode the primary BIC winner
damage_status string "present", "weak", or "absent"

deamination block

Cytosine spontaneously deaminates to uracil by hydrolysis of the C4 amino group. In an intact duplex, uracil-DNA glycosylase removes the lesion efficiently. In the single-stranded overhangs that form at the ends of nicked or fragmented ancient molecules, repair is unavailable: the uracil persists and templates thymine during PCR amplification, producing the characteristic C→T excess in ancient DNA reads [Briggs et al. 2007; Lindahl 1993]. The rate of spontaneous deamination in double-stranded DNA is approximately 10⁻¹³ s⁻¹ per cytosine under physiological conditions (~100–500 events per cell per day), and is accelerated by low pH and elevated temperature [Shen et al. 1994; Lindahl 1993]; cold, alkaline burial slows it by orders of magnitude, explaining DNA survival over millennia.

Because only the single-stranded overhang is susceptible, the excess decays exponentially inward from the fragment end [Jónsson et al. 2013]:

\[\delta(p) = b + A \cdot e^{-\lambda p}\]

where \(p\) is distance from the terminus (0-indexed), \(b\) is the interior background rate, \(A\) is the terminal excess above background, and \(\lambda\) is the decay constant. d_max is defined as \(A/(1-b)\) — the fraction of terminal cytosine sites that were converted to thymine, after accounting for the background T proportion.

Damage estimates

Field Description
d_max_5prime C→T excess at the 5′ terminus above the interior background
d_max_3prime G→A excess at the 3′ terminus (DS) or C→T excess (SS original-strand) above background
d_max_combined Best single damage estimate for this library; source records how it was derived
d_metamatch GC-composition-weighted estimate comparable to metaDMG's D_max [Michelsen et al. 2022]
source How d_max_combined was derived — see table below
lambda_5prime Exponential decay rate at the 5′ end; larger values mean shorter overhang length
lambda_3prime Exponential decay rate at the 3′ end
bg_5prime Interior T/(T+C) baseline at the 5′ end; reflects sequence composition, not damage
bg_3prime Interior A/(A+G) baseline at the 3′ end

source values:

Value Meaning
"5prime_only" 3′ signal inverted by adapter artefact; d_max_combined = d_max_5prime
"3prime_only" 5′ signal inverted; d_max_combined = d_max_3prime
"average" Both ends valid; weighted average or mixture model
"max_ss_asymmetry" SS library with unequal end amplitudes; max(d_max_5, d_max_3)
"min_asymmetry" DS with high end asymmetry; conservative min(d_max_5, d_max_3)
"channel_b_structural" Channel B stop-codon estimate; used when more reliable than Channel A
"none" Both ends inverted; no recoverable signal; d_max_combined = 0

Adapter remnants of 1–2 bp can depress T/(T+C) at position 0, hiding the biological signal. When this is detected, the fit window is shifted automatically and d_max_combined already reflects the correction. Do not read per_pos_5prime_ct[0] directly for such libraries.

Validation flags

Channel B tests the same C→T deamination through stop-codon context (CAA/CAG/CGA → TAA/TAG/TGA), which is insensitive to GC-composition bias. It provides a composition-independent cross-check on Channel A.

Field Description
validated Channel B corroborates Channel A deamination
artifact Channel A signal present but Channel B contradicts it; likely GC-composition artefact rather than genuine damage

Per-position rates

Field Description
per_pos_5prime_ct[15] Raw T/(T+C) at 5′ positions 0–14
per_pos_3prime[15] A/(A+G) at 3′ positions 0–14 for DS; T/(T+C) for SS original-strand

Values of -1 indicate insufficient coverage at that position.

cpg_like sub-block — methylation signal

When cytosine is methylated at the C5 position (5-methylcytosine, 5mC), its deamination product is thymine rather than uracil [Shen et al. 1994]. Deamination of 5mC in double-stranded DNA occurs at approximately twice the rate of unmodified cytosine [Shen et al. 1994]. An additional factor is that the thymine product creates a T:G mismatch, which is repaired by thymine-DNA glycosylase (TDG) less efficiently than the U:G mismatch from cytosine deamination is repaired by uracil-DNA glycosylase [Wiebauer & Jiricny 1990; Neddermann & Jiricny 1994]. In ancient DNA, this produces an enhanced C→T rate specifically at positions where the downstream base is G. A cpg_ratio above 1 indicates methylation-dependent deamination, the expected pattern for vertebrates and most eukaryotes with CpG methylation. Values near 1 indicate either unmethylated CpGs (e.g., plants with low global methylation, many fungi), modern contamination, or signal too eroded to detect context dependence.

Field Description
dmax_ct5_cpg 5′ C→T amplitude in CpG context
dmax_ct5_noncpg 5′ C→T amplitude in non-CpG context
cpg_ratio dmax_cpg / dmax_noncpg; values >1 indicate methylation-enhanced deamination
log2_cpg_ratio log₂(cpg_ratio); 0 means no context dependence
baseline_cpg Interior T/(C+T) at CpG positions
baseline_noncpg Interior T/(C+T) at non-CpG positions
cov_terminal_cpg Total T+C observations at CpG terminal positions
cov_terminal_noncpg Total T+C at non-CpG terminal positions
effcov_terminal_cpg Effective coverage: cov × (1 − baseline)
effcov_terminal_noncpg Effective coverage: cov × (1 − baseline)

All fields are null when signal is insufficient or the amplitude estimate hits its boundary.


bic block — library-type classifier

Library preparation determines which damage channels are active, so the BIC model selection that drives classification also reveals which sub-channels contributed signal. This block exposes the classifier's evidence for inspecting borderline calls.

Double-stranded (DS) library protocols (e.g., NEBNext, TruSeq [Meyer & Kircher 2010]) ligate adapters to both strands. Reads from the original strand show C→T at the 5′ end; reads from the complementary strand map the same damaged cytosines as G→A at the 3′ end. Single-stranded (SS) protocols (e.g., Gansauge & Meyer [2013]; Santa Cruz [Kapp et al. 2021]; SRSLY [Gansauge et al. 2017]) circularize and sequence only one strand per molecule, producing library-type-specific patterns depending on which strand is captured (see Library-type classification).

The classifier fits four sub-channel amplitudes simultaneously:

  • ct5 — 5′ C→T exponential decay (deamination on the sequenced strand)
  • ga3 — 3′ G→A exponential decay (deamination on the complementary strand, appearing as G→A when read 5′→3′)
  • ga0 — 3′ position-0 G→A spike (ligation-site pattern in SS complement-orientation libraries; also produced by certain DS end-repair protocols)
  • ct3 — 3′ C→T decay (SS original-orientation only, where the same strand carries damage at both ends)

These are evaluated under seven biological models: a null (all channels flat), two DS models (symmetric and with end-repair spike), a DS end-repair-only model, and three SS models (complement, original, and mixed orientations). The model with the lowest BIC wins [Schwarz 1978].

Field Description
bias BIC of the null model (all channels flat; no damage)
ds BIC of the best-fitting DS model
ss BIC of the best-fitting SS model
ct5_amp Fitted ct5 amplitude
ga3_amp Fitted ga3 amplitude
ga0_amp Fitted ga0 amplitude (3′ position-0 spike)
ct3_amp Fitted ct3 amplitude

Lower BIC = better fit. library_type is "double-stranded" when ds < ss, "single-stranded" when ss < ds, and "unknown" when neither beats bias.


complement_asymmetry block — oxidative damage (Channels C and D)

Guanine is the most easily oxidised DNA base owing to its low one-electron reduction potential [Steenken & Jovanovic 1997]. Reactive oxygen species — primarily hydroxyl radical (•OH) and singlet oxygen (¹O₂) — attack the C8 position of guanine to form 7,8-dihydro-8-oxoguanine (8-oxoG) [Cadet et al. 2010]. The replicative polymerase misreads 8-oxoG as adenine via a syn conformation, inserting dAMP opposite the lesion and producing G→T transversions in the forward strand and C→A in the reverse strand [Shibutani et al. 1991]. Unlike cytosine deamination, which is confined to single-stranded overhangs, oxidative damage accumulates throughout the molecule with no strong positional preference.

Chargaff's first rule and the strand asymmetry principle. Chargaff's first rule states that in a double-stranded molecule [G] = [C] and [A] = [T] across complementary strands [Chargaff 1950]. As a consequence, if oxidation affects both strands symmetrically, the G→T rate in forward-strand reads exactly equals the C→A rate in reverse-strand reads, and they cancel in a pooled DS library; s_gt and D are near zero even when oxidation is substantial. These statistics become informative when oxidation is strand-asymmetric (one strand preferentially oxidised during burial) or when only one strand is present (SS libraries, where no complementary strand exists to provide the cancelling C→A signal) [Mitchell & Bridge 2006].

Rates and strand asymmetry

Field Description
tg_interior G→T rate in the read interior (background)
ac_interior C→A rate in the read interior
tg_terminal G→T rate at 5′ terminal positions
ac_terminal C→A rate at 5′ terminal positions
gt_bg_fitted Fitted uniform background from the G→T model
gt_term_fitted Fitted terminal amplitude
gt_decay_fitted Fitted decay constant
s_gt G→T vs C→A strand asymmetry contrast; near zero for balanced DS oxidation, informative for SS
D Overall 8-oxoG asymmetry index
per_pos_5prime_gt[15] Raw G→T fraction at 5′ positions 0–14

channel_c_detected: true when G→T stop-codon conversions (GAG/GAA/GGA → stop) are detected at levels consistent with uniform oxidation throughout the read rather than terminal enrichment from co-occurring deamination.

8-oxoG 16-context panel

G→T asymmetry split by flanking trinucleotide context (NGN, all 16 combinations). The context index is 4 × enc(left) + enc(right) where enc(A)=0, C=1, G=2, T=3.

Field Description
s_oxog Overall G→T strand asymmetry at interior positions
se_s_oxog Standard error of s_oxog
s_oxog_16ctx[16] Per-context G→T asymmetry; null when coverage < 500
cov_oxog_16ctx[16] Coverage per trinucleotide context bin

In modern oxidative chemistry, guanine oxidation in GG-containing contexts is preferentially favoured [Cadet et al. 2010]. A single-context spike confined to one or two bins is more consistent with a preparation artefact than with the broad multi-context enrichment expected from genuine oxidative damage. A spike confined to one or two contexts is more consistent with an oxidation artefact introduced during library preparation.


interior_ct_cluster block — interior co-deamination

Under prolonged exposure conditions, cytosines within single-stranded micro-domains — short stretches that transiently unpair during storage — can be co-deaminated in the same hydrolytic event. This produces co-occurring T residues at nearby positions more often than expected under site independence. The signal is in the read interior (middle third), away from the terminal overhangs that drive Channel A.

Chargaff's second parity rule and the AG control. Chargaff's second (intra-strand) parity rule states that within a single strand, [A] ≈ [T] and [C] ≈ [G] [Rudner et al. 1968]. This symmetry arises from the evolutionary history of symmetric mutation processes acting on the genome. It underpins the AG co-occurrence control used here: if elevated T co-occurrence at adjacent {C,T} sites arose purely from strand composition rather than damage clustering, the G co-occurrence at adjacent {A,G} sites should show a matched elevation. The contrast between the CT and AG tracks therefore isolates genuine deamination clustering from composition effects.

For each pair of non-CpG {C,T} sites at separation d = 1–10 bp, the observed co-T fraction is compared to the expectation under site independence.

Field Description
short_asym_log2oe log₂(CT obs/exp) − log₂(AG obs/exp); composition-corrected effect size
short_log2oe log₂(CT obs/exp) without composition correction
short_z Normalised CT-vs-AG contrast statistic, summed over d = 1–5
short_obs Observed co-T pairs at d = 1–5
short_exp Expected co-T pairs at d = 1–5 under independence
reads_used Reads contributing ≥ 2 eligible non-CpG {C,T} sites
sep_log2oe[10] log₂(obs/exp) at each separation d = 1–10

short_z > 3 with positive short_asym_log2oe indicates genuine interior co-deamination beyond what strand composition predicts. In most aDNA samples this signal is weak; it becomes informative for very old material or samples with unusual interior damage such as permafrost specimens.


depurination block — AP-site fragmentation (Channel E)

The glycosidic bond linking a purine base (adenine or guanine) to the deoxyribose sugar is susceptible to acid-catalyzed hydrolysis. The rate under physiological conditions is approximately 3 × 10⁻¹¹ per purine per second (~2,000–10,000 events per cell per day), substantially faster than cytosine deamination [Lindahl & Nyberg 1972; Lindahl 1993]. The resulting apurinic (AP) site is a metastable abasic residue: β-elimination converts the AP deoxyribose to a strand break, fragmenting the molecule at the site of base loss. Because purines (A and G) are lost preferentially over pyrimidines (C and T), and because fragmentation at an AP site leaves the removed purine's position at the 5′ end of the newly formed fragment, the sequenced 5′ termini are enriched for purines — the signature detected by Channel E. This signal provides evidence of ancient fragmentation independent of deamination, and is particularly informative for samples where deamination is low but preservation conditions are known to be acidic or warm [Dabney et al. 2013].

Field Description
detected true when 5′ purine enrichment is statistically significant
enrichment_5prime Purine excess at 5′ terminal positions relative to interior
enrichment_3prime Purine excess at 3′ terminal positions relative to interior
rate_interior Interior A+G fraction (composition baseline)

Channel E is reported for sample characterisation and is not used by fqdup for position masking.


Library-type classification

Which damage channels are active depends on library preparation. Double-stranded (DS) protocols ligate adapters to both strands, so reads from both the original and complementary strand are sequenced. Single-stranded (SS) protocols circularize and sequence only one strand per molecule; which strand is captured depends on the protocol and ligation orientation.

Library type Active sub-channels Pattern
DS ct5 + ga3 Symmetric exponential C→T at 5′ and G→A at 3′
DS + end-repair artifact ct5 + ga3 + ga0 DS pattern plus isolated 3′ pos-0 spike
SS complement-orientation ga0 Strong 3′ pos-0 G→A spike; 5′ flat
SS original-orientation ct5 + ct3 C→T at both 5′ and 3′ ends; no G→A
SS mixed orientations ct5 + ga0 5′ C→T decay plus 3′ pos-0 spike
UNKNOWN No sub-channel beats the null model

In DS libraries, the symmetric ct5 ≈ ga3 pattern is a consequence of Chargaff's first rule: the original strand contributes C→T at its 5′ end; reads from the complementary strand map the same damaged positions as G→A at the 3′ end. The shared-amplitude DS model (M_DS_symm) enforces this symmetry as a model constraint, which also means it penalises libraries with strongly asymmetric ends.

UNKNOWN is the correct classification for zero-damage or near-zero-damage libraries where the damage pattern contains insufficient information to distinguish library type. fqdup falls back to standard hashing for UNKNOWN libraries, which has no effect when d_max is near zero.


Channel biology reference

Channel Measures Chemistry JSON output
A (ct5/ga3/ga0/ct3) C→T and G→A terminal rates Cytosine deamination deamination, bic
B / B₃′ Stop codon C→T / G→A frequency Deamination in triplet context validated, artifact
C G→T stop codon uniformity 8-oxoG oxidation channel_c_detected
D G→T and C→A transversion rates 8-oxoG oxidation complement_asymmetry
E Purine 5′ enrichment AP-site fragmentation depurination
CpG split C→T amplitude by CpG / non-CpG context Methylation-enhanced deamination cpg_like
Interior clustering Adjacent CT co-occurrence in read interior Clustered interior deamination interior_ct_cluster
8-oxoG 16-ctx G→T asymmetry by trinucleotide context 8-oxoG context specificity s_oxog_16ctx

Channels B–E cross-validate Channel A without contributing to position masking. If Channel A detects deamination but Channel B contradicts it, the signal is flagged as a composition artefact rather than genuine ancient damage.


How fqdup uses damage channels

Position masking

Only Channel A drives position masking. Position p is masked when the damage rate at that position, after subtracting the interior baseline b, exceeds the threshold τ (default 5%). The mask is applied symmetrically at both ends so that canonical hashing remains correct under reverse-complement: min(hash(seq), hash(rc(seq))) requires mask(rc(seq)) = rc(mask(seq)).

Masking by library type

Library type Positions masked
DS 5′ pos p and 3′ pos p symmetrically (ct5/ga3)
DS + artifact Same as DS; ga0 spike falls within the masked region
SS complement 3′ pos 0 only (ga0)
SS original 5′ and 3′ symmetrically (ct5/ct3)
SS mixed 5′ from ct5 + 3′ pos 0 from ga0
UNKNOWN None — standard hashing

Artifact suppression

When artifact: true, Channel A fired but Channel B contradicted it, indicating GC-composition bias rather than genuine damage. fqdup can optionally suppress masking in this case to avoid over-collapsing reads from modern contamination.


Internal analyses (not in JSON)

The following analyses are computed during damage estimation. Their results feed into the reported fields above but are not written to JSON directly.

  • Hexamer detection: T/(T+C) averaged over positions 1–6; provides position-0-bias-resistant corroboration of Channel A that is unaffected by adapter remnants at position 0.
  • Briggs biophysical model: Re-fits Channel A as δ_s (single-stranded overhang rate) and δ_d (double-stranded background rate), following [Briggs et al. 2007]. R² assesses fit quality.
  • Joint probabilistic model: Bayesian comparison of damage-present vs damage-absent using both BIC and Bayes factor. Drives damage_status and provides a fallback estimate for d_max_combined when the exponential fit is unreliable.
  • GC-conditional damage bins: Independent exponential fits across 10 GC-content bins. Separates genuine deamination from GC-composition artefacts and produces the adjusted estimate that feeds into d_metamatch.
  • Mixture model (K-component EM): EM over GC-stratified reads to separate ancient from modern components. When converged, mixture_d_reference feeds into d_max_combined.
  • Codon-position tracking: C→T rate at codon positions 0, 1, 2; supplementary wobble diagnostic.
  • Adapter offset detection: Per-end fit window shift of 1–3 bp when adapter remnants are detected. Applied before computing d_max_combined, which already reflects any correction.

References

Neddermann P, Jiricny J (1994) Efficient removal of uracil from G·U mispairs by the mismatch-specific thymine DNA glycosylase from HeLa cells. Proc Natl Acad Sci USA 91:1642–1646. DOI: 10.1073/pnas.91.5.1642

Briggs AW, Stenzel U, Johnson PLF, et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA 104:14616–14621. DOI: 10.1073/pnas.0704665104

Cadet J, Douki T, Ravanat J-L (2010) Oxidatively generated base damage to cellular DNA. Free Radic Biol Med 49:9–21. DOI: 10.1016/j.freeradbiomed.2010.03.025

Chargaff E (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 6:201–209. DOI: 10.1007/BF02173653

Dabney J, Knapp M, Glocke I, et al. (2013) Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments. Proc Natl Acad Sci USA 110:15758–15763. DOI: 10.1073/pnas.1314445110

Gansauge M-T, Meyer M (2013) Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc 8:737–748. DOI: 10.1038/nprot.2013.038

Gansauge M-T, Gerber T, Glocke I, et al. (2017) Single-stranded DNA library preparation from highly degraded DNA using T4 DNA ligase. Nucleic Acids Res 45:e79. DOI: 10.1093/nar/gkx033

Jónsson H, Ginolhac A, Schubert M, Johnson PLF, Orlando L (2013) mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29:1682–1684. DOI: 10.1093/bioinformatics/btt193

Kapp JD, Green RE, Shapiro B (2021) A fast and efficient single-stranded genomic library preparation method optimized for ancient DNA. J Hered 112:241–249. DOI: 10.1093/jhered/esab012

Michelsen C, Pedersen MW, Fernandez-Guerra A, et al. (2022) metaDMG: A fast and accurate ancient DNA damage toolkit for metagenomic data. bioRxiv. DOI: 10.1101/2022.12.06.519264

Lindahl T (1993) Instability and decay of the primary structure of DNA. Nature 362:709–715. DOI: 10.1038/362709a0

Lindahl T, Nyberg B (1972) Rate of depurination of native deoxyribonucleic acid. Biochemistry 11:3610–3618. DOI: 10.1021/bi00769a018

Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010:pdb.prot5448. DOI: 10.1101/pdb.prot5448

Mitchell D, Bridge R (2006) A test of Chargaff's second rule. Biochem Biophys Res Commun 340:90–94. DOI: 10.1016/j.bbrc.2005.11.160

Rudner R, Karkas JD, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. III. Direct analysis. Proc Natl Acad Sci USA 60:921–922. DOI: 10.1073/pnas.60.3.921

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464. DOI: 10.1214/aos/1176344136

Shen JC, Rideout WM, Jones PA (1994) The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res 22:972–976. DOI: 10.1093/nar/22.6.972

Shibutani S, Takeshita M, Grollman AP (1991) Insertion of specific bases during DNA synthesis past the oxidation-damaged base 8-oxodG. Nature 349:431–434. DOI: 10.1038/349431a0

Steenken S, Jovanovic SV (1997) How easily oxidizable is DNA? One-electron reduction potentials of adenosine and guanosine radicals in aqueous solution. J Am Chem Soc 119:617–618. DOI: 10.1021/ja962255b

Wiebauer K, Jiricny J (1990) Mismatch-specific thymine DNA glycosylase and DNA polymerase β mediate the correction of G·T mispairs in nuclear extracts from human cells. Proc Natl Acad Sci USA 87:5842–5845. DOI: 10.1073/pnas.87.15.5842

Willerslev E, Cooper A (2005) Ancient DNA. Proc Biol Sci 272:3–16. DOI: 10.1098/rspb.2004.2813

Orlando L, Gilbert MTP, Willerslev E (2015) Reconstructing ancient genomes and epigenomes. Nat Rev Genet 16:395–408. DOI: 10.1038/nrg3935

Pääbo S, Poinar H, Serre D, et al. (2004) Genetic analyses from ancient DNA. Annu Rev Genet 38:645–679. DOI: 10.1146/annurev.genet.37.110801.143214