Damage types and channels
libdart-damage scans raw FASTQ reads for six biochemical damage channels and writes a structured JSON report. Damage values are fractions in [0, 1]; multiply by 100 for percentages.
Ancient DNA degradation falls into three broad categories: hydrolytic damage (cytosine deamination; depurination and AP-site fragmentation), oxidative damage (8-oxoguanine formation), and library-preparation artefacts that mimic or confound these signals. Each channel targets one of these processes. Together they cross-validate that the detected signal is genuine ancient damage rather than a composition or preparation artefact.
JSON output reference
Top-level fields
| Field | Type | Description |
|---|---|---|
input |
string | Input FASTQ path |
n_reads |
int | Total reads scanned |
library_type |
string | "double-stranded", "single-stranded", or "unknown" |
library_type_auto |
bool | true if type was auto-detected (not forced by --library-type) |
library_type_rescued |
bool | true if a rescue rule overrode the primary BIC winner |
damage_status |
string | "present", "weak", or "absent" |
deamination block
Cytosine spontaneously deaminates to uracil by hydrolysis of the C4 amino group. In an intact duplex, uracil-DNA glycosylase removes the lesion efficiently. In the single-stranded overhangs that form at the ends of nicked or fragmented ancient molecules, repair is unavailable: the uracil persists and templates thymine during PCR amplification, producing the characteristic C→T excess in ancient DNA reads [Briggs et al. 2007; Lindahl 1993]. The rate of spontaneous deamination in double-stranded DNA is approximately 10⁻¹³ s⁻¹ per cytosine under physiological conditions (~100–500 events per cell per day), and is accelerated by low pH and elevated temperature [Shen et al. 1994; Lindahl 1993]; cold, alkaline burial slows it by orders of magnitude, explaining DNA survival over millennia.
Because only the single-stranded overhang is susceptible, the excess decays exponentially inward from the fragment end [Jónsson et al. 2013]:
where \(p\) is distance from the terminus (0-indexed), \(b\) is the interior background rate,
\(A\) is the terminal excess above background, and \(\lambda\) is the decay constant. d_max
is defined as \(A/(1-b)\) — the fraction of terminal cytosine sites that were converted to
thymine, after accounting for the background T proportion.
Damage estimates
| Field | Description |
|---|---|
d_max_5prime |
C→T excess at the 5′ terminus above the interior background |
d_max_3prime |
G→A excess at the 3′ terminus (DS) or C→T excess (SS original-strand) above background |
d_max_combined |
Best single damage estimate for this library; source records how it was derived |
d_metamatch |
GC-composition-weighted estimate comparable to metaDMG's D_max [Michelsen et al. 2022] |
source |
How d_max_combined was derived — see table below |
lambda_5prime |
Exponential decay rate at the 5′ end; larger values mean shorter overhang length |
lambda_3prime |
Exponential decay rate at the 3′ end |
bg_5prime |
Interior T/(T+C) baseline at the 5′ end; reflects sequence composition, not damage |
bg_3prime |
Interior A/(A+G) baseline at the 3′ end |
source values:
| Value | Meaning |
|---|---|
"5prime_only" |
3′ signal inverted by adapter artefact; d_max_combined = d_max_5prime |
"3prime_only" |
5′ signal inverted; d_max_combined = d_max_3prime |
"average" |
Both ends valid; weighted average or mixture model |
"max_ss_asymmetry" |
SS library with unequal end amplitudes; max(d_max_5, d_max_3) |
"min_asymmetry" |
DS with high end asymmetry; conservative min(d_max_5, d_max_3) |
"channel_b_structural" |
Channel B stop-codon estimate; used when more reliable than Channel A |
"none" |
Both ends inverted; no recoverable signal; d_max_combined = 0 |
Adapter remnants of 1–2 bp can depress T/(T+C) at position 0, hiding the biological
signal. When this is detected, the fit window is shifted automatically and d_max_combined
already reflects the correction. Do not read per_pos_5prime_ct[0] directly for such
libraries.
Validation flags
Channel B tests the same C→T deamination through stop-codon context (CAA/CAG/CGA → TAA/TAG/TGA), which is insensitive to GC-composition bias. It provides a composition-independent cross-check on Channel A.
| Field | Description |
|---|---|
validated |
Channel B corroborates Channel A deamination |
artifact |
Channel A signal present but Channel B contradicts it; likely GC-composition artefact rather than genuine damage |
Per-position rates
| Field | Description |
|---|---|
per_pos_5prime_ct[15] |
Raw T/(T+C) at 5′ positions 0–14 |
per_pos_3prime[15] |
A/(A+G) at 3′ positions 0–14 for DS; T/(T+C) for SS original-strand |
Values of -1 indicate insufficient coverage at that position.
cpg_like sub-block — methylation signal
When cytosine is methylated at the C5 position (5-methylcytosine, 5mC), its deamination
product is thymine rather than uracil [Shen et al. 1994]. Deamination of 5mC in double-stranded DNA occurs at approximately twice the rate of
unmodified cytosine [Shen et al. 1994]. An additional factor is that the
thymine product creates a T:G mismatch, which is repaired by thymine-DNA glycosylase (TDG)
less efficiently than the U:G mismatch from cytosine deamination is repaired by
uracil-DNA glycosylase [Wiebauer & Jiricny 1990; Neddermann & Jiricny
1994]. In ancient DNA, this produces an
enhanced C→T rate specifically at positions where the downstream base is G. A cpg_ratio
above 1 indicates methylation-dependent deamination, the expected pattern for vertebrates
and most eukaryotes with CpG methylation. Values near 1
indicate either unmethylated CpGs (e.g., plants with low global methylation, many fungi),
modern contamination, or signal too eroded to detect context dependence.
| Field | Description |
|---|---|
dmax_ct5_cpg |
5′ C→T amplitude in CpG context |
dmax_ct5_noncpg |
5′ C→T amplitude in non-CpG context |
cpg_ratio |
dmax_cpg / dmax_noncpg; values >1 indicate methylation-enhanced deamination |
log2_cpg_ratio |
log₂(cpg_ratio); 0 means no context dependence |
baseline_cpg |
Interior T/(C+T) at CpG positions |
baseline_noncpg |
Interior T/(C+T) at non-CpG positions |
cov_terminal_cpg |
Total T+C observations at CpG terminal positions |
cov_terminal_noncpg |
Total T+C at non-CpG terminal positions |
effcov_terminal_cpg |
Effective coverage: cov × (1 − baseline) |
effcov_terminal_noncpg |
Effective coverage: cov × (1 − baseline) |
All fields are null when signal is insufficient or the amplitude estimate hits its
boundary.
bic block — library-type classifier
Library preparation determines which damage channels are active, so the BIC model selection that drives classification also reveals which sub-channels contributed signal. This block exposes the classifier's evidence for inspecting borderline calls.
Double-stranded (DS) library protocols (e.g., NEBNext, TruSeq [Meyer & Kircher 2010]) ligate adapters to both strands. Reads from the original strand show C→T at the 5′ end; reads from the complementary strand map the same damaged cytosines as G→A at the 3′ end. Single-stranded (SS) protocols (e.g., Gansauge & Meyer [2013]; Santa Cruz [Kapp et al. 2021]; SRSLY [Gansauge et al. 2017]) circularize and sequence only one strand per molecule, producing library-type-specific patterns depending on which strand is captured (see Library-type classification).
The classifier fits four sub-channel amplitudes simultaneously:
- ct5 — 5′ C→T exponential decay (deamination on the sequenced strand)
- ga3 — 3′ G→A exponential decay (deamination on the complementary strand, appearing as G→A when read 5′→3′)
- ga0 — 3′ position-0 G→A spike (ligation-site pattern in SS complement-orientation libraries; also produced by certain DS end-repair protocols)
- ct3 — 3′ C→T decay (SS original-orientation only, where the same strand carries damage at both ends)
These are evaluated under seven biological models: a null (all channels flat), two DS models (symmetric and with end-repair spike), a DS end-repair-only model, and three SS models (complement, original, and mixed orientations). The model with the lowest BIC wins [Schwarz 1978].
| Field | Description |
|---|---|
bias |
BIC of the null model (all channels flat; no damage) |
ds |
BIC of the best-fitting DS model |
ss |
BIC of the best-fitting SS model |
ct5_amp |
Fitted ct5 amplitude |
ga3_amp |
Fitted ga3 amplitude |
ga0_amp |
Fitted ga0 amplitude (3′ position-0 spike) |
ct3_amp |
Fitted ct3 amplitude |
Lower BIC = better fit. library_type is "double-stranded" when ds < ss,
"single-stranded" when ss < ds, and "unknown" when neither beats bias.
complement_asymmetry block — oxidative damage (Channels C and D)
Guanine is the most easily oxidised DNA base owing to its low one-electron reduction potential [Steenken & Jovanovic 1997]. Reactive oxygen species — primarily hydroxyl radical (•OH) and singlet oxygen (¹O₂) — attack the C8 position of guanine to form 7,8-dihydro-8-oxoguanine (8-oxoG) [Cadet et al. 2010]. The replicative polymerase misreads 8-oxoG as adenine via a syn conformation, inserting dAMP opposite the lesion and producing G→T transversions in the forward strand and C→A in the reverse strand [Shibutani et al. 1991]. Unlike cytosine deamination, which is confined to single-stranded overhangs, oxidative damage accumulates throughout the molecule with no strong positional preference.
Chargaff's first rule and the strand asymmetry principle. Chargaff's first rule
states that in a double-stranded molecule [G] = [C] and [A] = [T] across complementary
strands [Chargaff 1950]. As a consequence, if oxidation affects both
strands symmetrically, the G→T rate in forward-strand reads exactly equals the C→A rate
in reverse-strand reads, and they cancel in a pooled DS library; s_gt and D are
near zero even when oxidation is substantial. These statistics become informative when
oxidation is strand-asymmetric (one strand preferentially oxidised during burial) or when
only one strand is present (SS libraries, where no complementary strand exists to provide
the cancelling C→A signal) [Mitchell & Bridge 2006].
Rates and strand asymmetry
| Field | Description |
|---|---|
tg_interior |
G→T rate in the read interior (background) |
ac_interior |
C→A rate in the read interior |
tg_terminal |
G→T rate at 5′ terminal positions |
ac_terminal |
C→A rate at 5′ terminal positions |
gt_bg_fitted |
Fitted uniform background from the G→T model |
gt_term_fitted |
Fitted terminal amplitude |
gt_decay_fitted |
Fitted decay constant |
s_gt |
G→T vs C→A strand asymmetry contrast; near zero for balanced DS oxidation, informative for SS |
D |
Overall 8-oxoG asymmetry index |
per_pos_5prime_gt[15] |
Raw G→T fraction at 5′ positions 0–14 |
channel_c_detected: true when G→T stop-codon conversions (GAG/GAA/GGA → stop) are
detected at levels consistent with uniform oxidation throughout the read rather than
terminal enrichment from co-occurring deamination.
8-oxoG 16-context panel
G→T asymmetry split by flanking trinucleotide context (NGN, all 16 combinations).
The context index is 4 × enc(left) + enc(right) where enc(A)=0, C=1, G=2, T=3.
| Field | Description |
|---|---|
s_oxog |
Overall G→T strand asymmetry at interior positions |
se_s_oxog |
Standard error of s_oxog |
s_oxog_16ctx[16] |
Per-context G→T asymmetry; null when coverage < 500 |
cov_oxog_16ctx[16] |
Coverage per trinucleotide context bin |
In modern oxidative chemistry, guanine oxidation in GG-containing contexts is preferentially favoured [Cadet et al. 2010]. A single-context spike confined to one or two bins is more consistent with a preparation artefact than with the broad multi-context enrichment expected from genuine oxidative damage. A spike confined to one or two contexts is more consistent with an oxidation artefact introduced during library preparation.
interior_ct_cluster block — interior co-deamination
Under prolonged exposure conditions, cytosines within single-stranded micro-domains — short stretches that transiently unpair during storage — can be co-deaminated in the same hydrolytic event. This produces co-occurring T residues at nearby positions more often than expected under site independence. The signal is in the read interior (middle third), away from the terminal overhangs that drive Channel A.
Chargaff's second parity rule and the AG control. Chargaff's second (intra-strand) parity rule states that within a single strand, [A] ≈ [T] and [C] ≈ [G] [Rudner et al. 1968]. This symmetry arises from the evolutionary history of symmetric mutation processes acting on the genome. It underpins the AG co-occurrence control used here: if elevated T co-occurrence at adjacent {C,T} sites arose purely from strand composition rather than damage clustering, the G co-occurrence at adjacent {A,G} sites should show a matched elevation. The contrast between the CT and AG tracks therefore isolates genuine deamination clustering from composition effects.
For each pair of non-CpG {C,T} sites at separation d = 1–10 bp, the observed co-T
fraction is compared to the expectation under site independence.
| Field | Description |
|---|---|
short_asym_log2oe |
log₂(CT obs/exp) − log₂(AG obs/exp); composition-corrected effect size |
short_log2oe |
log₂(CT obs/exp) without composition correction |
short_z |
Normalised CT-vs-AG contrast statistic, summed over d = 1–5 |
short_obs |
Observed co-T pairs at d = 1–5 |
short_exp |
Expected co-T pairs at d = 1–5 under independence |
reads_used |
Reads contributing ≥ 2 eligible non-CpG {C,T} sites |
sep_log2oe[10] |
log₂(obs/exp) at each separation d = 1–10 |
short_z > 3 with positive short_asym_log2oe indicates genuine interior co-deamination
beyond what strand composition predicts. In most aDNA samples this signal is weak; it
becomes informative for very old material or samples with unusual interior damage such as
permafrost specimens.
depurination block — AP-site fragmentation (Channel E)
The glycosidic bond linking a purine base (adenine or guanine) to the deoxyribose sugar is susceptible to acid-catalyzed hydrolysis. The rate under physiological conditions is approximately 3 × 10⁻¹¹ per purine per second (~2,000–10,000 events per cell per day), substantially faster than cytosine deamination [Lindahl & Nyberg 1972; Lindahl 1993]. The resulting apurinic (AP) site is a metastable abasic residue: β-elimination converts the AP deoxyribose to a strand break, fragmenting the molecule at the site of base loss. Because purines (A and G) are lost preferentially over pyrimidines (C and T), and because fragmentation at an AP site leaves the removed purine's position at the 5′ end of the newly formed fragment, the sequenced 5′ termini are enriched for purines — the signature detected by Channel E. This signal provides evidence of ancient fragmentation independent of deamination, and is particularly informative for samples where deamination is low but preservation conditions are known to be acidic or warm [Dabney et al. 2013].
| Field | Description |
|---|---|
detected |
true when 5′ purine enrichment is statistically significant |
enrichment_5prime |
Purine excess at 5′ terminal positions relative to interior |
enrichment_3prime |
Purine excess at 3′ terminal positions relative to interior |
rate_interior |
Interior A+G fraction (composition baseline) |
Channel E is reported for sample characterisation and is not used by fqdup for position masking.
Library-type classification
Which damage channels are active depends on library preparation. Double-stranded (DS) protocols ligate adapters to both strands, so reads from both the original and complementary strand are sequenced. Single-stranded (SS) protocols circularize and sequence only one strand per molecule; which strand is captured depends on the protocol and ligation orientation.
| Library type | Active sub-channels | Pattern |
|---|---|---|
| DS | ct5 + ga3 | Symmetric exponential C→T at 5′ and G→A at 3′ |
| DS + end-repair artifact | ct5 + ga3 + ga0 | DS pattern plus isolated 3′ pos-0 spike |
| SS complement-orientation | ga0 | Strong 3′ pos-0 G→A spike; 5′ flat |
| SS original-orientation | ct5 + ct3 | C→T at both 5′ and 3′ ends; no G→A |
| SS mixed orientations | ct5 + ga0 | 5′ C→T decay plus 3′ pos-0 spike |
| UNKNOWN | — | No sub-channel beats the null model |
In DS libraries, the symmetric ct5 ≈ ga3 pattern is a consequence of Chargaff's first rule: the original strand contributes C→T at its 5′ end; reads from the complementary strand map the same damaged positions as G→A at the 3′ end. The shared-amplitude DS model (M_DS_symm) enforces this symmetry as a model constraint, which also means it penalises libraries with strongly asymmetric ends.
UNKNOWN is the correct classification for zero-damage or near-zero-damage libraries where the damage pattern contains insufficient information to distinguish library type. fqdup falls back to standard hashing for UNKNOWN libraries, which has no effect when d_max is near zero.
Channel biology reference
| Channel | Measures | Chemistry | JSON output |
|---|---|---|---|
| A (ct5/ga3/ga0/ct3) | C→T and G→A terminal rates | Cytosine deamination | deamination, bic |
| B / B₃′ | Stop codon C→T / G→A frequency | Deamination in triplet context | validated, artifact |
| C | G→T stop codon uniformity | 8-oxoG oxidation | channel_c_detected |
| D | G→T and C→A transversion rates | 8-oxoG oxidation | complement_asymmetry |
| E | Purine 5′ enrichment | AP-site fragmentation | depurination |
| CpG split | C→T amplitude by CpG / non-CpG context | Methylation-enhanced deamination | cpg_like |
| Interior clustering | Adjacent CT co-occurrence in read interior | Clustered interior deamination | interior_ct_cluster |
| 8-oxoG 16-ctx | G→T asymmetry by trinucleotide context | 8-oxoG context specificity | s_oxog_16ctx |
Channels B–E cross-validate Channel A without contributing to position masking. If Channel A detects deamination but Channel B contradicts it, the signal is flagged as a composition artefact rather than genuine ancient damage.
How fqdup uses damage channels
Position masking
Only Channel A drives position masking. Position p is masked when the damage rate at
that position, after subtracting the interior baseline b, exceeds the threshold τ (default
5%). The mask is applied symmetrically at both ends so that canonical hashing remains
correct under reverse-complement: min(hash(seq), hash(rc(seq))) requires
mask(rc(seq)) = rc(mask(seq)).
Masking by library type
| Library type | Positions masked |
|---|---|
| DS | 5′ pos p and 3′ pos p symmetrically (ct5/ga3) |
| DS + artifact | Same as DS; ga0 spike falls within the masked region |
| SS complement | 3′ pos 0 only (ga0) |
| SS original | 5′ and 3′ symmetrically (ct5/ct3) |
| SS mixed | 5′ from ct5 + 3′ pos 0 from ga0 |
| UNKNOWN | None — standard hashing |
Artifact suppression
When artifact: true, Channel A fired but Channel B contradicted it, indicating
GC-composition bias rather than genuine damage. fqdup can optionally suppress masking in
this case to avoid over-collapsing reads from modern contamination.
Internal analyses (not in JSON)
The following analyses are computed during damage estimation. Their results feed into the reported fields above but are not written to JSON directly.
- Hexamer detection: T/(T+C) averaged over positions 1–6; provides position-0-bias-resistant corroboration of Channel A that is unaffected by adapter remnants at position 0.
- Briggs biophysical model: Re-fits Channel A as δ_s (single-stranded overhang rate) and δ_d (double-stranded background rate), following [Briggs et al. 2007]. R² assesses fit quality.
- Joint probabilistic model: Bayesian comparison of damage-present vs damage-absent using
both BIC and Bayes factor. Drives
damage_statusand provides a fallback estimate ford_max_combinedwhen the exponential fit is unreliable. - GC-conditional damage bins: Independent exponential fits across 10 GC-content bins.
Separates genuine deamination from GC-composition artefacts and produces the adjusted
estimate that feeds into
d_metamatch. - Mixture model (K-component EM): EM over GC-stratified reads to separate ancient from
modern components. When converged,
mixture_d_referencefeeds intod_max_combined. - Codon-position tracking: C→T rate at codon positions 0, 1, 2; supplementary wobble diagnostic.
- Adapter offset detection: Per-end fit window shift of 1–3 bp when adapter remnants
are detected. Applied before computing
d_max_combined, which already reflects any correction.
References
Neddermann P, Jiricny J (1994) Efficient removal of uracil from G·U mispairs by the mismatch-specific thymine DNA glycosylase from HeLa cells. Proc Natl Acad Sci USA 91:1642–1646. DOI: 10.1073/pnas.91.5.1642
Briggs AW, Stenzel U, Johnson PLF, et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA 104:14616–14621. DOI: 10.1073/pnas.0704665104
Cadet J, Douki T, Ravanat J-L (2010) Oxidatively generated base damage to cellular DNA. Free Radic Biol Med 49:9–21. DOI: 10.1016/j.freeradbiomed.2010.03.025
Chargaff E (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 6:201–209. DOI: 10.1007/BF02173653
Dabney J, Knapp M, Glocke I, et al. (2013) Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments. Proc Natl Acad Sci USA 110:15758–15763. DOI: 10.1073/pnas.1314445110
Gansauge M-T, Meyer M (2013) Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc 8:737–748. DOI: 10.1038/nprot.2013.038
Gansauge M-T, Gerber T, Glocke I, et al. (2017) Single-stranded DNA library preparation from highly degraded DNA using T4 DNA ligase. Nucleic Acids Res 45:e79. DOI: 10.1093/nar/gkx033
Jónsson H, Ginolhac A, Schubert M, Johnson PLF, Orlando L (2013) mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29:1682–1684. DOI: 10.1093/bioinformatics/btt193
Kapp JD, Green RE, Shapiro B (2021) A fast and efficient single-stranded genomic library preparation method optimized for ancient DNA. J Hered 112:241–249. DOI: 10.1093/jhered/esab012
Michelsen C, Pedersen MW, Fernandez-Guerra A, et al. (2022) metaDMG: A fast and accurate ancient DNA damage toolkit for metagenomic data. bioRxiv. DOI: 10.1101/2022.12.06.519264
Lindahl T (1993) Instability and decay of the primary structure of DNA. Nature 362:709–715. DOI: 10.1038/362709a0
Lindahl T, Nyberg B (1972) Rate of depurination of native deoxyribonucleic acid. Biochemistry 11:3610–3618. DOI: 10.1021/bi00769a018
Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010:pdb.prot5448. DOI: 10.1101/pdb.prot5448
Mitchell D, Bridge R (2006) A test of Chargaff's second rule. Biochem Biophys Res Commun 340:90–94. DOI: 10.1016/j.bbrc.2005.11.160
Rudner R, Karkas JD, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. III. Direct analysis. Proc Natl Acad Sci USA 60:921–922. DOI: 10.1073/pnas.60.3.921
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464. DOI: 10.1214/aos/1176344136
Shen JC, Rideout WM, Jones PA (1994) The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res 22:972–976. DOI: 10.1093/nar/22.6.972
Shibutani S, Takeshita M, Grollman AP (1991) Insertion of specific bases during DNA synthesis past the oxidation-damaged base 8-oxodG. Nature 349:431–434. DOI: 10.1038/349431a0
Steenken S, Jovanovic SV (1997) How easily oxidizable is DNA? One-electron reduction potentials of adenosine and guanosine radicals in aqueous solution. J Am Chem Soc 119:617–618. DOI: 10.1021/ja962255b
Wiebauer K, Jiricny J (1990) Mismatch-specific thymine DNA glycosylase and DNA polymerase β mediate the correction of G·T mispairs in nuclear extracts from human cells. Proc Natl Acad Sci USA 87:5842–5845. DOI: 10.1073/pnas.87.15.5842
Willerslev E, Cooper A (2005) Ancient DNA. Proc Biol Sci 272:3–16. DOI: 10.1098/rspb.2004.2813
Orlando L, Gilbert MTP, Willerslev E (2015) Reconstructing ancient genomes and epigenomes. Nat Rev Genet 16:395–408. DOI: 10.1038/nrg3935
Pääbo S, Poinar H, Serre D, et al. (2004) Genetic analyses from ancient DNA. Annu Rev Genet 38:645–679. DOI: 10.1146/annurev.genet.37.110801.143214