How to decide if we should keep the third codon position if we want to perform an oligotyping analysis on functional genes

In the following example, we use the acsA (acetyl-coenzyme A synthetase) gene data from Rhodopirellula species used in Zure et al. 2016 used to perform a Minimum Entropy Decomposition (MED) analysis, the unsupervised flavour of oligotyping.

In case we don’t have the package installed:

install.packages("devtools")
## 
## The downloaded binary packages are in
##  /var/folders/s1/7hqkd8bj4pjc8rvxncf_97_m0000gq/T//Rtmpx8irJW/downloaded_packages
library(devtools)
install_github("genomewalker/oligo4fun")
## Downloading GitHub repo genomewalker/oligo4fun@master
## Installing oligo4fun
## Skipping 6 packages ahead of CRAN: BatchJobs, digest, R6, Rcpp, sp, tidyr
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
##   --no-environ --no-save --no-restore CMD INSTALL  \
##   '/private/var/folders/s1/7hqkd8bj4pjc8rvxncf_97_m0000gq/T/Rtmpx8irJW/devtools6cd559097ec/genomewalker-oligo4fun-ce888bc'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library'  \
##   --install-tests

Then we load the package:

library(oligo4fun)
## Loading required package: ape
## Loading required package: ggplot2
## Loading required package: tidyr

Let’s have a look at the data, first at the entropy file:

data(entropy)
entropy
## Source: local data frame [483 x 2]
## 
##       V1     V2
##    (int)  (dbl)
## 1    413 1.4098
## 2    269 1.3759
## 3    218 1.3265
## 4    299 1.3041
## 5    419 1.2751
## 6    263 1.2687
## 7    260 1.1766
## 8    329 1.1619
## 9    275 1.0633
## 10   449 1.0600
## ..   ...    ...

and then at the MED input alignment file:

data(aln)
aln
## 361890 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 483 
## 
## Labels: Sample-101_Read10 Sample-101_Read19 Sample-101_Read55 Sample-101_Read72 Sample-101_Read90 Sample-101_Read105 ...
## 
## More than 1 million nucleotides: not printing base composition

One can read any fasta file using “read.dna” function from ape package:

aln <- read.dna(file=file, format = "fasta", as.character=FALSE, skip=0)

First we will explore the entropy value distribution on all codon positions:

entropy_plot <- plot_entropy_density(entropy)
## Note: Entropy dataframe has been processed
entropy_plot

The resulting plot shows that the 3rd codon position has higher entropy values. This suggests that if we use this aligment as input for MED it could result in a possible overestimation of the diversity.

The next step is about getting to know if the higher entropy values imply a real change in the aminoacid sequence or are silent substitution. We will use a saturation plot to explore the transitions and transversions ratio. As the NGS datasets that can be used for MED are really large, in this example we will pick 500 random sequences for the analysis (we recommend to use >5000 sequences). For the distance calculation we use the Kimura’s 2-parameters distance (K80). We set all = FALSE because we only want to focus on the 3rd codon position, we can calculate the ratio for all codon positions with all = TRUE.

saturation_estimate <- estimate_saturation(aln, nseqs = 500, model = "K80", all = FALSE )
## Note: A random seed has been selected
## Seed used: 1376112 
## Randomly selected 500 sequences
## 
## Sequence alignment cointaining only the 3rd codon position
## Calculating the number of pairwise transitions and transversions
## Computing matrix of pairwise distances using model K80 
## Calculating statistics
## Drawing saturation plot
## Results suggests that there is saturation
## Removing 3rd codon position to the alignment
## Note: Substitution saturation occurs when the frequency of transitions (mean:23.02) overtakes the frequency of transversions (mean:12.19)

We can visually inspect the saturation plot:

saturation_estimate$plot

Get some statistics:

saturation_estimate$stats
## $mean_ti
## [1] 23.01783
## 
## $sd_ti
## [1] 17.36827
## 
## $mean_tv
## [1] 12.18915
## 
## $sd_tv
## [1] 9.456685

Or get the alignment without the 3rd codon position which can be used for the MED analysis:

saturation_estimate$aln_no_3rd
## 361890 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 322 
## 
## Labels: Sample-101_Read10 Sample-101_Read19 Sample-101_Read55 Sample-101_Read72 Sample-101_Read90 Sample-101_Read105 ...
## 
## More than 1 million nucleotides: not printing base composition

But using only 500 or 5000 sequences is not enough to be sure that there is no substitution saturation in our alignment. In this case one can use the function estimate_saturation_n to perform n random subsamplings of the main alignment file. An example using the sequential version performing 10 random subsamplings of 500 sequences:

random_saturation_estimates <- estimate_saturation_n(aln, rep = 10, nseqs = 500, model = "K80", parallel = FALSE, all = FALSE, save_alns = FALSE)
## |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%

For a real world analysis we would perform at least 100 random subsamplings of >5000 sequences. The parallel version uses BatchJobs to distribute the different random subsamplings over multiple cores or a computer cluster. An example using the different cores of a computer:

random_saturation_estimates <- estimate_saturation_n(aln, rep = 10, nseqs = 500, model = "K80", parallel = TRUE, all = FALSE, save_aln = FALSE, conf_file = "~/.BatchJobs.R", reg_id = "rhodo", reg_dir = "~/rhodo_dir", keep_reg = FALSE)
## Creating dir: ~/rhodo_dir
## Loading required package: BatchJobs
## Loading required package: BBmisc
## Sourced 2 configuration files:
##   1: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/BatchJobs/etc/BatchJobs_global_config.R
##   2: /Users/ufo/.BatchJobs.R
## BatchJobs configuration:
##   cluster functions: Multicore
##   mail.from: 
##   mail.to: 
##   mail.start: none
##   mail.done: none
##   mail.error: none
##   default.resources: 
##   debug: FALSE
##   raise.warnings: FALSE
##   staged.queries: TRUE
##   max.concurrent.jobs: Inf
##   fs.timeout: NA
##   measure.mem: TRUE
## Saving registry: /Users/ufo/rhodo_dir/registry.RData
## Adding 10 jobs to DB.
## Saving conf: /Users/ufo/rhodo_dir/conf.RData
## Submitting 10 chunks / 10 jobs.
## Cluster functions: Multicore.
## Auto-mailer settings: start=none, done=none, error=none.
## Sending 10 submit messages...
## Might take some time, do not interrupt this!
## 7 temporary submit errors logged to file '/Users/ufo/rhodo_dir/submit.log'.
## First message: Multicore busy: L
## Syncing registry ...
## Reducing 10 results...
## Status for 10 jobs at 2016-03-02 15:49:30
## Submitted: 10 (100.00%)
## Started:   10 (100.00%)
## Running:    0 (  0.00%)
## Done:      10 (100.00%)
## Errors:     0 (  0.00%)
## Expired:    0 (  0.00%)
## Time: min=9.00s avg=16.60s max=20.00s

For reproducibility purposes, all seeds used by estimate_saturation and estimate_saturation_n are stored in the object returned under seed.

We also recommend to set save_alns = FALSE as it can produce a large IO load when using a large number of iterations

The contents of the .BatchJobs.R file:

cluster.functions = makeClusterFunctionsMulticore(ncpus = max(getOption("mc.cores", parallel::detectCores()) - 1L, 1L))
staged.queries = TRUE
debug = FALSE

You can find examples for other queue managers here. For example, if our cluster uses SGE, our .tmpl file should be similar to this here and the .BatchJobs.R file should contain:

cluster.functions = makeClusterFunctionsSGE("simple.tmpl")
staged.queries = TRUE
debug = FALSE

Once we got the results, we can explore the returned object. We can explore the combined statistics:

random_saturation_estimates$combined_stats
## Source: local data frame [10 x 4]
## 
##     mean_ti    sd_ti  mean_tv    sd_tv
##       (dbl)    (dbl)    (dbl)    (dbl)
## 1  23.34313 17.59591 12.36005 9.572586
## 2  22.61843 17.52720 11.98275 9.603165
## 3  21.76359 17.43766 11.37730 9.545728
## 4  22.63791 17.50658 11.89146 9.588195
## 5  22.03190 17.09155 11.57948 9.382785
## 6  22.22527 17.31354 11.82006 9.529700
## 7  22.90797 17.37829 12.26666 9.616049
## 8  21.55975 17.14403 11.40653 9.527593
## 9  22.30829 17.17119 11.62052 9.363744
## 10 20.59689 17.84109 10.80402 9.673496

Or we can look at the distribution of the average values for transitions and transversion for all random subsamplings:

write_saturation_plot_density(random_saturation_estimates, dir = "~", format = "pdf", show = TRUE)

## Plots written to  ~/saturation_test_alignment_density_030220161549.pdf

This plot also suggests that there are more transitions than transversions so it would be recommendable to remove the 3rd codon position from the alignment.

We can easily do this:

write_saturation_alignments(random_saturation_estimates, dir = "~", no3rd = TRUE)
## Saving original alignment without the 3rd codon position
## Alignment(s) written to ~/saturation_test_alignment_no3rd_030220161549.fasta

We can save all plots on disk:

write_saturation_plots(random_saturation_estimates, dir = "~", format = "pdf")
## Saving saturation plots from multiple alignment random subsamplings
## |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## Plot(s) written to  ~

And all random alignments used:

write_saturation_alignments(random_saturation_estimates, dir = "~")
## Saving random alignments used for the saturation estimation
## |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## Alignment(s) written to  ~

After all these steps we can use the exported alignment for the MED analysis.

For any questions realted to oligo4fun open a new issue here