Performs a mixed membership model clustering of aDNA samples using DNA damage patterns- mutation, flanking base, distance from read end and strand break information. The default implementation of this model follows the modeling framework of Shiraishi et al (2015).

archaic_fit(
  dat,
  K,
  tol = 0.1,
  labs = NULL,
  gom_method = "independent",
  gom.control = list(),
  output_dir = NULL
)

Arguments

dat

Either (a) output from archaic_prepare or (b) a vector of directories hosting the MFF files that the user wants to jointly model (c) a matrix of counts with samples along the rows and the mismatch signatures along the columns with entries reporting the counts.

K

the number of clusters to fit to the model.

tol

The tolerance level of convergence of the GoM model fit

labs

The factor of labels used to group the samples in visualization. May be used to distinguish samples from different labs, or different library prep.

gom_method

The GoM method type. Defaults to independent model proposed by Shiraishi2015. The other option is to use the full model which is uses the implementation due to Taddy2012.

gom.control

Control parameters for the GoM model fit.

output_dir

The output directory where the model is saved. If NULL, it picks the current working directory.

Value

Fits a GoM model on the aggregated data from archaic_prepare and outputs both the clusters (represented by mismatch signature frequencies) and the mixing proportion of clusters represented in each sample/MFF file. It also returns an assessment score like the BIC, to compare the models.

References

Taddy2012. Taddy, M., 2012, March. On estimation and selection for topic models. In Artificial Intelligence and Statistics (pp. 1184-1193).

Shiraishi2015. Shiraishi, Y., Tremmel, G., Miyano, S. and Stephens, M., 2015. A simple model-based approach to inferring and visualizing cancer mutation signatures. PLoS genetics, 11(12), p.e1005657.