Background
Unsupervised integrative clustering of multiple omics datasets for identification of unknown sub-groups of patients represents an increasing trend in data analysis, especially in precision medicine efforts of complex diseases. We have previously shown that integration of a multiple omics datasets can drastically improve the statistical power to detects subgroups in small clinical cohorts in COPD (Li CX et al ERJ 2018). However, missing data represents a major limiting factor in such studies.
Objective
Missingness of specific data modalities for different subjects will result in vastly different numbers omics combinations available for each pair of subject, as well as variations in sample sizes (n) between the respective omics combinations.
Method
We have developed an extension of the Consensus clustering algorithm for integration of multi-omics data, ccml (cran.rstudio.com/web/packages/ccml), to facilitate inclusion of omics data sets with unequal numbers of missing labels in multi-omics predictions in clinical cohorts.
Results
Evaluation of the ccml algorithm using the Karolinska COSMIC cohort demonstrate that ccml effectively predicts molecularly distinct subgroups from the integration of 9 omics datasets, allowing for as much as 60% data missingness when 5 or more data platforms are integrated.
Conclusion
Ccml is a downstream tool for multi-omics integration analysis that mitigates the limitations posed by missing data, a prevalent issue in human cohort studies involving multiple data modalities.