Abstract

Patients with acid sphingomyelinase deficiency (ASMD), a rare lysosomal storage disease, suffer interstitial lung disease (ILD) as a common clinical manifestation. Machine learning on electronic health records (EHR) was used to produce a data-driven decision tree (algorithm) to flag high-risk patients for ASMD diagnosis among patients with unexplained ILD. We hypothesized that a machine learning algorithm using clinical and laboratory traits associated with ASMD types A/B or B could distinguish the disease from matched controls. 

Using EHRs from Optum?s Humedica de-identified dataset (2007 ?2021), the ASMD cohort was enriched with 199 clinical characteristics and 11 laboratory measurements. An algorithm was trained against an extracted matched control cohort (ratio 1:20). The algorithm distinguished ASMD patients with pulmonary manifestations from the general population with pulmonary manifestations. It was further internally validated on the entire cohort, and then applied to an unexplained ILD cohort.

The algorithm highlighted these features: HDL cholesterol, aspartate transaminase, bilirubin, hemoglobin, neurodegeneration, and thrombophilia. It distinguished 31 ASMD vs. 620 matched controls, with sensitivity ~80% and specificity>99%. Applying the algorithm to an unexplained ILD cohort ?50 years (N=35,930) flagged 691 potential ASMD patients.

A machine learning derived algorithm was able to capture ASMD types A/B or B patients from EHR data with great specificity and flag a reasonably small number (<2%) of potential ASMD patients in the unexplained ILD cohort. This algorithm may enhance early diagnosis of ASMD, though validation is still needed.