Background: Pleural effusion has various origins, and it can be challenging to differentiate the causes clinically.
Objectives: This study aimed to develop a machine learning model to classify five etiologies of pleural effusion.
Methods: In this retrospective study, data from patients who underwent thoracentesis between October 2013 and December 2018 were collected from a tertiary care center. Five etiologies of pleural effusion were labelled as transudative, malignant, parapneumonic, tuberculous, and others. Among the 49 features from clinical information, blood, and pleural effusion, we extracted the most optimized and minimal features for classifying pleural effusion using the mutual information method. We applied five different models for their performance; multinomial logistic regression, support vector machine, random forest, extreme gradient boosting, and light gradient boosting machine (LightGBM). The established model was validated by 5-fold cross validation.
Results: A total of 2,253 patients (median age 64.0 [54.0, 74.0] years; men 45.5%) were evaluated. The most influential features with the highest mutual information values were pleural lactate dehydrogenase, protein, adenosine deaminase, and carcinoembryonic antigen. The LightGBM model applied with single imputation and standard scaling methods showed the best performance in classifying the five etiologies. The optimal model using the minimum number of 18 features showed the most significant accuracy and F1 score; 0.819 and 0.805 in validation set and 0.788 and 0.772 in extra-validation set.
Conclusions: Classifying five etiologies of pleural effusion showed achievable performance with fewer biomarkers using the machine learning model.