Predictive Modeling For Early Lung Cancer Detection Using Ensemble Machine Learning
Keywords:
Lung Cancer Detection, Ensemble Machine Learning, SMOTE, Stacked Classifier, CatBoost, XGBoost, LightGBM, Flask Web Application, Predictive Modeling, Healthcare AI.Abstract
Lung cancer remains one of the leading causes of cancer-related deaths worldwide, accounting for approximately 1.8 million deaths annually. Early detection is critical for improving patient survival rates and enabling timely therapeutic interventions. This paper presents an intelligent machine learning-based prediction system for early lung cancer detection using survey-based clinical and lifestyle data. The proposed system employs a stacked ensemble learning model that integrates five powerful base classifiers, namely CatBoost, XGBoost, LightGBM, AdaBoost, and Random Forest, with Logistic Regression as the meta-learner that produces the final binary prediction.
To address the inherent class imbalance in the dataset (270 cancer vs. 39 non-cancer cases), the Synthetic Minority Over-sampling Technique (SMOTE) was applied, ensuring a balanced training distribution. The proposed stacked ensemble model achieved an accuracy of 96.9%, precision of 96.98%, recall of 96.82%, F1-score of 96.90%, and an ROC-AUC score of 0.99, outperforming all individual classifiers and demonstrating state-of-the-art performance. Additionally, a Flask-based web application was implemented, providing a user-friendly interface for real-time prediction, data visualization, and result interpretation. The system is modular, scalable, and clinically accessible. The proposed system leverages advanced ensemble learning techniques to improve prediction reliability and reduce model variance compared to single classifiers. A balanced dataset is achieved using SMOTE, which enhances the model’s ability to correctly classify minority (non-cancer) cases and reduces bias. The system ensures high sensitivity (recall), which is critical in medical diagnosis to minimize false negatives and avoid missed cancer cases. Feature importance analysis highlights key contributing factors such as smoking habits, anxiety, fatigue, and respiratory symptoms, improving interpretability. The model demonstrates robust generalization capability, validated using cross-validation techniques to ensure consistent performance across unseen data. A modular architecture is designed, allowing easy scalability and integration with other healthcare systems or datasets in the future.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.










