Loading...

Predicting Imbalanced Data Using Machine Learning Approaches: a Case Study of Heart Patients

Salehi Amiri, Amir Reza | 2023

40 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 56505 (01)
  4. University: Sharif University of Technology
  5. Department: Industrial Engineering
  6. Advisor(s): Khedmati, Majid
  7. Abstract:
  8. One of the challenging issues in machine learning is the problem of data imbalance. Data imbalance occurs when the number of samples from one or more classes significantly exceeds or falls behind that of other classes. The existence of data imbalance in a dataset often leads to misleading accuracy of models, inadequate prediction of minority class, and a lack of generalization. Data imbalance can be observed in various datasets such as fraud detection, disease diagnosis, email spam detection, and fake news detection. To address this issue, various methods have been proposed, categorized into four groups: data-level, algorithm-level, cost-sensitive, and ensemble approaches. In this study, two hybrid balancing algorithms are proposed to tackle data imbalance in binary and multi-class datasets. These algorithms combine techniques such as oversampling, undersampling, clustering, and hybrid prediction algorithms to achieve a balanced dataset and mitigate challenges including data abundance after balancing, information loss in undersampling, and issues related to random sampling of instances. The performance of the proposed algorithms is initially compared with similar competitor methods using diverse datasets and relevant metrics. Upon demonstrating their efficacy, these algorithms are implemented on cardiac datasets for validation. For a more comprehensive investigation, this study utilizes questionnaire-based datasets of cardiac patients in the binary class section and laboratory-based datasets of cardiac patients in the multi-class section. The analysis of these datasets in preventive and predictive cardiac patient care proves to be effective. The results indicate that the proposed methods lead to improved identification of patient classes and yield more effective outcomes. Finally, influential factors in prediction are ranked according to their significance, providing a useful insight into understanding critical disease factors
  9. Keywords:
  10. Machine Learning ; Clustering ; Imbalanced Data ; Multiclass Classification ; Cardiovascular Patients

 Digital Object List

 Bookmark

No TOC