Machine learning approach for predicting cardiovascular disease in Bangladesh: evidence from a cross-sectional study in 2023

Hossain, Sorif; Hasan, Mohammad Kamrul; Faruk, Mohammad Omar; Aktar, Nelufa; Hossain, Riyadh; Hossain, Kabir

doi:10.1186/s12872-024-03883-2

Research
Open access
Published: 18 April 2024

Machine learning approach for predicting cardiovascular disease in Bangladesh: evidence from a cross-sectional study in 2023

BMC Cardiovascular Disorders volume 24, Article number: 214 (2024) Cite this article

307 Accesses
Metrics details

Abstract

Background

Cardiovascular disorders (CVDs) are the leading cause of death worldwide. Lower- and middle-income countries (LMICs), such as Bangladesh, are also affected by several types of CVDs, such as heart failure and stroke. The leading cause of death in Bangladesh has recently switched from severe infections and parasitic illnesses to CVDs.

Materials and methods

The study dataset comprised a random sample of 391 CVD patients' medical records collected between August 2022 and April 2023 using simple random sampling. Moreover, 260 data points were collected from individuals with no CVD problems for comparison purposes. Crosstabs and chi-square tests were used to determine the association between CVD and the explanatory variables. Logistic regression, Naïve Bayes classifier, Decision Tree, AdaBoost classifier, Random Forest, Bagging Tree, and Ensemble learning classifiers were used to predict CVD. The performance evaluations encompassed accuracy, sensitivity, specificity, and area under the receiver operator characteristic (AU-ROC) curve.

Results

Random Forest had the highest precision among the five techniques considered. The precision rates for the mentioned classifiers are as follows: Logistic Regression (93.67%), Naïve Bayes (94.87%), Decision Tree (96.1%), AdaBoost (94.94%), Random Forest (96.15%), and Bagging Tree (94.87%). The Random Forest classifier maintains the highest balance between correct and incorrect predictions. With 98.04% accuracy, the Random Forest classifier achieved the best precision (96.15%), robust recall (100%), and high F1 score (97.7%). In contrast, the Logistic Regression model achieved the lowest accuracy of 95.42%. Remarkably, the Random Forest classifier achieved the highest AUC value (0.989).

Conclusion

This research mainly focused on identifying factors that are critical in impacting patients with CVD and predicting CVD risk. It is strongly advised that the Random Forest technique be implemented in a system for predicting cardiac diseases. This research may change clinical practice by providing doctors with a new instrument to determine a patient’s CVD prognosis.

Peer Review reports

Introduction

Cardiovascular diseases (CVD) encompass several issues affecting the cardiopulmonary system and veins. These include various types of malignancies, cardiac failure (HF), cerebrovascular disorders such as stroke, and coronary illnesses such as heart attack [1]. CVDs constitute a broad category of cardiac and blood vessel conditions, including coronary artery disease, which is characterized by insufficient oxygenated blood supply to the heart and cardiovascular illness, impacting blood circulation in the cerebellum. Additionally, chronic heart failure is a condition in which the heart lobes suffer permanent damage [2].

CVDs encompasses a range of disorders that affect the heart and blood vessels. This category includes conditions, such as coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other related ailments. According to the World Health Organization (WHO), approximately 17.9 million deaths occurred due to CVD worldwide in 2016, accounting for 31% of all deaths worldwide. Among these deaths, 85% were due to heart failure [3]. Heart disease occurs when the heart fails to circulate enough blood to organs. It is frequently caused by high blood pressure, insulin resistance, infections, or other cardiovascular disorders [4].

CVD is a major health issue worldwide, affecting approximately 26 million individuals globally each year [5]. Individuals in lower- and middle-income countries (LMICs) such as Bangladesh are affected by several types of CVDs [6]. The leading cause of death in Bangladesh has increasingly switched from severe infections and parasitic illnesses to CVDs, accounting for only 8% of total deaths in 1986, which was reduced to 5% in 2018, with a higher prevalence in urban areas (8%) than in rural areas (2%) [6, 7]. In Bangladesh, heart disease had the highest reported prevalence (21%), whereas stroke had the lowest recorded prevalence (1%) in 2018 [7].

According to previous studies, the most important behavioral risk factors for CVDs, particularly heart disease and stroke, are unhealthy diet, physical inactivity, tobacco use, and harmful use of alcohol [8]. Dyslipidemia, tobacco use, diabetes, hypertension, and overweight have also been reported as potential risk factors for heart failure in previous studies [9, 10]. Another study conducted by Hossain et al. (2023) found that age, sex, smoking, obesity, diet, physical activity, stress, chest pain type, previous chest pain, diastolic blood pressure, diabetes, and troponin were the most important factors for identifying CVD risk [11]. Different experiences at different stages of epidemiological transition and urbanization, with varying life expectancies, diverse demographic profiles, and differences in environmental and genetic risk factors, could explain the different relationships between these risk factors and CVD mortality in Asian and Western societies [12].

Patients with heart disease do not exhibit symptoms in the early stages of the disease, but they do in later stages, which can often be too late to manage or treat [13]. As a result, despite the difficulty, early detection and prediction of CVD hypersensitivity in seemingly healthy patients is essential for determining the prognosis [13]. It remains difficult for cardiologists to diagnose and treat patients in their early stages [14]. Working with patient databases for patients with heart disease is a practical application. Therefore, it is reasonable to consider using the knowledge of diverse professionals compiled in databases to aid in the diagnosis process [15]. Every conventional model for assessing CVD risk implicitly assumes that every risk factor is linearly related to the CVD outcome [14]. Several risk factors with nonlinear interactions are among the complicated linkages that these models tend to oversimplify [14]. Prediction models based on machine learning algorithms are robust against common limitations such as nonlinearity, multicollinearity, interaction, and complexities available in large datasets in traditional statistical models [16]. Moreover, it is envisaged that prediction models based on machine-learning algorithms demonstrate better predictive performance than traditional statistical methods [16]. For this reason, machine learning approaches have shown great promise in supporting clinical decision-making, helping create clinical guidelines and management algorithms, and encouraging the adoption of clinical practices based on evidence for the treatment of cardiovascular diseases (CVDs) [13]. Additionally, the early diagnosis of CVDs using machine learning approaches can lessen the need for costly and time-consuming clinical and laboratory tests, which will save costs for both individuals and the healthcare system [13].

Recently, machine learning models have been widely used to precisely predict CVD risk factors. Hossain et al. (2023) analyzed a study 2023 to predicting the risk of heart failure using distinct artificial intelligence techniques (logistic regression, Naïve Bayes, K-nearest neighbor (K-NN), support vector machine (SVM), decision tree, random forest, and multilayer perceptron (MLP) [11]. In this study, the authors found that the Random Forest model achieved the highest accuracy rate (90%) compared to other machine learning models. Furthermore, previous studies have used a machine learning approach to predict heart failure risk using clinical, behavioral, socio-demographic, and socioeconomic features [17, 18]. Ensemble learning is critical for producing excellent forecast outcomes in a variety of real-world applications. For example, ensemble machine learning technologies such as random forests, XGBoost, light gradient boosting machines, and Soft Voting have improved the early identification of diabetes mellitus by merging numerous models to increase predictive accuracy. Their efficiency and cost-effectiveness make them excellent instruments for diabetes screening and diagnosis, providing faster and less expensive alternatives to traditional procedures [19]. In the field of health research, ensemble learning methods, such as bagging, boosting, and stacking, are used to increase the accuracy and reliability of Alzheimer's disease detection models by mixing several machine learning algorithms [20]. According to research in the field of sports science, footballer positions may be reliably and precisely classified with high accuracy when stacked ensemble machine learning models are applied to datasets, such as FIFA'19 [21]. A novel hybrid data-mining approach predicts Salmonella prevalence in agricultural waterways by combining ensemble feature selection and machine learning methods. The combined ANN and RF ensemble outperformed existing approaches, providing an enhanced strategy for accurately detecting and mitigating agricultural water sources [22]. In forecasting Escherichia coli levels in agricultural water, ensemble models such as random forest and AdaBoost using meteorological data performed better than individual models, indicating the potential for more precise predictions in agricultural contexts [23]. In addition, Long Short-Term Memory (LSTM) has been effectively used for cryptocurrency data analysis, with remarkable success in accurately anticipating price patterns and providing useful insights for investors and traders in the unpredictable crypto market [24].

A recent literature review showed that some model performances, but lack reproducibility, suffer some problems and limit their reliability [25], [26], [27], [28]. Some models have been established recently to improve model effectiveness, but they still do not show optimal performance [29], [30], [31], [32]. To address this gap, this study was conducted to learn more about the prevalence and risk factors of cardiac disease in Bangladesh. Therefore, this study seeks to respond to the following research questions, considering the study's aims and objectives:

To accurately predict cardiovascular diseases (CVD) using different machine learning and ensemble learning approaches
To identify significant predictors of heart failure.
To determine better classification technique among applicable model’s cardiovascular diseases (CVD) predicting

This study compares multiple modeling strategies, including logistic regression, Naïve Bayes classifier, Decision Tree, AdaBoost classifier, Random Forest, Bagging Tree, and Ensemble learning classifiers, to reliably predict cardiovascular diseases (CVD). First, we describe these methods to demonstrate their usefulness and optimization methodologies. Next, we divided the completed preprocessed datasets into training and test sets for model building and forecasting, along with performance assessment parameters, including accuracy, precision, recall, and F1 score. Finally, the chosen models were used to properly diagnose heart failure, followed by an evaluation of their CVD prediction ability. This study could assist physicians and health scientists in classifying high-risk patients and in making a novel diagnosis to prevent cardiac failure using counseling and medicines.

Methods

Data collection

Bangladeshi individuals aged > 15 years were included in this study. In this study, individuals with and without cardiac disease were considered. A questionnaire was used to collect primary data from Dhaka Medical College, the National Institute of Cardiovascular Disease (NICVD), and BIRDEM. These three institutions provide treatment for patients with cardiovascular disease. Patients from all regions of Bangladesh were included in this study. The research dataset comprised a random sample of clinical reports of 391 patients with cardiac failure gathered from August 2022 to April 2023. In addition, 260 data points were also collected from individuals with no cardiac failure problems for comparison purposes. The sample size was estimated using Cochran's law, and data were gathered using a simple random sampling procedure [33].

Dependent variables

In this study, we considered cardiac disease as a dependent variable, with and without cardiac disease. We asked patients, Do you have a heart disease according to the diagnosis? and reported answers of ‘yes’ or’ no.

Independent variables

In our study, we considered several types of independent variables including gender (Male, Female), education (No education, primary, secondary, higher secondary), division (Dhaka, Chattogram, Khulna, Rajshahi, Barisal, Sylhet, Mymensingh, Rangpur), residence (urban, rural), socio-economic status (< 20,000, 20,000–40000, > 40,000 Taka), take physical exercises regularly (yes, no), Consume two or more serving of fruits or vegetables per day (yes, no), eat junk food regularly (yes, no), Keep too much salt in your diet (yes, no), feel bad about yourself (yes, no), Feel no interest or pleasure in doing any things (yes, no), Feel hopeless (yes, no), have sound sleep at night (yes, no), Have smoking habit (yes, no), Have the habit of drinking alcohol (yes, no), Have blood pressure (yes, no), Have the presence of high cholesterol level (yes, no), Have any family history of heart failure disease (yes, no), Have the presence of anemia (yes, no), Have any type of diabetes (yes, no), Have the presence of hypertension (yes, no), Have sleep apnea problem (yes,no), Have irregular heart rhythms (yes, no), Have coronary artery disease (yes, no), Have angina symptoms (yes, no), Have kidney, lungs or other major disease (yes, no), Take statin to decrease cholesterol level (yes, no), BMI (calculated from height and weight), and platelets, creatinine and sodium levels are considered as independent variables for this study. For further clarification, please see the questionnaire attached in a supplementary file (see Table 1).

Table 1 Descriptive statistics (categorical) of different variables of Cardiovascular patient

Full size table

Statistical analysis

Crosstabs were used to find descriptive statistics for both heart disease and the explanatory variables. The chi-square test was used to determine the association between heart disease and independent components. The features that contributed substantially were selected for machine learning (ML) training and categorization. A Python machine-learning classifier with fivefold cross-validation was used for the categorization. The classifiers used in this application include logistic regression, Naïve Bayes classifier, Decision Tree, AdaBoost classifier, Random Forest, Bagging Tree and Ensemble learning. The data were divided into test (20%) and training (80%) data sets (Fig. 1). The machine learning classifier's performance indicators were the area under the receiver operator characteristic (AU-ROC) curve, sensitivity, specificity, and accuracy. For statistical analysis, Python software was used at a 5% significance level.

Different ML Techniques

Logistic regression

Machine learning techniques such as logistic regression, which are used to solve classification problems, are based on the concept of probability. When the target was categorical, it was used. This model converts probability to odds before calculating the logarithm of odds. The mathematical form of this model is,

$${\text{log}}\left[\frac{{P}_{i}}{1-{P}_{i}}\right]={\beta }_{0}+{\beta }_{1}{X}_{i1}+{\beta }_{2}{X}_{i2}+\dots +{\beta }_{k}{\beta X}_{ik}$$

where P_i denotes the probability of an event occurring and (1-P_i) does not occur.

The ratio of the two represents the odds of an event. The left-hand side expresses the log-odds. ${\beta }_{0}$ is the intercept, which represents the mean value of log-odds when all independent variables are replaced by zero. ${\beta }_{1}$, ${\beta }_{2}$,…, ${\beta }_{k}$ are the coefficients of regression, measure the rate of change of log-odds due to change of independent variables (${X}_{i1},{X}_{i2},\dots {X}_{ik})$ [34].

It converts any real value to a range from zero to one using a sigmoid function. The sigmoid function appears as an S-shaped curve and can be defined as

$$f\left(x\right)=\frac{1}{1+{e}^{-x}}$$

However, a cost function, such as the cross-entropy loss, works in this regression system to measure the loss between the predicted probabilities and actual labels. The purpose of logistic regression is to minimize the cost function during the training phase [35]. Optimizing the hyperparameters is key to achieving the optimal performance of this algorithm. Machine-learning algorithms inherently rely on default parameter values if they are not manually adjusted by the user. For our primary dataset, we configured certain hyperparameters to tailor the behavior of the model. For instance, setting the "penalty = L2" dictates the norm used in penalization, while "C = 1.0" signifies the inverse of regularization strength. Additionally, "solver = lbfgs" specifies the optimization problem-solving approach. Other default parameters include "tol" (tolerance for stopping), "fit_intercept" (specifies whether to add a constant), "class_weight" (adjusts for class imbalance), "random_state" (random number generator for data shuffling), "max_iter" (maximum number of iterations), among others.

Naive bayes classifier

Naive Bayes is a supervised learning method that solves classification issues by applying the conditional probability concept of Bayes’ theorem. It is mostly employed for text categorization with a large training set. The underlying assumption is that the attributes have no correlation and are not connected to one another. Bayes’ theorem is written according to the following classification issue:

$$P\left(y|X\right)=\frac{P\left(X|y\right)P(y)}{P(X)}$$

where.

y = Targeted variable.

X = (x₁,x₂,x₃,……,x_n) = The input features.

P(y) = The prior knowledge about targeted variable.

P(X|y) = The likelihood functions.

When we substitute X and extending using the chain rule the Bayes theorem will be [36]

$P\left(y|x1,x2,\dots xn\right)$ α $P(y) \prod_{i=1}^{n}P\left(X|y\right)$

The model utilizes two parameters: "priors" for specifying the prior probabilities of the classes (set to none), and "var_smoothing" for incorporating variances to enhance stability (set to 1e-9).

Decision tree

Decision trees are supervised learning techniques that can be used to solve regression and classification problems; however, they are mostly employed to solve classification problems. It is a tree-structure classifier with two nodes for classifying unknown data. The decision nodes, which contain several branches, are utilized to make any decision, and the leaf nodes present the outcomes of these decisions. Attribute selection measures (ASM), such as information gain and selecting the best attribute for the root node and sub-node, are frequently achieved by employing the Gini index. Based on the information gain estimate, which provides us with how much information a feature informs us about a class, we divide the node and build the decision tree. An attribute with high information gain should be preferred as compared to low information gain and can be written as,

Information gain = Entropy(S)- [(Weighted Average) *Entropy (Each feature)].

Entropy = -P(yes) log2 P(yes)—P(no) log2 P(no).

Where, S = Total number of samples.

P(yes) = probability of yes.

P(no) = probability of no.

On the other hand, A measure of purity or impurity utilized by the classification and regression process to create a decision tree is the Gini index. A low Gini index should be chosen over a high Gini index. and can be calculated as,

Gini Index = 1- $\sum_{j}{{P}^{2}}_{j}$

P_j denotes the proportion of instances in which nodes correspond to class j [37].

The model's learning parameters include the following: criterion: defines the function used to assess split quality, splitter: determines the strategy for selecting splits at each node, max_depth: specifies the maximum depth of the tree, min_samples_split: sets the minimum number of samples required to split an internal node, min_samples_leaf: establishes the minimum number of samples required to form a leaf node, and min_weight_fraction_leaf: determines the minimum weighted fraction of the sum total of weights, max_features: specifies the number of features to consider when making splits, and random_state: ensures reproducibility by initializing the random number generator. The values assigned to these parameters are listed in Table 2.

Table 2 The values of parameters of some ML Models

Full size table

AdaBoost classifier

The AdaBoost algorithm, which is also known as Adaptive Boosting, was proposed by Freund and Shapira. This is a machine learning ensemble method that uses boosting techniques for the final classification. It g generates n decision trees in the data-learning stage. When the decision tree is constructed, the incorrectly classified record from the original model is prioritized. Only these records were considered as the inputs for the second model. This process is repeated until we determine the number of basic learners that we want to generate. Recall that using all boosting strategies is acceptable for recording repetitions [38]. The tuning parameters that are used in this model for learning are Max_depth, Base_estimators: Represents the base estimator utilized to build the boosted ensemble.

Algorithm: Defines the algorithm employed to compute the weights for each classifier; learning _rate: Modifies the contribution of each classifier by shrinking it; N_estimators: Set the maximum number of estimators, indicating when boosting terminates. and Random_state. The values of these parameters are listed in Table 2.

Random forest

The Random Forest classifier is based on the principle of ensemble learning, which is the process of merging numerous classifiers to solve a complicated problem and enhance the model's performance. It employs a variety of decision trees on different subsets of the provided information and averages their results to increase the prediction accuracy of that dataset. Instead, depending on a single decision tree, the random forest collects forecasts from each tree and predicts the final output based on the majority vote of the predictions. The larger the number of trees in the forest, the higher the accuracy and lower the risk of overfitting. There are two phases in its operation: first, it builds a random forest by combining N decision trees, and then it predicts each tree that was built in the first stage. An attribute is selected using the information gain or Gini index for each decision tree [39]. The parameters used in this algorithm for learning are Criterion, Max_depth, Min_samples_split, Min_samples_leaf, Min_weigth_fraction_leaf, Max_features (the number of features to draw from X to train each base estimator), N_estimators, Random_state, oob_score (whether to use out-of-bag samples to estimate the generalization accuracy), bootstrap (whether bootstrap samples are used when building trees), and N jobs (the number of jobs to run in parallel for both fit and predict). Table 2 lists the values of the tuning parameters.

Bagging tree

Bagging, also referred to as bootstrap aggregating, is an ensemble learning method that enhances the efficiency and precision of machine-learning algorithms. It uses a bootstrapping approach to create random samples of data from a population and estimates a population parameter. We assume that the training set consists of n observations and m features. Next, a random sample was selected from the training dataset without replacement. A random subset of m characteristics was chosen to create a model using sample data. The attribute that yields the optimal split among all nodes is used to divide them. Because the tree was completely formed, we had the largest number of root nodes. The above-listed processes are completed ‘n’ times. It integrates the output from each individual decision tree to produce the most accurate forecast. The integrated classifier prediction is a weighted aggregate of separate classifier predictions and can be written as

$$H\left(di\right)=sign(\sum_{m=1}^{M}{\alpha }_{m}{H}_{m}(di)$$

where, $H\left(di\right)=$ For a given instance di, this is the ultimate decision function. This is the result of weighting the various classifiers by their respective coefficients.

Sign(.) = This function accepts the argument's sign and returns + 1 in the case of a positive argument, -1 in the case of a negative argument, and 0 in the case of a zero argument. This is used to determine a final conclusion in binary classification by considering the sign of the weighted sum.

M is the total number of classifiers in the ensemble.

Α represents the weight and ${H}_{m}\left(di\right)=$ For the instance di, this is the prediction of the m^th classifier [40]. The parameters used in this model for learning are Max_depth, Max_features, Max_samples (meaning it uses all(1) samples or not(0)), Base_estimators, N_estimators, Random_state, oob_score, bootstrap, and N-jobs. Table 2 lists the values of the parameters used in this algorithm.

Ensemble learning techniques

Ensemble learning is a strategy that integrates many machine-learning algorithms to generate a single optimum predictive model with decreased volatility (by bagging), bias (via boosting), and enhanced predictions (via stacking). This method offers robustness against data uncertainties and improves accuracy. Boosting, stacking, and bagging are the three primary categories of ensemble learning techniques [41] (Fig. 2).

Results

Descriptive statistics

The mean age of the respondents was 57.21 years. Among them, 60% were male and 37% were female (Table 1). Approximately 60.1% of the participants in the sample had cardiovascular disease, whereas the remaining 39.9% were not affected by any type of cardiac failure. The dataset contains several medical disorders, including high cholesterol (66.8%), hypertension (54.7%), and diabetes (60.8%). Most participants (65.6%) were normal weight, 28.7% were overweight, and 5.7% were underweight. Average Platelet's level, creatinine level, and sodium level are 263,430.47 mcl (150,000–400000) mcl, 1.777 mg\dl(0.40–1.40 mg\dl), and 146.335 mmol\L (135–148 mmol\L), respectively (Table 3).

Table 3 Summary statistics (continuous) of different variables of Cardiovascular patient

Full size table

Primary education constituted the highest percentage of the sample (38.4%). Moreover, 17.1% had no education, 26.6% had a secondary education, and 18.0% had a higher secondary education. Most participants (58.8%) came from a middle-income family (20,000–40,000), whereas the remaining 21.4% had low-income (< 20,000), and 19.8% belonged to high-income (> 40,000) families. Most participants lived in rural areas (61%). According to the table, 67.3% of respondents engaged in regular physical activity. Approximately 86.3% of the population consumes two or more servings of fruits or vegetables each day, and 70.0% do not consume junk food on a regular basis. More than half (55.9%) of the participants slept at night (Table 1).

A significant proportion of the respondents reported different negative mental health indicators, such as feeling bad about themselves (75.3%), feeling hopeless (79.9%), and having little interest or pleasure in doing activities (59.3%). Among the participants, 47.5% smoked, and 6.5% drank alcohol.

According to the chi-square test, there was a significant correlation between gender, respondents' educational levels, socio-economic status, regular physical exercise, sound sleep at night, eating junk food regularly, keeping too much salt in your diet, feeling bad about yourself, feeling hopeless, having a smoking habit, having a habit of drinking alcohol, having blood pressure, having a high cholesterol level, having any family history of heart failure disease, having anemia, having any type of diabetes, having hypertension, having a sleep apnea problem, having irregular heart rhythms, coronary artery disease, angina symptoms, kidney, lung, or other major diseases, BMI, and CVD. The chi-square test results suggested a significant correlation between numerous variables and the presence of CVD, all of which had a p-value of less than 0.05. However, there was no discernible link between CVD and division, residence, consuming two or more servings of fruits or vegetables daily, and feeling no interest or pleasure in doing anything (Table 4).

Table 4 Relationship between different variables and cardiovascular disease

Full size table

Implementation and analysis of different machine learning models

This study employed multiple ML models to predict CVDs in Bangladesh. The effectiveness of the employed ML models was analyzed by determining the confusion matrix, and a comparison among all employed ML techniques was also conducted. The next section examines the data and unveils its discoveries, paving the way for the subsequent section that delves into the assessment of performance across different classification techniques.

Data analysis

The collected data were scrutinized and categorized into male and female segments, as illustrated in Fig. 3 and Table 5. Of a total of 651 samples, 391 individuals were diagnosed with CVD. The Analysis further indicated that the incidence rates in males and females were 66.5% and 33.5%, respectively. Notably, the mean number of males diagnosed with heart disease exceeded that of females.

Table 5 Analysis of CVD Dataset

Full size table

Performance analysis

To assess and gauge the efficacy of the employed algorithms, a comprehensive evaluation was conducted using the confusion matrix and an array of pertinent metrics, which encompassed the ROC curve, True Positives, True Negatives, False Positives, False Negatives, precision, recall, F1 score, and accuracy. In the subsequent section, we present a performance analysis of each algorithm.