ABSTRACT
Objective
To develop an automated method that accurately predicts patients’ hemoglobin A1c (HbA1c) values.
Methods
Data from 118,265 unique patients were collected from 30 hospitals in İstanbul. Although the study began with a large dataset, extensive data cleaning reduced the final modeling dataset to 180 complete records. While this limits generalizability, it allows for a controlled evaluation of prediction models under realistic constraints of missing clinical data. For the analysis, the data was parted 70%-30% ratio as training and testing. A gradient boosting algorithm was used to train a machine learning model to predict HbA1c values.
Results
In this study, a machine learning model was developed to predict HbA1c using patients’ readily obtainable vital data. For testing, the overall accuracy of the prediction is 80%, Cohen’s kappa of 0.549, which is acceptable. Validation results are also promising, with an accuracy of 82%.
Conclusion
Although predicting HbA1c using different parameters can be advantageous, these predictions may be inaccurate and should be used and interpreted with caution in patients with anemia, polycythemia, or hemoglobinopathies. Our values, with specificity of 86.72%, sensitivity of 76.71%, and accuracy of 81.76%, can help eliminate this problem. In addition to satisfactory testing and validation results, the study not only explains the gradient-boosting machine learning model but also provides detailed information on cleaning noisy data and imputing missing health data.
INTRODUCTION
Artificial intelligence (AI) and machine learning (ML) have rapidly become essential tools in healthcare for predicting disease outcomes, supporting diagnosis, and personalizing treatment plans. Among chronic conditions, diabetes mellitus stands out due to its prevalence and long-term complications. Accurate prediction and early intervention are critical to improving patient outcomes. One biomarker central to diabetes management is hemoglobin A1c (HbA1c), which reflects average blood glucose levels over a two- to three-month period and is a key metric for both diagnosis and monitoring of the disease (1).
While HbA1c is typically measured through laboratory tests, several real-world challenges hinder its timely availability (2), such as cost, limited access to laboratories in rural areas, and delays resulting from the required three-month measurement period. Moreover, in cases where patients have hemoglobin-related disorders (e.g., anemia, polycythemia, hemoglobinopathies), direct measurement may yield unreliable results. In such scenarios, a reliable prediction of HbA1c based on alternative, easily obtainable physiological parameters (e.g., age, glucose level, blood pressure) becomes clinically valuable, potentially enabling faster interventions and improved continuity of care.
Although various ML approaches have been applied to predict glycemic trends, most existing models fall short in predicting HbA1c. For instance, Plis et al. (3) used support vector regression and autoregressive integrated moving average models informed by glucose dynamics to predict short-term glucose levels, but these models rely heavily on patient-specific time-series glucose data, limiting their generalizability. Zou et al. (4) compared decision trees, random forests, and neural networks for diabetes prediction using physical exam data, but did not focus on HbA1c as a predictive target, nor did they address the implications of missing or noisy data—a common issue in real-world health datasets.
This study addresses these gaps by proposing a gradient boosting model specifically tailored to predict HbA1c levels using routinely collected patient data. Unlike prior approaches, the proposed model incorporates robust data-preprocessing strategies—including advanced outlier detection, synthetic data augmentation, and ML-based imputation of missing values—resulting in a cleaner, more usable dataset (5). While the study began with a large dataset, strict data quality protocols and completeness criteria narrowed the final sample used for modeling. Gradient boosting is chosen because of its strengths in handling imbalanced datasets, reducing bias, and offering higher accuracy with less overfitting compared with traditional ensemble methods such as random forests (6).
In summary, this research contributes to the field by demonstrating that HbA1c can be predicted with high accuracy from non-invasive, widely available clinical parameters, using a refined gradient-boosting model, offering a practical solution in settings where HbA1c cannot be directly measured or trusted.
METHODS
A two-way analysis was performed in this study. The main purpose of the study is to estimate the HbA1c value, which is effective for predicting the development of diabetes. However, due to missing data, this estimation model could not directly adopted, and the data deficiencies were addressed beforehand, as explained above.
In the second stage, the HbA1c value, the main purpose of the study, was estimated. For this purpose, the HbA1c value was converted to discrete data. At this stage, HbA1c values were categorized into clinically relevant risk groups in alignment with American Diabetes Association guidelines and were further stratified as follows: values <5.7% were labeled as low risk (normal), 5.7-6.4% as moderate (prediabetes), 6.5-7.9% as high (controlled diabetes), and ≥8.0% as very high (uncontrolled diabetes). These thresholds were chosen to reflect increasing risk of diabetes-related complications and to align with clinical treatment targets. A total of 180 patients with complete HbA1c data were included in the study. Ethical approval was obtained from the Non-Invasive Clinical Research Ethics Committee of İstanbul Medipol University (approval no: 685, date: 03.09.2020). As this was a retrospective study using anonymized data, no informed consent was required from the patients. The data were divided into 70% training and 30% test sets using stratified sampling before ML model training. Synthetic minority over-sampling technique was applied to the 70% data slice used for learning, increasing the number of records to 375, and testing was performed with the remaining 30% slice (7, 8). No synthetic data generation was carried out for the testing data.
ML algorithms are highly configurable by their hyperparameters. These parameters substantially influence the behavior, complexity, speed, and other aspects of the algorithm, and their values must be selected carefully to achieve optimal performance (9). In the estimation of the HbA1c variable, the best result was obtained when the gradient boosting algorithm was customized with the following hyperparameters: tree depth=8, number of models=200, and learning rate=0.05. A learning rate of 0.1 is usually a good starting point for gradient boosting algorithms, but the optimal learning rate depends on the number of models. For this reason, the model achieved its best performance when the number of models was high and the learning rate was low. Also, for the k-nearest neighbor algorithm, the Euclidean distance metric was used, and k was set to 5. The optimum values for the support vector machine (SVM) algorithm were kernel type radial basis function and sigma=1.6.
Selection and Identification of Cases
Data
In this study, data from 118,265 unique patients were collected at 30 hospitals in İstanbul. This patient dataset was anonymized before they were included in the study. By stripping off the identifying features. The dataset contains 12 patient attributes (variables). These variables are continuous or categorical and include variables such as age, gender, height, weight, patient diabetes status, family history of diabetes, and average glucose level. All variables are given in Table 1.
Preparation of Data
The main objective of the study is to estimate HbA1c levels, an important parameter in individuals with diabetes. However, before reaching this stage, some data preprocessing needed to be carried out to prepare the data for analysis. It is well known that one problem with health data is that they may be incomplete, inaccurate, missing, or noisy. The performance of the AI models may be poor, or the models’ estimation accuracy may be low due to noisy data. Data preprocessing is probably the most important stage of data analysis and data mining, and therefore of ML.
Firstly, it was spotted that there were missing data in the raw dataset. Since the number of missing mean glucose values in the dataset is high (i.e., postprandial blood sugar; 107,012 missing values), these were completely excluded from the dataset. After removal of records with missing data, 11,253 unique patient records remained in the relevant dataset. After that, outlier cleaning was performed. To detect the outliers for each variable, the first and third quartiles (Q1 and Q3) were computed. An observation was flagged as an outlier if it lies outside the range R=[Q1–k (IQR), Q3+k (IQR)] with IQR=Q3-Q1, where k=1.5, which corresponds to the smallest value in R (10, 11). After this process, 6,754 records remained in the dataset.
Among these records, 6,686 mean fasting glucose values were missing and 68 were present. The aim is to fill records with missing mean fasting glucose valuesusing existing values. As a first step, missing mean fasting glucose values were predicted and imputed into the dataset. In the original dataset, the fasting glucose value was a continuous variable. To handle missing values for this variable, the fasting glucose value was first converted into discrete categories. Accordingly, fasting glucose values between 0 and 70 were labeled as hypoglycemia; values between 70 and 100 were labeled as normal; those between 100 and 125 were labeled as prediabetes; and values higher than 125 were labeled as diabetes. After this transformation, the dataset included 16 patients with diabetes, 25 with hypoglycemia, and 27 with prediabetes.
Because the aim was to fill in the missing mean fasting glucose records using the non-missing values, 68 existing values were synthetically reproduced using the k-nearest neighbor technique, increasing the number of non-missing values to 204. This was done to increase the capability of the AI ML model.
An AI model was built, using these 204 records, to replace missing fasting glucose values with predicted values. The model was trained using a gradient boosting algorithm to impute the missing values for 6,686 patients. However, not all the predictions made by the AI model were added to the dataset. Only those predicted with a confidence of 95% or higher were added to the dataset. Subsequently, 4,331 records met this criterion. Together with the original non-missing dataset, the total number of records reached 4,399 (4,331+68).
Nevertheless, the dataset included missing values for the target variable HbA1c. All records with missing HbA1c were removed from the dataset, and, finally, 180 clean, complete records were ready for analysis. Figure 1 shows the workflow diagram of the data cleaning and preprocessing steps, including (1) exclusion of records with missing glucose values, (2) outlier detection using the IQR method, (3) imputation of missing fasting glucose values using a ML model, and (4) removal of incomplete HbA1c records.
The final dataset used for training consisted of 180 fully populated records.
A Close Look at the Cleaned Data
In the final case, the data set includes 80 female and 100 male patients; 78 patients with prediabetes, 72 with hypoglycemia, and 30 with diabetes. Table 2 summarizes descriptive statistics for all numeric variables in the dataset.
Technical Information
In this study, the KNIME platform (version 4.4.2; Konstanz Information Miner, Germany) (12) was used to create learning models, develop predictive models from data, and visualize data. The learning algorithm used throughout the study is gradient boosting trees (13). Boosting is an ensemble learning method that sequentially generates base models. Boosting algorithms aim to construct a strong learner by sequentially combining weak learners generated at each iteration according to predefined rules. Gradient boosting, a boosting algorithm, is used to reduce model bias (6).
Gradient-based decision-tree algorithms are also weak learners. A weak learner is a ML model that performs slightly better than chance. In the case of gradient-boosting trees, the weak learners are shallow decision trees. Each new tree added to the ensemble model (i.e., the combination of all previous trees) minimizes the loss function associated with the ensemble (14). Decision trees rely on iteratively asking questions to split the data; however, they are prone to overfitting, and this is true for gradient-based trees as well. To reduce this risk, the gradient boosting algorithm employs a model that combines multiple decision trees.
The steps of the boosting method are listed below:
1. The weights of each sample are the same at the beginning,
2. It is used to learn basic learner 1 training examples,
3. When the learning is complete, the weight of the wrong samples is increased and the weight of the correct samples is decreased,
4. Basic learner 2 is used for learning,
5. Steps 2 and 4 are repeated to obtain M core-learners,
6. The results of M core learners are combined to produce the final learning outcome.
The weight of each base learner in the boosting method differs (15). Hyperparameters of the gradient boosting algorithm are set to tree depth=8, number of models=200, and learning rate=0.05. As a result, the average glucose hunger value was imputed from incomplete records using estimates with 95% confidence. Figure 2 depicts the overall model for training, testing, and validation.
Statistical Analysis
Testing
As mentioned above, 30% of the dataset, which consists of 55 records, was used for testing. As shown in Figure 3, the model achieved an accuracy of 80% and Cohen’s kappa of 0.549 during testing. While accuracy alone can be misleading in imbalanced datasets, Cohen’s kappa provides a more balanced measure of classification performance by accounting for agreement occurring by chance. A kappa value of 0.549 indicates moderate agreement, suggesting that the model produces reliable predictions beyond random chance but still has room for improvement. Examination of Figure 3 shows that the machine predicts a high level of HbA1c with 100% accuracy. On the other hand, there were no data to test for very high HbA1c values. Finally, the accuracies for the low and medium levels are 79.07% and 77.78%, respectively.
Table 3 shows a detailed evaluation of Gradient Boosting, k-nearest neighbors, and SVM algorithms. In the first table, all F-measures are acceptable and both sensitivity and specificity values are adequate for the gradient boost algorithm.
While the algorithm made the above predictions, it ranked the variables according to their importance. It should be kept in mind that this grading is relative and is performed by the gradient boosting algorithm itself. It means that the algorithm has obtained information from the features shown in Table 4. As can be seen, the patient’s average glucose, age, height, and blood pressure are more important to the algorithm than other features.
RESULTS
Since the testing data were drawn from the training data, were preprocessed, and had some missing values imputed using predictions from other variables, the results could have been biased. Due to this possibility, data collection continued during the study, and these data were used to validate the learning. The test dataset used to evaluate the model’s performance consisted of only 55 records, and the validation set included 73 medium and 75 low HbA1c values. No records representing the “high” or “very high” categories were present in the validation data. This limited sample size, particularly the absence of two of the four target classes during validation, reduces generalizability of the findings and undermines multi-class classification performance.
As it is seen in Table 5 and Figure 4 are examined it is seen that area under curve is 0.81 which close to 1 and sensitivity and specificity statistics are above 76% suggesting that machine predictions are accurate enough to be used. On the other hand, the area under the confidence curve of the machine appears to be approximately 50%. This value can be interpreted as: although the machine predicts with an accuracy of 81.76%, its confidence is low (Table 6, Figure 5). In this case, when the model is applied to real data, any prediction exceeding 55% may be sufficiently reliable.
DISCUSSION
HbA1c value is a criterion used in the diagnosis and treatment of diabetes. It is an important biomarker because, as the HbA1c level increases or decreases, the risk of diabetes-related complications changes accordingly. For this reason, it plays a critical role in assessing both the possibility of disease progression and the effectiveness of treatment. This value reflects the average of fasting and postprandial blood glucose levels over the preceding 2-3 months and is a robust indicator of long-term glycemic control. It is not only useful for monitoring diabetes but also predictive of cardiovascular complications.
While traditional glucose tests, such as fasting blood glucose, provide only a snapshot of a patient’s current state, they do not reflect long-term trends. Continuous glucose monitoring addresses this gap partially, but remains costly and less accessible. In contrast, HbA1c provides a time-averaged measure; predicting it from easily collected variables may be useful in contexts where laboratory testing is unavailable or unreliable.
Study Limitations
One potential limitation of the study is the significant reduction in usable data due to missing values. While the initial dataset was large, our strict criteria for data quality and the need for complete records led to a smaller final sample size. This decision was intended to avoid bias and to ensure model integrity. However, it does mean that the findings should be interpreted as a proof-of-concept, and further research with larger, more complete datasets is warranted to validate the model’s generalizability.
In our study, the gradient boosting model achieved a test accuracy of 80% and a Cohen’s kappa of 0.549. This kappa score indicates moderate agreement between the predicted and actual HbA1c categories, suggesting the model performs better than random classification. In the validation phase, the model demonstrated 81.76% accuracy, with sensitivity and specificity values exceeding 76% for low and medium HbA1c categories. These metrics support the model’s robustness and generalizability within the observed data range. However, due to the absence of high and very high HbA1c values in the validation dataset, further validation with a balanced sample is necessary to confirm performance across all classes.
Another critical consideration in clinical ML is model interpretability. Gradient boosting models are often perceived as black boxes, which can undermine trust in their outputs. To address this, we conducted a variable importance analysis. Features such as average glucose, age, height, and blood pressure were identified as the most influential predictors of HbA1c. This aligns with clinical knowledge and increases the model’s transparency. Future work should incorporate advanced interpretability techniques, such as SHapley Additive exPlanations or local interpretable model-agnostic explanations, to further explain individual predictions and enhance physicians’ confidence in the system.
CONCLUSION
Although it is an advantage to predict HbA1c using different parameters, it can be wrong to use and interpret because accurate results cannot be obtained for the patients with anemia, polycythemia or hemoglobinopathy. Our values with 86.72-76.71% specificity and sensitivity and accuracy of 81.76% can help to eliminate this problem. Positive and negative predictivity values are also above acceptable levels. Based on these results, it is a useful estimation tool in situations where HbA1c cannot be measured, when testing is requested without requiring a 3-month waiting period, or when Hb-related diseases render HbA1c measurements unreliable.


