For the complete code and additional details of this project, please visit the GitHub repository.

Introduction

The National Health and Nutrition Examination Survey (NHANES), conducted by the National Centre for Health Statistics (NCHS) under the Centres for Disease Control and Prevention (CDC), is a comprehensive, ongoing programme designed to assess the health and nutritional status of adults and children across the United States. Combining data from interviews, physical examinations, and laboratory tests, NHANES provides a nationally representative dataset that offers critical insights into population health. This report analyses the NHANES dataset within the R environment to explore the relationship between Body Mass Index (BMI)—a widely used measure of body composition—and various health and lifestyle indicators. Specifically, this analysis seeks to address three key questions:

  1. How does BMI vary across demographic groups such as gender, education, and household income?

  2. What are the associations between BMI and health indicators, including blood pressure, cholesterol, diabetes, and depression?

  3. How do lifestyle factors such as sleep, physical activity, alcohol use, and smoking relate to BMI and broader health outcomes?

The goal is to uncover patterns and associations that can inform public health strategies and deepen our understanding of BMI in the health of the U.S. adult population.

All analyses and visualisations were completed using R, with the code provided below for transparency and reproducibility.

Data Preparation

The NHANES dataset was obtained from the NHANES library within R and loaded with the below code:

data(NHANES)

Please note that custom themes were created which is utilised later in the code used to create the visualisations.

custom_theme <- theme_minimal(base_size = 10) +
        theme(plot.title = element_text(margin = margin(b = 10)),
        axis.title.x = element_text(margin = margin(t = 10)),
        axis.title.y = element_text(margin = margin(r = 10)))

custom_histogram <- geom_histogram(binwidth = 1,
                              fill = "lightblue",
                              color = "black",
                              size = 0.25)

custom_boxplot <- geom_boxplot(fill = "lightblue",
                               color = "black",
                               width = 0.5,
                               size = 0.25,
                               outlier.size = 0.25)

custom_point <- geom_point(color = "blue", 
                           size = 1)

custom_smooth <- geom_smooth(method = "lm", 
                             color = "red", 
                             se = TRUE, 
                             size = 1)

Only a select number of variables were extracted from the NHANES dataset. The selected variables were chosen because they covered key demographic factors (such as gender, age, and education) and essential health indicators (including BMI, blood pressure, cholesterol, diabetes status, and lifestyle behaviours such as physical activity, smoking, and alcohol consumption). This approach ensured the dataset remained focused and manageable, as the original NHANES dataset contains a large number of variables.

nhanes_clean <- NHANES |>
  select (ID, Gender, Age, Education,
          HHIncome, Height,
          BMI, BPSysAve, BPDiaAve,
          TotChol, Diabetes, Depressed,
          SleepHrsNight, PhysActiveDays,
          Alcohol12PlusYr, Smoke100)

In addition, it was decided to only focus on adults and therefore exclude those participants who were aged under 18 from our modified dataset.

nhanes_clean <- nhanes_clean |> 
  filter(Age >= 18)

It was also imperative that no missing data was included in the dataset and so rows where there was at least one field with an N/A result were removed from the modified dataset.

nhanes_clean <- nhanes_clean |>
  filter(!is.na(ID) & !is.na(Gender) & !is.na(Age) & !is.na(Education) &
           !is.na(HHIncome) & !is.na(BMI) & !is.na(BPSysAve) & 
           !is.na(TotChol) & !is.na(Diabetes) & !is.na(Depressed) & 
           !is.na(SleepHrsNight) & !is.na(PhysActiveDays) & 
           !is.na(Alcohol12PlusYr) & !is.na(Smoke100))

Two additional columns were also created in the modified dataset to categorise BMI into their standard categories and Household Income into larger brackets.

# Create new column that categorises BMI
# and places column after BMI
nhanes_clean <- nhanes_clean |> 
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI < 25 ~ "Normal",
    BMI < 30 ~ "Overweight",
    BMI >= 30 ~ "Obese",
    TRUE ~ NA_character_
  )) |>
  relocate(BMI_Category, .after = BMI)

# Convert new BMI column to a factor type and order levels
# to correct order
nhanes_clean$BMI_Category <- factor(nhanes_clean$BMI_Category, 
                                    levels = c("Underweight", "Normal", "Overweight", "Obese"))

# Create new column that categorises Household income
# into broader brackets
nhanes_clean <- nhanes_clean |> 
  mutate(HHIncomeV2 = case_when(
    HHIncome %in% c(" 0-4999", " 5000-9999", "10000-14999", "15000-19999", "20000-24999") ~ "0-24999",
    HHIncome %in% c("25000-34999", "35000-44999", "45000-54999", "55000-64999") ~ "25000-64999",
    HHIncome %in% c("65000-74999", "75000-99999") ~ "65000-99999",
    HHIncome == "more 99999" ~ "100000+",
    TRUE ~ NA_character_
  )) |>
  relocate(HHIncomeV2, .after = HHIncome)

# Convert new HHIncome column to a factor type and order levels
# to correct order
nhanes_clean$HHIncomeV2 <- factor(nhanes_clean$HHIncomeV2, 
                                  levels = c("0-24999", "25000-64999", "65000-99999", "100000+"))

Exploratory Analysis

After these actions were completed, the modified dataset contained 2,809 rows of data compared to the original dataset, which contained 10,000 rows. It should be highlighted that this substantial reduction means that only a small portion of the original data remains, which could impact the reliability and validity of the dataset. The smaller sample size may introduce bias or limit the generalisability of any findings, as certain trends or relationships present in the full dataset may no longer be accurately represented.

# Count rows in each dataset, therefore showing how many
# of the original rows were removed from the above filters
nrow(NHANES)
## [1] 10000
nrow(nhanes_clean)
## [1] 2809

A quick analysis of the dataset showed that there were 1,349 females included versus 1,460 males; the minimum age recorded was 20 years while the maximum age was 80 years (median age was 45 years); the median BMI was 27.29 (categorised as “Overweight”) with a significant outlier maximum of 63.30; and a median total cholesterol of 5.020 mg/dL (considered to be within the recommended range) with a minimum of 1.530 mg/dL and a maximum of 12.280 mg/dL.

See below the code as well as the results of these statistics.

summary(nhanes_clean)
##        ID           Gender          Age                 Education   
##  Min.   :51647   female:1349   Min.   :20.00   8th Grade     : 100  
##  1st Qu.:56675   male  :1460   1st Qu.:32.00   9 - 11th Grade: 258  
##  Median :61630                 Median :45.00   High School   : 511  
##  Mean   :61662                 Mean   :45.99   Some College  : 911  
##  3rd Qu.:66614                 3rd Qu.:58.00   College Grad  :1029  
##  Max.   :71915                 Max.   :80.00                        
##                                                                     
##         HHIncome         HHIncomeV2      Height           BMI       
##  more 99999 :806   0-24999    :509   Min.   :140.0   Min.   :15.02  
##  75000-99999:371   25000-64999:937   1st Qu.:163.1   1st Qu.:23.90  
##  35000-44999:258   65000-99999:557   Median :170.2   Median :27.29  
##  45000-54999:240   100000+    :806   Mean   :170.0   Mean   :28.23  
##  25000-34999:230                     3rd Qu.:176.6   3rd Qu.:31.60  
##  55000-64999:209                     Max.   :200.4   Max.   :63.30  
##  (Other)    :695                                                    
##       BMI_Category    BPSysAve        BPDiaAve         TotChol       Diabetes  
##  Underweight: 45   Min.   : 81.0   Min.   :  0.00   Min.   : 1.530   No :2568  
##  Normal     :890   1st Qu.:109.0   1st Qu.: 64.00   1st Qu.: 4.290   Yes: 241  
##  Overweight :940   Median :118.0   Median : 71.00   Median : 5.020             
##  Obese      :934   Mean   :120.3   Mean   : 70.34   Mean   : 5.074             
##                    3rd Qu.:128.0   3rd Qu.: 78.00   3rd Qu.: 5.720             
##                    Max.   :217.0   Max.   :116.00   Max.   :12.280             
##                                                                                
##    Depressed    SleepHrsNight    PhysActiveDays  Alcohol12PlusYr Smoke100  
##  None   :2274   Min.   : 2.000   Min.   :1.000   No : 485        No :1600  
##  Several: 391   1st Qu.: 6.000   1st Qu.:2.000   Yes:2324        Yes:1209  
##  Most   : 144   Median : 7.000   Median :3.000                             
##                 Mean   : 6.924   Mean   :3.718                             
##                 3rd Qu.: 8.000   3rd Qu.:5.000                             
##                 Max.   :12.000   Max.   :7.000                             
## 

The distribution of BMI and total cholesterol amongst the dataset can also be illustrated in the following charts.

ggplot(nhanes_clean, aes(x = BMI)) +
  custom_histogram +
  labs(title = "BMI Distribution",
       x = "BMI",
       y = "Count") +
  custom_theme
 Figure 1. BMI distribution

Figure 1. BMI distribution

ggplot(nhanes_clean, aes(x = TotChol)) +
  custom_histogram +
  labs(title = "Total Cholesterol Distribution",
       x = "Total Cholesterol (mg/dL)",
       y = "Count") +
  custom_theme
 Figure 2. Total cholesterol distribution

Figure 2. Total cholesterol distribution

BMI Across Demographic Groups

BMI by Gender

Analysis of BMI distribution by gender revealed that males tend to have a slightly higher median BMI than females, though females exhibit a wider range of BMI values, including more extreme outliers. This suggests potential differences in body composition patterns between genders.

ggplot(nhanes_clean, aes(x = BMI, y = Gender)) +
  custom_boxplot +
  labs(title = "BMI - Comparison Between Genders",
       x = "BMI",
       y = "Gender") +
  custom_theme
 Figure 3. BMI among genders - boxplot

Figure 3. BMI among genders - boxplot

BMI Across Education Levels and Income Brackets

The distribution of BMI sat lower amongst College graduates compared to the other levels of education (highest attained), while a similar pattern was found within the highest income bracket ($100,000+) compared to the lower income brackets. Nonetheless, the median BMI for both College graduates as well as for the highest income bracket both sat above 25.

ggplot(nhanes_clean, aes(x = BMI, y = Education)) +
  custom_boxplot +
  labs(title = "BMI - Comparison Among Different Education Levels",
       x = "BMI",
       y = "Education Level") +
  custom_theme
 Figure 4. BMI among education levels - boxplot

Figure 4. BMI among education levels - boxplot

ggplot(nhanes_clean, aes(x = BMI, y = HHIncomeV2)) +
  custom_boxplot +
  labs(title = "BMI - Comparison Between Household Income Brackets",
       x = "BMI",
       y = "Household Income Bracket ($)") +
  custom_theme
 Figure 5. BMI among income brackets - boxplot

Figure 5. BMI among income brackets - boxplot

Relationships with Health Indicators

BMI and Diabetes/Depression

Looking at the distribution of BMI for various medical conditions, it was found that those participants that had diabetes did have a BMI that sat much higher than those without diabetes. While the difference between those that have depression often and those who never had depression was not as pronounced, it was found that those that had depression on most days still had a somewhat higher BMI spread than those that did not.

ggplot(nhanes_clean, aes(x = BMI, y = Diabetes)) +
    custom_boxplot +
    labs(title = "BMI - Comparison of Diabetes Status",
         x = "BMI",
         y = "Diabetes Status") +
    custom_theme
 Figure 6. BMI among diabetes status - boxplot

Figure 6. BMI among diabetes status - boxplot

ggplot(nhanes_clean, aes(x = BMI, y = Depressed)) +
    custom_boxplot +
    labs(title = "BMI - Comparison of Frequency of Depression",
         x = "BMI",
         y = "Depression Status") +
    custom_theme
 Figure 7. BMI among depression status - boxplot

Figure 7. BMI among depression status - boxplot

BMI and Alcohol/Smoking

Finally, looking at the distribution of BMI amongst those who drink or smoke, it was found that those participants who had drank more than 12 standard drinks in any one year tended to have a lower BMI median and spread compared to those who never did, while even those who smoked at least 100 cigarettes in their lifetime tended to have a slightly lower median BMI, the spread was very similar to those who didn’t smoke.

ggplot(nhanes_clean, aes(x = BMI, y = Alcohol12PlusYr)) +
    custom_boxplot +
    labs(title = "BMI - Comparison of Alcohol Status",
         x = "BMI",
         y = "12 Drink Consumption in Any One Year") +
    custom_theme
 Figure 8. BMI among drinkers and non-drinkers - boxplot

Figure 8. BMI among drinkers and non-drinkers - boxplot

ggplot(nhanes_clean, aes(x = BMI, y = Smoke100)) +
    custom_boxplot +
    labs(title = "BMI - Comparison of Smoking Status",
         x = "BMI",
         y = "Smoked >=100 Cigarettes in Lifetime") +
    custom_theme
 Figure 9. BMI among smokers and non-smokers - boxplot

Figure 9. BMI among smokers and non-smokers - boxplot

While the distribution of BMI informed us of some information, the analysis also needed to see the direct relationship with BMI compared to other fields within the dataset and see whether these relationships were both strong and significant.

BMI and Cholesterol

Firstly, the relationship between BMI and total cholesterol was analysed. As shown in the chart below, their was no significant direct relationship between BMI and total cholesterol, even when the general pattern was split between males and females.

ggplot(nhanes_clean, aes(x = BMI, y = TotChol)) +
    custom_point +
    geom_smooth(aes(color = Gender), method = "lm", se = FALSE, size = 1) +
    labs(title = "BMI vs Total Cholesterol",
         x = "BMI",
         y = "Total Cholesterol (mg/dL)") +
    custom_theme
 Figure 10. BMI vs Total Cholesterol

Figure 10. BMI vs Total Cholesterol

When performing a correlation test on this data and the relationship between BMI and total cholesterol (including within the male and female groups), it was found that there were very small correlations between these two variables but more note worthy was that the correlations were not statistically significant (i.e. p-value > 0.05). Hence, it can be determined within our dataset that there is no relationship between BMI and total cholesterol.

# Correlation and p-value for BMI vs Total Cholesterol
cor.test(nhanes_clean$BMI, nhanes_clean$TotChol, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$BMI and nhanes_clean$TotChol
## t = 0.13518, df = 2807, p-value = 0.8925
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03443504  0.03953116
## sample estimates:
##         cor 
## 0.002551546
# Correlation and p-value for BMI vs Total Cholesterol (Males)
cor.test(subset(nhanes_clean, Gender == "male")$BMI, subset(nhanes_clean, Gender == "male")$TotChol, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  subset(nhanes_clean, Gender == "male")$BMI and subset(nhanes_clean, Gender == "male")$TotChol
## t = 0.54353, df = 1458, p-value = 0.5868
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03709630  0.06548758
## sample estimates:
##        cor 
## 0.01423309
# Correlation and p-value for BMI vs Total Cholesterol (Females)
cor.test(subset(nhanes_clean, Gender == "female")$BMI, subset(nhanes_clean, Gender == "female")$TotChol, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  subset(nhanes_clean, Gender == "female")$BMI and subset(nhanes_clean, Gender == "female")$TotChol
## t = -0.043003, df = 1347, p-value = 0.9657
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05454019  0.05220347
## sample estimates:
##        cor 
## -0.0011717

BMI and Blood Pressure

The relationship between BMI and both systolic and diastolic blood pressure was analysed. As shown on the charts below and based on the Pearson’s product-moment correlation calculations, it was found that there is a weak but nonetheless positive relationship between BMI and blood pressure, where blood pressure rises as BMI increases. It was also found that these correlations are statistically significant.

ggplot(nhanes_clean, aes(x = BMI, y = BPSysAve)) +
    custom_point +
    custom_smooth +
    labs(title = "BMI vs Systolic Blood Pressure",
         x = "BMI",
         y = "Systolic BP") +
    custom_theme
 Figure 11. BMI vs systolic blood ressure

Figure 11. BMI vs systolic blood ressure

cor.test(nhanes_clean$BMI, nhanes_clean$BPSysAve, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$BMI and nhanes_clean$BPSysAve
## t = 8.1378, df = 2807, p-value = 5.984e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1154819 0.1877460
## sample estimates:
##       cor 
## 0.1518168
ggplot(nhanes_clean, aes(x = BMI, y = BPDiaAve)) +
    custom_point +
    custom_smooth +
    labs(title = "BMI vs Diastolic Blood Pressure",
         x = "BMI",
         y = "Diastolic BP") +
    custom_theme
 Figure 12. BMI vs diastolic blood pressure

Figure 12. BMI vs diastolic blood pressure

cor.test(nhanes_clean$BMI, nhanes_clean$BPDiaAve, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$BMI and nhanes_clean$BPDiaAve
## t = 7.7661, df = 2807, p-value = 1.126e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1086317 0.1810446
## sample estimates:
##       cor 
## 0.1450324

BMI and Sleep Hours/Physical Activity

Looking at the relationship between BMI and hours slept per night, it was found that there is a weak but statistically significant negative correlation between number of hours slept versus BMI where BMI raises as hours slept reduces. In addition, there was also found to have been a similar statistically significant pattern between days of physical activity versus BMI where BMI increases as days of physical activity reduces.

ggplot(nhanes_clean, aes(x = BMI, y = SleepHrsNight)) +
    custom_smooth +
    labs(title = "BMI vs Hours Slept Per Night",
         x = "BMI",
         y = "Sleep Hours") +
    scale_y_continuous(breaks = seq(2, 12, by = 1), limits = c(2, 12)) +
    custom_theme
 Figure 13. BMI vs hours slept per night

Figure 13. BMI vs hours slept per night

cor.test(nhanes_clean$BMI, nhanes_clean$SleepHrsNight, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$BMI and nhanes_clean$SleepHrsNight
## t = -3.4935, df = 2807, p-value = 0.000484
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10253026 -0.02888336
## sample estimates:
##         cor 
## -0.06579642
ggplot(nhanes_clean, aes(x = BMI, y = PhysActiveDays)) +
    custom_smooth +
    labs(title = "BMI vs Days Physically Active",
         x = "BMI",
         y = "No of Physical Active Days") +
    scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 7)) +
    custom_theme
 Figure 14. BMI vs days of physical activity

Figure 14. BMI vs days of physical activity

cor.test(nhanes_clean$BMI, nhanes_clean$PhysActiveDays, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$BMI and nhanes_clean$PhysActiveDays
## t = -2.374, df = 2807, p-value = 0.01766
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.081611225 -0.007792551
## sample estimates:
##         cor 
## -0.04476299

Physical Activity and its Impact on Other Health Factors

While the main aim of this analysis was to look at the relationship of BMI to other health indicators and lifestyle choices, it was decided to further explore the relationship of days of physical activity to other indicators of health to see whether physical activity does in fact have a significant postivie influence on the health of the participants in this dataset.

It was found that the participants who had diabetes actually had a higher median number of days of physical activity as well as more distribution in the higher number of days of physical activity compared to those who do not have diabetes. On the other hand, there was no difference found in the distribution of physical activity days between those with varying levels of depression. Analysing the relationship between days of physical activity and total cholesterol, there was found to be a weak but statistically significant positive relationship where days of physical activity correlates slightly with a higher level of total cholesterol.

ggplot(nhanes_clean, aes(x = PhysActiveDays, y = Diabetes)) +
    custom_boxplot +
    labs(title = "Physical Activity vs Diabetes",
         x = "Days Physically Active",
         y = "Diabetes Status") +
    scale_x_continuous(breaks = seq(1, 7, by = 1)) +
    custom_theme
 Figure 15. Distribution of physical activity days between diabetes Status

Figure 15. Distribution of physical activity days between diabetes Status

ggplot(nhanes_clean, aes(x = PhysActiveDays, y = Depressed)) +
    custom_boxplot +
    labs(title = "Physical Activity vs Depression",
         x = "Days Physically Active",
         y = "Depression Frequency") +
    scale_x_continuous(breaks = seq(1, 7, by = 1)) +
    custom_theme
 Figure 16. Distribution of physical activity days between depression status

Figure 16. Distribution of physical activity days between depression status

ggplot(nhanes_clean, aes(x = PhysActiveDays, y = TotChol)) +
    custom_point +
    custom_smooth +
    labs(title = "Physical Activity vs Total Cholesterol",
         x = "Days Physically Active",
         y = "Total Cholesterol (mg/dL)") +
    scale_x_continuous(breaks = seq(1, 7, by = 1)) +
    custom_theme
 Figure 17. Relationship between days of physical activity and total cholesterol

Figure 17. Relationship between days of physical activity and total cholesterol

cor.test(nhanes_clean$PhysActiveDays, nhanes_clean$TotChol, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$PhysActiveDays and nhanes_clean$TotChol
## t = 2.0614, df = 2807, p-value = 0.03935
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.001899012 0.075754035
## sample estimates:
##        cor 
## 0.03887962

Sleep and Its Impact on Other Health Factors

A similar analysis was completed on the effect of number of hours slept per night on similar health indicators.

It was found that the distribution of sleep hours between those with diabetes and those without did not differ. However, while the distribution of sleep hours had a similar pattern between those who suffer from no days of depression and those who suffer from several days of depression, there was a distribution of lower number of sleep hours with those who suffered from depression on most days.

In addition, there was found to be no statistically significant correlation between sleep hours and total cholesterol.

ggplot(nhanes_clean, aes(x = SleepHrsNight, y = Diabetes)) +
    custom_boxplot +
    labs(title = "Sleep vs Diabetes",
         x = "Hours Slept Per Night",
         y = "Diabetes Status") +
    scale_x_continuous(breaks = seq(2, 12, by = 1)) +
    custom_theme
 Figure 18. Distribution of sleep hours between diabetes status

Figure 18. Distribution of sleep hours between diabetes status

ggplot(nhanes_clean, aes(x = SleepHrsNight, y = Depressed)) +
    custom_boxplot +
    labs(title = "Sleep vs Depression",
         x = "Hours Slept Per Night",
         y = "Depression Frequency") +
    scale_x_continuous(breaks = seq(2, 12, by = 1)) +
    custom_theme
 Figure 19. Distribution of sleep hours between depression status

Figure 19. Distribution of sleep hours between depression status

ggplot(nhanes_clean, aes(x = SleepHrsNight, y = TotChol)) +
    custom_point +
    custom_smooth +
    labs(title = "Sleep vs Total Cholesterol",
         x = "Hours Slept Per Night",
         y = "Total Cholesterol (mg/dL)") +
    scale_x_continuous(breaks = seq(2, 12, by = 1)) +
    custom_theme
 Figure 20. Relationship between sleep hours and total Cholesterol

Figure 20. Relationship between sleep hours and total Cholesterol

cor.test(nhanes_clean$SleepHrsNight, nhanes_clean$TotChol, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nhanes_clean$SleepHrsNight and nhanes_clean$TotChol
## t = -0.78352, df = 2807, p-value = 0.4334
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05174208  0.02220845
## sample estimates:
##         cor 
## -0.01478704

Conclusion

This analysis of the NHANES dataset explored the relationships between Body Mass Index (BMI) and various demographic, health, and lifestyle factors. The findings highlight several key patterns:

In regards to the effect of physical activity and sleep on other health indicators, the following patterns were found:

While this study provides useful insights, the significant data reduction (from 10,000 to 2,809 participants) may limit the generalisability of the findings.