For the complete code and additional details of this project, please visit the GitHub repository.
The National Health and Nutrition Examination Survey (NHANES), conducted by the National Centre for Health Statistics (NCHS) under the Centres for Disease Control and Prevention (CDC), is a comprehensive, ongoing programme designed to assess the health and nutritional status of adults and children across the United States. Combining data from interviews, physical examinations, and laboratory tests, NHANES provides a nationally representative dataset that offers critical insights into population health. This report analyses the NHANES dataset within the R environment to explore the relationship between Body Mass Index (BMI)—a widely used measure of body composition—and various health and lifestyle indicators. Specifically, this analysis seeks to address three key questions:
How does BMI vary across demographic groups such as gender, education, and household income?
What are the associations between BMI and health indicators, including blood pressure, cholesterol, diabetes, and depression?
How do lifestyle factors such as sleep, physical activity, alcohol use, and smoking relate to BMI and broader health outcomes?
The goal is to uncover patterns and associations that can inform public health strategies and deepen our understanding of BMI in the health of the U.S. adult population.
All analyses and visualisations were completed using R, with the code provided below for transparency and reproducibility.
The NHANES dataset was obtained from the NHANES library within R and loaded with the below code:
data(NHANES)
Please note that custom themes were created which is utilised later in the code used to create the visualisations.
custom_theme <- theme_minimal(base_size = 10) +
theme(plot.title = element_text(margin = margin(b = 10)),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)))
custom_histogram <- geom_histogram(binwidth = 1,
fill = "lightblue",
color = "black",
size = 0.25)
custom_boxplot <- geom_boxplot(fill = "lightblue",
color = "black",
width = 0.5,
size = 0.25,
outlier.size = 0.25)
custom_point <- geom_point(color = "blue",
size = 1)
custom_smooth <- geom_smooth(method = "lm",
color = "red",
se = TRUE,
size = 1)
Only a select number of variables were extracted from the NHANES dataset. The selected variables were chosen because they covered key demographic factors (such as gender, age, and education) and essential health indicators (including BMI, blood pressure, cholesterol, diabetes status, and lifestyle behaviours such as physical activity, smoking, and alcohol consumption). This approach ensured the dataset remained focused and manageable, as the original NHANES dataset contains a large number of variables.
nhanes_clean <- NHANES |>
select (ID, Gender, Age, Education,
HHIncome, Height,
BMI, BPSysAve, BPDiaAve,
TotChol, Diabetes, Depressed,
SleepHrsNight, PhysActiveDays,
Alcohol12PlusYr, Smoke100)
In addition, it was decided to only focus on adults and therefore exclude those participants who were aged under 18 from our modified dataset.
nhanes_clean <- nhanes_clean |>
filter(Age >= 18)
It was also imperative that no missing data was included in the dataset and so rows where there was at least one field with an N/A result were removed from the modified dataset.
nhanes_clean <- nhanes_clean |>
filter(!is.na(ID) & !is.na(Gender) & !is.na(Age) & !is.na(Education) &
!is.na(HHIncome) & !is.na(BMI) & !is.na(BPSysAve) &
!is.na(TotChol) & !is.na(Diabetes) & !is.na(Depressed) &
!is.na(SleepHrsNight) & !is.na(PhysActiveDays) &
!is.na(Alcohol12PlusYr) & !is.na(Smoke100))
Two additional columns were also created in the modified dataset to categorise BMI into their standard categories and Household Income into larger brackets.
# Create new column that categorises BMI
# and places column after BMI
nhanes_clean <- nhanes_clean |>
mutate(BMI_Category = case_when(
BMI < 18.5 ~ "Underweight",
BMI < 25 ~ "Normal",
BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",
TRUE ~ NA_character_
)) |>
relocate(BMI_Category, .after = BMI)
# Convert new BMI column to a factor type and order levels
# to correct order
nhanes_clean$BMI_Category <- factor(nhanes_clean$BMI_Category,
levels = c("Underweight", "Normal", "Overweight", "Obese"))
# Create new column that categorises Household income
# into broader brackets
nhanes_clean <- nhanes_clean |>
mutate(HHIncomeV2 = case_when(
HHIncome %in% c(" 0-4999", " 5000-9999", "10000-14999", "15000-19999", "20000-24999") ~ "0-24999",
HHIncome %in% c("25000-34999", "35000-44999", "45000-54999", "55000-64999") ~ "25000-64999",
HHIncome %in% c("65000-74999", "75000-99999") ~ "65000-99999",
HHIncome == "more 99999" ~ "100000+",
TRUE ~ NA_character_
)) |>
relocate(HHIncomeV2, .after = HHIncome)
# Convert new HHIncome column to a factor type and order levels
# to correct order
nhanes_clean$HHIncomeV2 <- factor(nhanes_clean$HHIncomeV2,
levels = c("0-24999", "25000-64999", "65000-99999", "100000+"))
After these actions were completed, the modified dataset contained 2,809 rows of data compared to the original dataset, which contained 10,000 rows. It should be highlighted that this substantial reduction means that only a small portion of the original data remains, which could impact the reliability and validity of the dataset. The smaller sample size may introduce bias or limit the generalisability of any findings, as certain trends or relationships present in the full dataset may no longer be accurately represented.
# Count rows in each dataset, therefore showing how many
# of the original rows were removed from the above filters
nrow(NHANES)
## [1] 10000
nrow(nhanes_clean)
## [1] 2809
A quick analysis of the dataset showed that there were 1,349 females included versus 1,460 males; the minimum age recorded was 20 years while the maximum age was 80 years (median age was 45 years); the median BMI was 27.29 (categorised as “Overweight”) with a significant outlier maximum of 63.30; and a median total cholesterol of 5.020 mg/dL (considered to be within the recommended range) with a minimum of 1.530 mg/dL and a maximum of 12.280 mg/dL.
See below the code as well as the results of these statistics.
summary(nhanes_clean)
## ID Gender Age Education
## Min. :51647 female:1349 Min. :20.00 8th Grade : 100
## 1st Qu.:56675 male :1460 1st Qu.:32.00 9 - 11th Grade: 258
## Median :61630 Median :45.00 High School : 511
## Mean :61662 Mean :45.99 Some College : 911
## 3rd Qu.:66614 3rd Qu.:58.00 College Grad :1029
## Max. :71915 Max. :80.00
##
## HHIncome HHIncomeV2 Height BMI
## more 99999 :806 0-24999 :509 Min. :140.0 Min. :15.02
## 75000-99999:371 25000-64999:937 1st Qu.:163.1 1st Qu.:23.90
## 35000-44999:258 65000-99999:557 Median :170.2 Median :27.29
## 45000-54999:240 100000+ :806 Mean :170.0 Mean :28.23
## 25000-34999:230 3rd Qu.:176.6 3rd Qu.:31.60
## 55000-64999:209 Max. :200.4 Max. :63.30
## (Other) :695
## BMI_Category BPSysAve BPDiaAve TotChol Diabetes
## Underweight: 45 Min. : 81.0 Min. : 0.00 Min. : 1.530 No :2568
## Normal :890 1st Qu.:109.0 1st Qu.: 64.00 1st Qu.: 4.290 Yes: 241
## Overweight :940 Median :118.0 Median : 71.00 Median : 5.020
## Obese :934 Mean :120.3 Mean : 70.34 Mean : 5.074
## 3rd Qu.:128.0 3rd Qu.: 78.00 3rd Qu.: 5.720
## Max. :217.0 Max. :116.00 Max. :12.280
##
## Depressed SleepHrsNight PhysActiveDays Alcohol12PlusYr Smoke100
## None :2274 Min. : 2.000 Min. :1.000 No : 485 No :1600
## Several: 391 1st Qu.: 6.000 1st Qu.:2.000 Yes:2324 Yes:1209
## Most : 144 Median : 7.000 Median :3.000
## Mean : 6.924 Mean :3.718
## 3rd Qu.: 8.000 3rd Qu.:5.000
## Max. :12.000 Max. :7.000
##
The distribution of BMI and total cholesterol amongst the dataset can also be illustrated in the following charts.
ggplot(nhanes_clean, aes(x = BMI)) +
custom_histogram +
labs(title = "BMI Distribution",
x = "BMI",
y = "Count") +
custom_theme
Figure 1. BMI distribution
ggplot(nhanes_clean, aes(x = TotChol)) +
custom_histogram +
labs(title = "Total Cholesterol Distribution",
x = "Total Cholesterol (mg/dL)",
y = "Count") +
custom_theme
Figure 2. Total cholesterol distribution
Analysis of BMI distribution by gender revealed that males tend to have a slightly higher median BMI than females, though females exhibit a wider range of BMI values, including more extreme outliers. This suggests potential differences in body composition patterns between genders.
ggplot(nhanes_clean, aes(x = BMI, y = Gender)) +
custom_boxplot +
labs(title = "BMI - Comparison Between Genders",
x = "BMI",
y = "Gender") +
custom_theme
Figure 3. BMI among genders - boxplot
The distribution of BMI sat lower amongst College graduates compared to the other levels of education (highest attained), while a similar pattern was found within the highest income bracket ($100,000+) compared to the lower income brackets. Nonetheless, the median BMI for both College graduates as well as for the highest income bracket both sat above 25.
ggplot(nhanes_clean, aes(x = BMI, y = Education)) +
custom_boxplot +
labs(title = "BMI - Comparison Among Different Education Levels",
x = "BMI",
y = "Education Level") +
custom_theme
Figure 4. BMI among education levels - boxplot
ggplot(nhanes_clean, aes(x = BMI, y = HHIncomeV2)) +
custom_boxplot +
labs(title = "BMI - Comparison Between Household Income Brackets",
x = "BMI",
y = "Household Income Bracket ($)") +
custom_theme
Figure 5. BMI among income brackets - boxplot
Looking at the distribution of BMI for various medical conditions, it was found that those participants that had diabetes did have a BMI that sat much higher than those without diabetes. While the difference between those that have depression often and those who never had depression was not as pronounced, it was found that those that had depression on most days still had a somewhat higher BMI spread than those that did not.
ggplot(nhanes_clean, aes(x = BMI, y = Diabetes)) +
custom_boxplot +
labs(title = "BMI - Comparison of Diabetes Status",
x = "BMI",
y = "Diabetes Status") +
custom_theme
Figure 6. BMI among diabetes status - boxplot
ggplot(nhanes_clean, aes(x = BMI, y = Depressed)) +
custom_boxplot +
labs(title = "BMI - Comparison of Frequency of Depression",
x = "BMI",
y = "Depression Status") +
custom_theme
Figure 7. BMI among depression status - boxplot
Finally, looking at the distribution of BMI amongst those who drink or smoke, it was found that those participants who had drank more than 12 standard drinks in any one year tended to have a lower BMI median and spread compared to those who never did, while even those who smoked at least 100 cigarettes in their lifetime tended to have a slightly lower median BMI, the spread was very similar to those who didn’t smoke.
ggplot(nhanes_clean, aes(x = BMI, y = Alcohol12PlusYr)) +
custom_boxplot +
labs(title = "BMI - Comparison of Alcohol Status",
x = "BMI",
y = "12 Drink Consumption in Any One Year") +
custom_theme
Figure 8. BMI among drinkers and non-drinkers - boxplot
ggplot(nhanes_clean, aes(x = BMI, y = Smoke100)) +
custom_boxplot +
labs(title = "BMI - Comparison of Smoking Status",
x = "BMI",
y = "Smoked >=100 Cigarettes in Lifetime") +
custom_theme
Figure 9. BMI among smokers and non-smokers - boxplot
While the distribution of BMI informed us of some information, the analysis also needed to see the direct relationship with BMI compared to other fields within the dataset and see whether these relationships were both strong and significant.
Firstly, the relationship between BMI and total cholesterol was analysed. As shown in the chart below, their was no significant direct relationship between BMI and total cholesterol, even when the general pattern was split between males and females.
ggplot(nhanes_clean, aes(x = BMI, y = TotChol)) +
custom_point +
geom_smooth(aes(color = Gender), method = "lm", se = FALSE, size = 1) +
labs(title = "BMI vs Total Cholesterol",
x = "BMI",
y = "Total Cholesterol (mg/dL)") +
custom_theme
Figure 10. BMI vs Total Cholesterol
When performing a correlation test on this data and the relationship between BMI and total cholesterol (including within the male and female groups), it was found that there were very small correlations between these two variables but more note worthy was that the correlations were not statistically significant (i.e. p-value > 0.05). Hence, it can be determined within our dataset that there is no relationship between BMI and total cholesterol.
# Correlation and p-value for BMI vs Total Cholesterol
cor.test(nhanes_clean$BMI, nhanes_clean$TotChol, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$BMI and nhanes_clean$TotChol
## t = 0.13518, df = 2807, p-value = 0.8925
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03443504 0.03953116
## sample estimates:
## cor
## 0.002551546
# Correlation and p-value for BMI vs Total Cholesterol (Males)
cor.test(subset(nhanes_clean, Gender == "male")$BMI, subset(nhanes_clean, Gender == "male")$TotChol, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: subset(nhanes_clean, Gender == "male")$BMI and subset(nhanes_clean, Gender == "male")$TotChol
## t = 0.54353, df = 1458, p-value = 0.5868
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03709630 0.06548758
## sample estimates:
## cor
## 0.01423309
# Correlation and p-value for BMI vs Total Cholesterol (Females)
cor.test(subset(nhanes_clean, Gender == "female")$BMI, subset(nhanes_clean, Gender == "female")$TotChol, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: subset(nhanes_clean, Gender == "female")$BMI and subset(nhanes_clean, Gender == "female")$TotChol
## t = -0.043003, df = 1347, p-value = 0.9657
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05454019 0.05220347
## sample estimates:
## cor
## -0.0011717
The relationship between BMI and both systolic and diastolic blood pressure was analysed. As shown on the charts below and based on the Pearson’s product-moment correlation calculations, it was found that there is a weak but nonetheless positive relationship between BMI and blood pressure, where blood pressure rises as BMI increases. It was also found that these correlations are statistically significant.
ggplot(nhanes_clean, aes(x = BMI, y = BPSysAve)) +
custom_point +
custom_smooth +
labs(title = "BMI vs Systolic Blood Pressure",
x = "BMI",
y = "Systolic BP") +
custom_theme
Figure 11. BMI vs systolic blood ressure
cor.test(nhanes_clean$BMI, nhanes_clean$BPSysAve, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$BMI and nhanes_clean$BPSysAve
## t = 8.1378, df = 2807, p-value = 5.984e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1154819 0.1877460
## sample estimates:
## cor
## 0.1518168
ggplot(nhanes_clean, aes(x = BMI, y = BPDiaAve)) +
custom_point +
custom_smooth +
labs(title = "BMI vs Diastolic Blood Pressure",
x = "BMI",
y = "Diastolic BP") +
custom_theme
Figure 12. BMI vs diastolic blood pressure
cor.test(nhanes_clean$BMI, nhanes_clean$BPDiaAve, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$BMI and nhanes_clean$BPDiaAve
## t = 7.7661, df = 2807, p-value = 1.126e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1086317 0.1810446
## sample estimates:
## cor
## 0.1450324
Looking at the relationship between BMI and hours slept per night, it was found that there is a weak but statistically significant negative correlation between number of hours slept versus BMI where BMI raises as hours slept reduces. In addition, there was also found to have been a similar statistically significant pattern between days of physical activity versus BMI where BMI increases as days of physical activity reduces.
ggplot(nhanes_clean, aes(x = BMI, y = SleepHrsNight)) +
custom_smooth +
labs(title = "BMI vs Hours Slept Per Night",
x = "BMI",
y = "Sleep Hours") +
scale_y_continuous(breaks = seq(2, 12, by = 1), limits = c(2, 12)) +
custom_theme
Figure 13. BMI vs hours slept per night
cor.test(nhanes_clean$BMI, nhanes_clean$SleepHrsNight, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$BMI and nhanes_clean$SleepHrsNight
## t = -3.4935, df = 2807, p-value = 0.000484
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10253026 -0.02888336
## sample estimates:
## cor
## -0.06579642
ggplot(nhanes_clean, aes(x = BMI, y = PhysActiveDays)) +
custom_smooth +
labs(title = "BMI vs Days Physically Active",
x = "BMI",
y = "No of Physical Active Days") +
scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 7)) +
custom_theme
Figure 14. BMI vs days of physical activity
cor.test(nhanes_clean$BMI, nhanes_clean$PhysActiveDays, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$BMI and nhanes_clean$PhysActiveDays
## t = -2.374, df = 2807, p-value = 0.01766
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.081611225 -0.007792551
## sample estimates:
## cor
## -0.04476299
While the main aim of this analysis was to look at the relationship of BMI to other health indicators and lifestyle choices, it was decided to further explore the relationship of days of physical activity to other indicators of health to see whether physical activity does in fact have a significant postivie influence on the health of the participants in this dataset.
It was found that the participants who had diabetes actually had a higher median number of days of physical activity as well as more distribution in the higher number of days of physical activity compared to those who do not have diabetes. On the other hand, there was no difference found in the distribution of physical activity days between those with varying levels of depression. Analysing the relationship between days of physical activity and total cholesterol, there was found to be a weak but statistically significant positive relationship where days of physical activity correlates slightly with a higher level of total cholesterol.
ggplot(nhanes_clean, aes(x = PhysActiveDays, y = Diabetes)) +
custom_boxplot +
labs(title = "Physical Activity vs Diabetes",
x = "Days Physically Active",
y = "Diabetes Status") +
scale_x_continuous(breaks = seq(1, 7, by = 1)) +
custom_theme
Figure 15. Distribution of physical activity days between diabetes Status
ggplot(nhanes_clean, aes(x = PhysActiveDays, y = Depressed)) +
custom_boxplot +
labs(title = "Physical Activity vs Depression",
x = "Days Physically Active",
y = "Depression Frequency") +
scale_x_continuous(breaks = seq(1, 7, by = 1)) +
custom_theme
Figure 16. Distribution of physical activity days between depression status
ggplot(nhanes_clean, aes(x = PhysActiveDays, y = TotChol)) +
custom_point +
custom_smooth +
labs(title = "Physical Activity vs Total Cholesterol",
x = "Days Physically Active",
y = "Total Cholesterol (mg/dL)") +
scale_x_continuous(breaks = seq(1, 7, by = 1)) +
custom_theme
Figure 17. Relationship between days of physical activity and total cholesterol
cor.test(nhanes_clean$PhysActiveDays, nhanes_clean$TotChol, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$PhysActiveDays and nhanes_clean$TotChol
## t = 2.0614, df = 2807, p-value = 0.03935
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.001899012 0.075754035
## sample estimates:
## cor
## 0.03887962
A similar analysis was completed on the effect of number of hours slept per night on similar health indicators.
It was found that the distribution of sleep hours between those with diabetes and those without did not differ. However, while the distribution of sleep hours had a similar pattern between those who suffer from no days of depression and those who suffer from several days of depression, there was a distribution of lower number of sleep hours with those who suffered from depression on most days.
In addition, there was found to be no statistically significant correlation between sleep hours and total cholesterol.
ggplot(nhanes_clean, aes(x = SleepHrsNight, y = Diabetes)) +
custom_boxplot +
labs(title = "Sleep vs Diabetes",
x = "Hours Slept Per Night",
y = "Diabetes Status") +
scale_x_continuous(breaks = seq(2, 12, by = 1)) +
custom_theme
Figure 18. Distribution of sleep hours between diabetes status
ggplot(nhanes_clean, aes(x = SleepHrsNight, y = Depressed)) +
custom_boxplot +
labs(title = "Sleep vs Depression",
x = "Hours Slept Per Night",
y = "Depression Frequency") +
scale_x_continuous(breaks = seq(2, 12, by = 1)) +
custom_theme
Figure 19. Distribution of sleep hours between depression status
ggplot(nhanes_clean, aes(x = SleepHrsNight, y = TotChol)) +
custom_point +
custom_smooth +
labs(title = "Sleep vs Total Cholesterol",
x = "Hours Slept Per Night",
y = "Total Cholesterol (mg/dL)") +
scale_x_continuous(breaks = seq(2, 12, by = 1)) +
custom_theme
Figure 20. Relationship between sleep hours and total Cholesterol
cor.test(nhanes_clean$SleepHrsNight, nhanes_clean$TotChol, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nhanes_clean$SleepHrsNight and nhanes_clean$TotChol
## t = -0.78352, df = 2807, p-value = 0.4334
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05174208 0.02220845
## sample estimates:
## cor
## -0.01478704
This analysis of the NHANES dataset explored the relationships between Body Mass Index (BMI) and various demographic, health, and lifestyle factors. The findings highlight several key patterns:
Demographic Differences: Males exhibited a slightly higher median BMI than females, though females had a wider range of BMI values. BMI was generally lower among college graduates and individuals in higher income brackets, although even within these groups, the median BMI remained in the overweight category.
Health Associations: BMI showed a clear association with diabetes, as individuals with diabetes had significantly higher BMI levels. A weaker but still notable relationship was observed with depression, where those experiencing depression most days had slightly higher BMI values. Notably, there was no significant relationship between BMI and total cholesterol levels.
Lifestyle Factors: Alcohol consumption and smoking appeared to be minimally linked to lower BMI values, although the extent of this association requires further investigation to determine causality. Sleep duration and physical activity were also examined, with indications that reduced physical activity and shorter sleep durations may contribute to higher BMI.
Blood Pressure Relationship: A statistically significant but weak positive correlation was found between BMI and both systolic and diastolic blood pressure, indicating that individuals with higher BMI tend to have higher blood pressure levels.
In regards to the effect of physical activity and sleep on other health indicators, the following patterns were found:
Days of Physical Activity - Surprisingly, those with diabetes had a higher number of physical active days, as well as a weak but statistically significant positive relationship with total cholesterol. Further exploration could be done on the effect that hours of physical activity per week has on these indicators of health.
Hours Slept Per Night: Sleep hours only seemed to demonstrate a difference among those whose suffered from depression most days, with a lower amount of sleep hours recorded. Further exploration could be done on the effect that quality of sleep has on these indicators of health.
While this study provides useful insights, the significant data reduction (from 10,000 to 2,809 participants) may limit the generalisability of the findings.