Data Exercise

Author

Kevin Kosewick

For this exercise, I’ll be using a dataset I got from the CDC website about smoking attributable expenses in the U.S. The data includes expenses per state for different categories such as hospital bills, ambulances, and prescriptions as well as overall expenses for the U.S. from 2005-2009. We’ll do an exploratory data analysis and some data processing for this exercise.

#load the dataset and packages
library(dplyr)
library(here)
library(ggplot2)
library(car)
smoke_expense<- read.csv(here("cdcdata-exercise", "SAE.csv"))
#make sure data fully loaded by checking number of rows and columns. We should have 19 variables and 1560 observations
nrow(smoke_expense)
[1] 1560
ncol(smoke_expense)
[1] 19
#check the structure and summary of the data. For the sake of the webpage, I've made the structure line an annotation, but if you're viewing this in R go ahead and delete the #
summary(smoke_expense)
      Year      LocationAbbr       LocationDesc        DataSource       
 Min.   :2005   Length:1560        Length:1560        Length:1560       
 1st Qu.:2006   Class :character   Class :character   Class :character  
 Median :2007   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2007                                                           
 3rd Qu.:2008                                                           
 Max.   :2009                                                           
  TopicType          TopicDesc         MeasureDesc          Variable        
 Length:1560        Length:1560        Length:1560        Length:1560       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 Data_Value_Unit    Data_Value_Type      Data_Value      
 Length:1560        Length:1560        Min.   :     8.4  
 Class :character   Class :character   1st Qu.:   105.7  
 Mode  :character   Mode  :character   Median :   306.1  
                                       Mean   :  1545.2  
                                       3rd Qu.:   842.3  
                                       Max.   :132459.8  
 Data_Value_Footnote_Symbol Data_Value_Footnote GeoLocation       
 Length:1560                Length:1560         Length:1560       
 Class :character           Class :character    Class :character  
 Mode  :character           Mode  :character    Mode  :character  
                                                                  
                                                                  
                                                                  
 Topic.Type.ID        Topic.ID          Measure.ID        SubMeasureID      
 Length:1560        Length:1560        Length:1560        Length:1560       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  DisplayOrder
 Min.   :1.0  
 1st Qu.:2.0  
 Median :3.5  
 Mean   :3.5  
 3rd Qu.:5.0  
 Max.   :6.0  
#structure(smoke_expense)

We can see that there’s one observation for each of our 6 expense types per year. We can also see from the structure and summary that there are many columns that we don’t need for an analysis. We can go ahead and pick out the ones that are of interest to us: year, location, variable (which refers to the type of expense), and value (which is the cost in millions of dollars for each expense type). All of the other columns seem to be for record keeping purposes.

#make a new data frame containing only our four columns of interest: Year, LocationAbbr, Variable, and Data_Value.
smoke_expense_2 <- smoke_expense[, c("Year", "LocationAbbr", "Variable", "Data_Value")]
#check the new object to make sure it has everything we want. For the sake of the webpage, I've made the structure line an annotation, but if you're viewing this in R go ahead and delete the #
#structure(smoke_expense_2)
summary(smoke_expense_2)
      Year      LocationAbbr         Variable           Data_Value      
 Min.   :2005   Length:1560        Length:1560        Min.   :     8.4  
 1st Qu.:2006   Class :character   Class :character   1st Qu.:   105.7  
 Median :2007   Mode  :character   Mode  :character   Median :   306.1  
 Mean   :2007                                         Mean   :  1545.2  
 3rd Qu.:2008                                         3rd Qu.:   842.3  
 Max.   :2009                                         Max.   :132459.8  

This new object is much more condensed and easier to work with. We’ll check now to make sure there are no NA values and then proceed with some EDA.

#check for NA values
na_check<- is.na(smoke_expense_2)
print(sum(na_check))
[1] 0

Since there aren’t any NA values and looking at the structure indicates no missing values, we can begin to check the mean and standard deviation of each expense type across the 4 years in the data set. Rather than looking at all 50 states, let’s focus on 5 to make this a bit easier. We’ll use GA, TN, MS, CA, and FL.

Using Microsoft Copilot with GPT-4 in “Precise Mode”, I entered the following prompt to get the code I’m about to use: “For my exploratory analysis I want to summarize each variable in a way that can be described by a distribution. For instance, I want to be able to determine the mean and standard deviation of each expense type for 5 different states over the 4 year period recorded in the dataset. What is the best approach for this and could you provide some example code?”

# Filter for the 5 states you are interested in
states <- c("TN", "MS", "GA", "FL", "CA")
smoke_filtered <- smoke_expense_2 %>% filter(LocationAbbr %in% states)

# Calculate mean and standard deviation
smoke_summary <- smoke_filtered %>%
  group_by(LocationAbbr, Year, Variable) %>%
  summarise(
    Mean = mean(Data_Value, na.rm = TRUE),
    SD = sd(Data_Value, na.rm = TRUE)
  )
`summarise()` has grouped output by 'LocationAbbr', 'Year'. You can override
using the `.groups` argument.
# Print the summary statistics
print(smoke_summary)
# A tibble: 150 × 5
# Groups:   LocationAbbr, Year [25]
   LocationAbbr  Year Variable             Mean    SD
   <chr>        <int> <chr>               <dbl> <dbl>
 1 CA            2005 Ambulatory          2057.    NA
 2 CA            2005 Hospital            5309.    NA
 3 CA            2005 Nursing Home         592.    NA
 4 CA            2005 Other                620.    NA
 5 CA            2005 Prescription Drugs  1883.    NA
 6 CA            2005 Total              10460.    NA
 7 CA            2006 Ambulatory          2122.    NA
 8 CA            2006 Hospital            5652.    NA
 9 CA            2006 Nursing Home         641.    NA
10 CA            2006 Other                677.    NA
# ℹ 140 more rows

This is good information, but since there’s only one observation for each expense category per year, we don’t learn much from a mean or standard deviation (SD) calculation. We can group each year together to get the mean and SD to allow for easier creation of synthetic data for the second part of this exercise.

# Calculate mean and standard deviation for all 5 states combined
smoke_summary <- smoke_filtered %>%
  group_by(LocationAbbr, Variable) %>%
  summarise(
    Mean = mean(Data_Value, na.rm = TRUE),
    SD = sd(Data_Value, na.rm = TRUE)
  )
`summarise()` has grouped output by 'LocationAbbr'. You can override using the
`.groups` argument.
# Print the summary statistics
print(smoke_summary)
# A tibble: 30 × 4
# Groups:   LocationAbbr [5]
   LocationAbbr Variable             Mean     SD
   <chr>        <chr>               <dbl>  <dbl>
 1 CA           Ambulatory          2294.  208. 
 2 CA           Hospital            6006.  571. 
 3 CA           Nursing Home         699.   87.9
 4 CA           Other                725.   76.8
 5 CA           Prescription Drugs  2172.  197. 
 6 CA           Total              11896. 1131. 
 7 FL           Ambulatory          1419.  105. 
 8 FL           Hospital            3532.  270. 
 9 FL           Nursing Home         595.   58.9
10 FL           Other                514.   53.3
# ℹ 20 more rows

Now we can create some visualizations of this information for easier interpretation of the data. I entered the following prompt into Microsoft Copilot with GPT-4 in “Precise mode”: “I want to make plots now to determine if the distribution of this data is relatively normal. How would I go about that?”

I received this code string from it, with the only modifications I made being the proper column names and adjusting the binwidth to the square root of all of the observations:

# Create a histogram
ggplot(smoke_filtered, aes(x = Data_Value)) + 
  geom_histogram(binwidth = 12.25, fill = "blue", color = "black") +
  facet_wrap(~ Variable) +
  theme_minimal() +
  labs(title = "Histogram of Expenses", x = "Expense", y = "Frequency")

# Create a density plot
ggplot(smoke_filtered, aes(x = Data_Value)) + 
  geom_density(fill = "blue") +
  facet_wrap(~ Variable) +
  theme_minimal() +
  labs(title = "Density Plot of Expenses", x = "Expense", y = "Density") 

# Create a Q-Q plot with the car package
qqPlot(smoke_filtered$Data_Value, distribution = "norm", main = "Q-Q Plot of Expenses")

[1] 131  45

The histogram isn’t very useful for much besides visualizing the numbers, but the Q-Q plot shows that the distribution for the data as a whole seems normal but contains some outliers to the left and the right. The density plots point out that “Other”, “Nursing Home”, and “Prescription Drugs”, and “Ambulatory” expenses are pretty skewed, but the other two are pretty normally distributed. Hopefully this is enough information to create a good synthetic data set that mimics the trends seen in this one!

This section was contributed by Emma Hardin-Parker

# Load required packages
library(here)
library(dplyr)
library(ggplot2)
library(skimr)
library(gtsummary)
# Set a seed for reproducibility
set.seed(207)
# Define the number of observations to generate
n_obs <- 150

I am now going to get a feel for the data ussing skimr and gtsummary() functions.

skimr::skim(smoke_filtered)
Data summary
Name smoke_filtered
Number of rows 150
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LocationAbbr 0 1 2 2 0 5 0
Variable 0 1 5 18 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2007.0 1.42 2005.0 2006.00 2007.00 2008.00 2009.0 ▇▇▇▇▇
Data_Value 0 1 1747.8 2581.55 50.6 225.15 660.15 2073.02 13292.4 ▇▁▁▁▁
gtsummary::tbl_summary(smoke_filtered, statistic = list(
  all_continuous() ~ "{mean}/{median}/{min}/{max}/{sd}",
  all_categorical() ~ "{n} / {N} ({p}%)"
),)
Characteristic N = 1501
Year
    2005 30 / 150 (20%)
    2006 30 / 150 (20%)
    2007 30 / 150 (20%)
    2008 30 / 150 (20%)
    2009 30 / 150 (20%)
LocationAbbr
    CA 30 / 150 (20%)
    FL 30 / 150 (20%)
    GA 30 / 150 (20%)
    MS 30 / 150 (20%)
    TN 30 / 150 (20%)
Variable
    Ambulatory 25 / 150 (17%)
    Hospital 25 / 150 (17%)
    Nursing Home 25 / 150 (17%)
    Other 25 / 150 (17%)
    Prescription Drugs 25 / 150 (17%)
    Total 25 / 150 (17%)
Data_Value 1,748/660/51/13,292/2,582
1 n / N (%); Mean/Median/Minimum/Maximum/SD

Now I am going to create a synthetic data set based off the actual data.

syn_smoke <- data.frame(
  Year = integer(n_obs),
  LocationAbbr = character(n_obs),
  Variable = character(n_obs),
  Data_Value = numeric(n_obs)
)


#Variable1
syn_smoke$Year <- sample(c("2005", "2006", "2007", "2008", "2009"),
                         n_obs, replace = TRUE,
                         prob = as.integer(table(smoke_filtered$Year)))
                   
#Variable2 
syn_smoke$LocationAbbr <- sample(c("GA", "MS", "TN", "FL", "CA"),
                                 n_obs, replace = TRUE,
                                 prob =  as.numeric(table(smoke_filtered$LocationAbbr)/100))

#Variable3
syn_smoke$Variable <- sample(c("Ambulatory", "Hospital", "Nursing Home", "Other", "Prescription Drugs", "Total"),
                             n_obs,
                             replace = TRUE,
                             prob = as.numeric(table(smoke_filtered$Variable)/100))
#Variable4
syn_smoke$Data_Value <- round(runif(n_obs,
                                    min = min(smoke_filtered$Data_Value),
                                    max = max(smoke_filtered$Data_Value)), 1)

To make sure the synthetic data set was created properly, I am going to use the head(), glimpse(), and summary() functions to see if the first five rows look okay for further analyses.

head(syn_smoke)
  Year LocationAbbr   Variable Data_Value
1 2009           CA   Hospital       76.5
2 2009           TN Ambulatory     2217.7
3 2009           FL   Hospital      411.9
4 2005           TN   Hospital     2993.4
5 2008           CA Ambulatory      910.9
6 2006           FL Ambulatory     7999.9
summary(syn_smoke)
     Year           LocationAbbr         Variable           Data_Value     
 Length:150         Length:150         Length:150         Min.   :   76.5  
 Class :character   Class :character   Class :character   1st Qu.: 3172.8  
 Mode  :character   Mode  :character   Mode  :character   Median : 7116.2  
                                                          Mean   : 6566.1  
                                                          3rd Qu.: 9791.8  
                                                          Max.   :13236.8  
glimpse(syn_smoke)
Rows: 150
Columns: 4
$ Year         <chr> "2009", "2009", "2009", "2005", "2008", "2006", "2006", "…
$ LocationAbbr <chr> "CA", "TN", "FL", "TN", "CA", "FL", "CA", "GA", "GA", "GA…
$ Variable     <chr> "Hospital", "Ambulatory", "Hospital", "Hospital", "Ambula…
$ Data_Value   <dbl> 76.5, 2217.7, 411.9, 2993.4, 910.9, 7999.9, 1914.5, 9803.…

Everything looks as it should, so it’s time to make some exploratory figures and tables.

# Calculate mean and standard deviation for all 5 states combined
syn_smoke_summary <- syn_smoke %>%
  group_by(LocationAbbr, Variable) %>%
  summarise(
    Mean = mean(Data_Value, na.rm = TRUE),
    SD = sd(Data_Value, na.rm = TRUE)
  )
`summarise()` has grouped output by 'LocationAbbr'. You can override using the
`.groups` argument.
# Create a histogram
syn_hist <- ggplot(syn_smoke, aes(x = Data_Value)) + 
  geom_histogram(binwidth = 12.25, fill = "blue", color = "black") +
  facet_wrap(~ Variable) +
  theme_minimal() +
  labs(title = "Histogram of Expenses", x = "Expense", y = "Frequency")
print(syn_hist)

# Create a density plot
library(scales)
syn_dens <- ggplot(syn_smoke, aes(x = Data_Value)) + 
  geom_density(fill = "blue") +
  facet_wrap(~ Variable) +
  theme_minimal() +
  labs(title = "Density Plot of Expenses", x = "Expense", y = "Density") +
  scale_x_continuous(labels = label_number())
print(syn_dens)

# Create a Q-Q plot
library(car)
syn_qq <- qqPlot(syn_smoke$Data_Value, distribution = "norm", main = "Q-Q Plot of Expenses")

The histograms produced between the original data and the synthetic data were quite different. Most of the individual histograms per Variable created with the original data were skewed to the right, while the histograms produced with the synthetic data were evenly distributed throughout.

The density plots also look different between the original data and the synthetic data, however, I struggled to code this plot. For some reason, the density value on the y-axis used scientific notation and I had to add an additional line of code to remove it. Even with that transformation, the densities on average are higher in the synthetic plot than the original plot.

The q-q plot was undoubtedly the most interesting plot to compare to the original data. The synthetic q-q plot is significantly more normally distributed that the original plot. Though the histograms differed drastically, it is much easier to compare normality between the data sets using a q-q plot, so I am happy that I was able to successfully create this one.