R Coding Exercise

Load the dslabs package. Install if you haven’t. Then, inspect the gapminder dataset.

#load dslabs package and tidyverse
library(dslabs)
library(tidyverse)
#look at help file for gapminder data
help(gapminder)
#get an overview of data structure
str(gapminder)

'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

#get a summary of data 
summary(gapminder)

                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586

#determine the type of object gapminder is
class(gapminder)

[1] "data.frame"

Create a new object that contains only the African countries. Then, check the structure and summary of the new object.

#create the object with only African countries
african_countries <- gapminder[gapminder$continent == "Africa", ]
#check the structure and summary
str(african_countries)

'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

summary(african_countries)

         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0

Now, using the new African countries object, create two new objects. One should only contain “infant_mortality” and “life_expectancy” and the other should only hold “population” and “life_expectancy”.

#create the object with only population and life expectancy data
african_countries_pop_life <- african_countries[, c("population", "life_expectancy")]
#create the object with only infant mortality and life expectancy data
african_countries_infant_life <- african_countries[, c("infant_mortality", "life_expectancy")]

Now that we’ve created two new objects that look at these specific variables, we can inspect them and get a better idea of the data.

#look at the structure and summary of the first object
str(african_countries_pop_life)

'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...

summary(african_countries_pop_life)

   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51

#do the same for the second object
str(african_countries_infant_life)

'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...

summary(african_countries_infant_life)

 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226

Using the two new objects we can now create plots to characterize the relationship between life expectancy, population, and infant mortality. We’ll create two plots; one that analyzes life expectancy vs infant mortality and one that analyzes life expectancy vs population size. The latter will have a log scale to make the data easier to visualize.

#load ggplot2 to create better plots
library(ggplot2)

# Plot 1: Life expectancy vs. Infant mortality. lab() creates titles for the graph.
ggplot(african_countries_infant_life, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs. Infant Mortality")

Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

# Plot 2: Life expectancy vs. Population size.
#scale_x_log10 puts the x axis (population) on a log scale.
ggplot(african_countries_pop_life, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs. Population Size (log scale)")

Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

In Plot 1, we can see a negative correlation. As life expectancy decreases, the number of infants dying increases; this makes sense since more developed countries with better healthcare have higher life expectancies and lower infant mortality rates. In Plot 2, population size and life expectancy are positively correlated. This is logical, as longer lives allow for greater population growth and more infrequent deaths. The “streaks” in the data can be attributed to the presence of different years for individual countries in the dataset.

Knowing this, we can begin to narrow in on certain years and see which ones would be easiest to analyze given our dataset. We’ll figure out which years have missing data for infant mortality.

#find which years have missing data for infant mortality. 
#is.na() identifies which rows have na as their value
#select() shows us the years that are associated with these rows.
african_countries %>%
  filter(is.na(infant_mortality)) %>%
  select(year)

There is data missing up to 1981 and then again for 2016, so we’ll select 2000. We’ll create a new object now with only observations from 2000.

#create an object with only data from 2000
african_countries_2000 <- african_countries[african_countries$year == 2000, ]

Now, we’ll make the same plots as before using only the data from 2000.

# Plot 3: Life expectancy vs. Infant mortality.
ggplot(african_countries_2000, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs. Infant Mortality")

# Plot 4: Life expectancy vs. Population size. 
ggplot(african_countries_2000, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs. Population Size (log scale)")

There still seems to be a negative correlation in plot 3, but plot 4 shows no noticeable correlation. We can now create some linear models with this data and draw more conclusions from the year 2000.

#Table 1: fit life expectancy as a function of infant mortality. 
#lm() creates a linear model for the specified variables from a given dataset.
fit1 <- lm(life_expectancy ~ infant_mortality, african_countries_2000)
#print results to screen
summary(fit1)


Call:
lm(formula = life_expectancy ~ infant_mortality, data = african_countries_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

#Table 2: fit life expectancy as a function of population size
fit2 <- lm(life_expectancy ~ population, african_countries_2000)
#print results to screen
summary(fit2)


Call:
lm(formula = life_expectancy ~ population, data = african_countries_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

Based off of the results from the fit, we can see that infant mortality is a statistically significant predictor of life expectancy for African countries in the year 2000. On the other hand, population does not seem to be a statistically significant predictor for life expectancy in 2000. These are logical conclusions given our prior knowledge of demography.

This section contributed by Cory Cribb

Loading dslabs dataset “murders”. Probably a more morbid data set but interesting to observe nonetheless.

library(dslabs)
help(murders)
str(murders)

'data.frame':   51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...

summary(murders)

    state               abb                      region     population      
 Length:51          Length:51          Northeast    : 9   Min.   :  563626  
 Class :character   Class :character   South        :17   1st Qu.: 1696962  
 Mode  :character   Mode  :character   North Central:12   Median : 4339367  
                                       West         :13   Mean   : 6075769  
                                                          3rd Qu.: 6636084  
                                                          Max.   :37253956  
     total       
 Min.   :   2.0  
 1st Qu.:  24.5  
 Median :  97.0  
 Mean   : 184.4  
 3rd Qu.: 268.0  
 Max.   :1257.0

Since I am originally from the Southern region of the US; lets explore murders in that region.

South_Murders <- murders[murders$region== "South", ]
str(South_Murders)

'data.frame':   17 obs. of  5 variables:
 $ state     : chr  "Alabama" "Arkansas" "Delaware" "District of Columbia" ...
 $ abb       : chr  "AL" "AR" "DE" "DC" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ population: num  4779736 2915918 897934 601723 19687653 ...
 $ total     : num  135 93 38 99 669 376 116 351 293 120 ...

summary(South_Murders)

    state               abb                      region     population      
 Length:17          Length:17          Northeast    : 0   Min.   :  601723  
 Class :character   Class :character   South        :17   1st Qu.: 2967297  
 Mode  :character   Mode  :character   North Central: 0   Median : 4625364  
                                       West         : 0   Mean   : 6804378  
                                                          3rd Qu.: 8001024  
                                                          Max.   :25145561  
     total      
 Min.   : 27.0  
 1st Qu.:111.0  
 Median :207.0  
 Mean   :246.8  
 3rd Qu.:293.0  
 Max.   :805.0

From this data set, it would appear the researchers classified 17 states as being in the Southern region. Let’s explore if population size had any relationship to gun murders.

Pop_and_murder <- South_Murders[, c("total", "population")]
str(Pop_and_murder)

'data.frame':   17 obs. of  2 variables:
 $ total     : num  135 93 38 99 669 376 116 351 293 120 ...
 $ population: num  4779736 2915918 897934 601723 19687653 ...

summary(Pop_and_murder)

     total         population      
 Min.   : 27.0   Min.   :  601723  
 1st Qu.:111.0   1st Qu.: 2967297  
 Median :207.0   Median : 4625364  
 Mean   :246.8   Mean   : 6804378  
 3rd Qu.:293.0   3rd Qu.: 8001024  
 Max.   :805.0   Max.   :25145561

Create a scatter plot viewing total gun murders on the x-axis and state population on the y-axis to observe a trend. Add a best fit line to the plot to see if there is a trend.

attach(South_Murders)

The following object is masked from package:tidyr:

    population

plot(total,population, main= "Total gun murders vs. population size", xlab="Total gun murders", ylab="population")
abline(lm(population~total))

From a quick view of the plot, we see that there appears to be a positive correlation that there are more gun murders in states with higher populations. Let’s run a linear model to see if the data gives a statistically significant observation.

fit3 <- lm(population~total, South_Murders)
summary(fit3)


Call:
lm(formula = population ~ total, data = South_Murders)

Residuals:
     Min       1Q   Median       3Q      Max 
-5332407  -680032   482183  1257898  1945758 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -443125     717119  -0.618    0.546    
total          29370       2229  13.178 1.19e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1898000 on 15 degrees of freedom
Multiple R-squared:  0.9205,    Adjusted R-squared:  0.9152 
F-statistic: 173.7 on 1 and 15 DF,  p-value: 1.189e-09

From the simple linear regression, we see that the slope is statistically significant. The adjusted R-squared is 0.9152 which would indicated a Strong, Positive correlation in total gun murders per Gross Population size in the Southern Region of the United States of America.