R Coding Exercise

Load the dslabs package. Install if you haven’t. Then, inspect the gapminder dataset.

#load dslabs package and tidyverse
library(dslabs)
library(tidyverse)
#look at help file for gapminder data
help(gapminder)
#get an overview of data structure
str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
#get a summary of data 
summary(gapminder)
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  
#determine the type of object gapminder is
class(gapminder)
[1] "data.frame"

Create a new object that contains only the African countries. Then, check the structure and summary of the new object.

#create the object with only African countries
african_countries <- gapminder[gapminder$continent == "Africa", ]
#check the structure and summary
str(african_countries)
'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(african_countries)
         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0  

Now, using the new African countries object, create two new objects. One should only contain “infant_mortality” and “life_expectancy” and the other should only hold “population” and “life_expectancy”.

#create the object with only population and life expectancy data
african_countries_pop_life <- african_countries[, c("population", "life_expectancy")]
#create the object with only infant mortality and life expectancy data
african_countries_infant_life <- african_countries[, c("infant_mortality", "life_expectancy")]

Now that we’ve created two new objects that look at these specific variables, we can inspect them and get a better idea of the data.

#look at the structure and summary of the first object
str(african_countries_pop_life)
'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
summary(african_countries_pop_life)
   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51                         
#do the same for the second object
str(african_countries_infant_life)
'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
summary(african_countries_infant_life)
 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226                     

Using the two new objects we can now create plots to characterize the relationship between life expectancy, population, and infant mortality. We’ll create two plots; one that analyzes life expectancy vs infant mortality and one that analyzes life expectancy vs population size. The latter will have a log scale to make the data easier to visualize.

#load ggplot2 to create better plots
library(ggplot2)

# Plot 1: Life expectancy vs. Infant mortality. lab() creates titles for the graph.
ggplot(african_countries_infant_life, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs. Infant Mortality")
Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

# Plot 2: Life expectancy vs. Population size.
#scale_x_log10 puts the x axis (population) on a log scale.
ggplot(african_countries_pop_life, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs. Population Size (log scale)")
Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

In Plot 1, we can see a negative correlation. As life expectancy decreases, the number of infants dying increases; this makes sense since more developed countries with better healthcare have higher life expectancies and lower infant mortality rates. In Plot 2, population size and life expectancy are positively correlated. This is logical, as longer lives allow for greater population growth and more infrequent deaths. The “streaks” in the data can be attributed to the presence of different years for individual countries in the dataset.

Knowing this, we can begin to narrow in on certain years and see which ones would be easiest to analyze given our dataset. We’ll figure out which years have missing data for infant mortality.

#find which years have missing data for infant mortality. 
#is.na() identifies which rows have na as their value
#select() shows us the years that are associated with these rows.
african_countries %>%
  filter(is.na(infant_mortality)) %>%
  select(year)
    year
1   1960
2   1960
3   1960
4   1960
5   1960
6   1960
7   1960
8   1960
9   1960
10  1960
11  1961
12  1961
13  1961
14  1961
15  1961
16  1961
17  1961
18  1961
19  1961
20  1961
21  1961
22  1961
23  1961
24  1961
25  1961
26  1961
27  1961
28  1962
29  1962
30  1962
31  1962
32  1962
33  1962
34  1962
35  1962
36  1962
37  1962
38  1962
39  1962
40  1962
41  1962
42  1962
43  1962
44  1963
45  1963
46  1963
47  1963
48  1963
49  1963
50  1963
51  1963
52  1963
53  1963
54  1963
55  1963
56  1963
57  1963
58  1963
59  1963
60  1964
61  1964
62  1964
63  1964
64  1964
65  1964
66  1964
67  1964
68  1964
69  1964
70  1964
71  1964
72  1964
73  1964
74  1964
75  1965
76  1965
77  1965
78  1965
79  1965
80  1965
81  1965
82  1965
83  1965
84  1965
85  1965
86  1965
87  1965
88  1965
89  1966
90  1966
91  1966
92  1966
93  1966
94  1966
95  1966
96  1966
97  1966
98  1966
99  1966
100 1966
101 1966
102 1967
103 1967
104 1967
105 1967
106 1967
107 1967
108 1967
109 1967
110 1967
111 1967
112 1967
113 1968
114 1968
115 1968
116 1968
117 1968
118 1968
119 1968
120 1968
121 1968
122 1968
123 1968
124 1969
125 1969
126 1969
127 1969
128 1969
129 1969
130 1969
131 1970
132 1970
133 1970
134 1970
135 1970
136 1971
137 1971
138 1971
139 1971
140 1971
141 1971
142 1972
143 1972
144 1972
145 1972
146 1972
147 1972
148 1973
149 1973
150 1973
151 1973
152 1973
153 1973
154 1974
155 1974
156 1974
157 1974
158 1974
159 1975
160 1975
161 1975
162 1975
163 1975
164 1976
165 1976
166 1976
167 1977
168 1977
169 1977
170 1978
171 1978
172 1979
173 1979
174 1980
175 1981
176 2016
177 2016
178 2016
179 2016
180 2016
181 2016
182 2016
183 2016
184 2016
185 2016
186 2016
187 2016
188 2016
189 2016
190 2016
191 2016
192 2016
193 2016
194 2016
195 2016
196 2016
197 2016
198 2016
199 2016
200 2016
201 2016
202 2016
203 2016
204 2016
205 2016
206 2016
207 2016
208 2016
209 2016
210 2016
211 2016
212 2016
213 2016
214 2016
215 2016
216 2016
217 2016
218 2016
219 2016
220 2016
221 2016
222 2016
223 2016
224 2016
225 2016
226 2016

There is data missing up to 1981 and then again for 2016, so we’ll select 2000. We’ll create a new object now with only observations from 2000.

#create an object with only data from 2000
african_countries_2000 <- african_countries[african_countries$year == 2000, ]

Now, we’ll make the same plots as before using only the data from 2000.

# Plot 3: Life expectancy vs. Infant mortality.
ggplot(african_countries_2000, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs. Infant Mortality")

# Plot 4: Life expectancy vs. Population size. 
ggplot(african_countries_2000, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs. Population Size (log scale)")

There still seems to be a negative correlation in plot 3, but plot 4 shows no noticeable correlation. We can now create some linear models with this data and draw more conclusions from the year 2000.

#Table 1: fit life expectancy as a function of infant mortality. 
#lm() creates a linear model for the specified variables from a given dataset.
fit1 <- lm(life_expectancy ~ infant_mortality, african_countries_2000)
#print results to screen
summary(fit1)

Call:
lm(formula = life_expectancy ~ infant_mortality, data = african_countries_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08
#Table 2: fit life expectancy as a function of population size
fit2 <- lm(life_expectancy ~ population, african_countries_2000)
#print results to screen
summary(fit2)

Call:
lm(formula = life_expectancy ~ population, data = african_countries_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

Based off of the results from the fit, we can see that infant mortality is a statistically significant predictor of life expectancy for African countries in the year 2000. On the other hand, population does not seem to be a statistically significant predictor for life expectancy in 2000. These are logical conclusions given our prior knowledge of demography.

This section contributed by Cory Cribb

Loading dslabs dataset “murders”. Probably a more morbid data set but interesting to observe nonetheless.

library(dslabs)
help(murders)
str(murders)
'data.frame':   51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...
summary(murders)
    state               abb                      region     population      
 Length:51          Length:51          Northeast    : 9   Min.   :  563626  
 Class :character   Class :character   South        :17   1st Qu.: 1696962  
 Mode  :character   Mode  :character   North Central:12   Median : 4339367  
                                       West         :13   Mean   : 6075769  
                                                          3rd Qu.: 6636084  
                                                          Max.   :37253956  
     total       
 Min.   :   2.0  
 1st Qu.:  24.5  
 Median :  97.0  
 Mean   : 184.4  
 3rd Qu.: 268.0  
 Max.   :1257.0  

Since I am originally from the Southern region of the US; lets explore murders in that region.

South_Murders <- murders[murders$region== "South", ]
str(South_Murders)
'data.frame':   17 obs. of  5 variables:
 $ state     : chr  "Alabama" "Arkansas" "Delaware" "District of Columbia" ...
 $ abb       : chr  "AL" "AR" "DE" "DC" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ population: num  4779736 2915918 897934 601723 19687653 ...
 $ total     : num  135 93 38 99 669 376 116 351 293 120 ...
summary(South_Murders)
    state               abb                      region     population      
 Length:17          Length:17          Northeast    : 0   Min.   :  601723  
 Class :character   Class :character   South        :17   1st Qu.: 2967297  
 Mode  :character   Mode  :character   North Central: 0   Median : 4625364  
                                       West         : 0   Mean   : 6804378  
                                                          3rd Qu.: 8001024  
                                                          Max.   :25145561  
     total      
 Min.   : 27.0  
 1st Qu.:111.0  
 Median :207.0  
 Mean   :246.8  
 3rd Qu.:293.0  
 Max.   :805.0  

From this data set, it would appear the researchers classified 17 states as being in the Southern region. Let’s explore if population size had any relationship to gun murders.

Pop_and_murder <- South_Murders[, c("total", "population")]
str(Pop_and_murder)
'data.frame':   17 obs. of  2 variables:
 $ total     : num  135 93 38 99 669 376 116 351 293 120 ...
 $ population: num  4779736 2915918 897934 601723 19687653 ...
summary(Pop_and_murder)
     total         population      
 Min.   : 27.0   Min.   :  601723  
 1st Qu.:111.0   1st Qu.: 2967297  
 Median :207.0   Median : 4625364  
 Mean   :246.8   Mean   : 6804378  
 3rd Qu.:293.0   3rd Qu.: 8001024  
 Max.   :805.0   Max.   :25145561  

Create a scatter plot viewing total gun murders on the x-axis and state population on the y-axis to observe a trend. Add a best fit line to the plot to see if there is a trend.

attach(South_Murders)
The following object is masked from package:tidyr:

    population
plot(total,population, main= "Total gun murders vs. population size", xlab="Total gun murders", ylab="population")
abline(lm(population~total))

From a quick view of the plot, we see that there appears to be a positive correlation that there are more gun murders in states with higher populations. Let’s run a linear model to see if the data gives a statistically significant observation.

fit3 <- lm(population~total, South_Murders)
summary(fit3)

Call:
lm(formula = population ~ total, data = South_Murders)

Residuals:
     Min       1Q   Median       3Q      Max 
-5332407  -680032   482183  1257898  1945758 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -443125     717119  -0.618    0.546    
total          29370       2229  13.178 1.19e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1898000 on 15 degrees of freedom
Multiple R-squared:  0.9205,    Adjusted R-squared:  0.9152 
F-statistic: 173.7 on 1 and 15 DF,  p-value: 1.189e-09

From the simple linear regression, we see that the slope is statistically significant. The adjusted R-squared is 0.9152 which would indicated a Strong, Positive correlation in total gun murders per Gross Population size in the Southern Region of the United States of America.