Chapter 6 Working with Data
Importing/exporting data, cleaning data, reshaping data, wrangling data, and other data manipulation tasks are an important and often overlooked component of data science. The book Spector (2008), while a few years old, is still an excellent reference for data-related issues, and is available for free at http://catalog.lib.msu.edu/record=b7254984~S39a. In this chapter we will present a variety of tools for common data analysis tasks you will encounter in R.
6.1 Data import
Data come in a dizzying variety of formats. It might be a proprietary format used by commercial software such as Excel, SPSS, or SAS. It might be structured using a relational model, for example, the USDA Forest Service Forest Inventory and Analysis database (Burrill et al. 2018). It might be a data-interchange format such as JSON (JavaScript Object Notation; Pezoa et al. (2016)), or a markup language format such as XML (Extensible Markup Language; Bray et al. (2008)) perhaps with specialized standards for describing ecological information such as EML (Ecological Metadata Language; Jones et al. (2019)). The good news is we’ve yet to encounter a data format unreadable by R. This is thanks to the large community of R package developers and data formats specific to their respective disciplines. For example, the foreign
and haven
packages provide functions for reading and writing data in common proprietary statistical software formats. Similarly, as we’ll see in Chapter 8, the rgdal
and raster
packages read and write pretty much any spatial data formats used in popular GIS software.
Fortunately many datasets are (or can be) saved as plain text files. Most software can read and write such files, so our initial focus is on reading and writing plain text files.
The function read.table
and its offshoots, such as read.csv
, are used to read 2-dimensional data, i.e., rows and columns, from a plain text file and are the core functions for reading external datasets into R. For example, the file BrainAndBody.csv
contains data35 on the brain weight, body weight, and name of some terrestrial animals. Here are the first few lines of that file:
body,brain,name
1.35,8.1,Mountain beaver
465,423,Cow
36.33,119.5,Grey wolf
27.66,115,Goat
1.04,5.5,Guinea pig
As is evident, the first line of the file contains the names of the three variables, separated (delimited) by commas. Each subsequent line (row) contains the body weight, brain weight, and name of a specific terrestrial animal.
This file is accessible at the url https://www.finley-lab.com/files/data/BrainAndBody.csv. The read.table
function is used to read these data into an R data frame. A data frame is a common rectangular (i.e., two-dimensional) data structure in R that consists of rows and columns. We will cover data frames extensively in Chapter 4.
"https://www.finley-lab.com/files/data/BrainAndBody.csv"
link.bb <- read.table(file = link.bb, header = TRUE, sep = ",",
BrainBody <-stringsAsFactors = FALSE)
str(BrainBody)
## 'data.frame': 28 obs. of 3 variables:
## $ body : num 1.35 465 36.33 27.66 1.04 ...
## $ brain: num 8.1 423 119.5 115 5.5 ...
## $ name : chr "Mountain beaver" "Cow" "Grey wolf" "Goat" ...
head(BrainBody)
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
The read.table
arguments and associated values used above include:
file = link.bb
identifies the file location. In this case the stringhttps://www.finley-lab.com/files/data/BrainAndBody.csv
giving the location is rather long, so it was first assigned to the objectlink.bb
.header = TRUE
specifies the column names are held in the file’s first row, known as the “header”.sep = ","
defines the comma as the character delimiter between the file’s columns.stringsAsFactors = FALSE
tells R not to convert character vectors to factors. While we have yet to discuss the different types of vectors in R, prior to R 4.0.0stringsAsFactors=TRUE
was the default behavior, and so we will often explicitly include this argument in our call toread.table
to avoid any differences across R versions. The new default value isFALSE
so we could have chosen not to include this in the code).
When we call read.table
or read.csv
, the resulting object is read into R and represented as a data frame. The str
function and head
are two useful functions for looking at the structure of a data frame (or any R object). The str
function displays the structure of the object BrainBody
that contains the data from the BrainAndBody.csv
file. Notice the object is a data frame with 28 observations (i.e., rows) and 3 variables (i.e., columns). The column names are displayed following the $
36. The head
function displays the the first six rows and all columns of the data frame. An alternative way to view a data frame is to use the View
function, which opens up a spreadsheet interface in R. Try this out by running View(BrainBody)
in your console.
The function read.csv
is the same as read.table
except the default separator is a comma, whereas the default separator for read.table
is whitespace.
The file BrainAndBody.tsv
contains the same data, except a tab is used in place of a comma to separate fields. The only change needed to read in the data in this file is in the sep
argument (and of course the file
argument, since the data are stored in a different file):
"https://www.finley-lab.com/files/data/BrainAndBody.tsv"
file.bb <- read.table(file = file.bb, header = TRUE, sep = "\t",
BrainBody2 <-stringsAsFactors = FALSE)
head(BrainBody2)
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
File extensions, e.g., .csv
or .tsv
, are naming conventions only and are there to remind us how the columns are separated. In other words, they have no influence on R’s file read functions.
A third file, BrainAndBody.txt
, contains the same data, but also contains a few lines of explanatory text above the names of the variables. It also uses whitespace rather than a comma or a tab as a separator. Here are the first several lines of the file.
This file contains data
on brain and body
weights of several terrestrial animals
"body" "brain" "name"
1.35 8.1 "Mountain beaver"
465 423 "Cow"
36.33 119.5 "Grey wolf"
27.66 115 "Goat"
1.04 5.5 "Guinea pig"
11700 50 "Dipliodocus"
2547 4603 "Asian elephant"
Notice that in this file the values of name
are put inside of quotation marks. This is necessary since instead R would (reasonably) assume the first line contained the values of four variables, the values being 1.35
, 8.1
, Mountain
, and beaver
while in reality there are only three values desired, with Mountain beaver
being the third.
To read in this file we need to tell R to skip the first five lines and to use whitespace as the separator. The skip
argument handles the first, and the sep
argument the second. First let’s see what happens if we don’t use the skip
argument.
"https://www.finley-lab.com/files/data/BrainAndBody.txt"
file.bb <- read.table(file.bb, header = TRUE, sep = " ",
BrainBody3 <-stringsAsFactors = FALSE)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 5 elements
R assumed the first line of the file contained the variable names, since header = TRUE
was specified, and counted four including This
, file
, contains
, and data
. So in the first line of actual data, R expected four columns containing data plus possibly a fifth column containing row names for the data set, and complained that “line 1 did not have 5 elements.” The error message is somewhat mysterious, since it starts with “Error in scan.” This happens because read.table
actually uses a more basic R function called scan
to do the work.
Here’s how to read in the file correctly.
"https://www.finley-lab.com/files/data/BrainAndBody.txt"
file.bb <- read.table(file.bb, header = TRUE, sep = " ",
BrainBody3 <-stringsAsFactors = FALSE, skip = 4)
head(BrainBody3)
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
6.2 Importing data with missing observations
Missing data are represented in many ways. Sometimes a missing data point is just that, i.e., the place where it should be in the file is blank. Other times specific numbers such as \(-9999\) or specific symbols are used. The read.table()
function has an argument na.string
that allows the user to specify how missing data is indicated in the source file.
The site http://www.wunderground.com/history/ makes weather data available for locations around the world from dates going back to 1945. The file WeatherKLAN2014.csv
contains weather data for Lansing, Michigan for the year 2014. Here are the first few lines of that file:
EST,Max TemperatureF,Min TemperatureF, Events
1/1/14,14,9,Snow
1/2/14,13,-3,Snow
1/3/14,13,-11,Snow
1/4/14,31,13,Snow
1/5/14,29,16,Fog-Snow
1/6/14,16,-12,Fog-Snow
1/7/14,2,-13,Snow
1/8/14,17,-1,Snow
1/9/14,21,2,Snow
1/10/14,39,21,Fog-Rain-Snow
1/11/14,41,32,Fog-Rain
1/12/14,39,31,
Look at the last line, and notice that instead of an Event
such as Snow
or Fog-Snow
there is nothing after the comma. This observation is missing, but rather than using an explicit code such as NA
37, the site just leaves that entry blank. To read these data into R we will supply the argument na.string = ""
which tells R the file indicates missing data by leaving the appropriate entry blank. We then print a subset of the resulting data frame to make sure the offending missing values were replaced with NA
.
"https://www.finley-lab.com/files/data/WeatherKLAN2014.csv"
u.weather <- read.csv(u.weather, header=TRUE,
WeatherKLAN2014 <-stringsAsFactors = FALSE, na.string = "")
head(WeatherKLAN2014, n = 15)
## EST Max.TemperatureF Min.TemperatureF
## 1 1/1/14 14 9
## 2 1/2/14 13 -3
## 3 1/3/14 13 -11
## 4 1/4/14 31 13
## 5 1/5/14 29 16
## 6 1/6/14 16 -12
## 7 1/7/14 2 -13
## 8 1/8/14 17 -1
## 9 1/9/14 21 2
## 10 1/10/14 39 21
## 11 1/11/14 41 32
## 12 1/12/14 39 31
## 13 1/13/14 44 34
## 14 1/14/14 37 26
## 15 1/15/14 27 18
## Events
## 1 Snow
## 2 Snow
## 3 Snow
## 4 Snow
## 5 Fog-Snow
## 6 Fog-Snow
## 7 Snow
## 8 Snow
## 9 Snow
## 10 Fog-Rain-Snow
## 11 Fog-Rain
## 12 <NA>
## 13 Rain
## 14 Rain-Snow
## 15 Snow
Notice the use of the argument n = 15
in the head
function, which displays the first 15 rows of the WeatherKLAN2014
data frame38.
6.3 Data export
Data stored in external text files are rarely in the format required for data analysis and summarization. We may need to create new variables in the data set, transform variables to different scales, rearrange the data, or manipulate the data to fix incorrect observations. The process of identifying and ultimating fixing incorrect observations from a data set is called data cleaning, while data wrangling is the process of obtaining and transforming data to answer a specific data analysis question. R has a vast suite of tools for data cleaning and data wrangling that we will cover extensively in Section 6.7. After cleaning and wrangling the data in R, we may want to create a new external text file that contains the data in the desired data analysis format for use in a separate project, submission to an agency, or archiving on a data repository. The R function write.table
allows us to take a data set in R and save it as an external file.
To illustrate write.table
, let’s try to create a new csv file in our current working directory named WeatherKLAN2014New.csv
from the WeatherKLAN2014
object we previously created. We do this with the following code.
write.table(x = WeatherKLAN2014, file = 'WeatherKLAN2014New.csv', sep = ',',
row.names = FALSE, na = 'NA')
Let’s walk through each of the arguments and associated values used in the previous code chunk:
x = WeatherKLAN2014
is the R object we want to write to an external file, in this case theWeatherKLAN2014
data set.file = 'WeatherKLAN2014New.csv'
is the name of the file in which we want to store the data specified in argumentx
. The file will be created relative in our current working directory as this is the directory R uses for relative path.sep = ','
defines the comma as the character delimiter between the file’s columns, just as we saw withread.table
.row.names = FALSE
tells R not to create a column in the external data set that contains the row names of the associated R objectx
. This argument is by defaultrow.names = TRUE
. We do not recommend storing data in the row names of R objects, and so we usually setrow.names = FALSE
.na = NA
tells R to use the valueNA
to represent missing data in the resulting text file.
Similar changes can be made to write.table
to create tab-separated files, plain text files, or other simple text files by changing the sep
argument and the file extension in the file
argument.
6.4 Summarizing Data Frames
Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables. As with many tasks, there are several ways to accomplish each of these.
6.4.1 Column (and Row) Summaries
The file WeatherKLAN2014Full.csv
contains a more complete set of weather data variables than WeatherKLAN2014.csv
, from the same source, http://www.wunderground.com/history.
"https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
u.weather <- read.csv(u.weather, header=TRUE,
WeatherKLAN2014Full <-stringsAsFactors = FALSE,
na.string = "")
names(WeatherKLAN2014Full)
## [1] "EST"
## [2] "Max.TemperatureF"
## [3] "Mean.TemperatureF"
## [4] "Min.TemperatureF"
## [5] "Max.Dew.PointF"
## [6] "MeanDew.PointF"
## [7] "Min.DewpointF"
## [8] "Max.Humidity"
## [9] "Mean.Humidity"
## [10] "Min.Humidity"
## [11] "Max.Sea.Level.PressureIn"
## [12] "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn"
## [14] "Max.VisibilityMiles"
## [15] "Mean.VisibilityMiles"
## [16] "Min.VisibilityMiles"
## [17] "Max.Wind.SpeedMPH"
## [18] "Mean.Wind.SpeedMPH"
## [19] "Max.Gust.SpeedMPH"
## [20] "PrecipitationIn"
## [21] "CloudCover"
## [22] "Events"
## [23] "WindDirDegrees"
How can we compute the mean for each variable? One possibility is to do this a variable at a time:
mean(WeatherKLAN2014Full$Mean.TemperatureF)
## [1] 45.78
mean(WeatherKLAN2014Full$Min.TemperatureF)
## [1] 36.25
mean(WeatherKLAN2014Full$Max.TemperatureF)
## [1] 54.84
##Et Cetera
This is pretty inefficient. Fortunately there is a colMeans()
function which computes the mean of each column (or a specified number of columns) in a data frame. Some columns in the current data frame are not numeric, and obviously we don’t want to ask R to compute means for these columns. We use str()
to investigate.
str(WeatherKLAN2014Full)
## 'data.frame': 365 obs. of 23 variables:
## $ EST : chr "2014-1-1" "2014-1-2" "2014-1-3" "2014-1-4" ...
## $ Max.TemperatureF : int 14 13 13 31 29 16 2 17 21 39 ...
## $ Mean.TemperatureF : int 12 5 1 22 23 2 -5 8 12 30 ...
## $ Min.TemperatureF : int 9 -3 -11 13 16 -12 -13 -1 2 21 ...
## $ Max.Dew.PointF : int 9 7 2 27 27 11 -6 7 18 37 ...
## $ MeanDew.PointF : int 4 4 -5 18 21 -4 -13 1 8 28 ...
## $ Min.DewpointF : int 0 -8 -14 3 11 -18 -18 -6 0 19 ...
## $ Max.Humidity : int 88 76 83 92 92 80 78 88 88 100 ...
## $ Mean.Humidity : int 76 70 68 73 86 73 72 78 75 92 ...
## $ Min.Humidity : int 63 63 53 53 80 65 65 67 62 84 ...
## $ Max.Sea.Level.PressureIn : num 30.4 30.4 30.5 30.1 30 ...
## $ Mean.Sea.Level.PressureIn: num 30.3 30.2 30.4 30 29.9 ...
## $ Min.Sea.Level.PressureIn : num 30.2 30.1 30.1 29.9 29.7 ...
## $ Max.VisibilityMiles : int 10 9 10 10 4 10 10 10 10 9 ...
## $ Mean.VisibilityMiles : int 4 4 10 6 1 2 6 10 7 3 ...
## $ Min.VisibilityMiles : int 1 0 5 1 0 0 1 8 2 0 ...
## $ Max.Wind.SpeedMPH : int 17 22 23 28 22 31 25 15 14 17 ...
## $ Mean.Wind.SpeedMPH : int 9 13 10 15 11 18 15 7 6 10 ...
## $ Max.Gust.SpeedMPH : int 22 30 32 36 30 40 31 18 17 22 ...
## $ PrecipitationIn : chr "0.08" "0.01" "0.00" "0.12" ...
## $ CloudCover : int 8 7 1 5 8 8 6 5 7 8 ...
## $ Events : chr "Snow" "Snow" "Snow" "Snow" ...
## $ WindDirDegrees : int 43 24 205 203 9 262 220 236 147 160 ...
It isn’t surprising that EST
and Events
are not numeric, but is surprising that PrecipitationIn
, which measures precipitation in inches, also is not numeric, but is character. Let’s investigate further.
$PrecipitationIn[1:50] WeatherKLAN2014Full
## [1] "0.08" "0.01" "0.00" "0.12" "0.78" "0.07" "T"
## [8] "T" "0.01" "0.39" "0.16" "0.00" "0.00" "0.01"
## [15] "T" "0.08" "T" "T" "T" "0.01" "0.00"
## [22] "0.05" "T" "T" "0.07" "0.23" "0.04" "T"
## [29] "T" "0.03" "T" "0.37" "T" "0.00" "T"
## [36] "0.27" "0.01" "T" "0.04" "0.03" "T" "0.00"
## [43] "0.00" "T" "T" "0.00" "0.02" "0.15" "0.08"
## [50] "0.01"
Now it’s more clear. The original data file included T
in the precipitation column to represent a “trace” of precipitation, which is precipitation greater than \(0\) but less than \(0.01\) inches. One possibility would be to set all these values to "0"
, and then to convert the column to numeric. For now we will just leave the PrecipitationIn
column out of the columns for which we request the mean.
colMeans(WeatherKLAN2014Full[,c(2:19, 21, 23)])
## Max.TemperatureF Mean.TemperatureF
## 54.838 45.781
## Min.TemperatureF Max.Dew.PointF
## 36.255 41.800
## MeanDew.PointF Min.DewpointF
## 36.395 30.156
## Max.Humidity Mean.Humidity
## 88.082 70.392
## Min.Humidity Max.Sea.Level.PressureIn
## 52.200 30.130
## Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
## 30.015 29.904
## Max.VisibilityMiles Mean.VisibilityMiles
## 9.896 8.249
## Min.VisibilityMiles Max.Wind.SpeedMPH
## 4.825 19.101
## Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
## 8.679 NA
## CloudCover WindDirDegrees
## 4.367 205.000
6.4.2 The apply()
Function
R also has functions rowMeans()
, colSums()
, and rowSums()
. But what if we want to compute the median or standard deviation of columns of data, or some other summary statistic? For this the apply()
function can be used. This function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are:
X
: the data frame of interestMARGIN
: specifying either rows (MARGIN = 1
) or columns (MARGIN = 2
)FUN
: the function to be applied.
apply(X = WeatherKLAN2014Full[,c(2:19, 21, 23)], MARGIN = 2, FUN = sd)
## Max.TemperatureF Mean.TemperatureF
## 22.2130 20.9729
## Min.TemperatureF Max.Dew.PointF
## 20.2597 19.5167
## MeanDew.PointF Min.DewpointF
## 20.0311 20.8511
## Max.Humidity Mean.Humidity
## 8.1910 9.3660
## Min.Humidity Max.Sea.Level.PressureIn
## 13.9462 0.2032
## Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
## 0.2159 0.2360
## Max.VisibilityMiles Mean.VisibilityMiles
## 0.5790 2.1059
## Min.VisibilityMiles Max.Wind.SpeedMPH
## 3.8168 6.4831
## Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
## 3.8863 NA
## CloudCover WindDirDegrees
## 2.7798 90.0673
As with any R function the arguments don’t need to be named as long as they are specified in the correct order, so
apply(WeatherKLAN2014Full[,c(2:19, 21, 23)], 2, sd)
has the same result.
6.4.3 Practice Problems
- Notice the output value of
NA
for the columnMax.Gust.SpeedMPH
. Why does this happen? Figure out a way to make theapply()
function return a numeric value for this column. - The
apply()
family of functions is extremely important in R, so it gets two Practice Problems :) Use theapply()
function to compute the median values for all numeric columns in theiris
data set.
6.4.4 Saving Typing Using with()
Consider calculating the mean of the maximum temperature values for those days where the cloud cover is less than 4 and when the maximum humidity is over 85. We can do this using subsetting.
mean(WeatherKLAN2014Full$Max.TemperatureF[
$CloudCover < 4 &
WeatherKLAN2014Full WeatherKLAN2014Full$Max.Humidity > 85])
## [1] 69.39
While this works, it requires a lot of typing, since each time we refer to a variable in the data set we need to preface its name by WeatherKLAN2014Full$
. The with()
function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.
with(WeatherKLAN2014Full,
mean(Max.TemperatureF[CloudCover < 4 & Max.Humidity > 85]))
## [1] 69.39
6.5 Transforming a Data Frame
Variables are often added to, removed from, changed in, or rearranged in a data frame. The subsetting features of R make this reasonably easy. We will investigate this in the context of the gapminder
data frame. If the gapminder
library is not yet installed, use install.packages("gapminder")
to install it locally.
library(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
6.5.1 Adding Variables
The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables. (If we were interested in an accurate value for the total GDP we would probably be better off getting this information directly, since it is likely that the per capita GDP values in the data frame are rounded substantially.)
$TotalGDP <- gapminder$gdpPercap * gapminder$pop
gapminderstr(gapminder)
## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
Analogous to the with()
function, there is a function within()
which can simplify the syntax. Whereas with()
does not change the data frame, within()
can. Note, below I first remove the altered gapminder dataframe using rm()
then bring a clean copy back in by reloading the gapminder
package.
rm(gapminder)
library(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
within(gapminder, TotalGDP <- gdpPercap * pop)
gapminder <-str(gapminder)
## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
A nice feature of within()
is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.
within(gapminder, {TotalGDP <- gdpPercap * pop
gapminder <- lifeExp * 12})
lifeExpMonths <-str(gapminder)
## tibble [1,704 × 8] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap : num [1:1704] 779 821 853 836 740 ...
## $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
## $ lifeExpMonths: num [1:1704] 346 364 384 408 433 ...
6.5.2 Removing Variables
After reflection we may realize the new variables we added to the gapminder
data frame are not useful, and should be removed.
str(gapminder)
## tibble [1,704 × 8] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap : num [1:1704] 779 821 853 836 740 ...
## $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
## $ lifeExpMonths: num [1:1704] 346 364 384 408 433 ...
gapminder[1:6]
gapminder <-str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
The same result could be obtained via gapminder <- gapminder[, 1:6]
. The first method uses the fact that a data frame is also a list, and uses list subsetting methods. It is slightly preferable, since even if only one variable is retained, the object will still be a data frame, while the other method can return a vector in this case. Note this difference in the resulting x
variable below (again this behavior can be frustrating at times if it is not anticipated).
data.frame(x = 1:3, y = c("dog", "cat", "pig"),
a <-z = seq(from = 1, to = 2, length = 3))
a
## x y z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
a[1]
a <- a
## x
## 1 1
## 2 2
## 3 3
data.frame(x = 1:3, y = c("dog", "cat", "pig"),
a <-z = seq(from = 1, to = 2, length = 3))
a
## x y z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
a[,1]
a <- a
## [1] 1 2 3
One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)]
would drop the last two columns of a
. Some care is needed when removing variables using the negative sign.
An alternative approach is to set the variables you’d like to remove to NULL
. For example, a[c("y","z")] <- NULL
and a[,2:3] <- NULL
produce the same result as above.
6.5.3 Practice Problem
What happens if you write a[-2:3]
instead of a[-(2:3)]
? Why are the parentheses important here?
6.5.4 Transforming Variables
Consider the gapminder data again. Possibly we don’t want to add a new variable that gives life expectancy in months, but rather want to modify the existing variable to measure life expectancy in months. Here are two ways to accomplish this.
rm(gapminder)
library(gapminder)
$lifeExp[1:5] gapminder
## [1] 28.80 30.33 32.00 34.02 36.09
$lifeExp <- gapminder$lifeExp * 12
gapminder$lifeExp[1:5] gapminder
## [1] 345.6 364.0 384.0 408.2 433.1
rm(gapminder)
library(gapminder)
$lifeExp[1:5] gapminder
## [1] 28.80 30.33 32.00 34.02 36.09
within(gapminder, lifeExp <- lifeExp * 12)
gapminder <-$lifeExp[1:5] gapminder
## [1] 345.6 364.0 384.0 408.2 433.1
6.5.5 Rearranging Variables
Consider the full weather data set again.
"https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
u.weather <- read.csv(u.weather, header=TRUE,
WeatherKLAN2014Full <-stringsAsFactors = FALSE,
na.string = "")
names(WeatherKLAN2014Full)
## [1] "EST"
## [2] "Max.TemperatureF"
## [3] "Mean.TemperatureF"
## [4] "Min.TemperatureF"
## [5] "Max.Dew.PointF"
## [6] "MeanDew.PointF"
## [7] "Min.DewpointF"
## [8] "Max.Humidity"
## [9] "Mean.Humidity"
## [10] "Min.Humidity"
## [11] "Max.Sea.Level.PressureIn"
## [12] "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn"
## [14] "Max.VisibilityMiles"
## [15] "Mean.VisibilityMiles"
## [16] "Min.VisibilityMiles"
## [17] "Max.Wind.SpeedMPH"
## [18] "Mean.Wind.SpeedMPH"
## [19] "Max.Gust.SpeedMPH"
## [20] "PrecipitationIn"
## [21] "CloudCover"
## [22] "Events"
## [23] "WindDirDegrees"
If we want the wind speed variables to come right after the date, we can again use subsetting.
WeatherKLAN2014Full[c(1,17, 18, 19, 2:16, 20:23)]
WeatherKLAN2014Full <-names(WeatherKLAN2014Full)
## [1] "EST"
## [2] "Max.Wind.SpeedMPH"
## [3] "Mean.Wind.SpeedMPH"
## [4] "Max.Gust.SpeedMPH"
## [5] "Max.TemperatureF"
## [6] "Mean.TemperatureF"
## [7] "Min.TemperatureF"
## [8] "Max.Dew.PointF"
## [9] "MeanDew.PointF"
## [10] "Min.DewpointF"
## [11] "Max.Humidity"
## [12] "Mean.Humidity"
## [13] "Min.Humidity"
## [14] "Max.Sea.Level.PressureIn"
## [15] "Mean.Sea.Level.PressureIn"
## [16] "Min.Sea.Level.PressureIn"
## [17] "Max.VisibilityMiles"
## [18] "Mean.VisibilityMiles"
## [19] "Min.VisibilityMiles"
## [20] "PrecipitationIn"
## [21] "CloudCover"
## [22] "Events"
## [23] "WindDirDegrees"
6.6 Reshaping Data
A data set can be represented in several different formats. Consider a (fictitious) data set on incomes of three people during three different years. Here is one representation of the data:
yearlyIncomeWide
## name income1990 income2000 income2010
## 1 John Smith 29784 39210 41213
## 2 Jane Doe 56789 89321 109321
## 3 Albert Jones 2341 34567 56781
Here is another representation of the same data:
yearlyIncomeLong
## name year income
## 1 John Smith income1990 29784
## 2 Jane Doe income1990 56789
## 3 Albert Jones income1990 2341
## 4 John Smith income2000 39210
## 5 Jane Doe income2000 89321
## 6 Albert Jones income2000 34567
## 7 John Smith income2010 41213
## 8 Jane Doe income2010 109321
## 9 Albert Jones income2010 56781
For hopefully obvious reasons, the first representation is called a wide representation of the data, and the second is called a long representation. Each has its merits. The first representation is probably easier for people to read, while the second is often the form needed for analysis by statistical software such as R. There are of course other representations. For example the rows and columns could be interchanged to create a different wide representation, or the long representation, which currently groups data by year, could group by name instead.
Whatever the relative merits of wide and long representations of data, transforming data from wide to long or long to wide is often required. As with many tasks, there are several ways to accomplish this in R. We will focus on a library called tidyr
written by Hadley Wickham that performs the transformations and more.
6.6.1 tidyr
The R library tidyr
has functions for converting data between formats. To illustrate its use, we examine a simple data set that explores the relationship between religion and income in the United States. The data come from a Pew survey, and are used in the tidyr
documentation to illustrate transforming data from wide to long format.
"https://www.finley-lab.com/files/data/religion2.csv"
u.rel <- read.csv(u.rel, header=TRUE, stringsAsFactors = FALSE)
religion <-head(religion)
## religion under10k btw10and20k btw20and30k
## 1 Agnostic 27 34 60
## 2 Atheist 12 27 37
## 3 Buddhist 27 21 30
## 4 Catholic 418 617 732
## 5 DoNotKnowOrRefused 15 14 15
## 6 EvangelicalProt 575 869 1064
## btw30and40k btw40and50k btw50and75k btw75and100k
## 1 81 76 137 122
## 2 52 35 70 73
## 3 34 33 58 62
## 4 670 638 1116 949
## 5 11 10 35 21
## 6 982 881 1486 949
## btw100and150k over150k DoNotKnowOrRefused
## 1 109 84 96
## 2 59 74 76
## 3 39 53 54
## 4 792 633 1489
## 5 17 18 116
## 6 723 414 1529
As given, the columns include religion and income level, and there are counts for each of the combinations of religion and income level. For example, there are 27 people who are Agnostic and whose income is less than 10 thousand dollars, and there are 617 people who are Catholic and whose income is between 10 and 20 thousand dollars.
The pivot_longer()
function can transform data from wide to long format.
library(tidyr)
pivot_longer(data = religion, cols = 2:11,
religionLong <-names_to = 'IncomeLevel', values_to = 'Frequency')
head(religionLong)
## # A tibble: 6 × 3
## religion IncomeLevel Frequency
## <chr> <chr> <int>
## 1 Agnostic under10k 27
## 2 Agnostic btw10and20k 34
## 3 Agnostic btw20and30k 60
## 4 Agnostic btw30and40k 81
## 5 Agnostic btw40and50k 76
## 6 Agnostic btw50and75k 137
tail(religionLong)
## # A tibble: 6 × 3
## religion IncomeLevel Frequency
## <chr> <chr> <int>
## 1 Unaffiliated btw40and50k 341
## 2 Unaffiliated btw50and75k 528
## 3 Unaffiliated btw75and100k 407
## 4 Unaffiliated btw100and150k 321
## 5 Unaffiliated over150k 258
## 6 Unaffiliated DoNotKnowOrRefused 597
To use pivot_longer()
we specified the data frame (data = religion
), the columns we want to pivot into longer format (cols = 2:11
), the name we want to give the column created from the income levels (names_to = 'IncomeLevel'
), and the name we want to give to the column containing the frequency values (values_to = 'Frequency'
).
Columns to be pivoted into longer format can be specified by name also, and we can also specify which columns should be omitted using a negative sign in front of the name(s). So the following creates an equivalent data frame:
pivot_longer(data = religion, cols = -religion,
religionLong <-names_to = 'IncomeLevel', values_to = 'Frequency')
head(religionLong)
## # A tibble: 6 × 3
## religion IncomeLevel Frequency
## <chr> <chr> <int>
## 1 Agnostic under10k 27
## 2 Agnostic btw10and20k 34
## 3 Agnostic btw20and30k 60
## 4 Agnostic btw30and40k 81
## 5 Agnostic btw40and50k 76
## 6 Agnostic btw50and75k 137
We now use the pivot_wider()
function to widen the religionLong
data set back into the original wide format.
pivot_wider(data = religionLong,
religionWide <-names_from = IncomeLevel,
values_from = Frequency)
head(religionWide)
## # A tibble: 6 × 11
## religion under10k btw10and20k btw20and30k btw30and40k
## <chr> <int> <int> <int> <int>
## 1 Agnostic 27 34 60 81
## 2 Atheist 12 27 37 52
## 3 Buddhist 27 21 30 34
## 4 Catholic 418 617 732 670
## 5 DoNotKn… 15 14 15 11
## 6 Evangel… 575 869 1064 982
## # … with 6 more variables: btw40and50k <int>,
## # btw50and75k <int>, btw75and100k <int>,
## # btw100and150k <int>, over150k <int>,
## # DoNotKnowOrRefused <int>
Here we specify the data frame (data = religionLong
), the column (names_from = IncomeLevel
) to get the name of the output column, and the column of values (values_from = Frequency
) to get the cell values from. As can be seen, this particular call to pivot_longer()
yields the original data frame.
tidyr
provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide
table. Say we want to split the name
variable into first and last name. This can be done using the separate()
function.
separate(data = yearlyIncomeLong, col = name,
firstLast <-into = c("first", "last"), sep="\\s")
print(firstLast)
## first last year income
## 1 John Smith income1990 29784
## 2 Jane Doe income1990 56789
## 3 Albert Jones income1990 2341
## 4 John Smith income2000 39210
## 5 Jane Doe income2000 89321
## 6 Albert Jones income2000 34567
## 7 John Smith income2010 41213
## 8 Jane Doe income2010 109321
## 9 Albert Jones income2010 56781
Now say, you’re not happy with that and you want to combine the name column again, but this time separate the first and last name with a underscore. This is done using the unite()
function.
unite(firstLast, col=name, first, last, sep="_")
## name year income
## 1 John_Smith income1990 29784
## 2 Jane_Doe income1990 56789
## 3 Albert_Jones income1990 2341
## 4 John_Smith income2000 39210
## 5 Jane_Doe income2000 89321
## 6 Albert_Jones income2000 34567
## 7 John_Smith income2010 41213
## 8 Jane_Doe income2010 109321
## 9 Albert_Jones income2010 56781
6.6.2 Practice Problem
Consider the data.birds
data frame produced below
c(10, 38, 29, 88, 42, 177, 200)
birds <- c("LA-2017-01-01", "DF-2011-03-02", "OG-2078-05-11", "YA-2000-11-18",
recordingInfo <-"LA-2019-03-17", "OG-2016-10-10", "YA-2001-03-22")
data.frame(birds, recordingInfo)
data.birds <- data.birds
## birds recordingInfo
## 1 10 LA-2017-01-01
## 2 38 DF-2011-03-02
## 3 29 OG-2078-05-11
## 4 88 YA-2000-11-18
## 5 42 LA-2019-03-17
## 6 177 OG-2016-10-10
## 7 200 YA-2001-03-22
The separate()
function from the tidyr
library is especially useful when working with real data as multiple pieces of information can be combined into one column in a data set.
The column “RecordingInfo” contains the site where data was collected, as well as the year, month, and date the data were recorded. The data are coded as follows: site-year-month-day
. Write a line of code that will extract the desired data from the data.birds
data frame into separate columns named site
, year
, month
and day
.
6.7 Manipulating Data with dplyr
Much of the effort (a figure of 80% is sometimes suggested) in data analysis is spent cleaning the data and getting it ready for analysis. Having effective tools for this task can save substantial time and effort. The R package dplyr
written by Hadley Wickham is designed, in Hadley’s words, to be “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Casting data analysis tasks in terms of “grammar” should be familiar from our work with the ggplot2
package, which was also authored by Hadley. Functions provided by dplyr
do in fact capture key data analysis actions (i.e., verbs). These functions include
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarize()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rows.
These all combine naturally with a group_by()
function that allows you to perform any operation grouped by values of one or more variables. All the tasks done using dplyr
can be accomplished using tools already covered in this text; however, dplyr
’s functions provide a potentially more efficient and convenient framework to accomplish these tasks. RStudio provides a convenient data wrangling cheat sheet that covers many aspects of the tidyr
and dplyr
packages.
This somewhat long section on dplyr
adapts the nice introduction by Jenny Bryan, available at https://stat545.com/dplyr-intro.html.
6.7.1 Improved Data Frames
The dplyr
package provides a couple functions that offer improvements on data frames. First, tibble
creates a tibble from a series of vectors39. A tibble has two advantages over a data frame. First, when printing, it only prints the first ten rows and the columns that fit on the page, as well as some additional information about the table’s dimension, data type of variables, and non-printed columns. Second, recall that subsetting a data frame can sometimes return a vector rather than a data frame (if only one row or column is the result of the subset), a tibble does not have this behavior. Second, we can use the as_tibble()
function to convert an existing data frame or list into a tibble. Here is an example using the religionWide
data frame.40
library(dplyr)
head(religionWide)
## # A tibble: 6 × 11
## religion under10k btw10and20k btw20and30k btw30and40k
## <chr> <int> <int> <int> <int>
## 1 Agnostic 27 34 60 81
## 2 Atheist 12 27 37 52
## 3 Buddhist 27 21 30 34
## 4 Catholic 418 617 732 670
## 5 DoNotKn… 15 14 15 11
## 6 Evangel… 575 869 1064 982
## # … with 6 more variables: btw40and50k <int>,
## # btw50and75k <int>, btw75and100k <int>,
## # btw100and150k <int>, over150k <int>,
## # DoNotKnowOrRefused <int>
1] religionWide[,
## # A tibble: 18 × 1
## religion
## <chr>
## 1 Agnostic
## 2 Atheist
## 3 Buddhist
## 4 Catholic
## 5 DoNotKnowOrRefused
## 6 EvangelicalProt
## 7 Hindu
## 8 HistoricallyBlackProt
## 9 JehovahsWitness
## 10 Jewish
## 11 MainlineProt
## 12 Mormon
## 13 Muslim
## 14 Orthodox
## 15 OtherChristian
## 16 OtherFaiths
## 17 OtherWorldReligions
## 18 Unaffiliated
as_tibble(religionWide)
religionWideTbl <-head(religionWideTbl)
## # A tibble: 6 × 11
## religion under10k btw10and20k btw20and30k btw30and40k
## <chr> <int> <int> <int> <int>
## 1 Agnostic 27 34 60 81
## 2 Atheist 12 27 37 52
## 3 Buddhist 27 21 30 34
## 4 Catholic 418 617 732 670
## 5 DoNotKn… 15 14 15 11
## 6 Evangel… 575 869 1064 982
## # … with 6 more variables: btw40and50k <int>,
## # btw50and75k <int>, btw75and100k <int>,
## # btw100and150k <int>, over150k <int>,
## # DoNotKnowOrRefused <int>
1] religionWideTbl[,
## # A tibble: 18 × 1
## religion
## <chr>
## 1 Agnostic
## 2 Atheist
## 3 Buddhist
## 4 Catholic
## 5 DoNotKnowOrRefused
## 6 EvangelicalProt
## 7 Hindu
## 8 HistoricallyBlackProt
## 9 JehovahsWitness
## 10 Jewish
## 11 MainlineProt
## 12 Mormon
## 13 Muslim
## 14 Orthodox
## 15 OtherChristian
## 16 OtherFaiths
## 17 OtherWorldReligions
## 18 Unaffiliated
As seen above, note that once the data frame is reduced to one dimension by subsetting to one column, it is no longer a data frame and has been simplified to a vector. This might not seem like a big deal; however, it can be very frustrating and potentially break your code when you expect an object to behave like a data frame and it doesn’t because it’s now a vector. Alternatively, once we convert religionWide
to a tibble
via the as_tibble()
function the object remains a data frame even when subsetting down to one dimension (there is no automatic simplification). Converting data frames using as_tibble()
is not required for using dplyr
but is convenient. Also, it is important to note that tibble
is simply a wrapper around a data frame that provides some additional behaviors. The newly formed tibble
object will still behave like a data frame (because it technically still is a data frame) but will have some added niceties (some of which are illustrated below).
6.7.2 Filtering Data by Row
Recall the gapminder
data. These data are available in tab-separated format in gapminder.tsv
, and can be read in using read.delim()
(or the related read functions described previously). The read.delim()
function defaults to header = TRUE
so this doesn’t need to be specified explicitly. In this section we will be working with the gapminder
data often, so we will use a short name for the data frame to save typing.
"https://www.finley-lab.com/files/data/gapminder.tsv"
u.gm <- read.delim(u.gm)
gm <- as_tibble(gm)
gm <-str(gm)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head(gm)
## # A tibble: 6 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8.43e6 Asia 28.8 779.
## 2 Afghanistan 1957 9.24e6 Asia 30.3 821.
## 3 Afghanistan 1962 1.03e7 Asia 32.0 853.
## 4 Afghanistan 1967 1.15e7 Asia 34.0 836.
## 5 Afghanistan 1972 1.31e7 Asia 36.1 740.
## 6 Afghanistan 1977 1.49e7 Asia 38.4 786.
Filtering helps us to examine subsets of the data such as data from a particular country, from several specified countries, from certain years, from countries with certain populations, etc. Some examples:
filter(gm, country == "Brazil")
## # A tibble: 12 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1957 65551171 Americas 53.3 2487.
## 3 Brazil 1962 76039390 Americas 55.7 3337.
## 4 Brazil 1967 88049823 Americas 57.6 3430.
## 5 Brazil 1972 100840058 Americas 59.5 4986.
## 6 Brazil 1977 114313951 Americas 61.5 6660.
## 7 Brazil 1982 128962939 Americas 63.3 7031.
## 8 Brazil 1987 142938076 Americas 65.2 7807.
## 9 Brazil 1992 155975974 Americas 67.1 6950.
## 10 Brazil 1997 168546719 Americas 69.4 7958.
## 11 Brazil 2002 179914212 Americas 71.0 8131.
## 12 Brazil 2007 190010647 Americas 72.4 9066.
filter(gm, country %in% c("Brazil", "Mexico"))
## # A tibble: 24 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1957 65551171 Americas 53.3 2487.
## 3 Brazil 1962 76039390 Americas 55.7 3337.
## 4 Brazil 1967 88049823 Americas 57.6 3430.
## 5 Brazil 1972 100840058 Americas 59.5 4986.
## 6 Brazil 1977 114313951 Americas 61.5 6660.
## 7 Brazil 1982 128962939 Americas 63.3 7031.
## 8 Brazil 1987 142938076 Americas 65.2 7807.
## 9 Brazil 1992 155975974 Americas 67.1 6950.
## 10 Brazil 1997 168546719 Americas 69.4 7958.
## # … with 14 more rows
filter(gm, country %in% c("Brazil", "Mexico") & year %in% c(1952, 1972))
## # A tibble: 4 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1972 100840058 Americas 59.5 4986.
## 3 Mexico 1952 30144317 Americas 50.8 3478.
## 4 Mexico 1972 55984294 Americas 62.4 6809.
filter(gm, pop > 300000000)
## # A tibble: 25 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 China 1952 5.56e8 Asia 44 400.
## 2 China 1957 6.37e8 Asia 50.5 576.
## 3 China 1962 6.66e8 Asia 44.5 488.
## 4 China 1967 7.55e8 Asia 58.4 613.
## 5 China 1972 8.62e8 Asia 63.1 677.
## 6 China 1977 9.43e8 Asia 64.0 741.
## 7 China 1982 1.00e9 Asia 65.5 962.
## 8 China 1987 1.08e9 Asia 67.3 1379.
## 9 China 1992 1.16e9 Asia 68.7 1656.
## 10 China 1997 1.23e9 Asia 70.4 2289.
## # … with 15 more rows
filter(gm, pop > 300000000 & year == 2007)
## # A tibble: 3 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 China 2007 1.32e9 Asia 73.0 4959.
## 2 India 2007 1.11e9 Asia 64.7 2452.
## 3 United Stat… 2007 3.01e8 Americas 78.2 42952.
Notice the full results are not printed. For example, when we asked for the data for Brazil and Mexico, only the first ten rows were printed. This is an effect of using the as_tibble()
function. Of course if we wanted to analyze the results (as we will below) the full set of data would be available.
6.7.3 Selecting variables by column
Continuing with the gapminder
data, another common task is to restrict attention to some subset of variables in the data set. The select()
function does this.
select(gm, country, year, lifeExp)
## # A tibble: 1,704 × 3
## country year lifeExp
## <chr> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # … with 1,694 more rows
select(gm, 2:4)
## # A tibble: 1,704 × 3
## year pop continent
## <int> <dbl> <chr>
## 1 1952 8425333 Asia
## 2 1957 9240934 Asia
## 3 1962 10267083 Asia
## 4 1967 11537966 Asia
## 5 1972 13079460 Asia
## 6 1977 14880372 Asia
## 7 1982 12881816 Asia
## 8 1987 13867957 Asia
## 9 1992 16317921 Asia
## 10 1997 22227415 Asia
## # … with 1,694 more rows
select(gm, -c(2,3,4))
## # A tibble: 1,704 × 3
## country lifeExp gdpPercap
## <chr> <dbl> <dbl>
## 1 Afghanistan 28.8 779.
## 2 Afghanistan 30.3 821.
## 3 Afghanistan 32.0 853.
## 4 Afghanistan 34.0 836.
## 5 Afghanistan 36.1 740.
## 6 Afghanistan 38.4 786.
## 7 Afghanistan 39.9 978.
## 8 Afghanistan 40.8 852.
## 9 Afghanistan 41.7 649.
## 10 Afghanistan 41.8 635.
## # … with 1,694 more rows
select(gm, starts_with("c"))
## # A tibble: 1,704 × 2
## country continent
## <chr> <chr>
## 1 Afghanistan Asia
## 2 Afghanistan Asia
## 3 Afghanistan Asia
## 4 Afghanistan Asia
## 5 Afghanistan Asia
## 6 Afghanistan Asia
## 7 Afghanistan Asia
## 8 Afghanistan Asia
## 9 Afghanistan Asia
## 10 Afghanistan Asia
## # … with 1,694 more rows
Notice a few things. Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with
that provide ways to match part of a variable’s name.
6.7.4 Practice Problem
Use the contains()
function to select only the columns that contain a c
in the gapminder
data set.
6.7.5 Pipes
Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter()
function inside a select()
function.
select(filter(gm, continent %in% c("Asia", "Europe")), country, year, pop)
## # A tibble: 756 × 3
## country year pop
## <chr> <int> <dbl>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # … with 746 more rows
Even a two-step process like this becomes hard to follow in this nested form, and often we will want to perform more than two operations. There is a nice feature in dplyr
that allows us to “feed” results of one function into the first argument of a subsequent function. Another way of saying this is that we are “piping” the results into another function. The %>%
operator does the piping. Here we again restrict attention to country, year, and population for countries in Asia or Europe41.
%>%
gm filter(continent %in% c("Asia", "Europe")) %>%
select(country, year, pop)
## # A tibble: 756 × 3
## country year pop
## <chr> <int> <dbl>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # … with 746 more rows
It can help to think of %>%
as representing the word “then”. The above can be read as, “Start with the data frame gm
then filter it to select data from the continents Asia and Europe then select the variables country, year, and population from these data”.
The pipe operator %>%
is not restricted to functions in dplyr
. In fact the pipe operator itself was introduced in another package called magrittr
, but is included in dplyr
as a convenience.
6.7.6 Arranging Data by Row
By default the gapminder
data are arranged by country and then by year.
head(gm, 15)
## # A tibble: 15 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8.43e6 Asia 28.8 779.
## 2 Afghanistan 1957 9.24e6 Asia 30.3 821.
## 3 Afghanistan 1962 1.03e7 Asia 32.0 853.
## 4 Afghanistan 1967 1.15e7 Asia 34.0 836.
## 5 Afghanistan 1972 1.31e7 Asia 36.1 740.
## 6 Afghanistan 1977 1.49e7 Asia 38.4 786.
## 7 Afghanistan 1982 1.29e7 Asia 39.9 978.
## 8 Afghanistan 1987 1.39e7 Asia 40.8 852.
## 9 Afghanistan 1992 1.63e7 Asia 41.7 649.
## 10 Afghanistan 1997 2.22e7 Asia 41.8 635.
## 11 Afghanistan 2002 2.53e7 Asia 42.1 727.
## 12 Afghanistan 2007 3.19e7 Asia 43.8 975.
## 13 Albania 1952 1.28e6 Europe 55.2 1601.
## 14 Albania 1957 1.48e6 Europe 59.3 1942.
## 15 Albania 1962 1.73e6 Europe 64.8 2313.
Possibly arranging the data by year and then country would be desired. The arrange()
function makes this easy. We will again use pipes.
%>%
gm arrange(year, country)
## # A tibble: 1,704 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8.43e6 Asia 28.8 779.
## 2 Albania 1952 1.28e6 Europe 55.2 1601.
## 3 Algeria 1952 9.28e6 Africa 43.1 2449.
## 4 Angola 1952 4.23e6 Africa 30.0 3521.
## 5 Argentina 1952 1.79e7 Americas 62.5 5911.
## 6 Australia 1952 8.69e6 Oceania 69.1 10040.
## 7 Austria 1952 6.93e6 Europe 66.8 6137.
## 8 Bahrain 1952 1.20e5 Asia 50.9 9867.
## 9 Bangladesh 1952 4.69e7 Asia 37.5 684.
## 10 Belgium 1952 8.73e6 Europe 68 8343.
## # … with 1,694 more rows
How about the data for Rwanda, arranged in order of life expectancy.
%>%
gm filter(country == "Rwanda") %>%
arrange(lifeExp)
## # A tibble: 12 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Rwanda 1992 7290203 Africa 23.6 737.
## 2 Rwanda 1997 7212583 Africa 36.1 590.
## 3 Rwanda 1952 2534927 Africa 40 493.
## 4 Rwanda 1957 2822082 Africa 41.5 540.
## 5 Rwanda 1962 3051242 Africa 43 597.
## 6 Rwanda 2002 7852401 Africa 43.4 786.
## 7 Rwanda 1987 6349365 Africa 44.0 848.
## 8 Rwanda 1967 3451079 Africa 44.1 511.
## 9 Rwanda 1972 3992121 Africa 44.6 591.
## 10 Rwanda 1977 4657072 Africa 45 670.
## 11 Rwanda 1982 5507565 Africa 46.2 882.
## 12 Rwanda 2007 8860588 Africa 46.2 863.
Possibly we want these data to be in decreasing (descending) order. Here, desc()
is one of many dplyr
helper functions.
%>%
gm filter(country == "Rwanda") %>%
arrange(desc(lifeExp))
## # A tibble: 12 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Rwanda 2007 8860588 Africa 46.2 863.
## 2 Rwanda 1982 5507565 Africa 46.2 882.
## 3 Rwanda 1977 4657072 Africa 45 670.
## 4 Rwanda 1972 3992121 Africa 44.6 591.
## 5 Rwanda 1967 3451079 Africa 44.1 511.
## 6 Rwanda 1987 6349365 Africa 44.0 848.
## 7 Rwanda 2002 7852401 Africa 43.4 786.
## 8 Rwanda 1962 3051242 Africa 43 597.
## 9 Rwanda 1957 2822082 Africa 41.5 540.
## 10 Rwanda 1952 2534927 Africa 40 493.
## 11 Rwanda 1997 7212583 Africa 36.1 590.
## 12 Rwanda 1992 7290203 Africa 23.6 737.
Possibly we want to include only the year and life expectancy, to make the message more stark.
%>%
gm filter(country == "Rwanda") %>%
select(year, lifeExp) %>%
arrange(desc(lifeExp))
## # A tibble: 12 × 2
## year lifeExp
## <int> <dbl>
## 1 2007 46.2
## 2 1982 46.2
## 3 1977 45
## 4 1972 44.6
## 5 1967 44.1
## 6 1987 44.0
## 7 2002 43.4
## 8 1962 43
## 9 1957 41.5
## 10 1952 40
## 11 1997 36.1
## 12 1992 23.6
For analyzing data in R, the order shouldn’t matter. But for presentation to human eyes, the order is important.
6.7.7 Practice Problem
It is worth your while to get comfortable with using pipes. Here is some hard-to-read code. Convert it into more easier to read code by using pipes.
arrange(select(filter(gm, country == "Afghanistan"),
c("year", "lifeExp")), desc(lifeExp))
6.7.8 Renaming Variables
The dplyr
package has a rename
function that makes renaming variables in a data frame quite easy.
rename(gm, population = pop)
gm <-head(gm)
## # A tibble: 6 × 6
## country year population continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghani… 1952 8425333 Asia 28.8 779.
## 2 Afghani… 1957 9240934 Asia 30.3 821.
## 3 Afghani… 1962 10267083 Asia 32.0 853.
## 4 Afghani… 1967 11537966 Asia 34.0 836.
## 5 Afghani… 1972 13079460 Asia 36.1 740.
## 6 Afghani… 1977 14880372 Asia 38.4 786.
6.7.9 Data Summaries and Grouping
The summarize()
function computes summary statistics using user provided functions for one or more columns of data in a data frame.
summarize(gm, meanpop = mean(population), medpop = median(population))
## # A tibble: 1 × 2
## meanpop medpop
## <dbl> <dbl>
## 1 29601212. 7023596.
##or
%>%
gm summarize(meanpop = mean(population), medpop = median(population))
## # A tibble: 1 × 2
## meanpop medpop
## <dbl> <dbl>
## 1 29601212. 7023596.
Often we want summaries for specific components of the data. For example, we might want the median life expectancy for each continent separately. One option is subsetting:
median(gm$lifeExp[gm$continent == "Africa"])
## [1] 47.79
median(gm$lifeExp[gm$continent == "Asia"])
## [1] 61.79
median(gm$lifeExp[gm$continent == "Europe"])
## [1] 72.24
median(gm$lifeExp[gm$continent == "Americas"])
## [1] 67.05
median(gm$lifeExp[gm$continent == "Oceania"])
## [1] 73.66
The group_by()
function makes this easier, and makes the output more useful.
%>%
gm group_by(continent) %>%
summarize(medLifeExp = median(lifeExp))
## # A tibble: 5 × 2
## continent medLifeExp
## <chr> <dbl>
## 1 Africa 47.8
## 2 Americas 67.0
## 3 Asia 61.8
## 4 Europe 72.2
## 5 Oceania 73.7
Or if we want the results ordered by the median life expectancy:
%>%
gm group_by(continent) %>%
summarize(medLifeExp = median(lifeExp)) %>%
arrange(medLifeExp)
## # A tibble: 5 × 2
## continent medLifeExp
## <chr> <dbl>
## 1 Africa 47.8
## 2 Asia 61.8
## 3 Americas 67.0
## 4 Europe 72.2
## 5 Oceania 73.7
As another example, we calculate the number of observations we have per continent (using the n()
helper function), and then, among continents, how many distinct countries are represented (using n_distinct()
).
%>%
gm group_by(continent) %>%
summarize(numObs = n())
## # A tibble: 5 × 2
## continent numObs
## <chr> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
%>%
gm group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country))
## # A tibble: 5 × 3
## continent n_obs n_countries
## <chr> <int> <int>
## 1 Africa 624 52
## 2 Americas 300 25
## 3 Asia 396 33
## 4 Europe 360 30
## 5 Oceania 24 2
Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.
%>%
gm filter(continent == "Africa") %>%
group_by(year) %>%
summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
## # A tibble: 12 × 3
## year min_lifeExp max_lifeExp
## <int> <dbl> <dbl>
## 1 1952 30 52.7
## 2 1957 31.6 58.1
## 3 1962 32.8 60.2
## 4 1967 34.1 61.6
## 5 1972 35.4 64.3
## 6 1977 36.8 67.1
## 7 1982 38.4 69.9
## 8 1987 39.9 71.9
## 9 1992 23.6 73.6
## 10 1997 36.1 74.8
## 11 2002 39.2 75.7
## 12 2007 39.6 76.4
This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies. Here is one way to achieve that. We will start with the minimum life expectancy. Note the rank of the minimum value will be 1.
%>%
gm select(country, continent, year, lifeExp) %>%
group_by(year) %>%
arrange(year) %>%
filter(rank(lifeExp) == 1)
## # A tibble: 12 × 4
## # Groups: year [12]
## country continent year lifeExp
## <chr> <chr> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Sierra Leone Africa 1972 35.4
## 6 Cambodia Asia 1977 31.2
## 7 Sierra Leone Africa 1982 38.4
## 8 Angola Africa 1987 39.9
## 9 Rwanda Africa 1992 23.6
## 10 Rwanda Africa 1997 36.1
## 11 Zambia Africa 2002 39.2
## 12 Swaziland Africa 2007 39.6
Next we add the maximum life expectancy. Here we need to better understand the desc()
function, which will transform a vector into a numeric vector which will be sorted in descending order. Here are some examples.
desc(1:5)
## [1] -1 -2 -3 -4 -5
desc(c(2,3,1,5,6,-4))
## [1] -2 -3 -1 -5 -6 4
desc(c("a", "c", "b", "w", "e"))
## [1] -1 -3 -2 -5 -4
We now use this to extract the maximum life expectancy. Recall that |
represents “or”. Also by default only the first few rows of a tibble
object will be printed. To see all the rows we pipe the output to print(n = 24)
to ask for all 24 rows to be printed.
%>%
gm select(country, continent, year, lifeExp) %>%
group_by(year) %>%
arrange(year) %>%
filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>%
print(n=24)
## # A tibble: 24 × 4
## # Groups: year [12]
## country continent year lifeExp
## <chr> <chr> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Norway Europe 1952 72.7
## 3 Afghanistan Asia 1957 30.3
## 4 Iceland Europe 1957 73.5
## 5 Afghanistan Asia 1962 32.0
## 6 Iceland Europe 1962 73.7
## 7 Afghanistan Asia 1967 34.0
## 8 Sweden Europe 1967 74.2
## 9 Sierra Leone Africa 1972 35.4
## 10 Sweden Europe 1972 74.7
## 11 Cambodia Asia 1977 31.2
## 12 Iceland Europe 1977 76.1
## 13 Japan Asia 1982 77.1
## 14 Sierra Leone Africa 1982 38.4
## 15 Angola Africa 1987 39.9
## 16 Japan Asia 1987 78.7
## 17 Japan Asia 1992 79.4
## 18 Rwanda Africa 1992 23.6
## 19 Japan Asia 1997 80.7
## 20 Rwanda Africa 1997 36.1
## 21 Japan Asia 2002 82
## 22 Zambia Africa 2002 39.2
## 23 Japan Asia 2007 82.6
## 24 Swaziland Africa 2007 39.6
6.7.10 Creating New Variables
The $
notation provides a simple way to create new variables in a data frame. The mutate()
function provides another, sometimes cleaner way to do this. We will use mutate()
along with the lag()
function to investigate changes in life expectancy over five years for the gapminder
data. We’ll do this in a few steps. First, we create a variable that measures the change in life expectancy and remove the population and GDP variables that are not of interest. We have to be careful to first group by country, since we want to calculate the change in life expectancy by country.
%>%
gm group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap))
## # A tibble: 1,704 × 5
## # Groups: country [142]
## country year continent lifeExp changeLifeExp
## <chr> <int> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 Asia 28.8 NA
## 2 Afghanistan 1957 Asia 30.3 1.53
## 3 Afghanistan 1962 Asia 32.0 1.66
## 4 Afghanistan 1967 Asia 34.0 2.02
## 5 Afghanistan 1972 Asia 36.1 2.07
## 6 Afghanistan 1977 Asia 38.4 2.35
## 7 Afghanistan 1982 Asia 39.9 1.42
## 8 Afghanistan 1987 Asia 40.8 0.968
## 9 Afghanistan 1992 Asia 41.7 0.852
## 10 Afghanistan 1997 Asia 41.8 0.0890
## # … with 1,694 more rows
Next, summarize by computing the largest drop in life expectancy.
%>%
gm group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
summarize(largestDropLifeExp = min(changeLifeExp))
## # A tibble: 142 × 2
## country largestDropLifeExp
## <chr> <dbl>
## 1 Afghanistan NA
## 2 Albania NA
## 3 Algeria NA
## 4 Angola NA
## 5 Argentina NA
## 6 Australia NA
## 7 Austria NA
## 8 Bahrain NA
## 9 Bangladesh NA
## 10 Belgium NA
## # … with 132 more rows
Oops. We forgot that since we don’t have data from before 1952, the first drop will be NA
. Let’s try again.
%>%
gm group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
summarize(largestDropLifeExp = min(changeLifeExp, na.rm = TRUE))
## # A tibble: 142 × 2
## country largestDropLifeExp
## <chr> <dbl>
## 1 Afghanistan 0.0890
## 2 Albania -0.419
## 3 Algeria 1.31
## 4 Angola -0.0360
## 5 Argentina 0.492
## 6 Australia 0.170
## 7 Austria 0.490
## 8 Bahrain 0.840
## 9 Bangladesh 1.67
## 10 Belgium 0.5
## # … with 132 more rows
That’s not quite what we wanted. We could arrange the results by the life expectancy drop, but it would be good to have both the continent and year printed out also. So we’ll take a slightly different approach, by arranging the results in increasing order.
%>%
gm group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
arrange(changeLifeExp)
## # A tibble: 1,704 × 5
## # Groups: country [142]
## country year continent lifeExp changeLifeExp
## <chr> <int> <chr> <dbl> <dbl>
## 1 Rwanda 1992 Africa 23.6 -20.4
## 2 Zimbabwe 1997 Africa 46.8 -13.6
## 3 Lesotho 2002 Africa 44.6 -11.0
## 4 Swaziland 2002 Africa 43.9 -10.4
## 5 Botswana 1997 Africa 52.6 -10.2
## 6 Cambodia 1977 Asia 31.2 -9.10
## 7 Namibia 2002 Africa 51.5 -7.43
## 8 South Africa 2002 Africa 53.4 -6.87
## 9 Zimbabwe 2002 Africa 40.0 -6.82
## 10 China 1962 Asia 44.5 -6.05
## # … with 1,694 more rows
That’s still not quite right. Because the data are grouped by country, R did the ordering within group. If we want to see the largest drops overall, we need to remove the grouping.
%>%
gm group_by(country) %>%
mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
select(-c(population, gdpPercap)) %>%
ungroup() %>%
arrange(changeLifeExp) %>%
print(n=20)
## # A tibble: 1,704 × 5
## country year continent lifeExp changeLifeExp
## <chr> <int> <chr> <dbl> <dbl>
## 1 Rwanda 1992 Africa 23.6 -20.4
## 2 Zimbabwe 1997 Africa 46.8 -13.6
## 3 Lesotho 2002 Africa 44.6 -11.0
## 4 Swaziland 2002 Africa 43.9 -10.4
## 5 Botswana 1997 Africa 52.6 -10.2
## 6 Cambodia 1977 Asia 31.2 -9.10
## 7 Namibia 2002 Africa 51.5 -7.43
## 8 South Africa 2002 Africa 53.4 -6.87
## 9 Zimbabwe 2002 Africa 40.0 -6.82
## 10 China 1962 Asia 44.5 -6.05
## 11 Botswana 2002 Africa 46.6 -5.92
## 12 Zambia 1997 Africa 40.2 -5.86
## 13 Iraq 1992 Asia 59.5 -5.58
## 14 Liberia 1992 Africa 40.8 -5.23
## 15 Cambodia 1972 Asia 40.3 -5.10
## 16 Kenya 1997 Africa 54.4 -4.88
## 17 Somalia 1992 Africa 39.7 -4.84
## 18 Zambia 1992 Africa 46.1 -4.72
## 19 Swaziland 2007 Africa 39.6 -4.26
## 20 Uganda 1997 Africa 44.6 -4.25
## # … with 1,684 more rows
6.7.11 Practice Problem
As you progress in your data science related career, we are sure you will find dplyr
as one of the most useful packages for your initial data exploration. Here is one more Practice Problem to get more comfortable with the syntax.
Recall the iris
data set we have worked with multiple times. We want to look at the ratio of Sepal.Length
to Petal.Length
in the three different species. Write a series of dplyr
statements that groups the data by species, creates a new column called s.p.ratio
that is the Sepal.Length
divided by Petal.Length
, then computes the mean of this column for each species in a column called mean.ratio
. Display the data in descending order of mean.ratio
.
6.8 Exercises
Exercise 7 Learning objectives: introduce with()
, tapply()
, and cut()
functions; summarize data using the table()
function with logical subsetting; practice using factor data types.
Exercise 8 Learning objectives: work with messy data; import data from an external spreadsheet; practice using functions in tidyr
and graphing functions.
Exercise 9 Learning objectives: work with several key dplyr functions; manipulate data frames (actually tibbles); summarize and visualize data from large data files.
References
Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau. 2008. “Extensible Markup Language (Xml) 1.0 (Fifth Edition).”
Burrill, Elizabeth A., Andrea M. Wilson, Jeffery A. Turner, Scott A. Pugh, James Menlove, Glenn Christiansen, Barbara L. Conkling, and Winnie David. 2018. “The Forest Inventory and Analysis Database: Database Description and User Guide Version 8.0 for Phase 2.” U.S. Department of Agriculture, Forest Service. 946 P. http://www.fia.fs.fed.us/library/database-documentation/.
Jones, Matthew, Margaret O’Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, and Steven Chong. 2019. “Ecological Metadata Language Version 2.2.0.” https://doi.org/10.5063/f11834t2.
Pezoa, Felipe, Juan L Reutter, Fernando Suarez, Martı́n Ugarte, and Domagoj Vrgoč. 2016. “Foundations of Json Schema.” In Proceedings of the 25th International Conference on World Wide Web, 263–73. International World Wide Web Conferences Steering Committee.
Spector, Phil. 2008. Data Manipulation with R. Use R! pub-sv:adr: pub-sv.
These data come from the
MASS
R library.↩︎Recall
NA
stands for “Not Available”.↩︎Look at the help page for
head
. The default value isn = 6
.↩︎Reminds me of The Trouble with Tribbles↩︎
The text printed immediately after
library(dplyr)
means thestats
andbase
packages, which are automatically loaded when you start R, have functions with the same name as functions indplyr
. So, for example, if you call thefilter()
orlag()
functions, R will uselibrary(dplyr)
‘s functions. Use the::
operator to explicity identify which packages’ function you want to use, e.g., if you wantstats
’slag()
then callstats::lag()
.↩︎Notice the indentation used in the code. This is not necessary, as the code could be all on one line, but I often find it easier to read in this more organized format↩︎