Chapter 4 Data Structures
A data structure is a format for organizing and storing data. The structure is designed so that data can be accessed and worked with in specific ways. Statistical software and programming languages have methods (or functions) designed to operate on different kinds of data structures.
This chapter’s focus is on data structures. To help initial understanding, the data in this chapter will be relatively modest in size and complexity. The ideas and methods, however, generalize to larger and more complex data sets.
The base data structures in R are vectors, matrices, arrays, data frames, and lists. The first three, vectors, matrices, and arrays, require all elements to be of the same type or homogeneous, e.g., all numeric or all character. Data frames and lists allow elements to be of different types or heterogeneous, e.g., some elements of a data frame may be numeric while other elements may be character. These base structures can also be organized by their dimensionality, i.e., 1-dimensional, 2-dimensional, or N-dimensional, as shown in Table 4.1.
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1 | Atomic vector | List |
2 | Matrix | Data frame |
N | Array |
R has no scalar types, i.e., 0-dimensional. Individual numbers or strings are actually vectors of length one.
An efficient way to understand what comprises a given object is to use the str()
function. str()
is short for structure and prints a compact, human-readable description of any R data structure. For example, in the code below, we prove to ourselves that what we might think of as a scalar value is actually a vector of length one.
1
a <-str(a)
## num 1
is.vector(a)
## [1] TRUE
length(a)
## [1] 1
Here we assigned a
the scalar value one. The str(a)
prints num 1
, which says a
is numeric of length one. Then just to be sure we used the function is.vector()
to test if a
is in fact a vector. Then, just for fun, we asked the length of a
, which again returns one. There are a set of similar logical tests for the other base data structures, e.g., is.matrix()
, is.array()
, is.data.frame()
, and is.list()
. These will all come in handy as we encounter different R objects.
4.1 Vectors
Think of a vector23 as a structure to represent one variable in a data set. For example a vector might hold the weights, in pounds, of 7 people in a data set. Or another vector might hold the genders of those 7 people. The c()
function in R is useful for creating (small) vectors and for modifying existing vectors. Think of c
as standing for “combine”.
c(123, 157, 205, 199, 223, 140, 105)
weight <- weight
## [1] 123 157 205 199 223 140 105
c("female", "female", "male", "female", "male",
gender <-"male", "female")
gender
## [1] "female" "female" "male" "female" "male"
## [6] "male" "female"
Notice that elements of a vector are separated by commas when using the c()
function to create a vector. Also notice that character values are placed inside quotation marks.
The c()
function also can be used to add to an existing vector. For example, if an eighth male person was included in the data set, and his weight was 194 pounds, the existing vectors could be modified as follows.
c(weight, 194)
weight <- c(gender, "male")
gender <- weight
## [1] 123 157 205 199 223 140 105 194
gender
## [1] "female" "female" "male" "female" "male"
## [6] "male" "female" "male"
4.1.1 Types, Conversion, Coercion
Clearly it is important to distinguish between different types of vectors. For example, it makes sense to ask R to calculate the mean of the weights stored in weight
, but does not make sense to ask R to compute the mean of the genders stored in gender
. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. We will not encounter the complex and raw types in everyday data analysis, and so we focus on the first four data types.
character
: consists of letters or words. Our vectorgender
is a character vector because it consists of the genders for each person in our dataset.
typeof(gender)
## [1] "character"
double
: a numeric object that can be an integer or non-integer value (e.g., 10, 4.2). Our vectorweight
is a double vector.
typeof(weight)
## [1] "double"
integer
: a numeric object that can only be an integer. It may be surprising to see the weight variableweight
is of typedouble
, even though its values are all integers. By default, R creates a double type vector when numeric values are given via thec
function. We can create an integer vector of weight variables by placing the letterL
next to each of the numbers when we place it in the vector:
c(123L, 157L, 205L, 199L, 223L, 140L, 105L, 194L)
weight.int <-typeof(weight.int)
## [1] "integer"
logical
: used to represent variables that can take valuesTRUE
orFALSE
. To illustrate logical vectors, imagine that each of the eight people in the data setwas asked whether they were taking blood pressure medication, and the responses were coded asTRUE
if the person answered yes, andFALSE
if the person answered no.
c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE)
bp <- bp
## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
typeof(bp)
## [1] "logical"
When it makes sense, it is possible to convert vectors to a different type. Consider the following examples.
as.integer(weight)
weight.int <- weight.int
## [1] 123 157 205 199 223 140 105 194
typeof(weight.int)
## [1] "integer"
as.character(weight)
weight.char <- weight.char
## [1] "123" "157" "205" "199" "223" "140" "105" "194"
as.double(bp)
bp.double <- bp.double
## [1] 1 1 0 1 0 0 1 1
as.double(gender) gender.oops <-
## Warning: NAs introduced by coercion
gender.oops
## [1] NA NA NA NA NA NA NA NA
sum(bp)
## [1] 5
The integer version of weight
doesn’t look any different, but it is stored differently, which can be important both for computational efficiency and for interfacing with other languages such as C++
. As noted above, however, we will not worry about the distinction between integer and double types. Converting weight
to character goes as expected: The character representations of the numbers replace the numbers themselves. Converting the logical vector bp
to double is pretty straightforward too: FALSE
is converted to zero, and TRUE
is converted to one. Now think about converting the character vector gender
to a numeric double vector. It’s not at all clear how to represent “female” and “male” as numbers. In fact in this case what R does is to create a character vector, but with each element set to NA
, which is the representation of missing data.24 Finally consider the code sum(bp)
. Now bp
is a logical vector, but when R sees that we are asking to sum this logical vector, it automatically converts it to a numerical vector and then adds the zeros and ones representing FALSE
and TRUE
.
R also has functions to test whether a vector is of a particular type.
is.double(weight)
## [1] TRUE
is.character(weight)
## [1] FALSE
is.integer(weight.int)
## [1] TRUE
is.logical(bp)
## [1] TRUE
4.1.1.1 Coercion
Consider the following examples.
c(1, 2, 3, TRUE)
xx <- xx
## [1] 1 2 3 1
c(1, 2, 3, "dog")
yy <- yy
## [1] "1" "2" "3" "dog"
c(TRUE, FALSE, "cat")
zz <- zz
## [1] "TRUE" "FALSE" "cat"
+bp weight
## [1] 124 158 205 200 223 140 106 195
Vectors in R can only contain elements of one type. If more than one type is included in a c()
function, R silently coerces the vector to be of one type. The examples illustrate the hierarchy—if any element is a character, then the whole vector is character. If some elements are numeric (either integer or double) and other elements are logical, the whole vector is numeric. Note what happened when R was asked to add the numeric vector weight
to the logical vector bp
. The logical vector was silently coerced to be numeric, so that FALSE became zero and TRUE became one, and then the two numeric vectors were added.
4.1.2 Accessing Specific Elements of Vectors
To access and possibly change specific elements of vectors, refer to the position of the element in square brackets. For example, weight[4]
refers to the fourth element of the vector weight
. Note that R starts the numbering of elements at 1, i.e., the first element of a vector x
is x[1]
.
weight
## [1] 123 157 205 199 223 140 105 194
5] weight[
## [1] 223
1:3] weight[
## [1] 123 157 205
length(weight)
## [1] 8
length(weight)] weight[
## [1] 194
weight[]
## [1] 123 157 205 199 223 140 105 194
3] <- 202
weight[ weight
## [1] 123 157 202 199 223 140 105 194
Note that including nothing in the square brackets results in the whole vector being returned.
Negative numbers in the square brackets tell R to omit the corresponding value. And a zero as a subscript returns nothing (more precisely, it returns a length zero vector of the appropriate type).
-3] weight[
## [1] 123 157 199 223 140 105 194
-length(weight)] weight[
## [1] 123 157 202 199 223 140 105
weight[-c(1,3,5)]
lessWeight <- lessWeight
## [1] 157 199 140 105 194
0] weight[
## numeric(0)
c(0,2,1)] weight[
## [1] 157 123
c(-1, 2)] weight[
## Error in weight[c(-1, 2)]: only 0's may be mixed with negative subscripts
Note that mixing zero and other nonzero subscripts is allowed, but mixing negative and positive subscripts is not allowed.
What about the (usual) case where we don’t know the positions of the elements we want? For example possibly we want the weights of all females in the data. Later we will learn how to subset using logical indices, which is a very powerful way to access desired elements of a vector.
4.1.3 Practice Problem
A bad programming technique that often plagues beginners is a technique called hardcoding. Consider the following simple vector containing data on the number of tree species found at ten different sites.
c(10, 13, 15, 8, 2, 9, 10, 20, 9, 11) tree.sp <-
Suppose we are interested in the second to last value of the data set. Since we know there are ten values in the data set, we do this as follows
10 - 1] tree.sp[
## [1] 9
This is an example of hardcoding. But what if we attempt to use the same code on a second vector of tree species data that only has six sites?
c(8, 4, 3, 2, 19, 3)
tree.sp <-10 - 1] tree.sp[
## [1] NA
That’s clearly not what we want. Fix this code so we can always extract the second to last value in the vector, regardless of the length of the vector.
4.2 Factors
Categorical variables can be represented as character vectors. In many cases this simple representation is sufficient. Consider, however, two other categorical variables, one representing age via categories youth
, young adult
, middle age
, senior
, and another representing income via categories lower
, middle
, and upper
. Suppose that for the small health data set, all the people are either middle aged or senior citizens. If we just represented the variable via a character vector, there would be no way to know that there are two other categories, representing youth and young adults, which happen not to be present in the data set. And for the income variable, the character vector representation does not explicitly indicate that there is an ordering of the levels.
Factors in R provide a more sophisticated way to represent categorical variables. Factors explicitly contain all possible levels, and allow ordering of levels.
c("middle age", "senior", "middle age", "senior",
age <-"senior", "senior", "senior", "middle age")
c("lower", "lower", "upper", "middle", "upper",
income <-"lower", "lower", "middle")
age
## [1] "middle age" "senior" "middle age" "senior"
## [5] "senior" "senior" "senior" "middle age"
income
## [1] "lower" "lower" "upper" "middle" "upper"
## [6] "lower" "lower" "middle"
factor(age, levels=c("youth", "young adult", "middle age",
age <-"senior"))
age
## [1] middle age senior middle age senior
## [5] senior senior senior middle age
## Levels: youth young adult middle age senior
factor(income, levels=c("lower", "middle", "upper"),
income <-ordered = TRUE)
income
## [1] lower lower upper middle upper lower lower
## [8] middle
## Levels: lower < middle < upper
In the factor version of age
the levels are explicitly listed, so it is clear that the two included levels are not all the possible levels. And in the factor version of income, the ordering is explicit.
In many cases the character vector representation of a categorical variable is sufficient and easier to work with. In this book, factors will not be used extensively. It is important to note that R often by default creates a factor when character data are read in, and sometimes it is necessary to use the argument stringsAsFactors = FALSE
to explicitly tell R not to do this. This is shown later in the chapter when data frames are introduced.
4.3 Missing Data, Infinity, etc.
Most real-world data sets have variables where some observations are missing. In a longitudinal study participants may drop out. In a survey, participants may decide not to respond to certain questions. Statistical software should be able to represent missing data and to analyze data sets in which some data are missing.
In R, the value NA
is used for a missing data value. Since missing values may occur in numeric, character, and other types of data, and since R requires that a vector contain only elements of one type, there are different types of NA
values. Usually R determines the appropriate type of NA
value automatically. It is worth noting that the default type for NA
is logical, and that NA
is NOT the same as the character string "NA"
.
c("dog", "cat", NA, "pig", NA, "horse")
missingCharacter <- missingCharacter
## [1] "dog" "cat" NA "pig" NA "horse"
is.na(missingCharacter)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
c(missingCharacter, "NA")
missingCharacter <- missingCharacter
## [1] "dog" "cat" NA "pig" NA "horse"
## [7] "NA"
is.na(missingCharacter)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE
c(NA, NA, NA)
allMissing <-typeof(allMissing)
## [1] "logical"
How should missing data be treated in computations, such as finding the mean or standard deviation of a variable? One possibility is to return NA
. Another is to remove the missing value(s) and then perform the computation.
> mean(c(1,2,3,NA,5))
## [1] NA
> mean(c(1,2,3,NA,5), na.rm=TRUE)
## [1] 2.75
As this example shows, the default behavior for the mean()
function is to return NA
. If removal of the missing values and then computing the mean is desired, the argument na.rm
is set to TRUE
. Different R functions have different default behaviors, and there are other possible actions. Consulting the help for a function provides the details.
4.3.1 Practice Problem
Collecting data is often a messy process resulting in multiple errors in the data. Consider the following small vector representing the weights of 10 adults in pounds.
c(150, 138, 289, 239, 12, 103, 310, 200, 218, 178) my.weights <-
As far as I know, it’s not possible for an adult to weigh 12 pounds, so that is most likely an error. Change this value to NA, and then find the standard deviation of the weights after removing the NA value.
4.3.2 Infinity and NaN
What happens if R code requests division by zero, or results in a number that is too large to be represented? Here are some examples.
> x <- 0:4
> x
## [1] 0 1 2 3 4
> 1/x
## [1] Inf 1.0000 0.5000 0.3333 0.2500
> x/x
## [1] NaN 1 1 1 1
> y <- c(10, 1000, 10000)
> 2^y
## [1] 1.024e+03 1.072e+301 Inf
Inf
and -Inf
represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers). NaN
represents the result of a calculation where the result is undefined, such as dividing zero by zero. All of these are common to a variety of programming languages, including R.
4.4 Data Frames
Commonly, data is rectangular in form, with variables as columns and cases as rows. Continuing with the (contrived) data on weight, gender, and blood pressure medication, each of those variables would be a column of the data set, and each person’s measurements would be a row. In R, such data are represented as a data frame.
data.frame(Weight = weight, Gender=gender,
healthData <-bp.meds = bp,
stringsAsFactors=FALSE)
healthData
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
names(healthData)
## [1] "Weight" "Gender" "bp.meds"
colnames(healthData)
## [1] "Weight" "Gender" "bp.meds"
names(healthData) <- c("Wt", "Gdr", "bp")
healthData
## Wt Gdr bp
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
rownames(healthData)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
names(healthData) <- c("Weight", "Gender", "bp.meds")
The data.frame
function can be used to create a data frame (although it’s more common to read a data frame into R from an external file, something that will be introduced later). The names of the variables in the data frame are given as arguments, as are the vectors of data that make up the variable’s values. The argument stringsAsFactors=FALSE
asks R not to convert character vectors into factors. As of version R 4.0.0
, R
does not automatically convert character vectors into factors. However, up until this recent version, R
would automatically convert strings to factors (i.e., stringsAsFactors = TRUE
), and so to avoid confusion we will typically display stringsAsFactors=FALSE
throughout most of the book. Names of the columns (variables) can be extracted and set via either names
or colnames
. In the example, the variable names are changed to Wt, Gdr, bp
and then changed back to the original Weight, Gender, bp.meds
in this way. Rows can be named also. In this case since specific row names were not provided, the default row names of "1", "2"
etc. are used.
In the next example a built-in dataset called mtcars
is made available by the data
function, and then the first and last six rows are displayed using head
and tail
.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
## Valiant 18.1 6 225 105 2.76 3.460 20.22
## vs am gear carb
## Mazda RX4 0 1 4 4
## Mazda RX4 Wag 0 1 4 4
## Datsun 710 1 1 4 1
## Hornet 4 Drive 1 0 3 1
## Hornet Sportabout 0 0 3 2
## Valiant 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1
## am gear carb
## Porsche 914-2 1 5 2
## Lotus Europa 1 5 2
## Ford Pantera L 1 5 4
## Ferrari Dino 1 5 6
## Maserati Bora 1 5 8
## Volvo 142E 1 4 2
Note that the mtcars
data frame does have non-default row names which give the make and model of the cars.
4.4.1 Accessing Specific Elements of Data Frames
Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.
1,4] mtcars[
## [1] 110
1:3, 3] mtcars[
## [1] 160 160 108
1:3, 2:3] mtcars[
## cyl disp
## Mazda RX4 6 160
## Mazda RX4 Wag 6 160
## Datsun 710 4 108
1] mtcars[,
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Note that mtcars[,1]
returns ALL elements in the first column. This agrees with the behavior for vectors, where leaving a subscript out of the square brackets tells R to return all values. In this case we are telling R to return all rows, and the first column.
For a data frame there is another way to access specific columns, using the $
notation.
> mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
> mtcars$cyl
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8
## [26] 4 4 4 8 6 8 4
> mpg
## Error in eval(expr, envir, enclos): object 'mpg' not found
> cyl
## Error in eval(expr, envir, enclos): object 'cyl' not found
> weight
## [1] 123 157 202 199 223 140 105 194
Notice that typing the variable name, such as mpg
, without the name of the data frame (and a dollar sign) as a prefix, does not work. This is sensible. There may be several data frames that have variables named mpg
, and just typing mpg
doesn’t provide enough information to know which is desired. But if there is a vector named mpg
that is created outside a data frame, it will be retrieved when mpg
is typed, which is why typing weight
does work, since weight
was created outside of a data frame, although ultimately it was incorporated into the healthData
data frame.
4.5 Lists
The third main data structure we will work with is a list. Technically a list is a vector, but one in which elements can be of different types. For example a list may have one element that is a vector, one element that is a data frame, and another element that is a function. Consider designing a function that fits a simple linear regression model to two quantitative variables. We might want that function to compute and return several things such as
- The fitted slope and intercept (a numeric vector with two components)
- The residuals (a numeric vector with \(n\) components, where \(n\) is the number of data points)
- Fitted values for the data (a numeric vector with \(n\) components, where \(n\) is the number of data points)
- The names of the dependent and independent variables (a character vector with two components)
In fact R has a function, lm
, which does this (and much more).
lm(mpg ~ hp, data=mtcars)
mpgHpLinMod <-mode(mpgHpLinMod)
## [1] "list"
names(mpgHpLinMod)
## [1] "coefficients" "residuals" "effects"
## [4] "rank" "fitted.values" "assign"
## [7] "qr" "df.residual" "xlevels"
## [10] "call" "terms" "model"
$coefficients mpgHpLinMod
## (Intercept) hp
## 30.09886 -0.06823
$residuals mpgHpLinMod
## Mazda RX4 Mazda RX4 Wag
## -1.59375 -1.59375
## Datsun 710 Hornet 4 Drive
## -0.95363 -1.19375
## Hornet Sportabout Valiant
## 0.54109 -4.83489
## Duster 360 Merc 240D
## 0.91707 -1.46871
## Merc 230 Merc 280
## -0.81717 -2.50678
## Merc 280C Merc 450SE
## -3.90678 -1.41777
## Merc 450SL Merc 450SLC
## -0.51777 -2.61777
## Cadillac Fleetwood Lincoln Continental
## -5.71206 -5.02978
## Chrysler Imperial Fiat 128
## 0.29364 6.80421
## Honda Civic Toyota Corolla
## 3.84901 8.23598
## Toyota Corona Dodge Challenger
## -1.98072 -4.36462
## AMC Javelin Camaro Z28
## -4.66462 -0.08293
## Pontiac Firebird Fiat X1-9
## 1.04109 1.70421
## Porsche 914-2 Lotus Europa
## 2.10991 8.01093
## Ford Pantera L Ferrari Dino
## 3.71340 1.54109
## Maserati Bora Volvo 142E
## 7.75761 -1.26198
The lm
function returns a list (which in the code above has been assigned to the object mpgHpLinMod
).25 One component of the list is the length 2 vector of coefficients, while another component is the length 32 vector of residuals. The code also illustrates that named components of a list can be accessed using the dollar sign notation, as with data frames.
The list
function is used to create lists.
list(first=weight, second=healthData,
temporaryList <-pickle=list(a = 1:10, b=healthData))
temporaryList
## $first
## [1] 123 157 202 199 223 140 105 194
##
## $second
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
##
## $pickle
## $pickle$a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $pickle$b
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
Here, for illustration, I assembled a list to hold some of the R data structures we have been working with in this chapter. The first list element, named first
, holds the weight
vector we created in Section 4.1, the second list element, named second
, holds the healthData
data frame, and the third list element, named pickle
, holds a list with elements named a
and b
that hold a vector of values 1 through 10 and another copy of the healthData
data frame, respectively. As this example shows, a list can contain another list.
4.5.1 Accessing Specific Elements of Lists
We already have seen the dollar sign notation works for lists. In addition, the square bracket subsetting notation can be used. There is an added, somewhat subtle wrinkle—using either single or double square brackets.
$first temporaryList
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList$first)
## [1] "numeric"
1]] temporaryList[[
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList[[1]])
## [1] "numeric"
1] temporaryList[
## $first
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList[1])
## [1] "list"
Note the dollar sign and double bracket notation return a numeric vector, while the single bracket notation returns a list. Notice also the difference in results below.
c(1,2)] temporaryList[
## $first
## [1] 123 157 202 199 223 140 105 194
##
## $second
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
c(1,2)]] temporaryList[[
## [1] 157
The single bracket form returns the first and second elements of the list, while the double bracket form returns the second element in the first element of the list. Generally, do not put a vector of indices or names in a double bracket, you will likely get unexpected results. See, for example, the results below.26
c(1,2,3)]] temporaryList[[
## Error in temporaryList[[c(1, 2, 3)]]: recursive indexing failed at level 2
So, in summary, there are two main differences between using the single bracket []
and double bracket [[]]
. First, the single bracket will return a list that holds the object(s) held at the given indices or names placed in the bracket, whereas the double brackets will return the actual object held at the index or name placed in the innermost bracket. Put differently, a single bracket can be used to access a range of list elements and will return a list, and a double bracket can only access a single element in the list and will return the object held at the index.
4.6 Comparison and logical operators
Comparison operators are binary operators that test a comparative condition between the operands and return a logical value to indicate the test result. We often use comparison operators to gain access to only part of an R object that passes some logical test. You’re likely already familiar with many comparison operators.
The basic idea of comparison operators is quite simple. We have a logical test (e.g., what weights are greater than 200) and want to determine what values in a vector (or some other R object) pass the test. When we apply a comparison operator, the results are logical values that indicate whether or not the specific element in the vector passes the test (TRUE
) or not (FALSE
).
Let’s walk through the comparison operators available in R. We’ll present the operator and its definition, followed by an example using the weight
and gender
vectors created in Section 4.1. First, let’s recall the values held in these vectors.
weight
## [1] 123 157 202 199 223 140 105 194
gender
## [1] "female" "female" "male" "female" "male"
## [6] "male" "female" "male"
==
the equality operator: The “double equals sign” tests if operands are equal. Below we perform a logical test to determine whichgender
vector elements equalmale
.
== "male" gender
## [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
Not surprisingly, the third, fifth, sixth, and eigth elements return TRUE
and all other elements return FALSE
. Notice we’re using the ==
sign, not the =
sign. Mixing up the comparison operator ==
and assignment operator =
is a common error.
!=
the inequality operator: Tests if operands are not equal, and is thus the inverse of==
. We see this by testing whichgender
vector elements do not equalmale
.
!= "male" gender
## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
<
,<=
,>
,>=
less than, less than or equal to, greater than, and greater than or equal to operators, respectively. Using theweights
vector, determine which elements are greater than 194 and then greater than or equal to 194.
> 194 weight
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
>= 194 weight
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
Suppose we want to know which weight
vector elements are greater than 194 and less than 210. Answering this question requires use of two comparison operators, i.e., \(<\) and \(>\). In such cases, logical operators are used to combine multiple comparison operations into a single logical statement. We consider the following logical operators “and”, “or”, “xor”, and “negation”.
Importantly, in order of operation, comparison operators precede logical operators. The Syntax
manual page (i.e., run ?Syntax
on the Console) lists R operators’ order of operation, where you’ll notice the comparison operators are listed before the logical operators in the precedence groups under the Details Section.
Let’s walk through each of the logical operators:
&
the “and” operator: A comparison using the&
operator returnsTRUE
when both operands areTRUE
andFALSE
otherwise. The&
operator works elementwise for operand vectors. Consider the following example.
< 210 weight
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
> 194 weight
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
< 210 & weight > 194 weight
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
First we show the results of weight < 210
and weight > 194
separately. When combining the comparison operations using the &
operator, R first performs weight < 210
and weight > 194
, then applies &
elementwise on the logical vector operands. The elementwise &
returns TRUE
when the element in the weight < 210
vector is TRUE
and the element in the weight > 194
vector is TRUE
. The key point to remember is that &
returns TRUE
only if both operands are TURE
.
|
the “or” operator: A comparison using the|
operator returnsTRUE
if at least one operand isTRUE
andFALSE
otherwise. Similar to the&
operator, the|
operator works element by element. Let’s use the same example as before, but now we’ll return individuals with a weight less than 210 or a weight greater than 194.
< 210 | weight > 194 weight
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Not surprisingly, this operation returns TRUE
for all elements, because all elements in weight
are either greater than 194 or less than 210.
xor
the “exclusive or” operator: A comparison using thexor
operator returnsTRUE
if one of the operands isTRUE
andFALSE
otherwise.
xor(weight < 210, weight > 194)
## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
While we can imagine cases where this operator would be handy, we’ve never found the occasion to use it in our own code.
!
the “negation” or “not” operator: The exclamation point!
(called “bang” in programmer’s slang) reverses a logical value, i.e.!TRUE
isFALSE
and!FALSE
isTRUE
. The code below returnsTRUE
for weight values not greater than 194 (while not required, the parentheses emphasize the order of operation).
!(weight > 194)
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
There is a “&&” and “||” variant of “&” and “|”, respectively. These “double” operators examine only the first element of operand vectors in a comparison rather than comparing element by element. There are a few cases where using &&
and ||
are useful when writing conditional statements in functions (see, e.g., Chapter 7), however, we’ll generally not use them in this book.
4.6.1 The %in%
operator
Suppose we want to identify the weight
vector elements equal to 123, 199, or 140. We can do this using the equality operator ==
and the |
operator as follows.
weight
## [1] 123 157 202 199 223 140 105 194
== 123 | weight == 199 | weight == 140 weight
## [1] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
However, this is a little clunky, involves a lot of typing, and generally makes code hard to read. Lucky for us, R has the “in” operator, %in%
, to accomplish this task in a more intuitive and easy-to-read manner.
%in% c(123, 199, 140) weight
## [1] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
In the spirit of coding techniques to promote efficient and reproducible code, we’ll use the %in%
operator throughout the book.
Comparison and logical operators are invaluable to identify subsets of data that meet specified conditions. The next section explores how conditional and logical operators facilitate subsetting vectors, data frames, and lists.
4.7 Subsetting with Logical Vectors
Consider the healthData
data frame. How can we access only those weights which are more than 200? How can we access the genders of those whose weights are more than 200? How can we compute the mean weight of males and the mean weight of females? Or consider the mtcars
data frame. How can we obtain the miles per gallon for all six cylinder cars? Both of these data sets are small enough that it would not be too onerous to extract the values by hand. But for larger or more complex data sets, this would be very difficult or impossible to do in a reasonable amount of time, and would likely result in errors.
R has a powerful method for solving these sorts of problems using a variant of the subsetting methods that we already have learned. When given a logical vector in square brackets, R will return the values corresponding to TRUE
.
To begin, focus on the weight
and gender
vectors created in Section 4.1.
The R code weight > 200
returns a TRUE
for each value of weight
which is more than 200, and a FALSE
for each value of weight
which is less than or equal to 200. Similarly gender == "female"
returns TRUE
or FALSE
depending on whether an element of gender
is equal to female
.
weight
## [1] 123 157 202 199 223 140 105 194
> 200 weight
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
> 200] gender[weight
## [1] "male" "male"
> 200] weight[weight
## [1] 202 223
== "female" gender
## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
== "female"] weight[gender
## [1] 123 157 199 105
Consider the lines of R code one by one.
weight
instructs R to display the values in the vectorweight
.weight > 200
instructs R to check whether each value inweight
is greater than 200, and to returnTRUE
if so, andFALSE
otherwise.- The next line,
gender[weight > 200]
, does two things. First, inside the square brackets, it does the same thing as the second line, namely, returningTRUE
orFALSE
depending on whether a value ofweight
is or is not greater than 200. Second, each element ofgender
is matched with the correspondingTRUE
orFALSE
value, and is returned if and only if the corresponding value isTRUE
. For example the first value ofgender
isgender[1]
. Since the firstTRUE
orFALSE
value isFALSE
, the first value ofgender
is not returned. Only the third and fifth values ofgender
, both of which happen to bemale
, are returned. Briefly, this line returns the genders of those people whose weight is over 200 pounds. - The fourth line of code,
weight[weight > 200]
, again begins by returningTRUE
orFALSE
depending on whether elements ofweight
are larger than 200. Then those elements ofweight
corresponding toTRUE
values, are returned. So this line returns the weights of those people whose weights are more than 200 pounds. - The fifth line returns
TRUE
orFALSE
depending on whether elements ofgender
are equal tofemale
or not. - The sixth line returns the weights of those whose gender is
female
.
4.7.1 Modifying or Creating Objects via Subsetting
The results of subsetting can be assigned to a new (or existing) R object, and subsetting on the left side of an assignment is a common way to modify an existing R object.
weight
## [1] 123 157 202 199 223 140 105 194
weight[weight < 200]
light.weight <- light.weight
## [1] 123 157 199 140 105 194
1:10
x <- x
## [1] 1 2 3 4 5 6 7 8 9 10
< 5] <- 0
x[x x
## [1] 0 0 0 0 5 6 7 8 9 10
-3:9
y <- y
## [1] -3 -2 -1 0 1 2 3 4 5 6 7 8 9
< 0] <- NA
y[y y
## [1] NA NA NA 0 1 2 3 4 5 6 7 8 9
rm(x)
rm(y)
4.7.2 Logical Subsetting and Data Frames
First consider the small and simple healthData
data frame.
healthData
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 3 202 male FALSE
## 4 199 female TRUE
## 5 223 male FALSE
## 6 140 male FALSE
## 7 105 female TRUE
## 8 194 male TRUE
$Weight[healthData$Gender == "male"] healthData
## [1] 202 223 140 194
$Gender == "female", ] healthData[healthData
## Weight Gender bp.meds
## 1 123 female TRUE
## 2 157 female TRUE
## 4 199 female TRUE
## 7 105 female TRUE
$Weight > 190, 2:3] healthData[healthData
## Gender bp.meds
## 3 male FALSE
## 4 female TRUE
## 5 male FALSE
## 8 male TRUE
The first example is really just subsetting a vector, since the $
notation creates vectors. The second two examples return subsets of the whole data frame. Note that the logical vector subsets the rows of the data frame, choosing those rows where the gender is female or the weight is more than 190. Note also that the specification for the columns (after the comma) is left blank in the first case, telling R to return all the columns. In the second case the second and third columns are requested explicitly.
Next consider the much larger and more complex WorldBank
data frame. Recall, the str
function displays the “structure” of an R object. Here is a look at the structure of several R objects.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
str(temporaryList)
## List of 3
## $ first : num [1:8] 123 157 202 199 223 140 105 194
## $ second:'data.frame': 8 obs. of 3 variables:
## ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
## ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
## ..$ bp.meds: logi [1:8] TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ pickle:List of 2
## ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
## ..$ b:'data.frame': 8 obs. of 3 variables:
## .. ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
## .. ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
## .. ..$ bp.meds: logi [1:8] TRUE TRUE FALSE TRUE FALSE FALSE ...
str(WorldBank)
## 'data.frame': 11880 obs. of 15 variables:
## $ iso2c : chr "AD" "AD" "AD" "AD" ...
## $ country : chr "Andorra" "Andorra" "Andorra" "Andorra" ...
## $ year : int 1978 1979 1977 2007 1976 2011 2012 2008 1980 1972 ...
## $ fertility.rate : num NA NA NA 1.18 NA NA NA 1.25 NA NA ...
## $ life.expectancy : num NA NA NA NA NA NA NA NA NA NA ...
## $ population : num 33746 34819 32769 81292 31781 ...
## $ GDP.per.capita.Current.USD : num 9128 11820 7751 39923 7152 ...
## $ X15.to.25.yr.female.literacy: num NA NA NA NA NA NA NA NA NA NA ...
## $ iso3c : chr "AND" "AND" "AND" "AND" ...
## $ region : chr "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" ...
## $ capital : chr "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" ...
## $ longitude : num 1.52 1.52 1.52 1.52 1.52 ...
## $ latitude : num 42.5 42.5 42.5 42.5 42.5 ...
## $ income : chr "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" ...
## $ lending : chr "Not classified" "Not classified" "Not classified" "Not classified" ...
First we see that mtcars
is a data frame which has 32 observations (rows) on each of 11 variables (columns). The names of the variables are given, along with their type (in this case, all numeric), and the first few values of each variable is given.
Second we see that temporaryList
is a list with three components. Each of the components is described separately, with the first few values again given.
Third we examine the structure of WorldBank
. It is a data frame with 11880 observations on each of 15 variables. Some of these are character variables, some are numeric, and one (year
) is integer. Looking at the first few values we see that some variables have missing values.
Consider creating a data frame which only has the observations from one year, say 1971. That’s relatively easy. Just choose rows for which year
is equal to 1971.
WorldBank[WorldBank$year == 1971, ]
WorldBank1971 <-dim(WorldBank1971)
## [1] 216 15
The dim
function returns the dimensions of a data frame, i.e., the number of rows and the number of columns. From dim
we see that there are dim(WorldBank1971)[1]
cases from 1971.
Next, how can we create a data frame which only contains data from 1971, and also only contains cases for which there are no missing values in the fertility rate variable? R has a built in function is.na
which returns TRUE
if the observation is missing and returns FALSE
otherwise. And !is.na
returns the negation, i.e., it returns FALSE
if the observation is missing and TRUE
if the observation is not missing.
$fertility.rate[1:25] WorldBank1971
## [1] NA 6.512 7.671 3.517 4.933 3.118 7.264 3.104
## [9] NA 2.200 2.961 2.788 4.479 2.260 2.775 2.949
## [17] 6.942 2.210 6.657 2.100 6.293 7.329 6.786 NA
## [25] 5.771
!is.na(WorldBank1971$fertility.rate[1:25])
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [17] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [25] TRUE
WorldBank1971[!is.na(WorldBank1971$fertility.rate),]
WorldBank1971 <-dim(WorldBank1971)
## [1] 193 15
From dim
we see that there are 193 cases from 1971 with non-missing fertility rate data.
Return attention now to the original WorldBank
data frame with data not only from 1971. How can we extract only those cases (rows) which have NO missing data? Consider the following simple example:
data.frame(V1 = c(1, 2, 3, 4, NA),
temporaryDataFrame <-V2 = c(NA, 1, 4, 5, NA),
V3 = c(1, 2, 3, 5, 7))
temporaryDataFrame
## V1 V2 V3
## 1 1 NA 1
## 2 2 1 2
## 3 3 4 3
## 4 4 5 5
## 5 NA NA 7
is.na(temporaryDataFrame)
## V1 V2 V3
## [1,] FALSE TRUE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] TRUE TRUE FALSE
rowSums(is.na(temporaryDataFrame))
## [1] 1 0 0 0 2
First notice that is.na
will test each element of a data frame for missingness. Also recall that if R is asked to sum a logical vector, it will first convert the logical vector to numeric and then compute the sum, which effectively counts the number of elements in the logical vector which are TRUE
. The rowSums
function computes the sum of each row. So rowSums(is.na(temporaryDataFrame))
returns a vector with as many elements as there are rows in the data frame. If an element is zero, the corresponding row has no missing values. If an element is greater than zero, the value is the number of variables which are missing in that row. This gives a simple method to return all the cases which have no missing data.
dim(WorldBank)
## [1] 11880 15
WorldBank[rowSums(is.na(WorldBank)) == 0,]
WorldBankComplete <-dim(WorldBankComplete)
## [1] 564 15
Out of the 564 rows in the original data frame, only 564 have no missing observations!
4.8 Patterned Data
Sometimes it is useful to generate all the integers from 1 through 20, to generate a sequence of 100 points equally spaced between 0 and 1, etc. The R functions seq()
and rep()
as well as the “colon operator” :
help to generate such sequences.
The colon operator generates a sequence of values with increments of \(1\) or \(-1\).
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
-5:3
## [1] -5 -4 -3 -2 -1 0 1 2 3
10:4
## [1] 10 9 8 7 6 5 4
:7 pi
## [1] 3.142 4.142 5.142 6.142
The seq()
function generates either a sequence of pre-specified length or a sequence with pre-specified increments.
seq(from = 0, to = 1, length = 11)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 5, by = 1/3)
## [1] 1.000 1.333 1.667 2.000 2.333 2.667 3.000 3.333
## [9] 3.667 4.000 4.333 4.667 5.000
seq(from = 3, to = -1, length = 10)
## [1] 3.0000 2.5556 2.1111 1.6667 1.2222 0.7778
## [7] 0.3333 -0.1111 -0.5556 -1.0000
The rep()
function replicates the values in a given vector.
rep(c(1,2,4), length = 9)
## [1] 1 2 4 1 2 4 1 2 4
rep(c(1,2,4), times = 3)
## [1] 1 2 4 1 2 4 1 2 4
rep(c("a", "b", "c"), times = c(3, 2, 7))
## [1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"
4.8.1 Practice Problem
Often when using R you will want to simulate data from a specific probability distribution (i.e. normal/Gaussian, bionmial, Poisson). R has a vast suite of functions for working with statistical distributions. To generate values from a statistical distribution, the function has a name beginning with an “r” followed by some abbreviation of the probability distribution. For example to simulate from the three distributions mentioned above, we can use the functions rnorm()
, rbinom()
, and rpois()
.
Use the rnorm()
function to generate 10,000 values from the standard normal distribution (the normal distribution with mean = 0 and variance = 1). Consult the help page for rnorm()
if you need to. Save this vector of variables to a vector named sim.vals
. Then use the hist()
function to draw a histogram of the simulated data. Does the data look like it follows a normal distribution?
4.9 Exercises
Exercise 3 Learning objectives: create, subset, and manipulate vector contents and attributes; summarize vector data using R table()
and other functions; generate basic graphics using vector data.
Exercise 4 Learning objectives: use functions to describe data frame characteristics; summarize and generate basic graphics for variables held in data frames; apply the subset function with logical operators; illustrate NA
, NaN
, Inf
, and other special values; recognize the implications of using floating point arithmetic with logical operators.
Exercise 5 Learning objectives: practice with lists, data frames, and associated functions; summarize variables held in lists and data frames; work with R’s linear regression lm()
function output; review logical subsetting of vectors for partitioning and assigning of new values; generate and visualize data from mathematical functions.
Technically the objects described in this section are “atomic” vectors (all elements of the same type), since lists, to be described below, also are actually vectors. This will not be an important issue, and the shorter term vector will be used for atomic vectors below.↩︎
Missing data will be discussed in more detail later in the chapter.↩︎
The
mode
function returns the type or storage mode of an object.↩︎Try this example using only single brackets\(\ldots\) it will return a list holding elements
first
,second
, andpickle
.↩︎