Tips for working with big data
Hunter Stanke, Jeffrey W. Doser
2020 (last updated February 6, 2025)
Source:vignettes/bigData.Rmd
bigData.Rmd
Larger-than-RAM methods
The sheer size of the FIA Database can present a serious challenge
for many users interested in performing regional studies (requiring a
large subset of the database). Recent updates to rFIA
are
intended to reduce these barriers.
Namely, we’ve implemented “larger-than-RAM” methods for all
rFIA
estimator functions. In short, behind the scenes we
read the necessary tables for individual states into RAM one at a time
and summarize to the estimation unit level (always sub-state and
mutually exclusive populations, hence additive properties apply). We
save the estimation unit level results for each state in RAM, and
combine them into the final output once we’ve iterated over all states.
This may sound complicated, but fortunately these “larger-than-RAM”
methods use the exact same syntax as normal “in-memory”
operations.
To get started, we simply have to set up a
Remote.FIA.Database
in place of our regular in-memory
FIA.Database
by setting inMemory=FALSE
in our
call to readFIA()
:
library(rFIA)
# Download data for two small states
getFIA(c('RI', 'CT'), dir = 'path/to/save/', load = FALSE)
# Now set up a Remote.FIA.Database with readFIA by setting inMemory = FALSE
# Instead of reading in the data now, readFIA will simply save a pointer
# and allow the estimator functions to read/process the data state-by-state
fia <- readFIA('path/to/save/', inMemory = FALSE)
class(fia)
## [1] "Remote.FIA.Database"
Once set up, our Remote.FIA.Database
will work exactly
the same as we are used to. That is, we can use the same syntax we have
been using for normal, in-memory operations. For example, to estimate
biomass using our Remote.FIA.Database
:
# Estimate biomass with Remote.FIA.Database
biomass(db = fia)
## # A tibble: 19 × 8
## YEAR BIO_ACRE CARB_ACRE BIO_ACRE_SE CARB_ACRE_SE nPlots_TREE nPlots_AREA
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 2005 70.6 34.0 3.19 3.19 219 223
## 2 2006 71.1 34.2 2.65 2.66 325 331
## 3 2007 71.0 34.2 2.26 2.26 436 442
## 4 2008 71.8 34.6 2.21 2.22 425 430
## 5 2009 73.6 35.4 2.24 2.24 419 425
## 6 2010 75.1 36.1 2.21 2.21 422 426
## 7 2011 76.1 36.6 2.18 2.18 435 441
## 8 2012 77.1 37.1 2.15 2.15 441 446
## 9 2013 77.3 37.2 2.12 2.11 438 443
## 10 2014 77.8 37.4 2.12 2.12 439 444
## 11 2015 78.3 37.7 2.10 2.10 442 448
## 12 2016 79.2 38.1 2.10 2.10 440 445
## 13 2017 79.4 38.2 2.11 2.11 438 443
## 14 2018 79.6 38.3 2.15 2.15 434 440
## 15 2019 80.4 38.7 2.26 2.26 430 436
## 16 2020 81.4 39.1 2.26 2.26 428 436
## 17 2021 81.3 39.1 2.36 2.36 426 435
## 18 2022 82.3 39.6 2.23 2.23 428 436
## 19 2023 82.2 39.5 2.63 2.63 301 306
## # ℹ 1 more variable: N <int>
# All the extra goodies work the same:
# By species
biomass(fia, bySpecies = TRUE)
## # A tibble: 1,235 × 11
## YEAR SPCD COMMON_NAME SCIENTIFIC_NAME BIO_ACRE CARB_ACRE BIO_ACRE_SE
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2005 10 fir spp. Abies spp. 0.000761 0.000366 100.
## 2 2005 43 Atlantic white-ce… Chamaecyparis … 0.00398 0.00189 99.9
## 3 2005 68 eastern redcedar Juniperus virg… 0.517 0.269 29.4
## 4 2005 91 Norway spruce Picea abies 0.00445 0.00213 100.
## 5 2005 97 red spruce Picea rubens 0.0155 0.00745 100.
## 6 2005 125 red pine Pinus resinosa 0.0300 0.0160 100.
## 7 2005 126 pitch pine Pinus rigida 0.417 0.199 58.4
## 8 2005 129 eastern white pine Pinus strobus 3.78 1.92 20.3
## 9 2005 261 eastern hemlock Tsuga canadens… 2.58 1.24 22.9
## 10 2005 315 striped maple Acer pensylvan… 0.00956 0.00457 77.1
## # ℹ 1,225 more rows
## # ℹ 4 more variables: CARB_ACRE_SE <dbl>, nPlots_TREE <int>, nPlots_AREA <int>,
## # N <int>
# Alternative estimators (linear moving average)
biomass(fia, method = 'LMA')
## # A tibble: 19 × 8
## YEAR BIO_ACRE CARB_ACRE BIO_ACRE_SE CARB_ACRE_SE nPlots_TREE nPlots_AREA
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 2005 70.8 34.1 2.97 2.97 219 223
## 2 2006 71.5 34.4 2.52 2.51 320 326
## 3 2007 72.3 34.8 2.20 2.20 435 441
## 4 2008 73.6 35.4 2.38 2.37 412 417
## 5 2009 74.3 35.7 2.45 2.44 417 423
## 6 2010 76.2 36.6 2.29 2.28 421 425
## 7 2011 76.0 36.6 2.45 2.44 435 441
## 8 2012 77.6 37.4 2.52 2.52 441 446
## 9 2013 77.2 37.1 2.49 2.49 438 443
## 10 2014 78.6 37.8 2.50 2.49 439 444
## 11 2015 79.0 37.9 2.33 2.32 442 448
## 12 2016 80.9 38.9 2.38 2.36 440 445
## 13 2017 80.7 38.8 2.51 2.54 438 443
## 14 2018 79.9 38.4 2.57 2.59 434 440
## 15 2019 81.8 39.4 2.89 2.91 430 436
## 16 2020 81.9 39.4 2.86 2.86 428 436
## 17 2021 82.5 39.7 2.93 2.92 426 435
## 18 2022 82.8 39.8 2.56 2.55 428 436
## 19 2023 82.2 39.5 2.96 2.94 301 306
## # ℹ 1 more variable: N <int>
## # A tibble: 145 × 10
## YEAR STDORGCD SITECLCD BIO_ACRE CARB_ACRE BIO_ACRE_SE CARB_ACRE_SE
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2005 0 3 105. 51.3 12.4 11.9
## 2 2005 0 4 86.5 41.6 7.52 7.53
## 3 2005 0 5 72.9 35.1 3.45 3.45
## 4 2005 0 6 55.5 26.7 7.37 7.36
## 5 2005 0 7 45.6 22.1 9.93 10.3
## 6 2005 1 3 84.7 42.1 0 0
## 7 2005 1 6 10.2 4.94 6.25 6.06
## 8 2006 0 3 103. 50.5 11.4 11.0
## 9 2006 0 4 80.9 39.0 6.96 6.89
## 10 2006 0 5 75.3 36.2 2.89 2.93
## # ℹ 135 more rows
## # ℹ 3 more variables: nPlots_TREE <int>, nPlots_AREA <int>, N <int>
In addition, you can still specify spatial-temporal subsets on
Remote.FIA.Database
objects using
clipFIA()
:
# A most recent subset with the Remote.FIA.Database
fiaMR <- clipFIA(fia)
# Biomass in most recent inventory
biomass(fiaMR)
## # A tibble: 1 × 8
## YEAR BIO_ACRE CARB_ACRE BIO_ACRE_SE CARB_ACRE_SE nPlots_TREE nPlots_AREA
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 2023 81.4 39.2 2.29 2.29 427 435
## # ℹ 1 more variable: N <int>
In practice, rFIA's
new larger-than-RAM methods make it
possible for nearly anyone to work with very large subsets of FIA
Database. In our testing, we have run tpa()
,
biomass()
, dwm()
, and carbon()
for the entire continental US on a machine with just 16 GB of RAM (where
the FIA data total ~ 50GB).
The only challenge that the Remote.FIA.Database
presents
is that it becomes difficult for users to modify variables in FIA tables
(e.g., make tree size classes). However, it is possible to read in,
modify, and save tables of interest prior to setting up a
Remote.FIA.Database
. For example, we can extend our example
above to produce estimates of live tree biomass grouped by stand age
classes, where stand age classes can be computed with
makeClasses()
.
# Rather than read all tables into memory, just read those of interest
# In this case, we just need the COND table
modTables <- readFIA(dir = 'path/to/save/', tables = 'COND',
states = c('RI', 'CT'), inMemory = TRUE)
# Now we can modify the COND table in any way we like
# Here we just add a variable that we will want to group by later
modTables$COND$STANDAGEGROUP <- makeClasses(modTables$COND$STDAGE, interval = 50)
# Now we can save our changes to the modified tables on disk with writeFIA
# This will overwrite the COND tables previously stored in our target directory
# And allow us to use our new variables in a subsequent 'Remote.FIA.Database'
writeFIA(modTables, dir = 'path/to/save/', byState = TRUE)
# Now set up the Remote database again
fia <- readFIA('path/to/save/', inMemory = FALSE)
# And produce estimates grouped by our new variable
biomass(fia, grpBy = STANDAGEGROUP)
## # A tibble: 78 × 9
## YEAR STANDAGEGROUP BIO_ACRE CARB_ACRE BIO_ACRE_SE CARB_ACRE_SE nPlots_TREE
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2005 [-49,1) 12.4 5.88 80.2 80.1 3
## 2 2005 [1,51) 33.6 16.2 14.2 14.2 35
## 3 2005 [101,151) 83.2 39.9 7.39 7.33 18
## 4 2005 [51,101) 74.7 35.9 3.19 3.20 178
## 5 2006 [-49,1) 12.9 6.11 67.9 67.8 3
## 6 2006 [1,51) 35.7 17.2 10.5 10.5 57
## 7 2006 [101,151) 79.9 38.3 6.64 6.58 23
## 8 2006 [51,101) 76.4 36.8 2.65 2.66 261
## 9 2007 [-49,1) 6.19 2.93 91.0 90.9 3
## 10 2007 [1,51) 35.5 17.1 9.45 9.43 74
## # ℹ 68 more rows
## # ℹ 2 more variables: nPlots_AREA <int>, N <int>
Simple, easy parallelization
All rFIA
estimator functions (as well as
readFIA()
and getFIA()
) can be implemented in
parallel, using the nCores
argument. By default, processing
is implemented serially with nCores = 1
, although users may
find substantial increases in efficiency by increasing
nCores
.
Parallelization is implemented with the parallel package. Parallel
implementation is achieved using a snow type cluster on any Windows OS,
and with multicore forking on any Unix OS (Linux, Mac). Implementing
parallel processing may substantially decrease free memory during
processing, particularly on Windows OS. Thus, users should be cautious
when running in parallel, and consider implementing serial processing
for this task if computational resources are limited
(nCores = 1
).
# Check the number of cores available on your machine
# Requires the parallel package
parallel::detectCores()
## [1] 16
# On our machine, we have a fun 16 cores to play with.
# To speed processing, we will split the workload
# across 3 of these cores using nCores = 3
tpaRI_par <- tpa(fiaRI, nCores = 3)