# 1 Introduction

Geospatial data can help social scientists and public policy researchers answer many important questions. Often, social researchers are just interested in aggregating geospatial information for use in standard statistical tests (rather than employing tests designed for geospatial data like spatial regression models). As simple as that sounds in theory, in practice it can be made difficult by: (1) managing bugs and technical difficulties (in particular when a person is unfamiliar with these forms of data); (2) finding efficient ways to execute processing power intensive tasks; and (3) ensuring that what you’ve done is reproducible and can be easily checked over by other researchers.

This document seeks to accomplish two goals. First, it serves as documentation for how I generate a dataset that I go on to use in some of my dissertation research. Unfortunately, I cannot make provide actual replication materials for this step of my research process as I’m using some datasets that are free but proprietary. Second, this document will hopefully provide a helpful example for others who are looking for guidance as they perform similar tasks in their own work.

To give a few more details, this document walks through the data generation process for a study comparing effectiveness of two kinds of “sustainable use” protected land areas in Brazil: Extractive Reserves (RESEX) and Sustainable Development Reserves (SDR). I’m interested in whether RESEX areas (which are created through a bottom-up process of local demand) are more effective at slowing forest loss than SDR areas, which are typically created through a more top-down process (some of these can also be created through bottom-up processes, which I account for).

To conduct that analaysis, at a bare minimum I need estimates of forest loss in RESEX and SDR areas over time. It would also be nice to have some control variables available (although I go on to use a fixed-effects specification that renders some of them irrelevant). I’ll provide examples below of how I’ve generated both time-variant and time-invariant controls for each sustainable use area based on various sources of geospatial data. Note that all of the geosptial data I draw on are free to the public (though some require registration and have restrictions on their distribution).

## 1.1 Packages used

Many packages are needed to execute this code. What I have below is my standard preamble for geospatial tasks in R. Not every package below is used in this document, but I’ve found that it’s easier to just migrate the preamble rather than tailor it to the packages I’m using in every specific script.

library(rmarkdown)
library(velox)
library(foreign)
library(xlsx)
library(rgdal)
library(rgeos)
library(maptools)
library(sp)
library(raster)
library(sf)
library(tidyverse)
library(gdalUtils)
library(tmap)
library(pcse)
library(ncdf4)
library(magrittr)
library(lubridate)

## 1.2 Support functions

To help my code read more smoothly and limit my opportunities to make errors, I have a number of manually written “support functions” that I’ve written and stored in a separate R script. Comments in that script describe what they do in more detail.

The snippet of code below loads these support function into R’s environment from where they are stored in the replication folder. This script is available for anyone who wants to take a look. Just download the replication materials for this project.

source(paste(getwd(),"/Support_functions.R",sep=""))
lsf.str() #prints a list of these support functions
## alt_extract : function (polygon, raster)
## alt_extract_avg : function (polygon, raster)
## alt_intersect : function (extentobj, obj)
## assignCRS : function (object, CRS)
## buildyeardata_polygon : function (polygon, tilenames, days)
## buildyeardata_polygon_b : function (polygon, tilenames, days, buffer)
## chooselags_glm_max10 : function (df, thresh)
## gendayvar : function (cday)
## generate_CRS_fromcode : function (code)
## generate_UTM_code : function (object)
## generate_UTM_CRS : function (object)
## loadTerrai_year : function (year, dir)
## manual_velox_mean : function (veloxobj, shapefiles)
## manual_velox_sum : function (veloxobj, shapefiles)
## overlap_tiles : function (polygon, listalltiles)
## pull_rain_tile : function (day, year)
## pull_temp_tile : function (day, year)
## road_density : function (clipped, polygon)
## zonalstats_tab : function (listtiles, days)

Some of the code below involves assigning coordinate reference systems (CRS). Doing this is easier if EPSG codes for those reference systems are loaded in the environment already:

EPSG <- make_EPSG()

I draw shapefiles for the different sustinable use areas from the WDPA dataset managed by Protected Planet. I downloaded shapefiles for the different legal designations separately, and gave them simpler informative names.

resex <- readOGR(dsn = paste(getwd(),"/raw data/wdpa",sep=""), layer = "resex")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\schul\sustainable-use-areas-Brazil\raw data\wdpa", layer: "resex"
## with 93 features
## It has 28 fields
sdr <- readOGR(dsn = paste(getwd(),"/raw data/wdpa",sep=""), layer = "sdr")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\schul\sustainable-use-areas-Brazil\raw data\wdpa", layer: "sdr"
## with 38 features
## It has 28 fields
brazil <- readOGR(dsn = paste(getwd(),"/raw data/gadm36_BRA_shp",sep=""), layer = "gadm36_BRA_0")
## OGR data source with driver: ESRI Shapefile
## with 1 features
## It has 2 fields

Below, I’m reassigning the unique ID variable for each polygon in these shapefiles to be a character variable, rather than a factor variable. This is just a personal preference.

resex$WDPA_PID <- as.character(resex$WDPA_PID)
sdr$WDPA_PID <- as.character(sdr$WDPA_PID)

Next, I’m going to set the start and end years for data collection here at the beginning of my code. These objects “startyr” and “endyr” are used to determine the range of years over which the code below will calculate time-variant outcome measures and covariates.

startyr <- 2004
endyr <- 2016

Finally, note that I’ll also be collecting some variables in a 20km buffer around these project sites. So, I’ll go ahead and set the buffer size here as well (in base unit meters). Creating a buffer in meters involves converting spatial objects to a CRS that uses meters as its base unit. I use the UTM coordinate system, which can provide accurate distances in meters but only within pre-specified zones. I’ve written a function that automatically detects the appropriate UTM zone for an object on the fly based on it’s central point. See the support functions R script mentioned above.

buffer <- 20000

# 3 Terra-i forest loss data (outcome measure)

Depending on what part of the world you’re interested in, there are a number of different sources of information on forest loss available. Generally, these datasets are available in what are called raster layers (e.g., .tif files). These raster layers are created by researchers who identify forest loss patterns within a georeferenced grid system by training an algorithim to process a time series of satellite images. The spatial resolution of the grid is the length (not area!) of one grid cell. Raster layers treat information as if it is uniform within grid cells, but variant across grid cells.

I’ll be using estimates produced by the Terra-i project as my outcome variable. The Terra-i project estimates forest loss every 16 days in cells with a 250m resolution (which means that each cell covers 62500 square meters). Although this is coarser than popular alternatives like the Global Forest Change dataset (which has a resolution of 30m), the tradeoff is that Terra-i estimates provide significantly more longitudinal variation.

To estimate forest loss in each WDPA shapefile every 16 days, we’ll need an object representing the day number for the end of each of these 16 day periods.

days <- seq(1,365,16)
days
##  [1]   1  17  33  49  65  81  97 113 129 145 161 177 193 209 225 241 257 273 289
## [20] 305 321 337 353

Using these data requires downloading each of the separate tiles that collectively cover Brazil, once for each year. This is a little time consuming, but not enough to really require automation. However, the need to effeciently aggregate information across so many different files DOES make it tricky to write a script that efficiently processes all these data. Another complication to keep in mind is the way Terra-i data is stored: each non-zero cell value represents the last day of the 16-day period in which forest cover in a cell was cleared.

The script below walks through a pipeline that adds up the amount of forest loss every 16 days in each of the SDR shapefiles between 2004 and 2016. I employ lapply() and purrr::map to iterate tasks across days, years, and shapefiles, rather than use something like a for() loop that could end up being much messier.

Note that in my case this code still takes some time to run (~30 minutes), even with the effort I put into streamlining. This is mostly due to the amount of time it takes R to crop/mask the Terra-i tiles to the boundaries of each polygon. While it’s likely that there are quicker solutions I haven’t thought of, it’s also important to notice that geospatial tasks can sometimes be unavoidably time-intensive. Using other software like QGIS or the Earth Engine could speed things up more (I HIGHLY recommend checking out the Earth Engine in particular), but that loses the advantages of keeping as much of the data preparation as possible contained in one script.

#store sdr as an object called shapefiles
#makes applying this to RESEX as well easier
shapefiles <- sdr

#begin with list of years between the start and end years
sdrdat <- as.list(startyr:endyr) %>%

#apply the code within ~{  } to each element in the list separately,
#with .x representing individual elements of the list of years
purrr::map(~{

#support function that loads all Terra-i tiles
#needed to cover Brazil in a particular year

#uses lapply to iterate tasks within the inner { }
#across elements in a list of the different polygons (SDRs)
#in the object "shapefiles." Here, z represents individual polygons.
lapply( split(shapefiles,f=(c(1:nrow(shapefiles)))), function(z)

#start a new pipe within a pipe,
#to represent the data processing tasks performed for each
#polygon within a particular year
{ z %>%

#a support function I wrote to create a frequency table
#representing the number of loss events within a polygon-year.
#the table is organized based on the end day of the 16-day period
#forest loss was observed within. output is a tibble object.
buildyeardata_polygon(tiles, days) %>%

#converts the variable "name" to numeric,
#creates a year variable
dplyr::mutate(name = as.numeric(z$WDPA_PID), year=.x) } ) %>% #stack elements from separate polygons into one object reduce(bind_rows) } ) %>% #stack elements from separate years into one final object reduce(bind_rows) sdrdat ## # A tibble: 11,856 x 4 ## value freq name year ## <dbl> <dbl> <dbl> <int> ## 1 0 52491 555576393 2004 ## 2 1 0 555576393 2004 ## 3 17 0 555576393 2004 ## 4 33 0 555576393 2004 ## 5 49 0 555576393 2004 ## 6 65 0 555576393 2004 ## 7 81 0 555576393 2004 ## 8 97 0 555576393 2004 ## 9 113 0 555576393 2004 ## 10 129 0 555576393 2004 ## # ... with 11,846 more rows The column “value” in the output above represents the day numbers for the end of each period. The column “name” represents an identifier code for the SDR polygon. This snippet shows that the first polygon did not see any forest clearing (at least at a large-enough scale to appear in the Terra-i data) during the first 129 days of 2004. I store the end result of the code above on my machine as an RDS object, so that the raw output can be examined later. saveRDS( sdrdat, paste(getwd(),"/data for analysis/preprocessed/SDRdat_", startyr,"_",endyr,".rds",sep="") ) I now repeat this process for the RESEX areas. As you can see, the only change needed is reassigning the object “shapefiles” in the first line. When repeating similar tasks, it’s important to make as few changes in the code as possible. #substitute in a new object here shapefiles <- resex #same as above resexdat <- as.list(startyr:endyr) %>% purrr::map(~{ loadTerrai_year(.x,paste(getwd(),"/raw data/terrai",sep="")) lapply( split(shapefiles,f=(c(1:nrow(shapefiles)))), function(z) { z %>% buildyeardata_polygon(tiles, days) %>% dplyr::mutate(name = as.numeric(z$WDPA_PID), year=.x) } ) %>%

reduce(bind_rows)

} ) %>%

reduce(bind_rows)

resexdat
## # A tibble: 29,016 x 4
##    value   freq   name  year
##    <dbl>  <dbl>  <dbl> <int>
##  1     0 114058 478420  2004
##  2     1      1 478420  2004
##  3    17      0 478420  2004
##  4    33      0 478420  2004
##  5    49      4 478420  2004
##  6    65      1 478420  2004
##  7    81      1 478420  2004
##  8    97      1 478420  2004
##  9   113      1 478420  2004
## 10   129      0 478420  2004
## # ... with 29,006 more rows
saveRDS( resexdat, paste(getwd(),"/data for analysis/preprocessed/RESEXdat_",
startyr,"_",endyr,".rds",sep="") )

# 4 Initial data preparation

Before continuing with the other geospatial data, I’ll do some initial setup using the shapefiles and the outcome measures I just generated. The code below creates a temporary object that stores some basic identifying information for use in a merge command.

#polygon ID codes and official status year for each SDR
temp <- as_tibble( cbind( as.numeric(sdr$WDPA_PID), sdr$STATUS_YR ) )
colnames(temp) <- c("name","startyear")

#polygon ID codes and status year for RESEX areas.
temp2 <- as_tibble( cbind( as.numeric(resex$WDPA_PID), resex$STATUS_YR ) )
colnames(temp2) <- c("name","startyear")

temp <- bind_rows(temp, temp2)

The next section of code takes that temporary object (“temp”) and combines it with the objects “resexdat” and “sdrdat” to create the skeleton of my final dataset.

#adding an indicator that equals 1 for RESEX areas
resexdat <- dplyr::mutate(resexdat, resex = 1)

#creating an object "dat" as a modification of "sdrdat"
dat <- sdrdat %>%

#the indicator equals 0 for SDR areas
dplyr::mutate(resex = 0) %>%

#stacking the SDR and RESEX data on top of each other
dplyr::bind_rows(resexdat) %>%

#joining in the identifying information
dplyr::left_join(temp) %>%

#a few other manipulations
dplyr::rename(day = value) %>%

dplyr::mutate(active = ifelse(year >= startyear,1,0)) %>%

dplyr::select(-startyear)

#the easiest way to get a variable for the number of cells in each polygon,
#at least within this pipeline
dat <- dat %>%

dplyr::filter(year == startyr) %>%

group_by(name) %>%

dplyr::summarise(ncell = sum(freq)) %>%

#joining this new variable to the dataset created above
dplyr::right_join(dat) %>%

#renaming the outcome measure
dplyr::rename(losscells = freq) %>%

dplyr::group_by(name) %>%

#filtering out observations with day=0,
#which are no longer necessary now that
#the "ncells" variable is created
dplyr::filter(day > 0) %>%

#some final manipulations
dplyr::mutate(period = row_number(),
active_resex = active*resex)

#simpler numeric ids for each sustainable use area
dat$unit <- as.numeric(as.factor(dat$name))

It will be helpful later to have a variable representing the year a sustainable use area’s current status was officially passed into law (in most cases, this represents when it was created). We can construct this from the shapefiles, which are based on information provided to Protected Planet by the Brazilian government.

startyears <- as_tibble(
rbind(
cbind(
sdr$STATUS_YR, as.numeric(as.character(sdr$WDPA_PID))
),
cbind(
resex$STATUS_YR, as.numeric(as.character(resex$WDPA_PID))
)
)
)

colnames(startyears) <- c("startyear","name")

startyears
## # A tibble: 131 x 2
##    startyear      name
##        <dbl>     <dbl>
##  1      2006 555576393
##  2      2005    352158
##  3      2006    478539
##  4      2002    478528
##  5      2003    352059
##  6      2008 555576217
##  7      2013    478502
##  8      2008 555576229
##  9      2016 555636599
## 10      2013    478537
## # ... with 121 more rows
dat <- dat %>%

dplyr::left_join(startyears, by="name")

Dummy variables indicating the state of Brazil a sustainable use area is in and whether it is federally or state-managed will also be useful. This information is in the Protected Planet shapefiles as well.

#transforming factor to character variables
sdr$GOV_TYPE <- as.character(sdr$GOV_TYPE)
resex$GOV_TYPE <- as.character(resex$GOV_TYPE)

dat <- rbind(cbind(sdr$GOV_TYPE,as.character(sdr$WDPA_PID)),
cbind(resex$GOV_TYPE,as.character(resex$WDPA_PID))) %>%

#convert to tibble
as_tibble %>%

#modify this information into dummy variable form
dplyr::mutate(federal = ifelse(V1=="Sub-national ministry or agency",0,1),
name = as.numeric(V2)) %>%

set_colnames(c("remove", "delete", "federal", "name")) %>%

#subset by columns
dplyr::select(c(federal, name)) %>%

#join into our dataset
dplyr::right_join(dat)

#this repeats a similar process for state names
dat <- rbind(cbind(as.character(sdr$SUB_LOC),as.character(sdr$WDPA_PID)),
cbind(as.character(resex$SUB_LOC),as.character(resex$WDPA_PID))) %>%

as_tibble %>%

dplyr::mutate(name = as.numeric(V2)) %>%

set_colnames(c("state", "remove", "name")) %>%

dplyr::select(c(state, name)) %>%

dplyr::right_join(dat)

#trimming the character strings
dat$state <- gsub("BRA-","",gsub("BR-", "", dat$state))

#converting to a matrix of dummy variables
dummies <- as_tibble(fastDummies::dummy_cols(dat$state, split=",")) colnames(dummies)[1] <- "state" colnames(dummies) <- gsub(".data_","",colnames(dummies)) #and joining into the main dataset dat <- dat %>% dplyr::left_join(distinct(dummies)) dat ## # A tibble: 39,169 x 34 ## state name federal ncell day losscells year resex active period ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> ## 1 AM 5.56e8 0 52492 1 0 2004 0 0 1 ## 2 AM 5.56e8 0 52492 17 0 2004 0 0 2 ## 3 AM 5.56e8 0 52492 33 0 2004 0 0 3 ## 4 AM 5.56e8 0 52492 49 0 2004 0 0 4 ## 5 AM 5.56e8 0 52492 65 0 2004 0 0 5 ## 6 AM 5.56e8 0 52492 81 0 2004 0 0 6 ## 7 AM 5.56e8 0 52492 97 0 2004 0 0 7 ## 8 AM 5.56e8 0 52492 113 0 2004 0 0 8 ## 9 AM 5.56e8 0 52492 129 0 2004 0 0 9 ## 10 AM 5.56e8 0 52492 145 0 2004 0 0 10 ## # ... with 39,159 more rows, and 24 more variables: active_resex <dbl>, ## # unit <dbl>, startyear <dbl>, AM <int>, PA <int>, MG <int>, SP <int>, ## # ES <int>, AP <int>, SC <int>, RJ <int>, RN <int>, AC <int>, RO <int>, ## # MA <int>, GO <int>, AL <int>, CE <int>, BA <int>, PB <int>, PE <int>, ## # PI <int>, TO <int>, RR <int> Now that this is ready, I’ll save it to return to later when the rest of the variables are available to merge in. I’ll also clean up some temporary objects from the environment that aren’t useful anymore. #saving to return to this object later saveRDS( dat, paste(getwd(),"/data for analysis/preprocessed/base_dataset.rds",sep="") ) rm(dummies, startyears, temp, temp2) # 5 Time variant covariates To account for seasonal variation over time across these sustainable use areas (which is important to consider as a proxy for seasonal economic activity), I’ll collect information on rainfall and temperature within the same time periods for which I’ve collected forest loss data. I draw on the PERSIANN-CDR Precipitation Dataset dataset, and the CPC Global Daily Temperature Dataset to do so. Each of these are stored in a raster (grid) format like the Terra-i forest loss data. I need a slightly different “days” object now representing every day of the year: days <- seq(1,365,1) ## 5.1 Rainfall I start with rainfall. My code for this step follows a similar structure to above. Once again, I walk through the different steps in some comments below: #to minimize chances for error, I simply replace the #shapefiles object again as above shapefiles <- sdr #beginning our pipe with a list of years sdrdat <- as.list(startyr:endyr) %>% #applying functions separate to each year in this list purrr::map( ~{ #within each year, applying a series of function separately for each day lapply( as.list(days), function(z) { #running a manually written function that pulls rainfall tiles #for each day/year combination pull_rain_tile(z,.x) %>% #ensuring a CRS match assignCRS(crs(shapefiles)) %>% #converting the rainfall tiles to a "velox" object velox() %>% #a manual function to calculate zonal statistics #for velox objects within a pipe much more quickly #than raster::extract() would allow manual_velox_sum(shapefiles) %>% #some other manipulations as_tibble() %>% dplyr::rename(rain = 2) %>% base::replace(.,is.na(.), 0) %>% dplyr::mutate(ID_sp = as.numeric(as.character(ID_sp))) %>% #aggregating rainfall information in cases where #one SDR area is actually represented by multiple polygons dplyr::group_by(ID_sp) %>% dplyr::summarise(sumrain = sum(rain)) %>% dplyr::mutate(name = as.numeric(shapefiles$WDPA_PID[ID_sp+1])) %>%

dplyr::select(-ID_sp) %>%

#adding in year and day variables
dplyr::mutate(year = .x,
day = z)

} ) %>%

#stacking together days
reduce(bind_rows)

} ) %>%

#stacking together years
reduce(bind_rows)

saveRDS( sdrdat, paste(getwd(),"/data for analysis/preprocessed/PREAGR_rain_SDRdat_",
startyr,"_",endyr,".rds",sep="") )

sdrdat %>%

arrange(name, year, day)
## # A tibble: 180,310 x 4
##    sumrain  name  year   day
##      <dbl> <dbl> <int> <dbl>
##  1    83.1 19769  2004     1
##  2   161.  19769  2004     2
##  3   430.  19769  2004     3
##  4   605.  19769  2004     4
##  5    20.3 19769  2004     5
##  6   185.  19769  2004     6
##  7   748.  19769  2004     7
##  8   211.  19769  2004     8
##  9   639.  19769  2004     9
## 10    59.5 19769  2004    10
## # ... with 180,300 more rows

This code has produced data on the amount of rainfall in each SDR area every day between 2004 and 2016. The final merging code towards the end of the document will join this information into the data on forest loss during the 16 day Terra-i periods.

The next snippet does the same thing for RESEX areas.

shapefiles <- resex

resexdat <- as.list(startyr:endyr) %>%

purrr::map( ~{

lapply( as.list(days), function(z) {

pull_rain_tile(z,.x) %>%

assignCRS(crs(shapefiles)) %>%

velox() %>%

manual_velox_sum(shapefiles) %>%

as_tibble() %>%

dplyr::rename(rain = 2) %>%

base::replace(.,is.na(.), 0) %>%

dplyr::mutate(ID_sp = as.numeric(as.character(ID_sp))) %>%

dplyr::group_by(ID_sp) %>%

dplyr::summarise(sumrain = sum(rain)) %>%

dplyr::mutate(name = as.numeric(shapefiles$WDPA_PID[ID_sp+1])) %>% dplyr::select(-ID_sp) %>% dplyr::mutate(year = .x, day = z) } ) %>% reduce(bind_rows) } ) %>% reduce(bind_rows) saveRDS( resexdat, paste(getwd(),"/data for analysis/preprocessed/PREAGR_rain_RESEXdat_", startyr,"_",endyr,".rds",sep="") ) ## 5.2 Temperature I now use a similar procedure to aggregate information on the average temperature in each SDR and RESEX area every day between 2004 and 2016. The code is similar enough that I have left it mostly uncommented. shapefiles <- sdr sdrdat <- as.list(startyr:endyr) %>% purrr::map( ~{ lapply( as.list(days), function(z) { pull_temp_tile(z,.x) %>% #one important change for temperature is that we need #to "rotate" the coordinate system to overlay the #shapefiles over the temperature grid properly rotate() %>% assignCRS(crs(shapefiles)) %>% velox() %>% manual_velox_mean(shapefiles) %>% as_tibble() %>% dplyr::rename(temp = 2) %>% base::replace(.,is.na(.), 0) %>% dplyr::mutate(ID_sp = as.numeric(as.character(ID_sp))) %>% dplyr::group_by(ID_sp) %>% #taking the average temperature within a shapefile, #rather than the sum of rainfall as above dplyr::summarise(avgtemp = mean(temp)) %>% dplyr::mutate(name = as.numeric(shapefiles$WDPA_PID[ID_sp+1])) %>%

dplyr::select(-ID_sp) %>%

dplyr::mutate(year = .x,
day = z)

} ) %>%

reduce(bind_rows)

} ) %>%

reduce(bind_rows)

saveRDS( sdrdat, paste(getwd(),"/data for analysis/preprocessed/PREAGR_temp_SDRdat_",
startyr,"_",endyr,".rds",sep="") )

sdrdat %>%

arrange(name, year, day)
## # A tibble: 180,310 x 4
##    avgtemp  name  year   day
##      <dbl> <dbl> <int> <dbl>
##  1    31.2 19769  2004     1
##  2    31.8 19769  2004     2
##  3    33.8 19769  2004     3
##  4    29.6 19769  2004     4
##  5    34.0 19769  2004     5
##  6    30.9 19769  2004     6
##  7    33.2 19769  2004     7
##  8    32.0 19769  2004     8
##  9    29.6 19769  2004     9
## 10    26.0 19769  2004    10
## # ... with 180,300 more rows
shapefiles <- resex

resexdat <- as.list(startyr:endyr) %>%

purrr::map( ~{

lapply( as.list(days), function(z) {

pull_temp_tile(z,.x) %>%

rotate() %>%

assignCRS(crs(shapefiles)) %>%

velox() %>%

manual_velox_mean(shapefiles) %>%

as_tibble() %>%

dplyr::rename(temp = 2) %>%

base::replace(.,is.na(.), 0) %>%

dplyr::mutate(ID_sp = as.numeric(as.character(ID_sp))) %>%

dplyr::group_by(ID_sp) %>%

dplyr::summarise(avgtemp = mean(temp)) %>%

dplyr::mutate(name = as.numeric(shapefiles$WDPA_PID[ID_sp+1])) %>% dplyr::select(-ID_sp) %>% dplyr::mutate(year = .x, day = z) } ) %>% reduce(bind_rows) } ) %>% reduce(bind_rows) saveRDS( resexdat, paste(getwd(),"/data for analysis/preprocessed/PREAGR_temp_RESEXdat_", startyr,"_",endyr,".rds",sep="") ) resexdat %>% dplyr::arrange(name, year, day) ## # A tibble: 441,285 x 4 ## avgtemp name year day ## <dbl> <dbl> <int> <dbl> ## 1 27.9 31774 2004 1 ## 2 30.3 31774 2004 2 ## 3 32.2 31774 2004 3 ## 4 33.2 31774 2004 4 ## 5 30.5 31774 2004 5 ## 6 28.5 31774 2004 6 ## 7 29.8 31774 2004 7 ## 8 32.4 31774 2004 8 ## 9 31.9 31774 2004 9 ## 10 31.9 31774 2004 10 ## # ... with 441,275 more rows ## 5.3 Alternative forest loss measure The last time variant factor I collect data on is actually an alternative outcome measure: yearly estimates of forest clearing in each sustainable use area drawn from the Global Forest Change dataset mentioned earlier. In this case, I have outsourced calculations to the Google Earth Engine for processing time reasons, but I load the results into this pipeline so I can merge the raw Earth Engine output with the rest of my data. It will be interesting to compare these yearly estimates (generated from higher resolution satellite images) to yearly estimates generated based on the Terra-i forest loss data. The script I run in the Earth Engine to produce these data is available in the replication materials for this study. #load files in the folder containing yearly forest loss estimates, #derived from the GFC landsat data, each in a separate .csv. files <- list.files(path = paste(getwd(),"/raw data/hansen/earth engine export", sep=""), pattern = "*.csv", full.names = T) tbl <- lapply(files, read_csv) hansen <- tbl %>% #subsetting to just the relevent columns lapply(function(z) { z <- z[,colnames(z) %in% c("sum","WDPA_PID","Year")] }) %>% #joining the separate sets of estimates together row by row reduce(bind_rows) #converting to square kilometers from square meters hansen$hansen_loss <- hansen$sum / 1000000 #some other minor modifications hansen <- hansen %>% dplyr::rename(name = WDPA_PID, year = Year) %>% dplyr::select(-c(sum)) saveRDS( hansen, paste(getwd(),"/data for analysis/preprocessed/hansen.rds",sep="") ) hansen ## # A tibble: 2,227 x 3 ## name year hansen_loss ## <dbl> <dbl> <dbl> ## 1 352151 2001 0.0336 ## 2 555576283 2001 0.361 ## 3 352135 2001 3.00 ## 4 352163 2001 0.00176 ## 5 33714 2001 0.0799 ## 6 352160 2001 0.00393 ## 7 352166 2001 0.00601 ## 8 352164 2001 0.0403 ## 9 352171 2001 0.0166 ## 10 352167 2001 0.000522 ## # ... with 2,217 more rows # 6 Time invariant covariates Lastly, I collect information on a number of factors that I do not allow to vary over time, either because recording longitudinal information accurately would not feasible (as with shapefiles of roads, for instance, which are inconsistently updated) or to avoid posttreatment bias in my statistical tests (as with nighttime lights data as a proxy for economic development). ## 6.1 Road length I use the Global Roads Open Access Dataset to calculate the length of roads (in km) within 20 km of each sustainable use area. The more roads there are in the area, the more easily accessible forests are to large scale clearing. #loading the roads shapefile roads <- readOGR(dsn = paste(getwd(),"/raw data/groads-v1-americas-shp",sep=""), layer = "gROADS-v1-americas") ## OGR data source with driver: ESRI Shapefile ## Source: "C:\Users\schul\sustainable-use-areas-Brazil\raw data\groads-v1-americas-shp", layer: "gROADS-v1-americas" ## with 23781 features ## It has 36 fields #I start by "splitting" sdr into a list of separate polygons roaddat <- split(sdr, f=(1:nrow(sdr))) %>% #adding a list of resex polygons as well append(split(resex,f=(39:(38+nrow(resex))))) %>% #applying a set of transformations to every element of this list purrr::map(~{ utmcrs <- generate_UTM_CRS(.x) #reassigns the shapefile to a UTM crs, which allows #creating a buffer in meters spTransform(.x, utmcrs) %>% raster::buffer(width=buffer,dissolve=TRUE) %>% #keeps sections of roads within this buffer, drops all others #manually written function to use intersect() in a pipe. #it just takes the arguments in a different order alt_intersect(spTransform(roads, utmcrs)) %>% #a manually writen function that calculates the length #of a clipped shapefile of roads roadlength() }) %>% #create a tibble purrr::map(~{tibble::enframe(.x,name=NULL)}) %>% #stack the results for different polygons reduce(bind_rows) %>% dplyr::mutate( name = c(as.numeric(sdr$WDPA_PID),as.numeric(resex$WDPA_PID)) ) rm(roads) roaddat$value <- roaddat$value/1000 saveRDS( roaddat, paste(getwd(),"/data for analysis/preprocessed/road_dat_", startyr,"_",endyr,".rds",sep="") ) roaddat ## # A tibble: 131 x 2 ## value name ## <dbl> <dbl> ## 1 0 555576393 ## 2 0 352158 ## 3 0 478539 ## 4 56.5 478528 ## 5 145. 352059 ## 6 49.8 555576217 ## 7 71.3 478502 ## 8 62.0 555576229 ## 9 26.3 555636599 ## 10 118. 478537 ## # ... with 121 more rows ## 6.2 Population Next, I collect estimates of population within 20km of these sustainable use areas. This is particularly important to collect in a reasonably large buffer, since people living outside the sustainable use area may clear forest land inside its boundaries. I use a dataset of population estimates in raster format: the Gridded Populations of the World. Both these data and the roads dataset are hosted by NASA’s Socioeconomic Data and Applications Center. Some of this code is similar to above. I use comments to point out anything new. brazilpop <- raster(paste(getwd(),"/raw data/population/brazilpop_32718.tif",sep="")) popdat <- split(sdr, f=(1:nrow(sdr))) %>% append(split(resex,f=(39:(38+nrow(resex))))) %>% purrr::map(~{ spTransform(.x, generate_UTM_CRS(.x)) %>% raster::buffer(width=buffer,dissolve=TRUE) %>% sp::spTransform(crs(brazilpop)) %>% #I used a manually written function here #to reverse the argument order of raster::extract, #which lets me use it in a pipe. This is a handy trick. alt_extract(brazilpop) }) %>% purrr::map(~{as_tibble(.x)}) %>% reduce(bind_rows) %>% dplyr::mutate( name = c(as.numeric(sdr$WDPA_PID),as.numeric(resex$WDPA_PID)) ) %>% dplyr::select(-(ID)) %>% dplyr::rename(estpop2000 = 1) rm(brazilpop) saveRDS( popdat, paste(getwd(),"/data for analysis/preprocessed/pop_dat_", startyr,"_",endyr,".rds",sep="") ) popdat ## # A tibble: 131 x 2 ## estpop2000 name ## <dbl> <dbl> ## 1 18093. 555576393 ## 2 159. 352158 ## 3 16067. 478539 ## 4 114219. 478528 ## 5 11244. 352059 ## 6 8686. 555576217 ## 7 71278. 478502 ## 8 10417. 555576229 ## 9 17926. 555636599 ## 10 41182. 478537 ## # ... with 121 more rows ## 6.3 Nighttime lights Next, I aggregate information on nighttime luminosity around these sustainable use areas, as a measure of infrastructure development in the surrounding area. Once again, these measures of nighttime luminosity are initially stored in a raster format, so I need to take the average luminosity within each sustainable use area. I use data from the DMPS DMSP-OLS Version 4 dataset provided by the NOAA (in particular their cloud free composites). lights <- raster( paste(getwd(),"/raw data/lights/lightclip.tif",sep="") ) lightdat <- split(sdr, f=(1:nrow(sdr))) %>% append(split(resex,f=(39:(38+nrow(resex))))) %>% purrr::map(~{ spTransform(.x, generate_UTM_CRS(.x)) %>% raster::buffer(width=buffer,dissolve=TRUE) %>% sp::spTransform(crs(lights)) %>% alt_extract_avg(lights) }) %>% purrr::map(~{as_tibble(.x)}) %>% reduce(bind_rows) %>% dplyr::mutate( name = c(as.numeric(sdr$WDPA_PID),as.numeric(resex$WDPA_PID)) ) %>% dplyr::select(-(ID)) %>% dplyr::rename(lights2004 = 1) rm(lights) saveRDS( lightdat, paste(getwd(),"/data for analysis/preprocessed/light_dat_", startyr,"_",endyr,".rds",sep="") ) lightdat ## # A tibble: 131 x 2 ## lights2004 name ## <dbl> <dbl> ## 1 37.5 555576393 ## 2 43.4 352158 ## 3 38.2 478539 ## 4 55.9 478528 ## 5 56.4 352059 ## 6 51.4 555576217 ## 7 29.9 478502 ## 8 50.5 555576229 ## 9 60.5 555636599 ## 10 54.6 478537 ## # ... with 121 more rows ## 6.4 Distance to towns/cities Finally, I draw on municipality coordinates in Brazil provided by the World Bank to estimate the distance between each sustainable use area and the nearest population center. In this case, I take the centroid of each sustainable use area polygon, and identify the distance (in kilometers) between it and the nearest city/town in this list of coordinates. cities <- readOGR(dsn = paste(getwd(),"/raw data/city",sep=""), layer = "cities") ## OGR data source with driver: ESRI Shapefile ## Source: "C:\Users\schul\sustainable-use-areas-Brazil\raw data\city", layer: "cities" ## with 5565 features ## It has 32 fields ## Integer64 fields read as strings: tessellate extrude visibility drawOrder citydat <- split(sdr, f=(1:nrow(sdr))) %>% append(split(resex,f=(39:(38+nrow(resex))))) %>% purrr::map(~{ utmcrs <- generate_UTM_CRS(.x) spTransform(.x, utmcrs) %>% #fixes some "orphaned holes" that would cause an error below raster::buffer(width=0,dissolve=TRUE) %>% gDistance(spTransform(cities, utmcrs),byid=TRUE) %>% sort(.,decreasing=FALSE) %>% first() }) %>% purrr::map(~{tibble::enframe(.x,name=NULL)}) %>% reduce(bind_rows) %>% dplyr::mutate( name = c(as.numeric(sdr$WDPA_PID),as.numeric(resex$WDPA_PID)) ) rm(cities) citydat$value <- citydat$value/1000 colnames(citydat)[1] <- "kmtocity" saveRDS( citydat, paste(getwd(),"/data for analysis/preprocessed/city_dat_", startyr,"_",endyr,".rds",sep="") ) citydat ## # A tibble: 131 x 2 ## kmtocity name ## <dbl> <dbl> ## 1 3.31 555576393 ## 2 72.8 352158 ## 3 9.57 478539 ## 4 7.63 478528 ## 5 32.9 352059 ## 6 19.0 555576217 ## 7 12.1 478502 ## 8 13.7 555576229 ## 9 3.51 555636599 ## 10 11.1 478537 ## # ... with 121 more rows # 7 Merging The next step is joining all of the other variables with the forest loss outcome measure (the Terra-i data) to create a final dataset. For the rain and temperature date, this requires aggregating the daily information over 16 day periods. I calculate the total amount of rain in each window of time, but for temperature I take an average. Let’s start by reloading those RDS files that store the results from the code above: days <- seq(1,353,16) sdr_rain <- readRDS(paste(getwd(),"/data for analysis/preprocessed/PREAGR_rain_SDRdat_2004_2016.rds",sep="")) resex_rain <- readRDS(paste(getwd(),"/data for analysis/preprocessed/PREAGR_rain_RESEXdat_2004_2016.rds",sep="")) sdr_temp <- readRDS(paste(getwd(),"/data for analysis/preprocessed/PREAGR_temp_SDRdat_2004_2016.rds",sep="")) resex_temp <- readRDS(paste(getwd(),"/data for analysis/preprocessed/PREAGR_temp_RESEXdat_2004_2016.rds",sep="")) Next, I use another manually written function called “gendayvar()” to create variables I can use to properly summarise data with the 16 day periods used by my outcome measure. I do this for both the rainfall and temperature measures across the SDR and RESEX areas. sdr_rain <- sdr_rain %>% dplyr::arrange(name, year, day) sdr_rain$dayagr <- unlist(gendayvar(sdr_rain$day)) sdr_rain$newyear <- sdr_rain$year sdr_rain$newyear <- ifelse(sdr_rain$day > 353, sdr_rain$newyear+1,
sdr_rain$newyear) sdr_rain <- sdr_rain %>% dplyr::group_by(name, newyear, dayagr) %>% dplyr::summarise(sumrain = sum(sumrain)) %>% dplyr::rename(year = newyear, day = dayagr) resex_rain <- resex_rain %>% dplyr::arrange(name, year, day) resex_rain$dayagr <- unlist(gendayvar(resex_rain$day)) resex_rain$newyear <- resex_rain$year resex_rain$newyear <- ifelse(resex_rain$day > 353, resex_rain$newyear+1,
resex_rain$newyear) resex_rain <- resex_rain %>% dplyr::group_by(name, newyear, dayagr) %>% dplyr::summarise(sumrain = sum(sumrain)) %>% dplyr::rename(year = newyear, day = dayagr) Then I load the rest of the data: citydat <- readRDS(paste(getwd(), "/data for analysis/preprocessed/city_dat_2004_2016.rds", sep="")) lightdat <- readRDS(paste(getwd(), "/data for analysis/preprocessed/light_dat_2004_2016.rds", sep="")) popdat <- readRDS(paste(getwd(), "/data for analysis/preprocessed/pop_dat_2004_2016.rds", sep="")) sdr_buffer_loss <- readRDS(paste(getwd(), "/data for analysis/preprocessed/bSDRdat_2004_2016.rds", sep="")) sdr_buffer_loss <- sdr_buffer_loss %>% dplyr::filter(value > 0) %>% dplyr::rename(day = value, blosscells = freq) resex_buffer_loss <- readRDS(paste(getwd(), "/data for analysis/preprocessed/bRESEXdat_2004_2016.rds", sep="")) resex_buffer_loss <- resex_buffer_loss %>% dplyr::filter(value > 0) %>% dplyr::rename(day = value, blosscells = freq) roaddat <- readRDS(paste(getwd(), "/data for analysis/preprocessed/road_dat_2004_2016.rds", sep="")) roaddat <- roaddat %>% dplyr::rename(roads=value) hansen <- readRDS(paste(getwd(), "/data for analysis/preprocessed/hansen.rds", sep="")) hansen <- hansen %>% filter(year <= 2016) #data on tree cover in 2000 (in sq km) in each of these areas #created alongside the alternative DV above. tc2000 <- read.csv(paste(getwd(), "/raw data/hansen/tc2000.csv", sep="")) %>% as_tibble() dat <- readRDS(paste(getwd(), "/data for analysis/preprocessed/base_dataset.rds", sep="")) Now, it’s time to actually join everything together. dat <- dat %>% left_join(bind_rows(sdr_rain,resex_rain)) %>% left_join(bind_rows(sdr_temp,resex_temp)) %>% left_join(citydat) %>% left_join(lightdat) %>% left_join(popdat) %>% left_join(roaddat) %>% left_join(tc2000) %>% left_join(hansen) %>% left_join(sdr_buffer_loss) %>% left_join(resex_buffer_loss) dat ## # A tibble: 39,169 x 43 ## state name federal ncell day losscells year resex active period ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 AM 5.56e8 0 52492 1 0 2004 0 0 1 ## 2 AM 5.56e8 0 52492 17 0 2004 0 0 2 ## 3 AM 5.56e8 0 52492 33 0 2004 0 0 3 ## 4 AM 5.56e8 0 52492 49 0 2004 0 0 4 ## 5 AM 5.56e8 0 52492 65 0 2004 0 0 5 ## 6 AM 5.56e8 0 52492 81 0 2004 0 0 6 ## 7 AM 5.56e8 0 52492 97 0 2004 0 0 7 ## 8 AM 5.56e8 0 52492 113 0 2004 0 0 8 ## 9 AM 5.56e8 0 52492 129 0 2004 0 0 9 ## 10 AM 5.56e8 0 52492 145 0 2004 0 0 10 ## # ... with 39,159 more rows, and 33 more variables: active_resex <dbl>, ## # unit <dbl>, startyear <dbl>, AM <int>, PA <int>, MG <int>, SP <int>, ## # ES <int>, AP <int>, SC <int>, RJ <int>, RN <int>, AC <int>, RO <int>, ## # MA <int>, GO <int>, AL <int>, CE <int>, BA <int>, PB <int>, PE <int>, ## # PI <int>, TO <int>, RR <int>, sumrain <dbl>, avgtemp <dbl>, kmtocity <dbl>, ## # lights2004 <dbl>, estpop2000 <dbl>, roads <dbl>, tc2000sqkm <dbl>, ## # hansen_loss <dbl>, blosscells <dbl> There are some variables here that are more useful when transformed to log scale. However, this isn’t possible in cases where variables have values of 0 (for instance, they might not have been rain for some 16 day periods). In those cases, I use a cube root transformation. dat <- dat %>% dplyr::mutate(lpop2000 = log(estpop2000), ltc2000 = log(tc2000sqkm), crdistcity = (kmtocity)^(1/3), crroads = (roads)^(1/3)) saveRDS( dat, paste(getwd(),"/data for analysis/postprocessed/largesample.rds",sep="") ) # 8 Final data preparation tasks There are a few more things to take care of before these data are ready for exploration and analysis ## 8.1 Trimming I’m going to use a “within” regression model (i.e., fixed effects) to compare the pre/post change in forest loss in SDR areas to the pre/post change in RESEX areas. This helps me control for time-invariant differences across units that would otherwise bias my estimates. But, using a within model the way I plan to requires outcome information both before and after a sustainable use area was established. So, I trim my sample down to the cases I can actually use in a fixed effects regression. #keep those starting after 2002 law #standardizing governance of SDR and RESEX areas ldat <- dat %>% dplyr::filter(startyear >= 2002) #identify units whose active status doesn't change #in the range of Terra-i data availabiility constant <- as.list(unique(ldat$unit)) %>%

purrr::map(~{ifelse(length(unique(ldat$active[ldat$unit == .x]))>1,1,0)}) %>%

unlist()

#modify the list into a more useful form
constant <- constant*unique(ldat$unit) constant <- unique(constant) constant <- sort(constant, decreasing = FALSE) constant <- constant[-1] #subset data based on the list ldat <- ldat[ldat$unit %in% constant,]

rm(constant)

## 8.2 Independent coding of cases

What I really care about is the difference between sustainable use areas created through a bottom-up process of local mobilization, and those created from the top-down. RESEX areas are guaranteed to be bottom up. In contrast, while most SDR areas were created through a top down process, a few were actually the result of a local social movement or petition as well.

That problem in mind, I researched the history of each sustainable use area still remaining (after trimming by outcome data availability above), to identify those SDR areas that needed to be coded as bottom-up instead. While doing this, I collected some additional information, too. The Protected Planet data provides the year a sustainable use area’s legal status was updated, but not the precise day. But I have outcome measures every 16 days, which means that some observations are falsely coded as “active” right now! So, while looking at the histories of these sustainable use areas, I also recorded more precise start dates. Finally, I used this as an opportunity to double check that each sustainable use area in my dataset was actually created to slow down forest clearing. I identified a number of cases that neeed to be dropped from my sample because they were created to accomplish different goals.

Below, I load information from this coding process (stored in a .csv), use it to further trim my data, and then create a proper treatment variable for my study (“topdown”).

#loading coding of each area type

#creating a list of cases to exclude since they
#aren't created to slow deforestation
exlist <- c(indep$name[indep$exclude==1],indep2$name[indep2$exclude==1])
ldat <- ldat[!(ldat$name %in% exlist),] #assigning a top down dummy based on my independent coding ldat$topdown <- ifelse(ldat$resex==0,1,0) ldat$topdown[ldat$name %in% indep$name[indep$topdown==0]] <- 0 rm(exlist) I now need to recreate the dummy variable for periods in which a sustainable use area is active. Before, this variable was simply based on the year an area was created. I accomplish this task with the code below. #converting to tibble and subsetting indep <- indep %>% as_tibble() %>% dplyr::select(name,creation) %>% dplyr::mutate(creation = as.character(creation)) #converting to tibble and subsetting indep2 <- indep2 %>% as_tibble() %>% dplyr::select(name,creation) %>% dplyr::mutate(creation = as.character(creation)) #merging together RESEX and SDR information indep <- indep %>% dplyr::bind_rows(indep2) rm(indep2) #joining the more accurate creation #date into my dataset ldat <- dplyr::left_join(ldat, indep, by="name") #initializing some new variables ldat$startyear <- year(dmy(ldat$creation)) ldat$activeyr <- ifelse(ldat$year < ldat$startyear,0,1)
ldat$active <- ifelse(ldat$year < ldat$startyear,0,1) #a for() loop that modifies these new variables #to be consistent with the more accurate creation date i <- 1 for (i in 1:nrow(ldat)) { if ( ldat$year[i] == ldat$startyear[i] ) { if ( ldat$period[i] < lubridate::yday(dmy(ldat$creation[i])) ) { ldat$active[i] <- 0

}

}

}

#a dummy for the interaction of "active" with "topdown"
ldat$active_topdown <- ldat$active * ldat$topdown rm(indep) # 9 Wrap-up Let’s take a look at the number of cases that are left. This illustrates that although these data are detailed longitudinally, restricting the sample for a reliable analysis leads to a dataset with limited cross-sectional variation. This influences my choice of estimation strategies. #bottom up length(unique(ldat$name[ldat$topdown==0])) ## [1] 24 #top down length(unique(ldat$name[ldat\$topdown==1]))
## [1] 14

Finally, the most important task of all: saving the finished product.

saveRDS( ldat, paste(getwd(),"/data for analysis/postprocessed/dataset.rds",sep="") )