Handling vector data

Last Update: 7 October 2022
Download RMarkdown: GRWG22_VectorData.Rmd

Overview

This tutorial covers how to manipulate geospatial vector datasets in R. A dataset with polygon geometries representing wildfire perimeters is joined with a dataset with point geometries representing air quality monitoring stations to determine which stations observed unhealthy concentrations of small particulate matter (PM2.5) in the atmosphere around the time of the Camp Fire in northern CA in 2018.

This tutorial assumes you are running this Rmarkdown file in RStudio Server. The easiest way to do that is with Open OnDemand (OoD) on Ceres or Atlas. Select the following parameter values when requesting a RStudio Server app to be launched depending on which cluster you choose. All other values can be left to their defaults. Note: on Atlas, we are using the development partition so that we have internet access to download files since the regular compute nodes on the atlas partition do not have internet access.

Ceres:

Slurm Partition: short
R Version: 4.2.0
Number of hours: 1
Number of cores: 2

Atlas:

R Version: 4.1.0
Partition Name: development
QOS: normal
Number of hours: 1
Number of tasks: 2

To download the Rmarkdown file for this tutorial to either cluster within OoD, you can use the following lines:

library(httr)
tutorial_name <- 'GRWG22_VectorData.Rmd'
GET(paste0('https://geospatial.101workbook.org/tutorials/',tutorial_name), 
    write_disk(tutorial_name))

Language: R

Primary Libraries/Packages:

Name	Description	Link
`sf`	Simple features for R	https://cran.r-project.org/web/packages/sf/index.html
`USAboundaries`	Historical and Contemporary Boundaries of the United States of America	https://cran.r-project.org/web/packages/USAboundaries/index.html
`ggplot2`	Create Elegant Data Visualisations Using the Grammar of Graphics	https://cran.r-project.org/web/packages/ggplot2/index.html

Nomenclature

Vector data: Spatial data defined by simple geometry types (points, lines, and polygons) where each geometry feature can be assigned non-spatial attributes.
CRS: Coordinate Reference System, also known as a spatial reference system. A system for defining geospatial coordinates.
Spatial join: Combining two spatial datasets by the relationship between their geometries.

Data Details

Data: National Interagency Fire Center’s Historic Perimeters dataset
Link: https://data-nifc.opendata.arcgis.com/datasets/nifc::historic-perimeters-combined-2000-2018-geomac/explore
Other Details: The original dataset contains perimeters of wildfires in the US from 2000-2018 as a polygon feature collection. For this tutorial, the wildfire perimeters in CA during 2018 were extracted.
Data: US EPA’s Air Quality System (AQS) database
Link: https://aqs.epa.gov/aqsweb/documents/data_api.html
Other Details: PM2.5 concentration data from this database covering CA in 2018 were retrieved and pre-processed for this tutorial.

Analysis Steps

Fire perimeter data - read in and visualize the wildfire perimeter data
- Read in geojson file
- Visualize perimeters on map of CA
- Visualize non-spatial attributes
Air quality data - read in the shapefile
Buffer and spatial join - find the air quality stations within 200km of the fire perimeter
Visualize - air quality impact around the Camp Fire

Step 0: Import Libraries/Packages

For this tutorial, we will use the sf package for handling vector data, the USAboundaries for including state boundaries in our maps, dplyr for general tabular data manipulations, ggplot2 for creating visuals, and lubridate for handling dates. Each package except USAboundaries is available from the site libraries accessible to RStudio Server on OoD for both clusters. If you have not used USAboundaries before, you may install it from CRAN with install.packages(USAboundaries) from within RStudio Server on OoD. To learn more about installing packages on Ceres, see this guide.

library(sf)               # Handling vector data
library(USAboundaries)    # Mapping administrative boundaries
library(dplyr)            # General data manipulation
library(ggplot2)          # Visualizations
library(lubridate)        # Date manipulation

Step 1: Read in fire perimeter data and visualize

The National Interagency Fire Center’s Historic Perimeters dataset has 23,776 polygons representing wildfire perimeters from 2000-2018. A version of the dataset filtered to wildfires in CA in 2018, since that is when the destructive Camp Fire occurred, will be used in this tutorial.

We will transform to planar coordinates for distance calculations. The 5070 EPSG code is for the Equal Area CONUS Albers. If you kept the dataset in it’s original CRS, WGS 84, sf would warn you about distance calculations not being accurate.

fire_f <- 'Historic_Perimeters_Combined_2000-2018_GeoMAC_CA2018.geojson'
dnld_url <- 'https://geospatial.101workbook.org/ExampleGeoWorkflows/assets/'
httr::GET(paste0(dnld_url,fire_f),
          httr::write_disk(fire_f,
                           overwrite=TRUE))

fire_CA2018 <- st_read(fire_f) %>%
  st_transform(5070)

To get an idea of what the data look like, we can create a map.

# CA boundary
CA <- us_states() %>%
            filter(state_abbr == 'CA') %>%
  st_transform(st_crs(fire_CA2018))

# The Camp Fire feature
camp_fire <- fire_CA2018 %>%
  filter(incidentname == 'CAMP')

# Plot a map of all wildfires and 
# highlight the Camp Fire.
fire_CA2018 %>%
  ggplot() +
  geom_sf(data = CA) +
  geom_sf(fill = 'black',
          color = NA) +
  geom_sf(fill = 'red',
          color = NA,
          data = camp_fire)

fire_map

Since this feature collection has several attributes, we can also visualize, for example, when the fires occurred during the year. Note that the dates are when the fires’ perimeters are established, not when the fires start. Since fires can endure for many days and change their spatial extent over that time, this date is when the maximum extent of the fire was determined, which needs to be at the end of the fire.

# Plotting when wildfires occurred throughout 2018
fire_CA2018 %>%
  select(perimeterdatetime) %>%
  ggplot(aes(perimeterdatetime)) +
  geom_histogram() +
  scale_x_datetime(breaks = '1 month',
                   date_labels = '%b')

fire_time

Step 2: Read in air quality data

We will read in an air quality dataset to showcase combining multiple vector datasets. The geometry type is point, representing sampling stations. This dataset can be downloaded with an API, but you need to register an account, so the download process has been done already and the data are available in our ‘session6’ folder. Some pre-processing for this dataset was done to make it into a feature collection.

aq_base <- 'air_quality_CA2018'
aq_zip <- paste0(aq_base,'.zip')
#dnld_url <- 'https://github.com/HeatherSavoy-USDA/geospatialworkbook/raw/master/ExampleGeoWorkflows/assets/'
httr::GET(paste0(dnld_url,aq_zip),
          httr::write_disk(aq_zip,
                           overwrite=TRUE))
unzip(aq_zip)

ca_PM25 <- st_read(paste0(aq_base,'.shp')) %>%
  st_transform(st_crs(fire_CA2018))

Step 3: Find the air quality stations within 200km of the fire perimeter

We will buffer the Camp Fire polygon and join that to the air quality data to find the nearby stations. Note that the direction of joining is important: the geometry of the left dataset, in this case the air quality points, will be retained and the attributes of intersecting features of the right dataset will be retained.

camp_zone <- st_buffer(camp_fire, 200000) # meters, buffer is in unit of CRS

air_fire <- st_join(ca_PM25, camp_zone)

Step 4: Visualize air quality around the Camp Fire

‘Around’ in this case should be in both space and time. We already found the air quality stations within 200 km from the fire perimeter. To make sure we are looking at the right time of year, we will relate the non-spatial attributes of the two feature collections. Both datasets come with date columns: when the air quality was recorded and when the perimeter was determined (so the latter is an approximate time near the end of the fire). We will filter our spatially- joined dataset to within 30 days of the fire perimeter date so we only consider air quality within a 2-month window around the time of the fire.

We can then plot the air quality metric of PM2.5 concentration (lower is better) as a function of time before/after the fire perimeter date. Each line is an individual air quality station. The figure indicates that there were several stations that recorded unhealthy to hazardous PM2.5 concentrations around the time of this wildfire.

# Filter to 30 days from fire perimeter date
air_near_fire <- air_fire %>%
  mutate(dat_lcl = ymd(dat_lcl), # local date for air quality measurement
         perimeterdate = as.Date(perimeterdatetime), # fire perimeter date
         date_shift = dat_lcl - perimeterdate,
         PM25 = as.numeric(arthmt_),
         station_id = paste(stat_cd, cnty_cd, st_nmbr)) %>% 
  filter(abs(date_shift) <= 30)
  
# Define bounds of Camp Fire dates for illustrative purposes
# Camp Fire burned for 17 days- add a polygon to graph to visualize temporal overlap with low air quality
camp_dates <- ymd(c('2018-11-08','2018-11-25'))

# Camp Fire's maximum perimeter established at end of fire
fp_date <- unique(air_near_fire$perimeterdate)


air_near_fire %>%
  tidyr::complete(station_id, dat_lcl) %>%
  ggplot(aes(dat_lcl,PM25)) +
  annotate('rect', xmin = camp_dates[1], xmax = camp_dates[2],
            ymin = -Inf, ymax = Inf,
            fill = 'firebrick', alpha = 0.5) +
  geom_line(aes(group = station_id)) +
  geom_vline(xintercept = fp_date, 
             color = 'red',
             lwd = 1,
             linetype = 'dashed') +
  scale_x_date(name = '',date_breaks = "1 week",date_labels = "%m-%d-%y") +
  scale_y_continuous(name = 'PM2.5 [micrograms/cubic meter]')

join_plot

We can then map where those stations were - quite an extensive impact on air quality!

# 101 micrograms/cubic meter and higher is considered
# unhealthy for sensitive groups 
unhealthy_air <- air_near_fire %>%
  filter(PM25 > 100)  

unhealthy_air %>%
  ggplot() +
  geom_sf(data = CA) +
  geom_sf(color = 'red') +
  geom_sf(fill = 'firebrick',
          color = NA,
          data = camp_fire) +
  geom_sf(color = 'firebrick',
          fill = NA,
          data = camp_zone)

unhealthy