Last Update: 28 September 2022
Download RMarkdown: GRWG22_ZonalStats_wSLURM.Rmd
Overview
This tutorial covers how to:
- calculate zonal statistics (i.e., extract summaries of raster values intersecting polygons) in R, and
- use SLURM job arrays to execute an R script with different inputs across multiple cores.
We will use 21 years of the PRISM gridded dataset’s annual precipitation variable and the US Census counties polygon dataset to calculate the mean annual precipitation in each county per year. We will request SLURM to distribute the 21 years of input data across as many cores and run our zonal statistics R script on each one.
If you prefer to have an R script handle looping over your data inputs and submitting many job submission scripts, see this tutorial.
Language: R
Primary Libraries/Packages:
Name | Description | Link |
---|---|---|
rgeocdl |
R interface for SCINet GeoCDL API | https://github.com/USDA-SCINet/rgeocdl |
terra |
Methods for spatial data analysis with raster and vector data | https://rspatial.org/terra/pkg/index.html |
Nomenclature
- GeoCDL: Geospatial Common Data Library, a collection of commonly used raster datasets accessible from an API running on SCINet’s Ceres cluster
- SLURM Workload Manager: The software on Ceres and Atlas that allocates compute cores to users for their submitted jobs.
- Zonal statistics: Calculating summary statistics, e.g. mean, of cell values from a raster in each region, where regions can be defined by an overlapping polygon feature collection.
Data Details
- Data: US Census Cartographic Boundary Files
- Link: https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html
-
Other Details: The cartographic boundary files are simplified representations of selected geographic areas from the Census Bureau’s MAF/TIGER geographic database. These boundary files are specifically designed for small scale thematic mapping.
- Data: PRISM
- Link: https://prism.oregonstate.edu/
- Other Details: The PRISM Climate Group gathers climate observations from a wide range of monitoring networks, applies sophisticated quality control measures, and develops spatial climate datasets to reveal short- and long-term climate patterns. The resulting datasets incorporate a variety of modeling techniques and are available at multiple spatial/temporal resolutions, covering the period from 1895 to the present.
Analysis Steps
- Write serial R script - this script will accept a year argument, open the raster file associated with that year, open the polygon dataset, calculate the mean value per polygon, and write a new shapefile with the mean values.
- Write and save a SLURM job submission script - Create a batch script with SLURM commands requesting resources to execute your calculations on multiple cores.
- Submit your job - Submit your batch script to SLURM
- Check results - Monitor the SLURM queue until your job is complete and then ensure your job executed successfully.
Step 0: Install packages and download data
For this tutorial, we are not running our spatial analyses via Open OnDemand,
so we do not have access to the same breadth of geospatial R packages in the
site library. If you have not used terra
or rgeocdl
outside of Open
OnDemand on Ceres before, you can install it in a shell with the code chunks
below. The first three module load
commands load versions of non-R
libraries on which terra
or rgeocdl
depends. Then we load the 3.6
version of R, which is a version recent enough for terra
to install that also
has some rgeocdl
install dependencies also available in the site library on
Ceres. You may use a newer version of R, but may need to do some additional package
installations yourself.
module load gdal
module load geos
module load udunits
module load r/3.6
R
Which will open R and you can run the following lines to install terra
. The
second line installs the SCINet rgeocdl
package used for downloading the
example data for this tutorial.
# Install packages
install.packages('terra') # if needed
devtools::install_github('USDA-SCINet/rgeocdl') # if needed
The code chunk below will download the example data. For our US county polygons,
we are downloading a zipped folder containing a shapefile. The folder is then
unzipped. For our precipitation data, we are using the SCINet GeoCDL’s R package
rgeocdl
to download annual precipitation for 2000-2020 in a bounding box
approximately covering the state of Florida. Feel free to change the latitude
and longitude to your preferred area, but any mentions of processing times below
will reflect the bounds provided.
# Load packages for downloading data
library(rgeocdl)
# Download vector data - US Counties shapefile from US Census
vector_f <- 'us_counties2021'
vector_zip <- paste0(vector_f,'.zip')
httr::GET('https://www2.census.gov/geo/tiger/GENZ2021/shp/cb_2021_us_county_20m.zip',
httr::write_disk(vector_zip,
overwrite=TRUE))
unzip(vector_zip,
exdir = vector_f)
# Download raster data - PRISM's annual precipitation in box around Florida
download_polygon_subset(dsvars = data.frame(ds = 'PRISM',var = 'ppt'),
years = '2000:2020',
t_geom = data.frame(x = c(-87.5,-79), y = c(31,24.5)),
req_name = 'ppt_for_zonal_stats')
You may now exit R by typing:
q()
Step 1: Write and save a serial R script that accepts command line arguments
Save these lines below as zonal_stats.R
on Ceres. It is an R script that:
- Takes one argument from the command line to specify the data input year
- Uses the
terra
package to open the data file associated with that year - Opens the vector data
- Extracts the mean values per polygon
# Read in command line arguments. Expecting a year value.
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
stop("Enter the year of data you want extracted",
call.=FALSE)
}
# Load terra library for spatial analyses
library(terra)
# Read in the raster corresponding to the year argument
my_raster <- rast(paste0('ppt_for_zonal_stats-1/PRISM_ppt_',
args[1],
'.tif'))
# Read in the polygon shapefile and transform it to match raster
my_polygons <- crop(project(vect('us_counties2021/cb_2021_us_county_20m.shp'),
my_raster),
my_raster)
# Extract mean raster value per polygon
my_polygons$mean <- extract(my_raster,
my_polygons,
na.rm = TRUE,
fun = mean)$ppt
# Save extracted mean values in a shapefile
writeVector(my_polygons, filename = paste0('stats_', args[1], '.shp'))
Step 2: Write and save a SLURM job submission script
Now that we have our R script that accepts a year as input, we will write a SLURM job batch script to request that R script be called over an array of years. This kind of job submission is known as a ‘job array’. Each ‘task’ in the job array, each year in our case, will be treated like its own job in the SLURM queue, but with the benefit of only having to submit one submission batch script.
Save the lines below as zonal_stats.sh
. The lines starting with #SBATCH
are
instructions for SLURM about how long your job will take to run, how many cores
you need, etc. The lines that follow afterwards are like any other batch script.
Here, we are loading required modules and then executing our R script.
#!/bin/bash
#SBATCH --time=00:30:00 # walltime limit (HH:MM:SS)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # 1 processor core(s) per node X 2 threads per core
#SBATCH --partition=short # standard node(s)
#SBATCH --array=2000-2020 # your script input (the years)
#SBATCH --output=slurm_%A_%a.out # format of output filename
# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
module load r/3.6
# Execute the R script and pass the year (task ID) as argument
Rscript --vanilla zonal_stats.R ${SLURM_ARRAY_TASK_ID}
The meaning of our parameter choices:
time=00:30:00
: Our tasks will take up to 30 minutes to run.nodes=1
: We only need one node. If you are just getting started with parallel processing, you will likely only need one node.ntasks-per-node=2
: We want two logical cores on our one node, i.e. each task will use one physical core. Our individual tasks are serial and only need one core.partition=short
: We will use the ‘short’ partition on Ceres, the collection of nodes dedicated to shorter walltime jobs not requiring extensive memory. See this guide for more information about the available partitions on Ceres.array=2000-2020
: This is the parameter that tells SLURM we want a job array: although we are submitting one job script, treat it as an array of many tasks. Those tasks should have IDs in the range of 2000-2020 to represent the years of data we want analyzed.--output=slurm_%A_%a.out
: Save any output from R (e.g. printed messages, warnings, and errors) to a file with a filename in the format of output_JOBID_TASKID.out. The JOBID is assigned when the job is submitted (see the next step) and the TASKID will be in the range of our tasks in the array, 2000-2020. This way, if a particular year runs into an error, it can be easily found.
Note: there are additional SLURM parameters you may use, including how to specify your own job ID or setting memory requirements. Check out the Ceres job script generator to see more examples on how to populate job submission scripts on Ceres.
Step 3: Submit your job
Now that we have our packages, data, R script, and job submission script prepared,
we can finally submit the job. To submit a batch job to SLURM, we use the command
sbatch
followed by the job submission script filename via our shell. After you
run this line, you will see a message with your job ID. You can use this to
identify this job in the queue.
sbatch zonal_stats.sh
Submitted batch job 8145006
Step 4: Check results
To see the status of your job, you can view the SLURM queue. The queue lists all
of the jobs currently submitted, who submitted them, the job status, and what
nodes are allocated to the job. Since this can be a very long list, it is easiest
to find your jobs if you filter the queue to only the jobs you submitted. The
command to view the queue is squeue
and you can filter it to a specific user
with the -u
parameter followed by their SCINet account name.
squeue -u firstname.lastname
Below is a partial output for the queue showing this job array. You can see the
JOBID column has job IDs that start with the submitted job ID printed above with
_X
after it indicating the tasks in our job array.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8145006_2000 short zonal_st heather. R 0:01 1 ceres19-compute-57
8145006_2001 short zonal_st heather. R 0:01 1 ceres20-compute-3
8145006_2002 short zonal_st heather. R 0:01 1 ceres20-compute-3
8145006_2003 short zonal_st heather. R 0:01 1 ceres20-compute-3
8145006_2004 short zonal_st heather. R 0:01 1 ceres20-compute-3
If you see jobs listed in the queue: you have jobs currently in the queue and the status column will indicate if that job is pending, running, or completing. If you do NOT see jobs listed in the queue: you do not have jobs currently in the queue. If you submitted jobs but they are not listed, then they completed - either successfully or unsuccessfully.
To determine if the job executed successfully, you may check if your anticipated output was created. In our case, we would expect to see new shapefiles of the format stats_YYYY.shp. If you do not see your anticipated results, you can read the contents of the output_JOBID_TASKID.out files to check for error messages.
Here is a visual of our 2020 result in stats_2020.shp showing the mean 2020 total precipitation per county in Florida. Note: if you want to run these lines and create the plot, it is easiest to do so in RStudio Server via Open OnDemand.
library(terra)
result <- vect('stats_2020.shp')
plot(result, 'mean')