Last Update: 4 January 2024
Download Jupyter Notebook: GRWG23_SpatialInterpolation_python.ipynb
Overview
This tutorial will implement and compare machine learning techniques with two approaches to including spatial proximity for spatial modeling tasks:
- Spatial interpolation from point observations
- Spatial prediction from point observations and gridded covariates
Language: Python
Primary Libraries/Packages:
Name | Description | Link |
---|---|---|
pandas |
Dataframes and other datatypes for data analysis and manipulation | https://pandas.pydata.org/ |
geopandas |
Extends datatypes used by pandas to allow spatial operations on geometric types | https://geopandas.org/en/stable/ |
scikit-learn |
Machine Learning in Python | https://scikit-learn.org/stable/ |
plotnine |
A plotting library for Python modeled after R’s ggplot2 | https://plotnine.readthedocs.io/en/v0.12.3/ |
Nomenclature
- (Spatial) Interpolation: Using observations of dependent and independent variables to estimate the value of the dependent variable at unobserved independent variable values. For spatial applications, this can be the case of having point observations (i.e., variable observations at known x-y coordinates) and then predicting a gridded map of the variable (i.e., estimating the variable at the remaining x-y cells in the study area).
- Random Forest: A supervised machine learning algorithm that uses an ensemble of decision trees for regression or classification.
Tutorial Steps
- 1. Read in and visualize point observations
- 2. Random Forest for spatial interpolation
Use Random Forest to interpolate zinc concentrations across the study area in two ways:
- 2.a. RFsp: distance to all observations
- 2.b. RFSI: n observed values and distance to those n observation locations
- 3. Bringing in gridded covariates
Step 0: Load libraries and define function
First, we will import required packages and set a large default figure size. We will also define a function to print model metrics.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from scipy.stats import randint
import geopandas as gpd
import pandas as pd
import numpy as np
import plotnine as pn
pn.options.figure_size = (10, 10)
def print_metrics(y_test, y_pred):
"""
Given observed and predicted y values, print the R^2 and RMSE metrics.
y_test (Series): The observed y values.
y_pred (Series): The predicted y values.
"""
# R^2
r2 = r2_score(y_test, y_pred)
# Root mean squared error - RMSE
rmse = mean_squared_error(y_test, y_pred, squared = False)
print("R^2 = {0:3.2f}, RMSE = {1:5.0f}".format(r2,rmse))
Step 1: Read in and visualize point observations
We will open three vector datasets representing the point observations, the grid across which we want to predict, and the location of the river surrounding the study area.
This dataset gives locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15 m x 15 m. The data were extracted from the sp
R package.
dnld_url = 'https://geospatial.101workbook.org/SpatialModeling/assets/'
# Point observations and study area grid
meuse_obs = gpd.read_file(dnld_url + 'meuse_obs.zip')
meuse_grid = gpd.read_file(dnld_url + 'meuse_grid.zip')
# Extra information for visualization:
xmin, ymin, xmax, ymax = meuse_grid.total_bounds
meuse_riv = gpd.read_file(dnld_url + 'meuse_riv.zip')
Let’s take a quick look at the dataset. Below is a map of the study area grid and the observation locations, plus the location of the river Meuse for reference. We can see that observed zinc concentrations tend to be higher closer to the river.
(pn.ggplot()
+ pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)
+ pn.geom_map(meuse_grid, fill = None, size = 0.05)
+ pn.geom_map(meuse_obs, pn.aes(fill = 'zinc'), color = None, size = 3)
+ pn.scale_y_continuous(limits = [ymin, ymax])
+ pn.scale_fill_continuous(trans = 'log1p')
+ pn.theme_minimal()
+ pn.coord_fixed()
)
Step 2: Random Forest for spatial interpolation
We will explore two methods from recent literature that combine spatial proximity information as variables in fitting Random Forest models for spatial interpolation.
Step 2a: RFsp: distance to all observations
First, we will implement the RFsp method from Hengl et al 2018. This method involves using the distance to every observation as a predictor. For example, if there are 10 observations of the target variable, then there would be 10 predictor variables with the ith predictor variable representing the distance to the ith observation.
If you want to learn more about this approach, see the thengl/GeoMLA GitHub repo or the Spatial and spatiotemporal interpolation using Ensemble Machine Learning site from the same creators. Note: these resources are for R, but the latter does mention the scikit-learn
library in Python that we will be using in this tutorial.
To start, we will generate two DataFrames of distances:
- One with rows representing observations and columns representing observations (these will be our data for fitting and testing the model)
- One with rows representing grid cell centers and columns representing observations (these will be how we estimate maps of our target variable with the final model)
# Get coordinates of grid cell centers - these are our locations at which we will
# want to interpolate our target variable.
grid_centers = meuse_grid.centroid
# Generate a grid of distances to each observation
grid_distances = pd.DataFrame()
# We also need the distance values among our observations
obs_distances = pd.DataFrame()
# We need a dataframe with rows representing prediction grid cells
# (or observations)
# and columns representing observations
for obs_index in range(meuse_obs.geometry.size):
cur_obs = meuse_obs.geometry.iloc[obs_index]
obs_name = 'obs_' + str(obs_index)
cell_to_obs = grid_centers.distance(cur_obs).rename(obs_name)
grid_distances = pd.concat([grid_distances, cell_to_obs], axis=1)
obs_to_obs = meuse_obs.distance(cur_obs).rename(obs_name)
obs_distances = pd.concat([obs_distances, obs_to_obs], axis=1)
Before moving on to model fitting, let’s take a look at the distance matrices we created.
obs_distances
obs_0 | obs_1 | obs_2 | obs_3 | obs_4 | obs_5 | obs_6 | obs_7 | obs_8 | obs_9 | ... | obs_145 | obs_146 | obs_147 | obs_148 | obs_149 | obs_150 | obs_151 | obs_152 | obs_153 | obs_154 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 70.837843 | 118.848643 | 259.239272 | 366.314073 | 473.629602 | 258.321505 | 252.049598 | 380.189426 | 471.008492 | ... | 4304.014173 | 4385.565870 | 4425.191182 | 4194.974374 | 4077.205538 | 3914.407363 | 3868.323926 | 3964.443088 | 3607.233843 | 3449.821155 |
1 | 70.837843 | 0.000000 | 141.566239 | 282.851551 | 362.640318 | 471.199533 | 234.401365 | 195.010256 | 328.867755 | 441.530293 | ... | 4236.122756 | 4316.784220 | 4355.550137 | 4126.296645 | 4007.817486 | 3845.342247 | 3798.730841 | 3894.291848 | 3538.899546 | 3391.434505 |
2 | 118.848643 | 141.566239 | 0.000000 | 143.171226 | 251.023903 | 356.866922 | 167.000000 | 222.081066 | 323.513524 | 375.033332 | ... | 4278.052010 | 4365.122793 | 4411.447155 | 4173.908121 | 4061.434476 | 3896.201483 | 3854.403326 | 3956.156721 | 3584.262407 | 3389.963569 |
3 | 259.239272 | 282.851551 | 143.171226 | 0.000000 | 154.262763 | 242.156974 | 175.171345 | 296.786118 | 347.351407 | 322.818835 | ... | 4292.750750 | 4386.465206 | 4440.764349 | 4194.687712 | 4088.695146 | 3920.739726 | 3884.100024 | 3992.349935 | 3603.447377 | 3361.647959 |
4 | 366.314073 | 362.640318 | 251.023903 | 154.262763 | 0.000000 | 108.577162 | 147.526269 | 281.937936 | 266.101484 | 178.518907 | ... | 4162.607356 | 4260.340127 | 4319.896411 | 4068.314639 | 3966.640014 | 3796.976824 | 3763.871411 | 3876.723488 | 3476.475514 | 3212.786952 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
150 | 3914.407363 | 3845.342247 | 3896.201483 | 3920.739726 | 3796.976824 | 3786.887904 | 3753.359162 | 3676.331051 | 3579.912988 | 3620.842443 | ... | 471.958685 | 476.656060 | 536.660041 | 296.082759 | 183.619171 | 0.000000 | 147.989865 | 334.846233 | 345.144897 | 1443.022176 |
151 | 3868.323926 | 3798.730841 | 3854.403326 | 3884.100024 | 3763.871411 | 3757.931479 | 3714.900268 | 3633.511387 | 3540.952697 | 3589.008916 | ... | 599.736609 | 562.318415 | 557.046677 | 405.097519 | 217.082933 | 147.989865 | 0.000000 | 210.857772 | 391.256949 | 1545.369859 |
152 | 3964.443088 | 3894.291848 | 3956.156721 | 3992.349935 | 3876.723488 | 3875.800046 | 3821.201513 | 3734.408655 | 3647.002194 | 3703.768081 | ... | 702.359595 | 596.896138 | 497.033198 | 494.814107 | 276.524863 | 334.846233 | 210.857772 | 0.000000 | 595.131078 | 1756.173397 |
153 | 3607.233843 | 3538.899546 | 3584.262407 | 3603.447377 | 3476.475514 | 3462.718152 | 3438.127688 | 3365.864673 | 3265.476382 | 3299.412827 | ... | 702.659235 | 784.390209 | 880.273253 | 592.150319 | 528.674758 | 345.144897 | 391.256949 | 595.131078 | 0.000000 | 1176.606136 |
154 | 3449.821155 | 3391.434505 | 3389.963569 | 3361.647959 | 3212.786952 | 3163.395170 | 3225.188987 | 3198.113350 | 3071.672183 | 3038.833493 | ... | 1461.677119 | 1666.870721 | 1877.419772 | 1521.862017 | 1600.647681 | 1443.022176 | 1545.369859 | 1756.173397 | 1176.606136 | 0.000000 |
155 rows × 155 columns
For fitting our model, we will use the distances among observations as our predictors and observed zinc concentration at those observations as our target variable.
# matrix of distance to observations as predictors
RFsp_X = obs_distances
# vector of observed zinc concentration as target variable
y = meuse_obs['zinc']
# We need to split our dataset into train and test datasets. We'll use 80% of
# the data for model training.
RFsp_X_train, RFsp_X_test, RFsp_y_train, RFsp_y_test = train_test_split(RFsp_X, y, train_size=0.8)
Machine learning algorithms typically have hyperparameters that can be tuned per application. Here, we will tune the number of trees in the random forest model and the maximum depth of the trees in the the random forest model. We use the training subset of our data for this fitting and tuning process. After the code chunk below, the best parameter values from our search are printed.
r_state = 0
# Define the parameter space that will be searched over.
param_distributions = {'n_estimators': randint(1, 100),
'max_depth': randint(5, 10)}
# Now create a searchCV object and fit it to the data.
tuned_RFsp = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),
n_iter=10,
param_distributions=param_distributions,
random_state=r_state).fit(RFsp_X_train, RFsp_y_train)
tuned_RFsp.best_params_
{'max_depth': 7, 'n_estimators': 37}
We can now use our testing subset of the data to quantify the model performance, i.e. how well did the model predict the remaining observed values? There are many potential metrics - see all the metrics scikit-learn
supports here. The two we show below are the coefficient of determination (R2) and the root mean square error (RMSE), two metrics that are likely familiar from outside machine learning as well.
print_metrics(RFsp_y_test, tuned_RFsp.predict(RFsp_X_test))
R^2 = 0.50, RMSE = 286
Our R2 is not awesome! We typically want R2 values closer to 1 and RMSE values closer to 0. Note: RMSE is in the units of the target variable, so our zinc concentrations. You can see the range of values of zinc concentrations in the legend in the figure above, from which you can get a sense of our error.
Excercise: Modify the param_distributions
and n_iter
values above - can you improve the metrics? Note that you may also increase the processing time.
Once we are happy with (or at least curious about!) the model, we can predict and visualize our zinc concentration field.
# Predict the value from all grid cells using their distances we determined above.
meuse_grid['pred_RFsp'] = tuned_RFsp.predict(grid_distances)
(pn.ggplot()
+ pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)
+ pn.geom_map(meuse_grid, pn.aes(fill = 'pred_RFsp'), color = 'white', size = 0.05)
+ pn.geom_map(meuse_obs)
+ pn.scale_y_continuous(limits = [ymin, ymax])
+ pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p')
+ pn.theme_minimal()
+ pn.coord_fixed()
)
2b: RFSI: n observed values and distance to those n observation locations
Now, we will try the the RFSI method from Sekulić et al 2020. In this method, instead of using distances to all observations as our predictors, we will use distances to the n closest observations as well as the observed values at those locations as our predictors.
Below, we define a function to find the n closest observations and record their distances and values.
def nclosest_dist_value(dist_ij, obs_i, n = 3):
"""
Given a distance matrix among i locations and j observation
locations, j observed values, and the number of close
observations desired, generates a dataframe of distances to
and values at n closest observations for each of the i
locations.
dist_ij (DataFrame): distance matrix among i locations and j
observation locations
obs_i (Series): The i observed values
n (int): The desired number of closest observations
"""
# Which observations are the n closest?
# But do not include distance to oneself.
# Note: ranks start at 1, not 0.
nclosest_dist_ij = dist_ij.replace(0.0,np.nan).rank(axis = 1, method = 'first') <= n
nclosest = pd.DataFrame()
# For each observation, find the n nearest observations and
# record the distance and target variable pairs
for i in range(dist_ij.shape[0]):
# Which obs are the n closest to the ith location?
nclosest_j_indices = np.where(nclosest_dist_ij.iloc[i,:])
# Save the distance to and observed value at the n closest
# observations from the ith location
i_loc_dist = dist_ij.iloc[i].iloc[nclosest_j_indices]
sort_indices = i_loc_dist.values.argsort()
i_loc_dist = i_loc_dist.iloc[sort_indices]
i_loc_dist.rename(lambda x: 'dist' + str(np.where(x == i_loc_dist.index)[0][0]), inplace=True)
i_loc_value = obs_i.iloc[nclosest_j_indices]
i_loc_value = i_loc_value.iloc[sort_indices]
i_loc_value.rename(lambda x: 'obs' + str(np.where(x == i_loc_value.index)[0][0]), inplace=True)
i_loc = pd.concat([i_loc_dist,i_loc_value],axis = 0)
nclosest = pd.concat([nclosest, pd.DataFrame(i_loc).transpose()], axis = 0)
return nclosest
Let’s now use that function to find and describe the n closest observations to each obsersevation and each grid cell. Note that we are taking advantage of the obs_distances
and grid_distances
variables we created for the RFsp approach.
n = 10
obs_nclosest_obs = nclosest_dist_value(obs_distances, meuse_obs['zinc'], n)
grid_nclosest_obs = nclosest_dist_value(grid_distances, meuse_obs['zinc'], n)
Let’s take a closer look at our new distance matrices.
obs_nclosest_obs
dist0 | dist1 | dist2 | dist3 | dist4 | dist5 | dist6 | dist7 | dist8 | dist9 | obs0 | obs1 | obs2 | obs3 | obs4 | obs5 | obs6 | obs7 | obs8 | obs9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 70.837843 | 118.848643 | 252.049598 | 258.321505 | 259.239272 | 336.434243 | 366.314073 | 373.483601 | 380.189426 | 399.656102 | 1141.0 | 640.0 | 406.0 | 346.0 | 257.0 | 1096.0 | 269.0 | 504.0 | 347.0 | 279.0 |
0 | 70.837843 | 141.566239 | 195.010256 | 234.401365 | 266.011278 | 282.851551 | 311.081983 | 328.867755 | 356.349547 | 362.640318 | 1022.0 | 640.0 | 406.0 | 346.0 | 1096.0 | 257.0 | 504.0 | 347.0 | 279.0 | 269.0 |
0 | 118.848643 | 141.566239 | 143.171226 | 167.000000 | 222.081066 | 251.023903 | 323.513524 | 326.401593 | 345.891602 | 351.973010 | 1022.0 | 1141.0 | 257.0 | 346.0 | 406.0 | 269.0 | 347.0 | 279.0 | 504.0 | 1096.0 |
0 | 143.171226 | 154.262763 | 175.171345 | 242.156974 | 259.239272 | 282.851551 | 296.786118 | 322.818835 | 324.499615 | 347.351407 | 640.0 | 269.0 | 346.0 | 281.0 | 1022.0 | 1141.0 | 406.0 | 183.0 | 279.0 | 347.0 |
0 | 108.577162 | 147.526269 | 154.262763 | 178.518907 | 221.758878 | 244.296950 | 251.023903 | 266.101484 | 281.937936 | 340.847473 | 281.0 | 346.0 | 257.0 | 183.0 | 279.0 | 189.0 | 640.0 | 347.0 | 406.0 | 326.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0 | 117.038455 | 145.602198 | 147.989865 | 183.619171 | 217.117480 | 262.619497 | 269.818457 | 293.400750 | 296.082759 | 334.846233 | 224.0 | 189.0 | 496.0 | 214.0 | 187.0 | 400.0 | 296.0 | 154.0 | 258.0 | 342.0 |
0 | 116.361506 | 141.000000 | 147.989865 | 160.863296 | 210.857772 | 217.082933 | 219.665655 | 243.772435 | 273.791892 | 287.702972 | 400.0 | 296.0 | 166.0 | 187.0 | 342.0 | 214.0 | 539.0 | 224.0 | 451.0 | 332.0 |
0 | 81.412530 | 174.942848 | 205.000000 | 210.857772 | 242.866218 | 270.351623 | 276.524863 | 286.225436 | 296.278585 | 301.317109 | 332.0 | 400.0 | 420.0 | 496.0 | 296.0 | 539.0 | 214.0 | 553.0 | 451.0 | 577.0 |
0 | 158.294030 | 170.390727 | 195.747286 | 260.555176 | 260.823695 | 309.161770 | 345.144897 | 353.747085 | 384.532183 | 385.020779 | 155.0 | 199.0 | 180.0 | 187.0 | 224.0 | 154.0 | 166.0 | 157.0 | 296.0 | 226.0 |
0 | 353.004249 | 503.135171 | 799.335974 | 848.958185 | 870.917907 | 914.617406 | 986.887025 | 1013.126349 | 1038.171469 | 1043.985153 | 722.0 | 1672.0 | 192.0 | 130.0 | 203.0 | 157.0 | 778.0 | 113.0 | 240.0 | 199.0 |
155 rows × 20 columns
We will then use the same model fitting process as for RFsp.
# matrix of distances to and observed values at the n closest observations as predictors
RFSI_X = obs_nclosest_obs
# We need to split our dataset into train and test datasets. We'll use 80% of
# the data for model training.
RFSI_X_train, RFSI_X_test, RFSI_y_train, RFSI_y_test = train_test_split(RFSI_X, y, train_size=0.8)
param_distributions = {'n_estimators': randint(1, 100),
'max_depth': randint(5, 10)}
tuned_RFSI = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),
n_iter=10,
param_distributions=param_distributions,
random_state=r_state).fit(RFSI_X_train, RFSI_y_train)
tuned_RFSI.best_params_
{'max_depth': 9, 'n_estimators': 59}
print_metrics(RFSI_y_test, tuned_RFSI.predict(RFSI_X_test))
R^2 = 0.45, RMSE = 302
How does RFSI’s metrics compare to RFsp’s? What if you modify n, the number of closest observations? What if you modify the param_distributions
and n_iter
values like above?
Let’s visualize the two maps from these two methods together. To do so, we will need to transform our meuse_grid
DataFrame into a longer format for plotting with facets in plotnine
.
meuse_grid['pred_RFSI'] = tuned_RFSI.predict(grid_nclosest_obs)
meuse_grid_long = pd.melt(meuse_grid, id_vars= 'geometry', value_vars=['pred_RFsp','pred_RFSI'])
(pn.ggplot()
+ pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)
+ pn.geom_map(meuse_grid_long, pn.aes(fill = 'value'), color = 'white', size = 0.05)
+ pn.geom_map(meuse_obs)
+ pn.scale_y_continuous(limits = [ymin, ymax])
+ pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p')
+ pn.theme_minimal()
+ pn.coord_fixed()
+ pn.facet_wrap('variable')
)
Step 3: Bringing in gridded covariates
This dataset has three covariates supplied with the grid and the observations:
- dist: the distance to the river
- ffreq: a category describing the flooding frequency
- soil: a category of soil type
We can extend this spatial interpolation task into a more general spatial prediction task by including these co-located observations and gridded covariates. Let’s visualize the flooding frequency:
(pn.ggplot()
+ pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)
+ pn.geom_map(meuse_grid, pn.aes(fill = 'ffreq'), size = 0.05)
+ pn.geom_map(meuse_obs, size = 2)
+ pn.scale_y_continuous(limits = [ymin, ymax])
+ pn.theme_minimal()
+ pn.coord_fixed()
)
Also visualize the other covariates. (Either one at a time, or try melt
ing like above to use facets!). Do you expect these variables to improve the model?
Adding these covariates to the RF model, either method, is straightforward. We will stick to just the RFSI model here. All that needs to be done is to concatenate these three columns to our distance (and observed values) dataset and repeat the modeling fitting process.
# matrix of distances to and observed values at the n closest observations as predictors
RFSI_wcov_X = pd.concat([obs_nclosest_obs.reset_index(),meuse_obs[['dist','ffreq','soil']]], axis=1)
# We need to split our dataset into train and test datasets. We'll use 80% of
# the data for model training.
RFSI_wcov_X_train, RFSI_wcov_X_test, RFSI_wcov_y_train, RFSI_wcov_y_test = train_test_split(RFSI_wcov_X, y, train_size=0.8)
param_distributions = {'n_estimators': randint(1, 100),
'max_depth': randint(5, 10)}
tuned_RFSI_wcov = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),
n_iter=10,
param_distributions=param_distributions,
random_state=r_state).fit(RFSI_wcov_X_train, RFSI_wcov_y_train)
tuned_RFSI_wcov.best_params_
{'max_depth': 5, 'n_estimators': 68}
print_metrics(RFSI_wcov_y_test, tuned_RFSI_wcov.predict(RFSI_wcov_X_test))
R^2 = 0.56, RMSE = 260
How did the new covariates change our metrics? Was it as you’d expect?
grid_nclosest_obs_wcov = pd.concat([grid_nclosest_obs.reset_index(),meuse_grid[['dist','ffreq','soil']]], axis=1)
meuse_grid['pred_RFSI_wcov'] = tuned_RFSI_wcov.predict(grid_nclosest_obs_wcov)
meuse_grid_long = pd.melt(meuse_grid, id_vars= 'geometry', value_vars=['pred_RFsp','pred_RFSI','pred_RFSI_wcov'])
(pn.ggplot()
+ pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)
+ pn.geom_map(meuse_grid_long, pn.aes(fill = 'value'), color = 'white', size = 0.05)
+ pn.geom_map(meuse_obs, size = 0.25)
+ pn.scale_y_continuous(limits = [ymin, ymax])
+ pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p')
+ pn.theme_minimal()
+ pn.coord_fixed()
+ pn.facet_wrap('variable')
)
Use the covariates to create a RFsp_wcov
model and add it to the figure. How do the metrics compare to the other results?