\n", "**Download Jupyter Notebook**: [GRWG23_SpatialInterpolation_python.ipynb](https://geospatial.101workbook.org/tutorials/GRWG23_SpatialInterpolation_python.ipynb)" ] }, { "cell_type": "markdown", "id": "stone-albany", "metadata": {}, "source": [ "## Overview\n", "\n", "This tutorial will implement and compare machine learning techniques with two approaches to including spatial proximity for spatial modeling tasks:\n", " * Spatial interpolation from point observations\n", " * Spatial prediction from point observations and gridded covariates\n", "\n", "*Language*: `Python`\n", "\n", "*Primary Libraries/Packages*:\n", "\n", "|Name|Description|Link|\n", "|-|-|-|\n", "| `pandas` | Dataframes and other datatypes for data analysis and manipulation | https://pandas.pydata.org/ |\n", "| `geopandas` | Extends datatypes used by pandas to allow spatial operations on geometric types | https://geopandas.org/en/stable/ |\n", "| `scikit-learn` | Machine Learning in Python | https://scikit-learn.org/stable/ |\n", "| `plotnine` | A plotting library for Python modeled after R's [ggplot2](https://ggplot2.tidyverse.org/) | https://plotnine.readthedocs.io/en/v0.12.3/ |\n", "\n", "\n", "## Nomenclature\n", "\n", " * *(Spatial) Interpolation*: Using observations of dependent and\n", " independent variables to estimate the value of the dependent\n", " variable at unobserved independent variable values. For spatial\n", " applications, this can be the case of having point observations\n", " (i.e., variable observations at known x-y coordinates) and then\n", " predicting a gridded map of the variable (i.e., estimating the\n", " variable at the remaining x-y cells in the study area).\n", " * *Random Forest*: A supervised machine learning algorithm that \n", " uses an ensemble of decision trees for regression or \n", " classification. \n", "\n", "## Tutorial Steps\n", "\n", " * 1\\. **[Read in and visualize point observations](#step_1)**\n", " * 2\\. **[Random Forest for spatial interpolation](#step_2)**\n", " Use Random Forest to interpolate zinc concentrations across the study area in two ways:\n", " * 2.a. *RFsp*: distance to all observations\n", " * 2.b. *RFSI*: n observed values and distance to those n observation locations\n", " * 3\\. **[Bringing in gridded covariates](#step_3)**" ] }, { "cell_type": "markdown", "id": "economic-orchestra", "metadata": {}, "source": [ "## Step 0: Load libraries and define function\n", "\n", "First, we will import required packages and set a large default figure size. We will also define a function to print model metrics." ] }, { "cell_type": "code", "execution_count": null, "id": "answering-external", "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "from sklearn.model_selection import train_test_split, RandomizedSearchCV\n", "from scipy.stats import randint\n", "import geopandas as gpd\n", "import pandas as pd\n", "import numpy as np\n", "import plotnine as pn\n", "\n", "pn.options.figure_size = (10, 10)" ] }, { "cell_type": "code", "execution_count": null, "id": "2ce5d3d7", "metadata": {}, "outputs": [], "source": [ "def print_metrics(y_test, y_pred):\n", " \"\"\"\n", " Given observed and predicted y values, print the R^2 and RMSE metrics.\n", "\n", " y_test (Series): The observed y values.\n", " y_pred (Series): The predicted y values.\n", " \"\"\"\n", " # R^2\n", " r2 = r2_score(y_test, y_pred)\n", " # Root mean squared error - RMSE\n", " rmse = mean_squared_error(y_test, y_pred, squared = False)\n", " print(\"R^2 = {0:3.2f}, RMSE = {1:5.0f}\".format(r2,rmse))" ] }, { "cell_type": "markdown", "id": "necessary-marijuana", "metadata": {}, "source": [ "\n", "## Step 1: Read in and visualize point observations\n", "\n", "We will open three vector datasets representing the point observations, the grid across which we want to predict, and the location of the river surrounding the study area. \n", "\n", "This dataset gives locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15 m x 15 m. The data were extracted from the [`sp` R package](https://cran.r-project.org/web/packages/sp/index.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "external-andorra", "metadata": {}, "outputs": [], "source": [ "# Point observations and study area grid\n", "dnld_url = 'https://geospatial.101workbook.org/SpatialModeling/assets/'\n", "meuse_obs = gpd.read_file(dnld_url + 'meuse_obs.shp')\n", "meuse_grid = gpd.read_file(dnld_url + 'meuse_grid.shp')\n", "\n", "# Extra information for visualization:\n", "xmin, ymin, xmax, ymax = meuse_grid.total_bounds\n", "meuse_riv = gpd.read_file(dnld_url + 'meuse_riv.shp')" ] }, { "cell_type": "markdown", "id": "d05a36f4", "metadata": {}, "source": [ "Let’s take a quick look at the dataset. Below is a map of the study area grid and the observation locations, plus the location of the river Meuse for reference. We can see that observed zinc concentrations tend to be higher closer to the river." ] }, { "cell_type": "code", "execution_count": null, "id": "a1683625", "metadata": {}, "outputs": [], "source": [ "(pn.ggplot()\n", " + pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)\n", " + pn.geom_map(meuse_grid, fill = None, size = 0.05)\n", " + pn.geom_map(meuse_obs, pn.aes(fill = 'zinc'), color = None, size = 3)\n", " + pn.scale_y_continuous(limits = [ymin, ymax]) \n", " + pn.scale_fill_continuous(trans = 'log1p') \n", " + pn.theme_minimal() \n", " + pn.coord_fixed()\n", ")" ] }, { "cell_type": "markdown", "id": "earned-broadway", "metadata": {}, "source": [ "\n", "## Step 2: Random Forest for spatial interpolation\n", "\n", "We will explore two methods from recent literature that combine spatial proximity information as variables in fitting Random Forest models for spatial interpolation. " ] }, { "cell_type": "markdown", "id": "3a6d5d39", "metadata": {}, "source": [ "### Step 2a: *RFsp*: distance to all observations\n", "\n", "First, we will implement the *RFsp* method from [Hengl et al 2018](https://doi.org/10.7717/peerj.5518). This method involves using the distance to every observation as a predictor. For example, if there are 10 observations of the target variable, then there would be 10 predictor variables with the ith predictor variable representing the distance to the ith observation. \n", "\n", "If you want to learn more about this approach, see the [thengl/GeoMLA](https://github.com/thengl/GeoMLA) GitHub repo or the [Spatial and spatiotemporal interpolation using Ensemble Machine Learning](https://opengeohub.github.io/spatial-prediction-eml/index.html) site from the same creators. Note: these resources are for R, but the latter does mention the `scikit-learn` library in Python that we will be using in this tutorial.\n", "\n", "To start, we will generate two DataFrames of distances:\n", "* One with rows representing observations and columns representing observations (these will be our data for fitting and testing the model)\n", "* One with rows representing grid cell centers and columns representing observations (these will be how we estimate maps of our target variable with the final model)" ] }, { "cell_type": "code", "execution_count": null, "id": "surgical-parameter", "metadata": {}, "outputs": [], "source": [ "# Get coordinates of grid cell centers - these are our locations at which we will\n", "# want to interpolate our target variable.\n", "grid_centers = meuse_grid.centroid\n", "\n", "# Generate a grid of distances to each observation\n", "grid_distances = pd.DataFrame()\n", "# We also need the distance values among our observations\n", "obs_distances = pd.DataFrame()\n", "\n", "# We need a dataframe with rows representing prediction grid cells\n", "# (or observations)\n", "# and columns representing observations\n", "for obs_index in range(meuse_obs.geometry.size):\n", " cur_obs = meuse_obs.geometry.iloc[obs_index]\n", " obs_name = 'obs_' + str(obs_index)\n", " \n", " cell_to_obs = grid_centers.distance(cur_obs).rename(obs_name)\n", " grid_distances = pd.concat([grid_distances, cell_to_obs], axis=1)\n", "\n", " obs_to_obs = meuse_obs.distance(cur_obs).rename(obs_name)\n", " obs_distances = pd.concat([obs_distances, obs_to_obs], axis=1)\n", " " ] }, { "cell_type": "markdown", "id": "a440ee18-6b7f-4bfb-b282-032c71d489c6", "metadata": {}, "source": [ "Before moving on to model fitting, let's take a look at the distance matrices we created." ] }, { "cell_type": "code", "execution_count": null, "id": "b114265f-0615-4bfd-bcbb-52f5fb71299c", "metadata": {}, "outputs": [], "source": [ "obs_distances" ] }, { "cell_type": "markdown", "id": "686dc861", "metadata": {}, "source": [ "For fitting our model, we will use the distances among observations as our predictors and observed zinc concentration at those observations as our target variable." ] }, { "cell_type": "code", "execution_count": null, "id": "81c831c1", "metadata": {}, "outputs": [], "source": [ "# matrix of distance to observations as predictors\n", "RFsp_X = obs_distances\n", "# vector of observed zinc concentration as target variable\n", "y = meuse_obs['zinc']\n", "\n", "# We need to split our dataset into train and test datasets. We'll use 80% of\n", "# the data for model training.\n", "RFsp_X_train, RFsp_X_test, RFsp_y_train, RFsp_y_test = train_test_split(RFsp_X, y, train_size=0.8)\n" ] }, { "cell_type": "markdown", "id": "a72d7324", "metadata": {}, "source": [ "Machine learning algorithms typically have hyperparameters that can be tuned per application. Here, we will tune the number of trees in the random forest model and the maximum depth of the trees in the the random forest model. We use the training subset of our data for this fitting and tuning process. After the code chunk below, the best parameter values from our search are printed. " ] }, { "cell_type": "code", "execution_count": null, "id": "17e27615", "metadata": {}, "outputs": [], "source": [ "r_state = 0\n", "\n", "# Define the parameter space that will be searched over.\n", "param_distributions = {'n_estimators': randint(1, 100),\n", " 'max_depth': randint(5, 10)}\n", "\n", "# Now create a searchCV object and fit it to the data.\n", "tuned_RFsp = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),\n", " n_iter=10,\n", " param_distributions=param_distributions,\n", " random_state=r_state).fit(RFsp_X_train, RFsp_y_train)\n", "tuned_RFsp.best_params_" ] }, { "cell_type": "markdown", "id": "bf952281", "metadata": {}, "source": [ "We can now use our testing subset of the data to quantify the model performance, i.e. how well did the model predict the remaining observed values? There are many potential metrics - see all the metrics `scikit-learn` supports [here](https://scikit-learn.org/stable/modules/model_evaluation.html#). The two we show below are the coefficient of determination ($R^2$) and the root mean square error ($RMSE$), two metrics that are likely familiar from outside machine learning as well." ] }, { "cell_type": "code", "execution_count": null, "id": "8536f77b", "metadata": {}, "outputs": [], "source": [ "print_metrics(RFsp_y_test, tuned_RFsp.predict(RFsp_X_test))" ] }, { "cell_type": "markdown", "id": "372f3e1a", "metadata": {}, "source": [ "Our $R^2$ is not awesome! We typically want $R^2$ values closer to $1$ and RMSE values closer to $0$. Note: $RMSE$ is in the units of the target variable, so our zinc concentrations. You can see the range of values of zinc concentrations in the legend in the figure above, from which you can get a sense of our error. \n", "\n", "**Excercise:** Modify the `param_distributions` and `n_iter` values above - can you improve the metrics? Note that you may also increase the processing time." ] }, { "cell_type": "markdown", "id": "2ab2eb71", "metadata": {}, "source": [ "Once we are happy with (or at least curious about!) the model, we can predict and visualize our zinc concentration field." ] }, { "cell_type": "code", "execution_count": null, "id": "955a1c71", "metadata": {}, "outputs": [], "source": [ "# Predict the value from all grid cells using their distances we determined above. \n", "meuse_grid['pred_RFsp'] = tuned_RFsp.predict(grid_distances)\n", "\n", "(pn.ggplot()\n", " + pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)\n", " + pn.geom_map(meuse_grid, pn.aes(fill = 'pred_RFsp'), color = 'white', size = 0.05)\n", " + pn.geom_map(meuse_obs)\n", " + pn.scale_y_continuous(limits = [ymin, ymax]) \n", " + pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p') \n", " + pn.theme_minimal() \n", " + pn.coord_fixed()\n", ")" ] }, { "cell_type": "markdown", "id": "ca40657a", "metadata": {}, "source": [ "### 2b: *RFSI*: n observed values and distance to those n observation locations\n", "\n", "Now, we will try the the *RFSI* method from [Sekulić et al 2020](https://doi.org/10.3390/rs12101687). In this method, instead of using distances to *all* observations as our predictors, we will use distances to the _n_ closest observations as well as the observed values at those locations as our predictors.\n", "\n", "Below, we define a function to find the _n_ closest observations and record their distances and values." ] }, { "cell_type": "code", "execution_count": null, "id": "4db5e98b", "metadata": {}, "outputs": [], "source": [ "def nclosest_dist_value(dist_ij, obs_i, n = 3):\n", " \"\"\"\n", " Given a distance matrix among i locations and j observation \n", " locations, j observed values, and the number of close\n", " observations desired, generates a dataframe of distances to\n", " and values at n closest observations for each of the i \n", " locations.\n", "\n", " dist_ij (DataFrame): distance matrix among i locations and j \n", " observation locations\n", " obs_i (Series): The i observed values\n", " n (int): The desired number of closest observations\n", " \"\"\"\n", " # Which observations are the n closest? \n", " # But do not include distance to oneself. \n", " # Note: ranks start at 1, not 0.\n", " nclosest_dist_ij = dist_ij.replace(0.0,np.nan).rank(axis = 1, method = 'first') <= n\n", " \n", " nclosest = pd.DataFrame()\n", "\n", " # For each observation, find the n nearest observations and\n", " # record the distance and target variable pairs\n", " for i in range(dist_ij.shape[0]):\n", " # Which obs are the n closest to the ith location?\n", " nclosest_j_indices = np.where(nclosest_dist_ij.iloc[i,:])\n", "\n", " # Save the distance to and observed value at the n closest\n", " # observations from the ith location\n", " i_loc_dist = dist_ij.iloc[i].iloc[nclosest_j_indices]\n", " sort_indices = i_loc_dist.values.argsort()\n", " i_loc_dist = i_loc_dist.iloc[sort_indices]\n", " i_loc_dist.rename(lambda x: 'dist' + str(np.where(x == i_loc_dist.index)[0][0]), inplace=True)\n", " \n", " i_loc_value = obs_i.iloc[nclosest_j_indices]\n", " i_loc_value = i_loc_value.iloc[sort_indices]\n", " i_loc_value.rename(lambda x: 'obs' + str(np.where(x == i_loc_value.index)[0][0]), inplace=True)\n", " i_loc = pd.concat([i_loc_dist,i_loc_value],axis = 0)\n", " nclosest = pd.concat([nclosest, pd.DataFrame(i_loc).transpose()], axis = 0)\n", "\n", " return nclosest" ] }, { "cell_type": "markdown", "id": "f70cbd7a", "metadata": {}, "source": [ "Let's now use that function to find and describe the n closest observations to each obsersevation and each grid cell. Note that we are taking advantage of the `obs_distances` and `grid_distances` variables we created for the *RFsp* approach. " ] }, { "cell_type": "code", "execution_count": null, "id": "f1b67e6c", "metadata": {}, "outputs": [], "source": [ "n = 10\n", "obs_nclosest_obs = nclosest_dist_value(obs_distances, meuse_obs['zinc'], n)\n", "grid_nclosest_obs = nclosest_dist_value(grid_distances, meuse_obs['zinc'], n)" ] }, { "cell_type": "markdown", "id": "87928652-ddf4-40f5-a100-1e32473d8f7e", "metadata": {}, "source": [ "Let's take a closer look at our new distance matrices." ] }, { "cell_type": "code", "execution_count": null, "id": "ee343194-b3f5-45ee-b54f-0204fc001ffc", "metadata": {}, "outputs": [], "source": [ "obs_nclosest_obs" ] }, { "cell_type": "markdown", "id": "9c267736", "metadata": {}, "source": [ "We will then use the same model fitting process as for *RFsp*." ] }, { "cell_type": "code", "execution_count": null, "id": "2d0baebe", "metadata": {}, "outputs": [], "source": [ "# matrix of distances to and observed values at the n closest observations as predictors\n", "RFSI_X = obs_nclosest_obs\n", "\n", "# We need to split our dataset into train and test datasets. We'll use 80% of\n", "# the data for model training.\n", "RFSI_X_train, RFSI_X_test, RFSI_y_train, RFSI_y_test = train_test_split(RFSI_X, y, train_size=0.8)\n", "\n", "param_distributions = {'n_estimators': randint(1, 100),\n", " 'max_depth': randint(5, 10)}\n", "tuned_RFSI = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),\n", " n_iter=10,\n", " param_distributions=param_distributions,\n", " random_state=r_state).fit(RFSI_X_train, RFSI_y_train)\n", "tuned_RFSI.best_params_\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a39806d6", "metadata": {}, "outputs": [], "source": [ "print_metrics(RFSI_y_test, tuned_RFSI.predict(RFSI_X_test))" ] }, { "cell_type": "markdown", "id": "c17a24b8", "metadata": {}, "source": [ "How does *RFSI*'s metrics compare to *RFsp*'s? What if you modify n, the number of closest observations? What if you modify the `param_distributions` and `n_iter` values like above?" ] }, { "cell_type": "markdown", "id": "9cbeabe4", "metadata": {}, "source": [ "Let's visualize the two maps from these two methods together. To do so, we will need to transform our `meuse_grid` DataFrame into a longer format for plotting with facets in `plotnine`." ] }, { "cell_type": "code", "execution_count": null, "id": "6bb06848", "metadata": {}, "outputs": [], "source": [ "meuse_grid['pred_RFSI'] = tuned_RFSI.predict(grid_nclosest_obs)\n", "meuse_grid_long = pd.melt(meuse_grid, id_vars= 'geometry', value_vars=['pred_RFsp','pred_RFSI'])\n", "\n", "(pn.ggplot()\n", " + pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)\n", " + pn.geom_map(meuse_grid_long, pn.aes(fill = 'value'), color = 'white', size = 0.05)\n", " + pn.geom_map(meuse_obs)\n", " + pn.scale_y_continuous(limits = [ymin, ymax]) \n", " + pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p') \n", " + pn.theme_minimal() \n", " + pn.coord_fixed() \n", " + pn.facet_wrap('variable')\n", ")" ] }, { "cell_type": "markdown", "id": "9148ff0b", "metadata": {}, "source": [ "\n", "## Step 3: Bringing in gridded covariates\n", "\n", "This dataset has three covariates supplied with the grid and the observations:\n", "* dist: the distance to the river\n", "* ffreq: a category describing the flooding frequency\n", "* soil: a category of soil type\n", "\n", "We can extend this spatial interpolation task into a more general spatial prediction task by including these co-located observations and gridded covariates. Let's visualize the flooding frequency: " ] }, { "cell_type": "code", "execution_count": null, "id": "fe5a97e7", "metadata": {}, "outputs": [], "source": [ "(pn.ggplot()\n", " + pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)\n", " + pn.geom_map(meuse_grid, pn.aes(fill = 'ffreq'), size = 0.05)\n", " + pn.geom_map(meuse_obs, size = 2)\n", " + pn.scale_y_continuous(limits = [ymin, ymax]) \n", " + pn.theme_minimal() \n", " + pn.coord_fixed()\n", ")" ] }, { "cell_type": "markdown", "id": "cc1ab6dd", "metadata": {}, "source": [ "Also visualize the other covariates. (Either one at a time, or try `melt`ing like above to use facets!). Do you expect these variables to improve the model?" ] }, { "cell_type": "markdown", "id": "293c9aa4", "metadata": {}, "source": [ "Adding these covariates to the RF model, either method, is straightforward. We will stick to just the *RFSI* model here. All that needs to be done is to concatenate these three columns to our distance (and observed values) dataset and repeat the modeling fitting process." ] }, { "cell_type": "code", "execution_count": null, "id": "c7bb455e", "metadata": {}, "outputs": [], "source": [ "# matrix of distances to and observed values at the n closest observations as predictors\n", "RFSI_wcov_X = pd.concat([obs_nclosest_obs.reset_index(),meuse_obs[['dist','ffreq','soil']]], axis=1)\n", "\n", "# We need to split our dataset into train and test datasets. We'll use 80% of\n", "# the data for model training.\n", "RFSI_wcov_X_train, RFSI_wcov_X_test, RFSI_wcov_y_train, RFSI_wcov_y_test = train_test_split(RFSI_wcov_X, y, train_size=0.8)\n", "\n", "param_distributions = {'n_estimators': randint(1, 100),\n", " 'max_depth': randint(5, 10)}\n", "tuned_RFSI_wcov = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=r_state),\n", " n_iter=10,\n", " param_distributions=param_distributions,\n", " random_state=r_state).fit(RFSI_wcov_X_train, RFSI_wcov_y_train)\n", "tuned_RFSI_wcov.best_params_" ] }, { "cell_type": "code", "execution_count": null, "id": "8874dfc0", "metadata": {}, "outputs": [], "source": [ "print_metrics(RFSI_wcov_y_test, tuned_RFSI_wcov.predict(RFSI_wcov_X_test))" ] }, { "cell_type": "markdown", "id": "b83ace15", "metadata": {}, "source": [ "How did the new covariates change our metrics? Was it as you'd expect? " ] }, { "cell_type": "code", "execution_count": null, "id": "c2e0744f", "metadata": {}, "outputs": [], "source": [ "grid_nclosest_obs_wcov = pd.concat([grid_nclosest_obs.reset_index(),meuse_grid[['dist','ffreq','soil']]], axis=1)\n", "meuse_grid['pred_RFSI_wcov'] = tuned_RFSI_wcov.predict(grid_nclosest_obs_wcov)\n", "\n", "meuse_grid_long = pd.melt(meuse_grid, id_vars= 'geometry', value_vars=['pred_RFsp','pred_RFSI','pred_RFSI_wcov'])\n", "\n", "\n", "(pn.ggplot()\n", " + pn.geom_map(meuse_riv, fill = '#1e90ff', color = None)\n", " + pn.geom_map(meuse_grid_long, pn.aes(fill = 'value'), color = 'white', size = 0.05)\n", " + pn.geom_map(meuse_obs, size = 0.25)\n", " + pn.scale_y_continuous(limits = [ymin, ymax]) \n", " + pn.scale_fill_distiller(type = 'div', palette = 'RdYlBu',trans = 'log1p') \n", " + pn.theme_minimal() \n", " + pn.coord_fixed() \n", " + pn.facet_wrap('variable')\n", ")" ] }, { "cell_type": "markdown", "id": "655ae3db", "metadata": {}, "source": [ "Use the covariates to create a `RFsp_wcov` model and add it to the figure. How do the metrics compare to the other results?" ] } ], "metadata": { "kernelspec": { "display_name": "workshop", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }