{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "---\n",
    "title: Handling vector data\n",
    "layout: single\n",
    "author: Heather Savoy\n",
    "author_profile: true\n",
    "header:\n",
    "  overlay_color: \"444444\"\n",
    "  overlay_image: /assets/images/margaret-weir-GZyjbLNOaFg-unsplash_dark.jpg\n",
    "---\n",
    "-->\n",
    "\n",
    "**Last Update:** 7 October 2022 <br />\n",
    "**Download Jupyter Notebook**: [GRWG22_VectorData.ipynb](https://geospatial.101workbook.org/tutorials/GRWG22_VectorData.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "This tutorial covers how to manipulate geospatial vector datasets in python. A \n",
    "dataset with polygon geometries representing wildfire perimeters is joined with \n",
    "a dataset with point geometries representing air quality monitoring stations to\n",
    "determine which stations observed unhealthy concentrations of small particulate \n",
    "matter (PM2.5) in the atmosphere around the time of the Camp Fire in northern CA \n",
    "in 2018.  \n",
    "\n",
    "\n",
    "*Language*: `Python`\n",
    "\n",
    "*Primary Libraries/Packages*:\n",
    "\n",
    "|Name|Description|Link|\n",
    "|-|-|-|\n",
    "| geopandas | GeoPandas is an open source project to make working with geospatial data in python easier. | https://geopandas.org/en/stable/ |\n",
    "| plotnine | A Grammar of Graphics for Python | https://plotnine.readthedocs.io/en/stable/ |\n",
    "\n",
    "\n",
    "## Nomenclature\n",
    "\n",
    "* *Vector data:* Spatial data defined by simple geometry types (points, lines, \n",
    "  and polygons) where each geometry feature can be assigned non-spatial attributes.\n",
    "* *CRS:* Coordinate Reference System, also known as a spatial reference system. A\n",
    "  system for defining geospatial coordinates.\n",
    "* *Spatial join:* Combining two spatial datasets by the relationship between their\n",
    "  geometries.\n",
    "\n",
    "## Data Details\n",
    "\n",
    "* Data: National Interagency Fire Center's Historic Perimeters dataset\n",
    "* Link: [https://data-nifc.opendata.arcgis.com/datasets/nifc::historic-perimeters-combined-2000-2018-geomac/explore](https://data-nifc.opendata.arcgis.com/datasets/nifc::historic-perimeters-combined-2000-2018-geomac/explore)\n",
    "* Other Details: The original dataset contains perimeters of wildfires in the US \n",
    "  from 2000-2018 as a polygon feature collection. For this tutorial, the wildfire\n",
    "  perimeters in CA during 2018 were extracted. \n",
    "\n",
    "* Data: US EPA's Air Quality System (AQS) database\n",
    "* Link: [https://aqs.epa.gov/aqsweb/documents/data_api.html](https://aqs.epa.gov/aqsweb/documents/data_api.html)\n",
    "* Other Details: PM2.5 concentration data from this database covering CA in 2018 \n",
    "  were retrieved and pre-processed for this tutorial. \n",
    "\n",
    "## Analysis Steps\n",
    "\n",
    "  1. **Fire perimeter data**\n",
    "    - read in and visualize the wildfire perimeter data\n",
    "       - Read in geojson file\n",
    "       - Visualize perimeters on map of CA\n",
    "       - Visualize non-spatial attributes\n",
    "  2. **Air quality data**\n",
    "    - read in the shapefile\n",
    "  3. **Buffer and spatial join**\n",
    "    - find the air quality stations within 200km of the fire perimeter\n",
    "  4. **Visualize**\n",
    "    -  air quality around the Camp Fire"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 0: Import Libraries / Packages\n",
    "\n",
    "Below are commands to run to create a new Conda environment named 'geoenv' that contains the packages used in this tutorial series. To learn more about using Conda environments on Ceres, see [this guide](https://scinet.usda.gov/guide/conda/). NOTE: If you have used other Geospatial Workbook tutorials from the SCINet Geospatial Research Working Group Workshop 2022, you may have aleady created this environment and may skip to launching JupyterHub.\n",
    "\n",
    "First, we allocate resources on a compute (Ceres) or development (Atlas) node so we do not burden the login node with the conda installations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On Ceres:\n",
    "```bash\n",
    "salloc\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On Atlas (you will need to replace `yourProjectName` with one of your project's name):\n",
    "```bash\n",
    "srun -A yourProjectName -p development --pty --preserve-env bash\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we load the `miniconda` conda module available on Ceres and Atlas to access the `Conda` commands to create environments, activate them, and install Python and packages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "source": [
    "```bash\n",
    "module load miniconda\n",
    "conda create --name geoenv\n",
    "source activate geoenv\n",
    "conda install geopandas rioxarray rasterstats plotnine ipython ipykernel dask dask-jobqueue -c conda-forge\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To have JupyterLab use this conda environment, we will make a kernel.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "source": [
    "```bash\n",
    "ipython kernel install --user --name=geo_kernel\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial assumes you are running this python notebook in JupyterLab. The \n",
    "easiest way to do that is with Open OnDemand (OoD) on [Ceres](http://ceres-ood.scinet.usda.gov/)\n",
    "or [Atlas](https://atlas-ood.hpc.msstate.edu/). \n",
    "Select the following parameter values when requesting a JupyterLab\n",
    "app to be launched depending on which cluster you choose. All other values can \n",
    "be left to their defaults. Note: on Atlas, we are using the development partition\n",
    "so that we have internet access to download files since the regular compute nodes\n",
    "on the `atlas` partition do not have internet access.\n",
    "\n",
    "Ceres:\n",
    "* `Slurm Partition`: short\n",
    "* `Number of hours`: 1\n",
    "* `Number of cores`: 2\n",
    "\n",
    "Atlas:\n",
    "* `Partition Name`: development \n",
    "* `QOS`: normal\n",
    "* `Number of hours`: 1\n",
    "* `Number of tasks`: 2\n",
    "\n",
    "To download the python notebook file for this tutorial to either cluster within OoD, \n",
    "you can use the following lines in the python console:\n",
    "\n",
    "```python\n",
    "import urllib.request\n",
    "tutorial_name = 'GRWG22_VectorData.ipynb'\n",
    "urllib.request.urlretrieve('https://geospatial.101workbook.org/tutorials/' + tutorial_name, \n",
    "                           tutorial_name)\n",
    "```\n",
    "\n",
    "Once you are in JupyterLab with this notebook open, select your kernel by clicking on the *Switch kernel* button in the top right corner of the editor. A pop-up will appear with a dropdown menu containing the *geo_kernel* kernel we made above. Click on the *geo_kernel* kernel and click the *Select* button. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geopandas as gpd\n",
    "import numpy as np\n",
    "from plotnine import ggplot, geom_map, aes, theme, geom_histogram, scale_x_datetime, geom_line, ylab, xlab, annotate, geom_vline\n",
    "from datetime import datetime, date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='fire'></a>\n",
    "### Step 1: Read in fire perimeter data and visualize\n",
    "\n",
    "The National Interagency Fire Center's Historic Perimeters dataset has 23,776 \n",
    "polygons representing wildfire perimeters from 2000-2018. A version of the dataset \n",
    "filtered to wildfires in CA in 2018, since that is when the \n",
    "destructive Camp Fire occurred, will be used in this tutorial. \n",
    "\n",
    "We will transform to planar coordinates for distance calculations. The 5070 EPSG\n",
    "code is for the Equal Area CONUS Albers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fire_f = 'Historic_Perimeters_Combined_2000-2018_GeoMAC_CA2018.geojson'\n",
    "dnld_url = 'https://geospatial.101workbook.org/ExampleGeoWorkflows/assets/'\n",
    "all_fires = gpd.read_file(dnld_url + fire_f)\n",
    "fire_CA2018 = all_fires.to_crs(5070)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get an idea of what the data look like, we can create a map. Since this \n",
    "feature collection has several attributes, we can also visualize, for example, \n",
    "when the fires occurred during the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CA boundary\n",
    "census_states = gpd.read_file('https://www2.census.gov/geo/tiger/GENZ2021/shp/cb_2021_us_state_20m.zip')\n",
    "CA = census_states.loc[census_states['NAME'] == 'California'].to_crs(fire_CA2018.crs)\n",
    "\n",
    "# The Camp Fire feature\n",
    "camp_fire = fire_CA2018.loc[fire_CA2018['incidentname'] == 'CAMP']\n",
    "\n",
    "# Plot a map of all wildfires and \n",
    "# highlight the Camp Fire.\n",
    "ggplot() + \\\n",
    "    geom_map(CA, fill = 'white') + \\\n",
    "    geom_map(fire_CA2018, fill = 'black', color = None) + \\\n",
    "    geom_map(camp_fire, fill = 'red', color = None) + \\\n",
    "    theme(figure_size = (6,8))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since this feature collection has several attributes, we can also visualize, \n",
    "for example, when the fires occurred during the year. Note that the dates are\n",
    "when the fires' perimeters are established, not when the fires start. Since fires \n",
    "can endure for many days and change their spatial extent over that time, this\n",
    "date is when the maximum extent of the fire was determined, which needs to be \n",
    "at the end of the fire."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plotting when wildfires occurred throughout 2018\n",
    "ggplot(fire_CA2018, aes('perimeterdatetime')) + \\\n",
    "  geom_histogram() + \\\n",
    "  scale_x_datetime(breaks = '1 month',\n",
    "                   date_labels = '%b')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='air'></a>\n",
    "### Step 2: Read in air quality data\n",
    "\n",
    "We will read in an air quality dataset to showcase combining multiple vector \n",
    "datasets. The geometry type is point, representing sampling stations. This \n",
    "dataset can be downloaded with an API, but you need to register an account, so \n",
    "the download process has been done already and the data are available in our \n",
    "GitHub repository. Some pre-processing for this dataset was done to make it into \n",
    "a feature collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CA_PM25 = gpd.read_file(dnld_url + 'air_quality_CA2018.zip').to_crs(fire_CA2018.crs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "<a id='merge'></a>\n",
    "### Step 3: Find the air quality stations within 200km of the fire perimeter\n",
    "\n",
    "We will buffer the Camp Fire polygon and join that to the air quality data to \n",
    "find the nearby stations. Note that the direction of joining is important: the \n",
    "geometry of the left dataset, in this case the air quality points, will be \n",
    "retained and the attributes of intersecting features of the right dataset will \n",
    "be retained. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "camp_zone = camp_fire.copy()   \n",
    "camp_zone['geometry'] = camp_fire.buffer(200000) # meters, buffer is in unit of CRS\n",
    "air_fire = CA_PM25.sjoin(camp_zone, how=\"inner\", predicate='intersects')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='merge_plot'></a>\n",
    "### Step 4: Visualize air quality around the Camp Fire\n",
    "\n",
    "'Around' in this case should be in both space and time. We already found the \n",
    "air quality stations within 200 km from the fire perimeter. To make sure we are\n",
    "looking at the right time of year, we will relate the non-spatial attributes of\n",
    "the two feature collections. Both datasets come with date columns: when the air\n",
    "quality was recorded and when the perimeter was determined (so the latter is \n",
    "an approximate time near the end of the fire). We will filter our spatially-\n",
    "joined dataset to within 30 days of the fire perimeter date so we only consider\n",
    "air quality within a 2-month window around the time of the fire. \n",
    "\n",
    "We can then plot the air quality metric of PM2.5 concentration (lower is better) \n",
    "as a function of time before/after the fire perimeter date. Each line is an \n",
    "individual air quality station. The figure indicates that there were several \n",
    "stations that recorded unhealthy to hazardous PM2.5 concentrations around the\n",
    "time of this wildfire. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter to 30 days from fire perimeter date -\n",
    "# local date for air quality measurement\n",
    "air_fire['dat_local'] = [datetime.strptime(dt, '%Y-%m-%d').date() for dt in list(air_fire['dat_lcl'])] \n",
    "# fire perimeter date\n",
    "air_fire['perimeterdate'] = [dt.date() for dt in air_fire['perimeterdatetime']] \n",
    "air_fire['date_shift'] = air_fire['dat_local'] - air_fire['perimeterdate']\n",
    "# Arithmetic mean of PM2.5 concentration\n",
    "air_fire['arthmt_'] = air_fire['arthmt_'].astype('float64')\n",
    "air_fire['station_id'] = air_fire['stat_cd'] + air_fire['cnty_cd'] + air_fire['st_nmbr']\n",
    "air_near_fire = air_fire.loc[abs(air_fire['date_shift']).astype('timedelta64[D]') < 30]\n",
    "\n",
    "# Define bounds of Camp Fire dates for illustrative purposes\n",
    "# Camp Fire burned for 17 days- add a polygon to graph to visualize temporal overlap with low air quality\n",
    "camp_dates = [datetime.strptime(dt, '%Y-%m-%d').date() for dt in ['2018-11-08','2018-11-25']]\n",
    "\n",
    "# Camp Fire's maximum perimeter established at end of fire\n",
    "fp_date = air_fire['perimeterdate'][0]\n",
    "\n",
    "ggplot(air_near_fire, aes('dat_local','arthmt_')) + \\\n",
    "  annotate('rect', xmin = camp_dates[0], xmax = camp_dates[1],\n",
    "           ymin = -np.inf, ymax = np.inf,\n",
    "           fill = 'firebrick', alpha = 0.5) + \\\n",
    "  geom_line(aes(group='station_id')) + \\\n",
    "  geom_vline(xintercept = fp_date, \n",
    "             color = 'red',\n",
    "             size = 1,\n",
    "             linetype = 'dashed') + \\\n",
    "  scale_x_datetime(name = 'Date', date_breaks = \"10 days\", date_labels = \"%m-%d-%y\") + \\\n",
    "  ylab('PM2.5 [micrograms/cubic meter]')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then map where those stations were - quite an \n",
    "extensive impact on air quality!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 101 micrograms/cubic meter and higher is considered\n",
    "# unhealthy for sensitive groups \n",
    "unhealthy_air = air_near_fire.loc[air_near_fire['arthmt_'] > 100]\n",
    "\n",
    "ggplot() + \\\n",
    "   geom_map(CA, fill = 'white') + \\\n",
    "   geom_map(camp_zone, fill = None, color = 'red') + \\\n",
    "   geom_map(camp_fire, fill = 'red', color = None) + \\\n",
    "   geom_map(unhealthy_air, color = 'red') + \\\n",
    "   theme(figure_size = (6,8))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.13 ('workshop')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "a5fd0ed268e7e1b4f4ff0b0f7170b45930872f3469ffae2abb34377fde66ce90"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}