Last Update: 17 September 2023
Overview
This tutorial explores raster file properties, how those properties affect file sizes, and how they affect geospatial processing execution times. Commands shown are for direct use of the Geospatial Data Abstraction Library (GDAL) which is a library that is often used by other geospatial software including R packages, Python packages, QGIS, and ESRI products.
Both Atlas and Ceres have GDAL installed as a module. To start using GDAL on either cluster, load the module with the following command:
module load gdal
Language: bash
Primary Libraries/Packages:
Name | Description | Link |
---|---|---|
GDAL | Geospatial Data Abstraction Library | https://gdal.org/ |
Nomenclature
- Raster: Data defined on a grid of geospatial coordinates
- Block: The configuration of how raster values are read from the file.
- Data type: If raster values are stored as integers or floating point numbers, if they are signed, and how many bits are used to store a value.
Sections
- Print file properties of raster datasets
- Effects of data types on file sizes
- Effects of tiles on cropping execution time
- Effects of virtual rasters on cropping execution time
Section 1: Print file properties of raster datasets
First, we will look into what kind of raster file properties can be provided by GDAL.
There is a command called gdalinfo
which will print such information for whichever
raster filename you pass to it.
gdalinfo <raster_filepath>
To illustrate a variety of raster file property combinations, we will look at the datasets availabile in the Geospatial Common Data Library (GeoCDL) on Ceres (see this tutorial to learn more).
For example, calling gdalinfo
on one of the source PRISM data files in the GeoCDL,
the output looks like this:
gdalinfo prism/PRISM_ppt_stable_4kmM3_200807_bil.bil
Driver: EHdr/ESRI .hdr Labelled
Files: prism/PRISM_ppt_stable_4kmM3_200807_bil.bil
prism/PRISM_ppt_stable_4kmM3_200807_bil.bil.aux.xml
prism/PRISM_ppt_stable_4kmM3_200807_bil.hdr
prism/PRISM_ppt_stable_4kmM3_200807_bil.stx
prism/PRISM_ppt_stable_4kmM3_200807_bil.prj
Size is 1405, 621
Coordinate System is:
GEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]],
ID["EPSG",6269]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["longitude",east,
ORDER[1],
ANGLEUNIT["Degree",0.0174532925199433]],
AXIS["latitude",north,
ORDER[2],
ANGLEUNIT["Degree",0.0174532925199433]]]
Data axis to CRS axis mapping: 1,2
Origin = (-125.020833333333357,49.937499999999751)
Pixel Size = (0.041666666666700,-0.041666666666700)
Corner Coordinates:
Upper Left (-125.0208333, 49.9375000) (125d 1'15.00"W, 49d56'15.00"N)
Lower Left (-125.0208333, 24.0625000) (125d 1'15.00"W, 24d 3'45.00"N)
Upper Right ( -66.4791667, 49.9375000) ( 66d28'45.00"W, 49d56'15.00"N)
Lower Right ( -66.4791667, 24.0625000) ( 66d28'45.00"W, 24d 3'45.00"N)
Center ( -95.7500000, 37.0000000) ( 95d45' 0.00"W, 37d 0' 0.00"N)
Band 1 Block=1405x1 Type=Float32, ColorInterp=Undefined
Min=0.000 Max=556.523
Minimum=0.000, Maximum=556.523, Mean=72.033, StdDev=61.444
NoData Value=-9999
Metadata:
STATISTICS_MAXIMUM=556.52301025391
STATISTICS_MEAN=72.033263630928
STATISTICS_MINIMUM=0
STATISTICS_STDDEV=61.444120607318
This output has some recognizable geospatial metadata such as the coordinate reference
system (Coordinate System
) and the spatial resolution (Pixel Size
). The raster file
properties discernable from this output that we will be focusing on are:
- File format (
Driver
,Files
): This is discerned by the file extension of the raster file being described and potentially also the GDAL driver used to read that type of file. Here, the PRISM file has an extension.bil
for which GDAL uses itsEHdr/ESRI .hdr Labelled
driver to read. The.bil
format is developed by ESRI, along with other similar raster file formats, so GDAL uses a combined driver to read these formats. When reading the GDAL documentation about what format-specific options there might be, the content will be organized by driver. For example, there is a GeoTIFF driver. - Data type (
Type
): If raster values are stored as integers or floating point numbers, if they are signed, and how many bits are used to store a value. - X-Y dimensions (
Size
): This is the number of columns and rows of pixels in the image discretizing space (typically horizontal space). - Block size (
Block
): The dimensions (columns and rows) of the tile or strip (or scanline) that represent the file access chunk size. When blocks are configured as square subsets of a raster, they are referred to as tiles.
Here is a table summarizing these properties from several of the GeoCDL datasets:
Dataset | File format | Data type | X-Y dimensions | Block size |
---|---|---|---|---|
DaymetV4 | .nc | Float32 | 7814, 8075 | 1000x1000 |
GTOPO | .dem | Int16 | 4800, 60001 | 4800x1 |
NASS CDL | .tif | Byte | 153811, 96523 | 512x512 |
NLCD | .img | Byte | 161190, 104424 | 512x512 |
PRISM | .bil | Float32 | 1405, 621 | 1405x1 |
SMAP-HB 1km | .nc | Float32 | 6996, 3120 | 1400x624 |
SRTM | .bil | Int16 | 3601,36011 | 3601x1 |
VIP | .hdf | Int16, UInt16, Int8 | 3600, 7200 | 512x512 |
The following sections will explore some file storage implications for data types and computation time implications for block sizes.
Section 2: Effects of data types on file sizes
Data types indicate the range and precision of values that can be stored in the file. There are three main components to differentiate data types:
- Integer or floating point: is it a whole number (integer) or a decimal (floating point)? It takes more bits to store the extra precision of 1.0001 than 1, and thus you can store a greater range of values as integers than floats or consume less disk space storing similar ranges of values as integers than floats.
- Signed or unsigned: do the values range from negative to positive, or are they strictly non-negative? This effects the range of values that can be stored.
- Number of bits: how many bits are used to store a value? This is typically 8-bit to 64-bit and dictates the ultimate file size.
The values that are able to be stored in a raster depend on the data type.
For example, the UInt16
(unsigned 16-bit integer) data type can
hold a range of 2^16 integers within [0, 65535]. The Int16
(signed 16-bit integer) data type also covers 2^16 integer values, but within
[-32768, 32767].
The data types that GDAL supports in ascending order of size are:
Byte
, Int8
< UInt16
, Int16
< UInt32
, Int32
, Float32
< UInt64
, Int64
, Float64
.
The support for some of these data types will depend on the version of GDAL
installed. Based on our table of GeoCDL datasets, the Byte
, Int16
, and
Float32
are common data types. Each band in a raster data file can have
its own data type, e.g., VIP has multiple data types in the table above.
Note that these data type names are from GDAL’s naming scheme. Other
programs might have slightly different names, e.g. ‘uint8’
instead of ‘Byte’ for an unsigned 8-bit integer.
Some datasets, like VIP,
use scaling factors (e.g., dividing NDVI by 0.0001)
to help reduce file size with the use of integer instead of float data types.
We can explore this concept a bit further to see the effect of modifying a
raster’s data type (with and without scaling) on the file size and data loss.
For example, let’s use our example PRISM data file again, which holds precipitation
(ppt) values as the Float32
data type. Since precipitation is typically a measurement
of height and a continuous number, storing it as a floating point number makes sense.
However, depending on the precision needed, disk space could be saved by storing
these precipitation values as scaled integers. In the gdalinfo
output printed above,
the range of precipitation values in the file is [0, 556.523] mm. If we scale those values
by 100 ([0, 55652.3]) and then save them as integers ([0, 55652]), we could store them as
UInt16
, which would take up half the disk space (going from 32-bit to 16-bit data type)
while also only losing precision around 0.001 mm of precipitation. We could not, however,
successfully store all values as Int16
since 55652 is outside the supported range
of [-32768, 32767] (pixels with scaled values >32767 would become no-data pixels).
Also, precipitation itself is non-negative, so an unsigned data type is sensible.
The table below summarizes a range of scaling and data type conversion outcomes for
this PRISM file. Ultimately, scaling by 100 and saving the result as a UInt16
has
the best balance of reducing file size and preventing data loss out of all scenarios
considered.
Data type | Writing error | File size (MB) | Scaling factor -> RMSE from original file (% pixels lost) |
---|---|---|---|
Byte |
Warning: detected values outside of the limits of datatype INT1U (for all scaling factors) |
0.873 | 1 (None) -> 0.56 (1.2%) 10 -> 0.054 (70%) 100 -> 0.0042 (92%) |
UInt16 |
1.7 | 1 (None) -> 0.56 10 -> 0.056 100 -> 0.0057 |
|
Int16 |
Warning: detected values outside of the limits of datatype INT2S (for scaling factor = 100) |
1.7 | 1 (None) -> 0.56 10 -> 0.056 100 -> 0.0057 (0.3%) |
UInt32 |
3.5 | 1 (None) -> 0.56 10 -> 0.056 100 -> 0.0057 |
|
Int32 |
3.5 | 1 (None) -> 0.56 10 -> 0.056 100 -> 0.0057 |
|
Float32 (original data type) |
3.5 | N/A |
Section 3: Effects of tiles on cropping execution time
Modifying the block size of a raster can improve the computation time of certain geospatial analyses. For example, we can look at the effect of setting tiles for a GeoTIFF so that values are read in by square (typically of dimensions that are multiples of 16) chunks.
In GDAL, you can create tiles or change tile sizes for a raster with the gdal_translate
command with the following parameters:
gdal_translate -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 input.tif interim.tif
Or remove or not create tiles:
gdal_translate -co TILED=NO input.tif interim.tif
To illustrate the effect of tiles on computation time, lets take the NASS CDL dataset.
From our dataset table above, we see that it is already tiled with blocks of 512x512 pixels.
We can use the gdal_translate -co TILED=NO
line above to make a copy that is not tiled,
which results in a block size of 153811x1 (one row of pixels across the whole image.)
This dataset covers CONUS and we will ask GDAL to crop it to some small county-sized area
specified by a shapefile:
gdalwarp -crop_to_cutline -cutline my_county.shp interim.tif output.tif
The table below shows how long that cropping command took to execute with the without the tiles. With tiles, it took half the time! This is because GDAL had to read in a smaller amount of the entire raster to cover the small cropping area, a small number of local 512x512 tiles instead of many 153811x1 strips across CONUS. So this computation time benefit will depend on the relative sizes of the blocks, the cropping area, and the original raster extent.
Tile/block size | Computation time (s) |
---|---|
512x512 | ~0.8 |
153811x1 | ~1.5 |
Section 4: Effects of virtual rasters on cropping execution time
Another raster file concept that can save computation time is the GDAL Virtual Format raster (VRT). This is essentially a way to store interim raster operation instructions, as opposed to the actualy interim raster values, which can save processing time if the total sequence of all raster operations needed can be calculated in one combined GDAL operation.
For example, let’s revisit the process of creating tiled GeoTIFFs to improve cropping execution times. If we request first a tiled geoTIFF, then crop it:
gdal_translate -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 input.tif interim.tif
gdalwarp -crop_to_cutline -cutline input.shp interim.tif output.tif
that can still take several minutes to tile if that GeoTIFF is large, e.g. several GB.
What we can do instead is request a tiled VRT and pass that (essentially instructions, not data) to the cropping command:
gdal_translate -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 input.tif interim.vrt
gdalwarp -crop_to_cutline -cutline input.shp interim.vrt output.tif
This generates the same result as making the interim GeoTIFF instead of the VRT, but the VRT method takes less than 1 second instead of minutes.
VRT | Task | Computation time (s) |
---|---|---|
N | Creating the tiled GeoTIFF (several GB) | ~ 4.5 min |
N | Cropping the tiled GeoTIFF | < 1 sec |
Y | Creating the tiled VRT (37 KB) | < 1 sec |
Y | Cropping the tiled VRT: | < 1 sec |