1. stars data model

Edzer Pebesma

For a better version of the stars vignettes see https://r-spatial.github.io/stars/articles/

This vignette explains the data model of stars objects, illustrated using artificial and real datasets.

Stars objects

stars objects consist of

A dimensions object is a named list of dimension elements, each describing the semantics a dimension of the data arrays (space, time, type etc). In addition to that, a dimensions object has an attribute called raster of class stars_raster, which is a named list with three elements:

The affine and curvilinear values are only relevant in case of raster data, indicated by dimensions to have non-NA values.

A dimension object describes a single dimension; it is a list with named elements

Clearly, offset and delta only apply to regularly discretized dimensions, and are NA if this is not the case. from and to will usually be 1 and the dimension size, but from may be larger than 1 in case a regular sub-grid got cut out or was cropped. Rectilinear and curvilinear grids need grid values in values; this can be irregularly spaced coordinate values, or coordinate intervals of irregular width, or spatial geometries encoded in an sfc vector (“list-column”), or a matrix with grid cell centre values (longitude or latitude) for curvilinear grids.

Grid type

Regular grids

With a very simple file created from a \(4 \times 5\) matrix

suppressPackageStartupMessages(library(stars))
m = matrix(1:20, nrow = 5, ncol = 4)
dim(m) = c(x = 5, y = 4) # named dim
(s = st_as_stars(m))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##       A1        
##  Min.   : 1.00  
##  1st Qu.: 5.75  
##  Median :10.50  
##  Mean   :10.50  
##  3rd Qu.:15.25  
##  Max.   :20.00  
## dimension(s):
##   from to offset delta refsys point values    
## x    1  5      0     1     NA FALSE   NULL [x]
## y    1  4      0     1     NA FALSE   NULL [y]

we see that

dim(s[[1]])
## x y 
## 5 4

When we plot this object, using the image method for stars objects,

image(s, text_values = TRUE, axes = TRUE)

we see that \((0,0)\) is the origin of the grid (grid corner), and \(1\) the coordinate value increase from one index (row, col) to the next. It means that consecutive matrix columns represent grid lines, going from south to north. Grids defined this way are regular: grid cell size is constant everywhere.

Many actual grid datasets have y coordinates (grid rows) going from North to South (top to bottom); this is realised with a negative value for delta. We see that the grid origing \((0,0)\) did not change:

attr(s, "dimensions")[[2]]$delta = -1
image(s, text_values = TRUE, axes = TRUE)

An example is the GeoTIFF carried in the package, which, as probably all data sources read through GDAL, has a negative delta for the y-coordinate:

tif = system.file("tif/L7_ETMs.tif", package = "stars")
st_dimensions(read_stars(tif))["y"]
##   from  to  offset delta                       refsys point values
## y    1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE   NULL

Raster attributes, rotated and sheared grids

Dimension tables of stars objects carry a raster attribute:

str(attr(st_dimensions(s), "raster"))
## List of 3
##  $ affine     : num [1:2] 0 0
##  $ dimensions : chr [1:2] "x" "y"
##  $ curvilinear: logi FALSE
##  - attr(*, "class")= chr "stars_raster"

which is a list that holds

These fields are needed at this level, because they describe properties of the array at a higher level than individual dimensions do: a pair of dimensions forms a raster, both affine and curvilinear describe how x and y as a pair are derived from grid indexes (see below) when this cannot be done on a per-dimension basis.

With two affine parameters \(a_1\) and \(a_2\), \(x\) and \(y\) coordinates are derived from (1-based) grid indexes \(i\) and \(j\), grid offset values \(o_x\) and \(o_y\), and grid cell sizes \(d_x\) and \(d_y\) by

\[x = o_x + (i-1) d_x + (j-1) a_1\]

\[y = o_y + (i-1) a_2 + (j-1) d_y\] Clearly, when \(a_1=a_2=0\), \(x\) and \(y\) are entirely derived from their respective index, offset and cellsize.

Note that for integer indexes, the coordinates are that of the starting edge of a grid cell; to get the grid cell center of the top left grid cell (in case of a negative \(d_y\)), use \(i=1.5\) and \(j=1.5\).

We can rotate grids by setting \(a_1\) and \(a_2\) to a non-zero value:

attr(attr(s, "dimensions"), "raster")$affine = c(0.1, 0.1)
plot(st_as_sf(s, as_points = FALSE), axes = TRUE, nbreaks = 20)

The rotation angle, in degrees, is

atan2(0.1, 1) * 180 / pi
## [1] 5.710593

Sheared grids are obtained when the two rotation coefficients, \(a_1\) and \(a_2\), are unequal:

attr(attr(s, "dimensions"), "raster")$affine = c(0.1, 0.2)
plot(st_as_sf(s, as_points = FALSE), axes = TRUE, nbreaks = 20)

Now, the y-axis and x-axis have different rotation in degrees of respectively

atan2(c(0.1, 0.2), 1) * 180 / pi
## [1]  5.710593 11.309932

Rectilinear grids

Rectilinear grids have orthogonal axes, but do not have congruent (equally sized and shaped) cells: each axis has its own irregular subdivision.

We can define a rectilinear grid by specifying the cell boundaries, meaning for every dimension we specify one more value than the dimension size:

x = c(0, 0.5, 1, 2, 4, 5)  # 6 numbers: boundaries!
y = c(0.3, 0.5, 1, 2, 2.2) # 5 numbers: boundaries!
(r = st_as_stars(list(m = m), dimensions = st_dimensions(x = x, y = y)))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##        m        
##  Min.   : 1.00  
##  1st Qu.: 5.75  
##  Median :10.50  
##  Mean   :10.50  
##  3rd Qu.:15.25  
##  Max.   :20.00  
## dimension(s):
##   from to offset delta refsys point                values    
## x    1  5     NA    NA     NA FALSE     [0,0.5),...,[4,5) [x]
## y    1  4     NA    NA     NA FALSE [0.3,0.5),...,[2,2.2) [y]
st_bbox(r)
## xmin ymin xmax ymax 
##  0.0  0.3  5.0  2.2
image(r, axes = TRUE, col = grey((1:20)/20))

Would we leave out the last value, than stars may come up with a different cell boundary for the last cell, as this is now derived from the width of the one-but-last cell:

x = c(0, 0.5, 1, 2, 4)  # 5 numbers: offsets only!
y = c(0.3, 0.5, 1, 2)   # 4 numbers: offsets only!
(r = st_as_stars(list(m = m), dimensions = st_dimensions(x = x, y = y)))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##        m        
##  Min.   : 1.00  
##  1st Qu.: 5.75  
##  Median :10.50  
##  Mean   :10.50  
##  3rd Qu.:15.25  
##  Max.   :20.00  
## dimension(s):
##   from to offset delta refsys point              values    
## x    1  5     NA    NA     NA FALSE   [0,0.5),...,[4,6) [x]
## y    1  4     NA    NA     NA FALSE [0.3,0.5),...,[2,3) [y]
st_bbox(r)
## xmin ymin xmax ymax 
##  0.0  0.3  6.0  3.0

This is not problematic if cells have a constant width, in which case the boundaries are reduced to an offset and delta value, irrespective whether an upper boundary is given:

x = c(0, 1, 2, 3, 4)  # 5 numbers: offsets only!
y = c(0.5, 1, 1.5, 2)   # 4 numbers: offsets only!
(r = st_as_stars(list(m = m), dimensions = st_dimensions(x = x, y = y)))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##        m        
##  Min.   : 1.00  
##  1st Qu.: 5.75  
##  Median :10.50  
##  Mean   :10.50  
##  3rd Qu.:15.25  
##  Max.   :20.00  
## dimension(s):
##   from to offset delta refsys point values    
## x    1  5      0     1     NA FALSE   NULL [x]
## y    1  4    0.5   0.5     NA FALSE   NULL [y]
st_bbox(r)
## xmin ymin xmax ymax 
##  0.0  0.5  5.0  2.5

Alternatively, one can also set the cell midpoints by specifying arguments cell_midpoints to the st_dimensions call:

x = st_as_stars(matrix(1:9, 3, 3), 
                st_dimensions(x = c(1, 2, 3), y = c(2, 3, 10), cell_midpoints = TRUE))

When the dimension is regular, this results in offset being shifted back with half a delta, or else in intervals derived from the distances between cell centers. This should obviously not be done when cell boundaries are specified.

Curvilinear grids

Curvilinear grids are grids whose grid lines are not straight. Rather than describing the curvature parametrically, the typical (HDF5 or NetCDF) files in which they are found have two raster layers, one with the latitudes and one with the longitudes for every grid cell. Using the GDAL interface, we need to compose a stars object from the array and a long and lat matrix. The matrices with longitude and latitude values are stored in the values field of their respective dimensions.

As an example, we will use a Sentinel 5P dataset available from package starsdata; this package can be installed with

install.packages("starsdata", repos = "http://pebesma.staff.ifgi.de", type = "source") 

The dataset is found here:

(s5p = system.file("sentinel5p/S5P_NRTI_L2__NO2____20180717T120113_20180717T120613_03932_01_010002_20180717T125231.nc", package = "starsdata"))
## [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/3.6/starsdata/sentinel5p/S5P_NRTI_L2__NO2____20180717T120113_20180717T120613_03932_01_010002_20180717T125231.nc"

We can construct the curvilinear stars raster by calling st_as_stars on three arrays, with data, longitude and latitudes:

nit.c = read_stars(s5p, sub = "//PRODUCT/SUPPORT_DATA/DETAILED_RESULTS/nitrogendioxide_summed_total_column",
    curvilinear = c("//PRODUCT/longitude", "//PRODUCT/latitude"), driver = NULL)
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## /PRODUCT/longitude,
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## /PRODUCT/latitude,
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## /PRODUCT/SUPPORT_DATA/DETAILED_RESULTS/nitrogendioxide_summed_total_column,
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.

## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: The dataset has several variables
## that could be identified as vector fields, but not all share the same primary
## dimension. Consequently they will be ignored.
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: dimension #2 (ground_pixel) is not a
## Longitude/X dimension.
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: dimension #1 (scanline) is not a
## Latitude/Y dimension.
## Warning in CPL_read_gdal(as.character(x), as.character(options),
## as.character(driver), : GDAL Message 1: No 1D variable is indexed by dimension
## time
if (inherits(nit.c[[1]], "units")) {
    threshold = units::set_units(9e+36, mol/m^2)
} else {
    threshold = 9e+36
}
nit.c[[1]][nit.c[[1]] > threshold] = NA
st_crs(nit.c) = 4326
nit.c
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##  nitrogendioxide_summed_total_column [mol/m^2]
##  Min.   :0e+00                                
##  1st Qu.:1e-04                                
##  Median :1e-04                                
##  Mean   :1e-04                                
##  3rd Qu.:1e-04                                
##  Max.   :5e-04                                
##  NA's   :330                                  
## dimension(s):
##      from  to offset delta refsys point                                 values
## x       1 450     NA    NA WGS 84    NA [450x278] -5.81066 [°],...,30.9468 [°]
## y       1 278     NA    NA WGS 84    NA  [450x278] 28.3605 [°],...,51.4686 [°]
## time    1   1     NA    NA     NA    NA                                   NULL
##         
## x    [x]
## y    [y]
## time    
## curvilinear grid
plot(nit.c, breaks = "equal", reset = FALSE, axes = TRUE, as_points = TRUE, 
         pch = 16,  logz = TRUE, key.length = 1)
maps::map('world', add = TRUE, col = 'red')

plot(nit.c, breaks = "equal", reset = FALSE, axes = TRUE, as_points = FALSE, 
         border = NA, logz = TRUE, key.length = 1)
maps::map('world', add = TRUE, col = 'red')

We can downsample the data by

(nit.c = stars:::st_downsample(nit.c, 8))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##  nitrogendioxide_summed_total_column [mol/m^2]
##  Min.   :0.00005                              
##  1st Qu.:0.00008                              
##  Median :0.00008                              
##  Mean   :0.00009                              
##  3rd Qu.:0.00009                              
##  Max.   :0.00023                              
##  NA's   :37                                   
## dimension(s):
##      from to offset delta refsys point                               values    
## x       1 57     NA    NA WGS 84    NA [57x35] -5.81066 [°],...,30.8108 [°] [x]
## y       1 35     NA    NA WGS 84    NA  [57x35] 28.6622 [°],...,51.4679 [°] [y]
## time    1  1     NA    NA     NA    NA                                 NULL    
## curvilinear grid
plot(nit.c, breaks = "equal", reset = FALSE, axes = TRUE, as_points = TRUE, 
         pch = 16, logz = TRUE, key.length = 1)
maps::map('world', add = TRUE, col = 'red')

which doesn’t look nice, but plotting the cells as polygons looks better:

plot(nit.c, breaks = "equal", reset = FALSE, axes = TRUE, as_points = FALSE, 
         border = NA, logz = TRUE, key.length = 1)
maps::map('world', add = TRUE, col = 'red')