starschemar Transformation Operations

Jose Samos (jsamos@ugr.es)

2020-09-24

Introduction

Package starschemar offers operations to transform flat tables into star schemas and also to export or exploit them through queries.

  1. From a flat table, we define a dimensional model classifying its attributes as facts or dimensions (dimensional modelling).

  2. From a flat table and a dimensional model we obtain a star schema that we can transform; from various star schemas we can define a constellation (star schema and constellation definition).

  3. Dimensions contain rows without duplicates, we can apply operations to perform data cleaning and to conform them (cleaning and conforming data).

  4. When new data is obtained, it is necessary to refresh the existing data with them by means of incremental refresh operations (incremental refresh).

  5. The results obtained can be exported to be consulted with other tools (exporting results).

  6. Finally, basic queries can be performed from R (query functions), especially to select the data to export.

In this document, these operations available in the package are briefly presented.

Dimensional modelling

Starting from a flat table, a dimensional model is defined specifying the attributes that make up each of the dimensions and the measurements in the facts. The result is a dimensional_model object. It is carried out through the following functions:

dm <- dimensional_model()
dm <- dimensional_model() |>
  define_dimension(name = "When",
                   attributes = c("Week Ending Date",
                                  "WEEK",
                                  "Year"))
dm <- dimensional_model() |>
  define_fact(
    name = "mrs_age",
    measures = c("Deaths"),
    agg_functions = c("SUM"),
    nrow_agg = "nrow_agg"
  )

dm <- dimensional_model() |>
  define_fact(name = "Factless fact")

Star schema and constellation definition

A dimensional model is implemented using a star schema. We can have several related star schemas through common dimensions that together form a fact constellation.

Star schema definition

A star schema is defined from a flat table and a dimensional model definition. Once defined, a star schema can be transformed by defining role playing dimensions, changing the writing style of element names or the type of dimension attributes. These operations are carried out through the following functions:

st <- star_schema(mrs_age, dm_mrs_age)
st <- star_schema(mrs_age, dm_mrs_age) |>
  role_playing_dimension(
    dim_names = c("when", "when_available"),
    name = "When Common",
    attributes = c("Date", "Week", "Year")
  )
st <- star_schema(mrs_age, dm_mrs_age) |>
  snake_case()
st <- star_schema(mrs_age, dm_mrs_age) |>
  character_dimensions()

Star schema rename

Once a star schema is defined, we can rename its elements. It is necessary to be able to rename attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table. For completeness also dimensions and facts can be renamed. To carry out these operations, the following functions are available:

st <- st_mrs_age |>
  rename_dimension(name = "when", new_name = "when_happened")
attribute_names <- 
  st_mrs_age |> get_dimension_attribute_names("when")
st <-
  st_mrs_age |> rename_dimension_attributes(
    name = "when",
    attributes = c("when_happened_week", "when_happened_year"),
    new_names = c("week", "year")
  )
st <- st_mrs_age |> rename_fact("age") 
measure_names <- 
  st_mrs_age |> get_measure_names()
st <-
  st_mrs_age |> rename_measures(measures = c("n_deaths"),
                                 new_names = c("num_deaths"))

Constellation definition

Based on various star schemas, a constellation can be defined in which star schemas share common dimensions. Dimensions with the same name must be shared. It is defined by the following function:

ct <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs")

Cleaning and conforming data

Once star schemas and fact constellations are defined, data cleaning operations can be carried out on dimensions. There are three groups of functions:

  1. One to obtain components of star schemas and constellations.

  2. Another to define data cleaning operations over dimensions.

  3. One more to apply operations to star schemas or constellations.

Obtaining components

We can obtain dimensions from a star schema or conformed dimensions from a fact constellation. Available functions in both cases are similar.

Star schema

  • get_dimension_names(): Get the names of the dimensions of a star schema. Role playing dimensions are not considered. Example:
dn <- st_mrs_age |>
  get_dimension_names()
  • get_dimension(): Get a dimension of a star schema given its name. Role dimensions can be obtained but not role playing dimensions. Example:
where <- st_mrs_age |>
  get_dimension("where")

Constellation

  • get_conformed_dimension_names(): Get the names of the conformed dimensions of a constellation. Example:
dn <- ct_mrs |>
  get_conformed_dimension_names()
  • get_conformed_dimension(): Get a conformed dimension of a constellation given its name. Example:
when <- ct_mrs |>
  get_conformed_dimension("when")
  • get_star_schema_names(): Get the names of the star schemas of a constellation. Example:
stn <- ct_mrs |>
  get_star_schema_names()
  • get_star_schema(): Get a star schema of a constellation given its name. Example:
age <- ct_mrs |>
  get_star_schema("mrs_age")

Definition of updates

Modifications are defined on dimension rows in various ways based exclusively on the values of the dimension fields. Although the surrogate key intervenes in the definition, the result, internally, does not depend on it so that it can be applied more generally in other star schemas.

updates <- record_update_set()
updates <- record_update_set() |>
  match_records(dimension = where,
                old = 1,
                new = 2)
updates <- record_update_set() |>
  update_record(
    dimension = who,
    old = 1,
    values = c("1: <1 year")
  )
updates <- record_update_set() |>
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  )
updates <- record_update_set() |>
  update_selection_general(
    dimension = where,
    columns_old = c("state", "city"),
    old_values = c("CT", "Bridgepor"),
    columns_new = c("city"),
    new_values = c("Bridgeport")
  )

Updates application

Defined updates can be applied on a star schema or on the conformed dimension of a fact constellation.

Star schema

  • modify_dimension_records(): Given a list of dimension record update operations, they are applied on the dimensions of the star_schema object. Update operations must be defined with the set of functions available for that purpose. Example:
st <- st_mrs_age |>
  modify_dimension_records(updates_st_mrs_age)

Constellation

  • modify_conformed_dimension_records(): Given a list of dimension record update operations, they are applied on the conformed dimensions of the constellation object. Update operations must be defined with the set of functions available for that purpose. Example:
ct <- ct_mrs |>
  modify_conformed_dimension_records(updates_st_mrs_age)

Dimension enrichment

To enrich a dimension with new attributes related to others already included in it, first, we export the attributes on which the new ones depend, then we define the new attributes, and import the table with all the attributes to be added to the dimension.

tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))
tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))

# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")

st <- enrich_dimension_import(st_mrs_age, name = "when_common", tb)
tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))

# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")[-1, ]

tb2 <- enrich_dimension_import_test(st_mrs_age, name = "when_common", tb)

Incremental refresh

When new data is obtained, an incremental refresh of the data can be carried out, both of the dimensions and of the facts. Incremental refresh can be applied to both star schema and fact constellation, using the following functions.

Star schema

st <- st_mrs_age |>
  incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace")

Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. This data can be selected using the following function:

st <- st_mrs_age |>
  filter_fact_rows(name = "when", when_happened_week <= "03") |>
  filter_fact_rows(name = "where", city == "Bridgeport")

st2 <- st_mrs_age |>
  incremental_refresh_star_schema(st, existing = "delete")

Once the fact data is removed (using the other incremental refresh functions), we can remove the data for the dimensions that are no longer needed using the following function:

st3 <- st2 |>
  purge_dimensions_star_schema()

Constellation

ct <- ct_mrs |>
  incremental_refresh_constellation(st_mrs_age_w10, existing = "replace")
ct <- ct_mrs |>
  purge_dimensions_constellation()

Exporting results

Once the data has been properly structured and transformed, it can be exported to be consulted with other tools or with R. Various export formats have been defined, both for star schemas and for constellations, using the following functions.

Star schema

ft <- st_mrs_age |>
  star_schema_as_flat_table()
ms <- st_mrs_age |>
  star_schema_as_multistar()
tl <- st_mrs_age |>
  star_schema_as_tibble_list(include_role_playing = TRUE)

Constellation

ms <- ct_mrs |>
  constellation_as_multistar()
tl <- ct_mrs |>
  constellation_as_tibble_list(include_role_playing = TRUE)

multistar

ft <- ms_mrs |>
  multistar_as_flat_table(fact = "mrs_age")

Query functions

There are many multidimensional query tools available. The exported data, once stored in files, can be used directly from them. You can also perform basic queries from R on data in the multistar format, mainly for selecting the data to export, using the following functions:

ms_mrs <- ct_mrs |>
  constellation_as_multistar()

dq <- dimensional_query(ms_mrs)
dq <- dimensional_query(ms_mrs) |>
  select_fact(
    name = "mrs_age",
    measures = c("n_deaths"),
    agg_functions = c("MAX")
  )

dq <- dimensional_query(ms_mrs) |>
  select_fact(name = "mrs_age",
              measures = c("n_deaths"))

dq <- dimensional_query(ms_mrs) |>
  select_fact(name = "mrs_age")
dq <- dimensional_query(ms_mrs) |>
  select_dimension(name = "where",
                   attributes = c("city", "state")) |>
  select_dimension(name = "when")
dq <- dimensional_query(ms_mrs) |>
  filter_dimension(name = "when", when_happened_week <= "03") |>
  filter_dimension(name = "where", city == "Boston")
ms <- dimensional_query(ms_mrs) |>
  select_dimension(name = "where",
                   attributes = c("city", "state")) |>
  select_dimension(name = "when",
                   attributes = c("when_happened_year")) |>
  select_fact(
    name = "mrs_age",
    measures = c("n_deaths")
  ) |>
  select_fact(
    name = "mrs_cause",
    measures = c("pneumonia_and_influenza_deaths", "other_deaths")
  ) |>
  filter_dimension(name = "when", when_happened_week <= "03") |>
  filter_dimension(name = "where", city == "Boston") |>
  run_query()

Conclusions

starschemar package offers a set of operations that allow us to transform flat tables into star schemas. Star schemas support the definition of role playing and role dimensions. Additional transformation operations can be applied to each star schema to adapt the format of the data. From several star schemas you can define fact constellation with conformed dimensions.

Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation. In this document the available operations have been briefly presented.