An overview of features

As the introductory vignette shows, writing R code with eider simply consists of a call to run_pipeline(). Most of the time spent using this library will be spent defining the features themselves using JSON.

Features as JSON

Features are JSON objects, which are an association of keys and values tied together within curly braces. Keys are always strings, and values can be strings, numbers, booleans, arrays, or objects themselves, as shown in the example object below. Conceptually, JSON objects are similar to R lists.

{
  "key_1": "a string",
  "key_2": 1,
  "key_3": true,
  "key_4": [1, 2, 3],
  "key_5": {
    "nested_key_1": "a string",
    "nested_key_2": 1
  }
}

To be correctly parsed by eider, each feature must contain a specific set of keys. The keys that are shared across all features are:

Transformation types

The available transformation types can be split into a few groups:

Counting

Transformation types: "count", "present"

The two simplest features are "count", which counts the number of occurrences of each ID in the dataset, and "present", which outputs 1 if the ID was found in the dataset and 0 if not.

Examples of the "count" feature type are provided in A&E features 1 and 2, as well as SMR04 feature 1.

The "present" feature type is showcased in A&E feature 3, as well as LTC features 2 and 3.

Summaries

Transformation types: "sum", "nunique", "mean", "median", "sd", "first", "last", "min", "max"

As these features act with respect to the values in a specific column, they require a single extra key to be specified:

The feature will be calculated for each unique ID by aggregating the values in this column.

Example features with summary functions include all the PIS features, and also SMR04 features 2, 3, and 4. These cover the transformation types "nunique", "sum", and "max".

Time-based

Transformation types: "time_since"

The time_since transformation type calculates the period of time between a given date and the first (or last) date in the dataset for each ID. This feature requires a few more keys:

Examples of "time_since" features are given in A&E feature 4 and LTC feature 1.

Combination features

Transformation types: "combine_linear", "combine_min", "combine_max"

Combination features are a way of combining the results of multiple features into a single feature. They have a slightly different structure to the rest: broadly speaking, these transformation types require a subfeature key, which is itself an object which contains the features which are to be combined.

Combination features are covered in a separate vignette.

Preprocessing and filtering

While the above may seem like a large number of possible calculations, on their own they offer no way of controlling which parts of the input data are to be considered.

In addition to the keys shown above, (non-combination) features may also contain the preprocess and filter keys, which perform transformations on the input table before the features are calculated from them. Preprocessing refers to the modification of values within a table, whereas filtering does not modify the values, but only allows rows that pass a set of criteria to be considered when calculating the feature.

Preprocessing is performed prior to filtering: thus, if both are specified, filtering is performed on the already-preprocessed values.

Both the preprocess and filter keys are themselves JSON objects, and are detailed respectively in the preprocessing and filtering vignettes.