RWDataPlyr Workflow

Alan Butlerr

2020-04-16

The RWDataPlyr package provides an interface to read, aggregate, and summarize data from one or more RiverWareTM simulations and then work with those data in a dplyr pipeline. As RiverWare (and RiverSMART) produce data in multiple formats, and these data can be combined using RiverSMART in multiple ways, there are several different workflows that may make sense depending on the goals of the analysis. This vignette provides details on the different workflows that are covered by v0.6.0 of RWDataPlyr.

As of version 0.6.0, the package assumes the following:

Note that only the summary and aggregation functions include these implicit assumptions, i.e., reading rdfs should work regardless of timestep size, start timestep, and end timestep. However, only data conforming to these standards has been tested.

Background

RiverWareTM is a generalized river system modeling tool that represents physical processes and basin features, while allowing for complex reservoir operation policies to be expressed in rule-based logic. RiverWare is widely used across the U.S. by the Bureau of Reclamation, Tennessee Valley Authority, and U.S. Army Corps to assist their water management missions, and has applications worldwide, e.g., the Nile Basin.

One of the strengths of RiverWare is its “multiple run” mode, where many unique realizations, or traces, of hypothetical operations can be simulated with different input assumptions, e.g., hydrology. Multiple traces can be grouped together to form a single scenario, and it is often desirable to compare multiple scenarios together to understand how changes in assumptions affect model results. The RiverSMART tool is a wrapper to RiverWare (and other tools) that helps in the simulation and analysis of multiple scenarios.

RiverWare stores important data in “slots” on “objects” in the RiverWare model file. Any subset of these data can be output and saved in ASCII text files, known as RiverWare data files, with a .rdf extension; csv files; and/or NetCDF files. When multiple scenarios are run using RiverSMART, these output files are saved in unique scenario folders. RiverSMART can also combine csv output from multiple scenarios into a single csv file to expedite the analysis. This setup is summarized in figure 1.

RiverWare and RiverSMART output flow chart. Part A shows a single RiverWare run; part B shows how RiverSMART wraps individual RiverSMART runs to produce output, with those output becoming input to RiverSMART for post-processing. The color of the file extensions correspond to the color of the functions that read those file types; yellow indicates that reading NetCDF files is not yet supported by RWDataPlyr.

Figure 1: RiverWare and RiverSMART output flow chart. Part A shows a single RiverWare run; part B shows how RiverSMART wraps individual RiverSMART runs to produce output, with those output becoming input to RiverSMART for post-processing. The color of the file extensions correspond to the color of the functions that read those file types; yellow indicates that reading NetCDF files is not yet supported by RWDataPlyr.

When using RiverSMART to simulate multiple scenarios, each with hundreds or thousands of traces of data, it is increasingly easy to generate large amounts of data from RiverWare simulations. This package provides a tool to read these data into R, manipulate and summarize the data, and aggregate multiple scenarios’ data together.

As there are multiple ways RiverWare and RiverSMART can generate the data, there are also different workflows that may expedite the analysis process. This vignette documents these different workflows.

RWDataPlyr Workflows

RiverWare and RiverSMART can create data (rdf, csv, nc) for one or more scenarios, and depending on the analysis different workflows will provide advantages to the user. There are currently three different workflows supported by RWDataPlyr that are presented in the following sections.

  1. Reading and manipulating a single scenario
    • Fast
    • Best for inspecting a single slot
    • If comparing scenarios, must manually repeat for each scenario
    • Relies on read_rdf() and read_rw_csv()
  2. Summarizing multiple slots of data from a single scenario
    • Repeatable; allows user to process many slots at once
    • Best for producing “polished” analyses of a single scenario
    • Relies on rdf_aggregate() and user specified rwd_agg object
  3. Aggregating and summarizing many scenarios
    • Repeatable; allows user to process many slots for many scenarios at once
    • Repeats summary of a single scenario on multiple scenarios and combines results together
    • Relies on rw_scen_aggregate() and user specified rwd_agg object

As of v0.6.0, RWDataPlyr can only read rdf and csv files, and the summary and aggregation can only be performed on rdf files.

Reading and Manipulating a Single Scenario

Both rdf and csv files can be read into R using read_rdf() and read_rw_csv().

read_rdf() returns an rdf object; to be useful, the rdf object must be converted to a tbl_df using rdf_to_rwtbl2() or individual slot data can be obtained using rdf_get_slot(). rdf_get_slot() creates a timestep by traces matrix for a single slot. Several useful summary functions are available to summarize a single slot: rwslot_annual_sum(), rwslot_annual_min(), rwslot_annual_max(). A typical workflow to obtain the minimum annual Mead pool elevation from the included sample data might be:

read_rw_csv() returns a tbl_df that can easily be manipulated using the dplyr framework. There are no analogous functions to rdf_get_slot() or rwslot_annual_*(), but dplyr::filter() and dplyr::summarise() work well instead. The following will obtain the same minimum annual Mead pool elevation data from the example csv file with the added benefit of having years and trace number (note that the output is converted to a data.frame to avoid the rounding in the current tibble print method):

The fields that are stored in csv files are controlled by the user in RiverWare. Depending on which fields the user specifies in RiverWare and the keep_cols options provided by the user, read_rw_csv(x) is approximately equal to rdf_to_rwtbl2(read_rdf(y), keep_cols = TRUE) if x is a csv file and y is an rdf file that contain the same RiverWare slots.

Summarizing Multiple Slots of Data From a Single Scenario

To summarize and process multiple slots of data from a single scenario at once, the user can define a RiverWare data aggregator object (rwd_agg) and provide that to rdf_aggregate(), which returns a tbl_df of the summarized data. Slots can be from different rdf files, but it is up to the user to know which slots come from which rdf file.

As of v0.6.0, the aggregation and summarization will only work for rdf files.

rwd_agg

rwd_agg objects are data.frames with specific column names and expected values in those columns. The expected column names are file, slot, period, summary, eval, t_s, and variable in that order. It is up to the user to specify which rdf file contains each slot. In a general case, the user specifies the slot that is found in a specific rdf file (file). A summary function is applied to a subset period of the slot data, and then compared (eval) to a threshold (t_s) and saved as the variable.

rwd_agg objects can be specified in three ways:

  1. By providing a data.frame with the expected column names (see below).
  2. By providing a vector of rdf file names using the rdfs parameter. If specified in this manor, all of the slots in each rdf file will be read into a rwtbl, but will not be aggregated/summarized.
  3. By reading in a csv file, with the expected column names (see below), using read_rdf_agg(). The user may first use rdf_agg_template() to create a blank template, to help ensure it has the correct column names.

In order to specify how each slot should be aggregated, each column should include specific keywords, which are described below.

  • file: specifies the rdf file that contains the slot.
  • slot: the full RiverWare slot name, i.e., “Object.Slot”.
  • period: the period that the slot should be summarized over. This should either be a function name, a full month name (found in month.name), or the keyword “asis”.
    • function name: Specifying a function name allows for pre-specified or custom functions to group together several months in the period. This package provides the following functions: cy(), wy(), eocy(), and eowy(). cy() indicates the data will be summarized over the calendar year, i.e., January - December, while wy() summarizes over the water year, i.e., October - September. eocy() selects the end of the calendar year, and eowy() selects the end of the water year. When specified in the slot_agg object, leave off the parenthesis, i.e., only specify the function name. If wy() is specified, the function will remove data for any water years that have less than 7 months of data. This “tolerance” is specified by the rwdataplyr.wy_month_tol option, and can be modified by updating this option to another number. For standard monthly data that starts in January and ends in December, this results in keeping the first water year, since it includes 9 months of data, and removing the final water year, since it includes only three months of data. Setting this option to 0 will result in keeping any water year data that has at least one month of data; setting this option to 11, ensures that there must be a full water year of data for that year to be kept.

      This can also be a user specified custom function; see the Custom Period Functions section for details on constructing the custom functions.
    • full month name: When the full month name is specified, data will be filtered to only include data for that particular month. To select multiple months of data, use a function as described above. If the month specified is not found in month.name, an error will occur.
    • asis: If the keyword “asis” is specified, the data is returned for its native timestep, i.e, monthly data will return monthly data and annual data will return annual.
  • summary: the summary function that should be applied to the period specified as a function name, or NA. If the period specified is “asis” or returns only one month, e.g., eocy(), then the summary should be NA. The summary function should only return one value; for that reason, most of the Summary S4groupGenerics work. Notably, range() will not since it returns two values. There is no reason that a custom function will not work here, but it has not been tested.
  • eval: the comparison operator to use (see Compare S4groupGenerics). If no comparison is desired, then NA should be used. If eval is specified the value returned from applying the summary to the period will be compared to the threshold specified by t_s. The results of the comparison are returned as 0 and 1 instead of TRUE and FALSE.
  • t_s: either the threshold to be compared to if eval is not NA or a value to scale the result by, e.g,. 0.001 to convert from acre-ft to thousand acre-ft. NA can also be specified to not scale the data.
  • variable: the variable name that will be used to identify the results of applying the period, summary, comparison/scaling to. All variable names should be unique.

Custom Period Functions

The period in a slot_agg object that a slot will be summarized over can be a function that is provided by RWDataPlyr, e.g., wy() for summarizing over a water year, or it can be a user defined custom function to allow for more complex time aggregations. The custom function should return a list with three elements:

  • fun - a function that will modify a rwtbl and properly determine the new Year based on the custom period. Necessary for more complex periods that might span different calendar years, e.g., water years. This function should modify the Year column so that months that should be grouped together have the same year value.
  • filter_months - the months that should be grouped together.
  • group_tbl - how to group the returned rwtbl; likely either c("Year") or c("Year", "Month")

For example, a user might want to aggregate all summer months together (June - August). A relatively simple custom function for this would be:

The fun in the summer example simply returns the original data, because it does not need to provide any complex logic for grouping different years together. A more complex example would be summarizing all winter months together (December - February). In this example December from year 0 should be grouped with January and February of year 1 for the winter of year 1 summary. The djf() custom function would be:

In this example, fun now includes a function that takes in the original data (and expects a Timestep column). The time step is converted to a yearmon object, and then the years are re-assigned based on their water year (since December - February are in the same water year).

rdf_aggregate() Example

To summarize multiple slots at once, the first step is to define the rwd_agg object. In this example, the following summaries are desired:

  • a flag of each occurrence that Mead’s minimum annual elevation is below elevation 1,060’
  • the end-of-calendar year Mead elevation
  • the winter release from Powell (sum of December - February) in thousands acre-ft (this uses our custom period function defined in the previous section)
  • the July release from Powell
  • the water year total release from Powell in million acre-ft
  • the shortage flag (already annual no summary required)

For this, we want to create the following table:

file slot period summary eval t_s variable
KeySlots.rdf Mead.Pool Elevation cy min <= 1,060 meadLt1060
KeySlots.rdf Mead.Pool Elevation eocy NA NA NA meadEocy
KeySlots.rdf Powell.Outflow djf sum NA 0.001 powellDjfRel
KeySlots.rdf Powell.Outflow July NA NA NA powellJulRel
KeySlots.rdf Powell.Outflow wy sum NA 1e-06 powellWyRel
SystemConditions.rdf SummaryOutputData.LBShortageConditions asis NA NA NA short

Often times, it will be easier to define the table in Excel and then read it in as a csv using read_rwd_agg(x) but below shows an example of defining it directly in R.

After the rwd_agg is specified the object is passed to rdf_aggregate() along with a few parameters specifying where the data are stored, and a tbl_df is returned:

To keep all of the columns (attribute info), specify keep_cols = TRUE:

Aggregating and Summarizing Many Scenarios

rw_scen_aggregate() will aggregate and summarize multiple scenarios, essentially calling rdf_aggregate() for each scenario. Similar to rdf_aggregate() it relies on a user specified rwd_agg object to know how to summarize and process the scenarios.

rw_scen_aggregate() is setup to work with a directory structure typical to a RiverSMART study. RiverSMART manages the output of multiple scenarios by creating unique scenario folders in a “Scenario” directory. The scen_dir argument should point to this scenario directory in a typical study setup. For example, if the file system is setup as shown in figure 2, scen_dir should point to “C:/user/crss/CRSS.Jan2017/Scenario” (or the correct relative path to the same location).

Sample directory structure

Figure 2: Sample directory structure

rw_scen_aggregate() also needs to know the scenario folders and allows the user to specify scenario names that may differ from the folder name so they are easier to use in R. In this example, the scenarios argument would be c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002"). The scenario names can be specified either through the scen_names argument, or by naming the scenarios vector. The following will obtain the same results as defined by the earlier rwd_agg object for both of these scenarios.

To compare results from multiple studies, i.e., the different top level CRSS.JanNNNN folders in figure 2, scen_dir should point to “C:/user/crss” and scenarios should be assigned to c("CRSS.Jan2017/Scenario/ISM1988_2014,2007Dems,IG,Most", "CRSS.Jan2018/Scenario/ISM1988_2014,2007Dems,IG,Most").

If scenarios has a length of one, it will match the results of rdf_aggregate() if the scenario parameter is used and the same as the scen_names parameter in rw_scen_aggregate().

Legacy Workflow

getDataForAllScens() is the legacy function that preceded rw_scen_aggregate() and rdf_aggregate() in RWDataPlyr v0.5.0 and earlier. It works similarly to rw_scen_aggregate() and relies on a slot_agg_list, which is analogous to the rwd_agg object, though it only allows for certain pre-specified aggregation methods. The help pages include information on these functions, though rw_scen_aggregate() is superior and preferred as of v0.6.0.

getDataForAllScens() treats NaNs as 0s in its aggregation methods. The nans_are parameter defaults to "0" in rdf_aggregate() and rw_scen_aggregate() and thus matches the way getDataForAllScens() treats NaNs.

Utilizing RWDataPlyr within RiverSMART

RiverSMART includes a plugin that allows R scripts to be executed from RiverSMART for post-processing scenario(s). The exact configuration of the script and the RiverSMART R plugin is beyond the scope of this vignette, but a user can pass scenario folders and names to rw_scen_aggregate() from within RiverSMART, thus streamlining the post-processing steps.