Before using VWPre

Before using this package a number of steps are required: First, your eye gaze data have been collected using an SR Research Eyelink eye tracker. Second, your data have been exported using SR Research DataViewer software. For this, an interest period relative to the onset of the critical auditory stimulus was specified. A Sample Report was exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). Additionally, it is preferable to export to a .txt file rather than a .xlsx file.

The following procedure assumes that, in your experiment, interest areas IDs and Labels were assigned consistently to the object types (i.e., the target was always in interest area 1, the competitor was always in interest area 2, etc). This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects.
If, instead, your interest areas were static and you have columns indicating the location of each object by trial, you will need to relabel your interest areas (sample code available from maintainer, upon request). After that, you can follow the steps in this vignette. Additionally, the functions presented here are capable of handling data with a maximum of 8 interest areas. If you have more than 8 interest areas, it is necessary to adjust the source to accommodate the number needed.

Lastly, the functions included here, internally make use of dplyr for manipulating and restructuring data. For more information about dplyr, please refer to its reference manual and extensive collection of vignettes.

Loading the package and the data

First, load the sample report. By default, DataViewer will assign “.” to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.

library(VWPre)
VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))

However, for the purposes of this vignette we will use the sample dataset included in the package.

data(VWdat)

Preparing the data

Verifying the data and making necessary columns

In order for the functions in the package to work appropriately, the data need to be in a specific format. The prep_data function first converts the data frame to a data table object. Second, it examines the class of specific columns (LEFT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, RIGHT_INTEREST_AREA_LABEL, TIMESTAMP, and TRIAL_INDEX) to ensure they are appropriately assigned (e.g., categorical variables are encoded as factors).

Additionally, the function will look for the column the name of which is specified in the Subject parameter. Typical DataViewer output contains a column called RECORDING_SESSION_LABEL which is the name of the column containing the subject identifier. The function will rename it Subject and will ensure it is encoded as a factor.

If your data contain a column corresponding to an item identifier please specify it in the Item parameter. By doing so, the function will rename the column as Item and will ensure it is encoded as a factor. If you don’t have an item identifier column, by default the value of this parameter is NA.

Lastly, a new column called Event will be created which indexes each unique series of samples corresponding to the combination of Subject and TRIAL_INDEX. This Event variable will be needed internally subsequent operations. Should you choose to define the Event variable differently, you can override the default; however, do so cautiously as this may impact the performance of how subsequent operations. Upon completion, the function prints a summary indicating the results.

dat0 <- prep_data(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid")

[1] “Step 1 of 9…” [1] “RECORDING_SESSION_LABEL renamed to Subject.” [1] “Subject converted to factor.” [1] “Step 2 of 9…” [1] “itemid renamed to Item.” [1] “Item converted to factor” [1] “Step 3 of 9…” [1] “LEFT_INTEREST_AREA_ID converted to numeric.” [1] “Step 4 of 9…” [1] “RIGHT_INTEREST_AREA_ID already numeric.” [1] “Step 5 of 9…” [1] “LEFT_INTEREST_AREA_LABEL converted to factor.” [1] “Step 6 of 9…” [1] “RIGHT_INTEREST_AREA_LABEL already factor.” [1] “Step 7 of 9…” [1] “TIMESTAMP already numeric.” [1] “Step 8 of 9…” [1] “TRIAL_INDEX already numeric.” [1] “Step 9 of 9…” [1] “Event variable created from Subject and TRIAL_INDEX”

Remove unnecessary columns

At this point, it is safe to remove the columns which were output by DataViewer, but that are not needed for the functions to operate. Removing these will help to make the functions perform faster and result in data that consume less disk space. This is done straightforwardly using dplyr::select, which can also accomodate both column names and regular expressions for matching. If using the sample data set included in this package, it is not necessary to do this step, as these columns have alreacy been removed.

dat0 <- select(dat0, -starts_with("AVERAGE"), -starts_with("DATA_"), 
               -starts_with("HTARGET"), -starts_with("IP"), 
               -starts_with("LEFT_ACCELLERATION"), -starts_with("LEFT_GAZE"), 
               -starts_with("LEFT_IN_"), -starts_with("LEFT_PUPIL"), 
               -starts_with("LEFT_VELOCITY"), -starts_with("RESOLUTION"), 
               -starts_with("RIGHT_ACCELLERATION"), -starts_with("RIGHT_GAZE"), 
               -starts_with("RIGHT_IN_"), -starts_with("RIGHT_PUPIL"), 
               -starts_with("RIGHT_VELOCITY"), -starts_with("SAMPLE"), 
               -starts_with("TARGET_"), -starts_with("TRIAL_START"), 
               -starts_with("VIDEO"))

Relabel NA samples as outside any interest area

When the data were loaded, samples that were outside of any interest area were labeled as NA. The relabel_na function examines the interest area columns (LEFT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, and RIGHT_INTEREST_AREA_LABEL) for cells containing NAs. It then assigns 0 to ID columns and “Outside” to LABEL columns) to indicate those eye gaze samples which fell outside of the interest areas defined in the study. The number of interest areas you defined in your experiment should be supplied to the parameter NoIA.

dat1 <- relabel_na(data = dat0, NoIA = 4)

[1] “LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.” [1] “RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.”

Creating a meaningful time series

The function create_time_series creates a meaningful time series which can be used for visualizing and modeling the data. It is common to export a period of time prior to the onset of the stimulus as a baseline. In this case, an offset (equal to the duration of the baseline period) must be applied to the time series, specified in the offset parameter. In the example below, the data were exported with a 100ms pre-stimulus interval. The function creates a new column called Time.

dat2 <- create_time_series(data = dat1, Offset = 100)

The function check_time_series can be used to verify that a meaningful time series was created and that each Event begins at the same standardized time point relative to the stimulus.

check_time_series(data = dat2)

[1] -100

Selecting which eye to use

Depending on the design of the study, right, left, or both eyes may have been recorded during the experiment. DataViewer outputs gaze data by placing it in separate columns for each eye (LEFT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, RIGHT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_LABEL). However, it is preferable to have gaze data in a single set of columns, regardless of which eye was recorded during the experiment. The function select_recorded_eye provides the functionality for this purpose, returning two new columns (IA_ID and IA_LABEL).

The function select_recorded_eye requires that the parameter Recording be specified. This parameter instructs the function about which eye(s) was used to record the gaze data. It takes one of four possible strings: “LandR”, “LorR”, “L”, or “R”. “LandR” should be used when any participant had both eyes recorded. “LorR” should be used when some participants had their left eye recorded and others had their right eye recorded “L” should be used when all participant had their left eye recorded. “R” should be used when all participant had their right eye recorded.

If in doubt, use the function check_eye_recording which will do a quick check to see if LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID contain data. It will then suggest the appropriate Recording parameter setting. When in complete doubt, use “LandR”. The “LandR” setting requires an additional parameter (WhenLandR) to be specified. This instructs the function to select either the right eye or the left eye when there data exist for both.

check_eye_recording(data = dat2)

[1] “The dataset contains recordings for ONLY the right eye. Set the Recording parameter in select_recorded_eye() to ‘R’.”

After operating, the function prints a summary of the output. While the function check_eye_recording indicated that the parameter Recording should be set to “R”, the example below sets the parameter to “LandR”, which can act as a “catch-all”. Consequently, in the summary, it can be seen that there were only recordings in the right eye.

dat3 <- select_recorded_eye(data = dat2, Recording = "R", WhenLandR = "Right")

[1] “Gaze data summary for 320 events:” [1] “The final data frame contains 319 event(s) using gaze data from the right eye.” [1] “The final data frame contains 1 event(s) with no samples falling within any interest area during the given time series.”

Binning the data

In order to obtain proportion looks, it is necessary to bin the data. That is, group samples in chunks of time, count the number of samples in each of the interest areas, and calculate the proportions based on the counts. For the function to do this correctly, it needs to know the sampling rate at which the eye gaze data were recorded. With Eyelink trackers, this is typically 250Hz, 500Hz, or 1000Hz.
If in doubt, use the function check_samplingrate to determine it. The sampling rate will then be supplied as a parameter to the function bin_prop.

check_samplingrate(dat3)

[1] “Sampling rate(s) present in the data are: 1000 Hz.”

Note that the check_samplingrate function returns a printed message indicating the sampling rate(s) present in the data. Optionally, it can return a new column called SamplingRate by specifying the parameter ReturnData as TRUE. In the event that data was collected at different sampling rates, this column can be used to subset the dataset by the sampling rate before proceeding to the next processing step.

The function bin_prop calculates the proportion of looks (samples) to each interest area in a particular span of time (bin size). In order to do this, it is necessary to supply the parameters BinSize and SamplingRate. BinSize should be specified in milliseconds, representing the chunk of time within which to calculate the proportions.

Not all bin sizes will work for all sampling rates. If unsure which are appropriate for your current sampling rate, use the ds_options function. When provided with the current sampling rate in SamplingRate (see above), the function will return a printed summary of the bin size options and their corresponding downsampled rate.

ds_options(SamplingRate = 1000)

[1] “Bin size: 1 ms; Downsampled rate: 1000 Hz” [1] “Bin size: 2 ms; Downsampled rate: 500 Hz” [1] “Bin size: 4 ms; Downsampled rate: 250 Hz” [1] “Bin size: 5 ms; Downsampled rate: 200 Hz” [1] “Bin size: 8 ms; Downsampled rate: 125 Hz” [1] “Bin size: 10 ms; Downsampled rate: 100 Hz” [1] “Bin size: 20 ms; Downsampled rate: 50 Hz” [1] “Bin size: 25 ms; Downsampled rate: 40 Hz” [1] “Bin size: 40 ms; Downsampled rate: 25 Hz” [1] “Bin size: 50 ms; Downsampled rate: 20 Hz” [1] “Bin size: 100 ms; Downsampled rate: 10 Hz”

The SamplingRate parameter in bin_prop should be specified in Hertz (see check_samplingrate), representing the original sampling rate of the data and the BinSize should be specified in milliseconds (see ds_options), representing the span of time over which to calculate the proportion. The bin_propfunction returns new columns corresponding to each interest area ID (e.g., IA_1_C, IA_1_P). The extension ‘_C’ indicates the count of samples in the bin and the extension ‘_P’ indicates the proportion.

dat4 <- bin_prop(dat3, NoIA = 4, BinSize = 20, SamplingRate = 1000)

[1] “Sampling rate OK. You’re good to go!”

In performing the calculation, the function effectively downsamples the data. To check this and to know the new sampling rate, simply call the function check_samplingrate again.

check_samplingrate(dat4)

[1] “Sampling rate(s) present in the data are: 50 Hz.”

Empirical logits

Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. Logits provide a transformation resulting in an unbounded measure. However, because logits range from negative infinity to infinity, the empirical logit transformation adds a constant thus avoiding +/- infinity.

In order to calculate empirical logits, it is necessary to know the number of samples in each bin. This will vary depending on your original sampling rate and bin size. To determine this, use the function check_samples_per_bin.

check_samples_per_bin(dat4)

[1] “There are 20 samples in each bin.” [1] “One data point every 20 millisecond(s)”

The function transform_to_elogit transforms the proportions to empirical logits and also calculates a weight for each value. The weight estimate the variance in each bin (because the variance of the logit depends on the mean). This is particularly important for regression analyses and should be specified in the model call (e.g., weight = 1 / IA_1_wts).

These calculations are taken from: Barr, D. J., (2008) Analyzing ‘visual world’ eyetracking data using multilevel logistic regression, Journal of Memory and Language, 59(4), 457–474. Note that by default the calculation uses a constant of 0.5 (as indicated by Barr); however, a different value can be used by specifying it in the parameter Constant.

dat5 <- transform_to_elogit(dat4, NoIA = 4, SamplesPerBin = 20)

Binomial data

Some researchers may prefer to perform a binomial analysis. Therefore the function create_binomial uses (previously calculated) sample counts to create a success/failure column for each IA. This column is then suitable as a response variable in logistic regression.

dat5a <- create_binomial(data = dat4, NoIA = 4)

Fastrack function

For advanced users who have worked with the package functions before and who are familiar with the required steps and output, there is a meta-function, called fasttrack, which runs through the previous functions and outputs a dataframe with either empirical logits or binomial data. Note that using this function will still require the user to manually remove unneeded columns (see above). This meta-function takes as parameters all the required arguments to the component functions. Again, this is only recommended for users who have previously worked with visual world data and the functions contained in this package.

dat5b <- fasttrack(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid", 
    EventColumns = c("Subject", "TRIAL_INDEX"), NoIA = 4, Offset = 100, Recording = "LandR", 
  WhenLandR = "Right", BinSize = 20, SamplingRate = 1000,
  SamplesPerBin = 20, Constant = 0.5, Output = "ELogit")

[1] “Preparing data…” [1] “Step 1 of 9…” [1] “RECORDING_SESSION_LABEL renamed to Subject.” [1] “Subject converted to factor.” [1] “Step 2 of 9…” [1] “itemid renamed to Item.” [1] “Item converted to factor” [1] “Step 3 of 9…” [1] “LEFT_INTEREST_AREA_ID converted to numeric.” [1] “Step 4 of 9…” [1] “RIGHT_INTEREST_AREA_ID already numeric.” [1] “Step 5 of 9…” [1] “LEFT_INTEREST_AREA_LABEL converted to factor.” [1] “Step 6 of 9…” [1] “RIGHT_INTEREST_AREA_LABEL already factor.” [1] “Step 7 of 9…” [1] “TIMESTAMP already numeric.” [1] “Step 8 of 9…” [1] “TRIAL_INDEX already numeric.” [1] “Step 9 of 9…” [1] “Event variable created from Subject and TRIAL_INDEX” [1] “Relabelling outside of 4 interest areas…” [1] “LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.” [1] “RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.” [1] “Creating time series with 100 ms offset…” [1] -100 [1] “The dataset contains recordings for ONLY the right eye. Set the Recording parameter in select_recorded_eye() to ‘R’.” [1] “Selecting recorded eye…” [1] “Gaze data summary for 320 events:” [1] “0 event(s) contained gaze data for both eyes, for which the Right eye has been selected.” [1] “The final data frame contains 319 event(s) using gaze data from the right eye.” [1] “The final data frame contains 0 event(s) using gaze data from the left eye.” [1] “The final data frame contains 1 event(s) with no samples falling within any interest area during the given time series.” [1] “Sampling rate(s) present in the data are: 1000 Hz.” [1] “Binning 1000 Hz data into 20 ms bins…” [1] “Calculating proportions…” [1] “Sampling rate OK. You’re good to go!” [1] “Sampling rate(s) present in the data are: 50 Hz.” [1] “There are 20 samples in each bin.” [1] “One data point every 20 millisecond(s)” [1] “Preparing ELogit output…”

Renaming Interest Area Columns

Some may wish to rename the interest area columns created by the functions to something more meaningful than the numeric coding scheme. To do so, use the function rename_columns. This will convert column names like IA_1_C and IA_2_P to IA_Target_C and IA_Rhyme_P, respectively. This will perform the operation on all the IA_ columns for upto 8 interest areas.

dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme", 
                                       IA3="OnsetComp", IA4="Distractor")) 

[1] “Renaming 4 interest areas.”

Plotting the data

It’s often desirable to visualize the proportion (or empirical logit) data, either as a grand average or by condition. In some cases it is even necessary to visualize the trend in the data over a continuous predictor. So, the functions plot_avg and plot_avg_contour provide straightforward plotting options for such cases. These functions internally calculate the average(s) and plot the results. The plotting is powered by ggplot2, so further customization (plot titles, custom themes, etc) is still possible. For more information about ggplot2, please refer to its reference manual and extensive documentation.

Grand average

Using the function plot_avg, it is possible to plot grand average of the data by interest area. The parameter type specifies which type of plot to create: proportion or empirical logit. In IAColumns, list the column names of the interest areas proportions (here we have used the default names) along with desired labels.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"),
    Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
    ErrorBar = TRUE, VWPreTheme = TRUE) 

To add a title to the plot, simply add the title function from ggplot2.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"),
    Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
    ErrorBar = TRUE, VWPreTheme = TRUE) + ggtitle("Grand Average Plot")

To customize the appearance of a plot (e.g., font, size, color, margins, etc.), the VWPreTheme parameter can be set to FALSE, which reverts to the default theming in gglpot2. In doing the user can apply a custom theme to the plot. Detailed information about creating themes can be found at ggplot2 Theme Vignette. For the purpose of illustration, the default ggplot2 theme has been applied, with the axis text elements increased in size.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"),
    Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
    ErrorBar = TRUE, VWPreTheme = FALSE) + theme(axis.text = element_text(size = 15))

Conditional averages

The function plot_avg can also be used to plot averages for different conditions, based on a factor variable in the data. If the labels of the factor levels in the data are not suitable for plotting, specify new labels using a list in Cond1Labels.

Specifying Condition1 will stack the plots.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"), Condition1 = "talker", 
    Condition2 = NA, Cond1Labels = c(CH1 = "Chinese 1", CH10 = "Chinese 3", 
                                     CH9 = "Chinese 2", EN3 = "English 1"),
    Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)

Alternatively, specifying just Condition2 will plot the same information, but align it horizontally.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"), Condition1 = NA, 
    Condition2 = "talker", Cond1Labels = NA, Cond2Labels = c(CH1 = "Chinese 1", 
                                                             CH10 = "Chinese 3", 
                                                             CH9 = "Chinese 2", 
                                                             EN3 = "English 1"), 
    ErrorBar = TRUE, VWPreTheme = TRUE)

For a 2x2 design, it is possible to specify both conditions. This will create a grid-style plot.

plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000), 
    IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp", 
                  IA_4_P = "Distractor"), Condition1 = "talker", 
    Condition2 = "Exp", Cond1Labels = c(CH1 = "Chinese 1", CH10 = "Chinese 3", 
                                     CH9 = "Chinese 2", EN3 = "English 1"),
    Cond2Labels = c(High = "High Exp", Low = "Low Exp"), ErrorBar = TRUE, 
    VWPreTheme = TRUE)

Difference plots

The function plot_avg_diff can also be used to plot the average difference between looks to two interest areas. As with plot_avg upto two conditions can be supplied for conditional plotting.

plot_avg_diff(data = dat4, xlim = c(0, 1000), DiffCols = c(IA_1_P = "Target", IA_2_P = "Rhyme"), 
            Condition1 = NA, Condition2 = NA, Cond1Labels = NA,
            Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)

plot_avg_diff(data = dat4, xlim = c(0, 1000), DiffCols = c(IA_1_P = "Target", IA_2_P = "Rhyme"), 
            Condition1 = "talker", Condition2 = NA, Cond1Labels = c(CH1 = "Chinese 1", 
            CH10 = "Chinese 3", CH9 = "Chinese 2", EN3 = "English 1"),
            Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)

Conditional contour surface

In some cases, studies have not employed a factorial design; rather they aim to investigate continuous variables. Therefore, using the function plot_avg_contour it is also possible to create a contour plot representing the looks to one interest area as a surface over the continuous variable and Time. This function calculates the average time series at each value of the continuous variable and applies a 3D smooth (utilizing gam) over the surface. The function then plots the result as a contour plot. Here, the example plots looks to the target as a function of Rating and Time.

plot_avg_contour(data = dat4, IA = "IA_1_P", type = "proportion", Var = "Rating", 
VarLabel = "Accent Rating", xlim = c(0,1000), Theme = FALSE, 
Color = c("gray20", "gray90"))

It is possible to change the contour colors and add a title. ggplot2 accepts predefined palette colors, RGB, hexadecimal, among others.

plot_avg_contour(data = dat4, IA = "IA_1_P", type = "proportion", Var = "Rating", 
VarLabel = "Accent Rating", xlim = c(0,1000), Theme = FALSE, 
Color = c("red", "green")) + ggtitle("Looks to target")

Shiny app plots for data inspection

There are two functions which provide diagnostic Shiny apps for inspecting the data: plot_var_app and plot_indiv_app. These are interactive and allow the user to inspect variability among subjects and items as well as individual averages compared to the grand average. In this way, the user can determine if there are particular subjects or items that might need to be removed from the dataset.

The function plot_var_app allows the user to view by-subject or by-item Z-scores with respect to the overall mean. For this the user provides the desired interest area and time window. The length of the line indicates how far above or below the mean a particular subject or item is within the window. Additionally, the gray circles indicate the SD within each subject or item.

plot_var_app(dat4)

VWP

The function plot_indiv_app allows the user to view by-subject or by-item averages for all interest areas, along side the grand average For this the user provides the desired interest areas and time window.

plot_indiv_app(dat4)

VWP

Saving the data

Subsetting and ordering

Before embarking on a statistical analysis, it is probably necessary to take a couple steps, such as paring down the data to only include the columns which will be needed later and ensuring the data are ordered appropriately. This is straightforward using dplyr.

FinalDat <- dat5 %>% 
  # un-do any previous groupings
  ungroup() %>%
  # Select just the columns you want
  select(., Subject, Item, Time, starts_with("IA"), Event, TRIAL_INDEX, Rating, 
         InteractChinese, Exp, target, rhymecomp, onsetcomp, distractor) %>%
  # Order the data by Subject, Trial, and Time
  arrange(., Subject, TRIAL_INDEX, Time)

Saving to a file

Save the resulting dataset to a .rda file and use compression to make it more compact.

save(FinalDat, file = "FinalDat.rda", compress = "xz")