Before using this package a number of steps are required: First, your eye gaze data have been collected using an SR Research Eyelink eye tracker. Second, your data have been exported using SR Research DataViewer software. For this, an interest period relative to the onset of the critical auditory stimulus was specified. A Sample Report was exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). Additionally, it is preferable to export to a .txt file rather than a .xlsx file.
The following procedure assumes that, in your experiment, interest areas IDs and Labels were assigned consistently to the object types (i.e., the target was always in interest area 1, the competitor was always in interest area 2, etc). This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects.
If, instead, your interest areas were static and you have columns indicating the location of each object by trial, you will need to relabel your interest areas (sample code available from maintainer, upon request). After that, you can follow the steps in this vignette. Additionally, the functions presented here are capable of handling data with a maximum of 8 interest areas. If you have more than 8 interest areas, it is necessary to adjust the source to accommodate the number needed.
Lastly, the functions included here, internally make use of dplyr
for manipulating and restructuring data. For more information about dplyr
, please refer to its reference manual and extensive collection of vignettes.
First, load the sample report. By default, DataViewer will assign “.” to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.
library(VWPre)
VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
However, for the purposes of this vignette we will use the sample dataset included in the package.
data(VWdat)
In order for the functions in the package to work appropriately, the data need to be in a specific format. The prep_data
function first converts the data frame to a data table object. Second, it examines the class of specific columns (LEFT_INTEREST_AREA_ID
, RIGHT_INTEREST_AREA_ID
, LEFT_INTEREST_AREA_LABEL
, RIGHT_INTEREST_AREA_LABEL
, TIMESTAMP
, and TRIAL_INDEX
) to ensure they are appropriately assigned (e.g., categorical variables are encoded as factors).
Additionally, the function will look for the column the name of which is specified in the Subject
parameter. Typical DataViewer output contains a column called RECORDING_SESSION_LABEL
which is the name of the column containing the subject identifier. The function will rename it Subject
and will ensure it is encoded as a factor.
If your data contain a column corresponding to an item identifier please specify it in the Item
parameter. By doing so, the function will rename the column as Item
and will ensure it is encoded as a factor. If you don’t have an item identifier column, by default the value of this parameter is NA.
Lastly, a new column called Event
will be created which indexes each unique series of samples corresponding to the combination of Subject
and TRIAL_INDEX
. This Event variable will be needed internally subsequent operations. Should you choose to define the Event variable differently, you can override the default; however, do so cautiously as this may impact the performance of how subsequent operations. Upon completion, the function prints a summary indicating the results.
dat0 <- prep_data(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid")
[1] “Step 1 of 9…” [1] “RECORDING_SESSION_LABEL renamed to Subject.” [1] “Subject converted to factor.” [1] “Step 2 of 9…” [1] “itemid renamed to Item.” [1] “Item converted to factor” [1] “Step 3 of 9…” [1] “LEFT_INTEREST_AREA_ID converted to numeric.” [1] “Step 4 of 9…” [1] “RIGHT_INTEREST_AREA_ID already numeric.” [1] “Step 5 of 9…” [1] “LEFT_INTEREST_AREA_LABEL converted to factor.” [1] “Step 6 of 9…” [1] “RIGHT_INTEREST_AREA_LABEL already factor.” [1] “Step 7 of 9…” [1] “TIMESTAMP already numeric.” [1] “Step 8 of 9…” [1] “TRIAL_INDEX already numeric.” [1] “Step 9 of 9…” [1] “Event variable created from Subject and TRIAL_INDEX”
At this point, it is safe to remove the columns which were output by DataViewer, but that are not needed for the functions to operate. Removing these will help to make the functions perform faster and result in data that consume less disk space. This is done straightforwardly using dplyr::select
, which can also accomodate both column names and regular expressions for matching. If using the sample data set included in this package, it is not necessary to do this step, as these columns have alreacy been removed.
dat0 <- select(dat0, -starts_with("AVERAGE"), -starts_with("DATA_"),
-starts_with("HTARGET"), -starts_with("IP"),
-starts_with("LEFT_ACCELLERATION"), -starts_with("LEFT_GAZE"),
-starts_with("LEFT_IN_"), -starts_with("LEFT_PUPIL"),
-starts_with("LEFT_VELOCITY"), -starts_with("RESOLUTION"),
-starts_with("RIGHT_ACCELLERATION"), -starts_with("RIGHT_GAZE"),
-starts_with("RIGHT_IN_"), -starts_with("RIGHT_PUPIL"),
-starts_with("RIGHT_VELOCITY"), -starts_with("SAMPLE"),
-starts_with("TARGET_"), -starts_with("TRIAL_START"),
-starts_with("VIDEO"))
When the data were loaded, samples that were outside of any interest area were labeled as NA. The relabel_na
function examines the interest area columns (LEFT_INTEREST_AREA_ID
, RIGHT_INTEREST_AREA_ID
, LEFT_INTEREST_AREA_LABEL
, and RIGHT_INTEREST_AREA_LABEL
) for cells containing NAs. It then assigns 0 to ID columns and “Outside” to LABEL columns) to indicate those eye gaze samples which fell outside of the interest areas defined in the study. The number of interest areas you defined in your experiment should be supplied to the parameter NoIA
.
dat1 <- relabel_na(data = dat0, NoIA = 4)
[1] “LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.” [1] “RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.”
The function create_time_series
creates a meaningful time series which can be used for visualizing and modeling the data. It is common to export a period of time prior to the onset of the stimulus as a baseline. In this case, an offset (equal to the duration of the baseline period) must be applied to the time series, specified in the offset
parameter. In the example below, the data were exported with a 100ms pre-stimulus interval. The function creates a new column called Time
.
dat2 <- create_time_series(data = dat1, Offset = 100)
The function check_time_series
can be used to verify that a meaningful time series was created and that each Event begins at the same standardized time point relative to the stimulus.
check_time_series(data = dat2)
[1] -100
Depending on the design of the study, right, left, or both eyes may have been recorded during the experiment. DataViewer outputs gaze data by placing it in separate columns for each eye (LEFT_INTEREST_AREA_ID
, LEFT_INTEREST_AREA_LABEL
, RIGHT_INTEREST_AREA_ID
, RIGHT_INTEREST_AREA_LABEL
). However, it is preferable to have gaze data in a single set of columns, regardless of which eye was recorded during the experiment. The function select_recorded_eye
provides the functionality for this purpose, returning two new columns (IA_ID
and IA_LABEL
).
The function select_recorded_eye
requires that the parameter Recording
be specified. This parameter instructs the function about which eye(s) was used to record the gaze data. It takes one of four possible strings: “LandR”, “LorR”, “L”, or “R”. “LandR” should be used when any participant had both eyes recorded. “LorR” should be used when some participants had their left eye recorded and others had their right eye recorded “L” should be used when all participant had their left eye recorded. “R” should be used when all participant had their right eye recorded.
If in doubt, use the function check_eye_recording
which will do a quick check to see if LEFT_INTEREST_AREA_ID
and RIGHT_INTEREST_AREA_ID
contain data. It will then suggest the appropriate Recording parameter setting. When in complete doubt, use “LandR”. The “LandR” setting requires an additional parameter (WhenLandR
) to be specified. This instructs the function to select either the right eye or the left eye when there data exist for both.
check_eye_recording(data = dat2)
[1] “The dataset contains recordings for ONLY the right eye. Set the Recording parameter in select_recorded_eye() to ‘R’.”
After operating, the function prints a summary of the output. While the function check_eye_recording
indicated that the parameter Recording
should be set to “R”, the example below sets the parameter to “LandR”, which can act as a “catch-all”. Consequently, in the summary, it can be seen that there were only recordings in the right eye.
dat3 <- select_recorded_eye(data = dat2, Recording = "R", WhenLandR = "Right")
[1] “Gaze data summary for 320 events:” [1] “The final data frame contains 319 event(s) using gaze data from the right eye.” [1] “The final data frame contains 1 event(s) with no samples falling within any interest area during the given time series.”
In order to obtain proportion looks, it is necessary to bin the data. That is, group samples in chunks of time, count the number of samples in each of the interest areas, and calculate the proportions based on the counts. For the function to do this correctly, it needs to know the sampling rate at which the eye gaze data were recorded. With Eyelink trackers, this is typically 250Hz, 500Hz, or 1000Hz.
If in doubt, use the function check_samplingrate
to determine it. The sampling rate will then be supplied as a parameter to the function bin_prop
.
check_samplingrate(dat3)
[1] “Sampling rate(s) present in the data are: 1000 Hz.”
Note that the check_samplingrate
function returns a printed message indicating the sampling rate(s) present in the data. Optionally, it can return a new column called SamplingRate
by specifying the parameter ReturnData
as TRUE. In the event that data was collected at different sampling rates, this column can be used to subset the dataset by the sampling rate before proceeding to the next processing step.
The function bin_prop
calculates the proportion of looks (samples) to each interest area in a particular span of time (bin size). In order to do this, it is necessary to supply the parameters BinSize
and SamplingRate
. BinSize
should be specified in milliseconds, representing the chunk of time within which to calculate the proportions.
Not all bin sizes will work for all sampling rates. If unsure which are appropriate for your current sampling rate, use the ds_options
function. When provided with the current sampling rate in SamplingRate
(see above), the function will return a printed summary of the bin size options and their corresponding downsampled rate.
ds_options(SamplingRate = 1000)
[1] “Bin size: 1 ms; Downsampled rate: 1000 Hz” [1] “Bin size: 2 ms; Downsampled rate: 500 Hz” [1] “Bin size: 4 ms; Downsampled rate: 250 Hz” [1] “Bin size: 5 ms; Downsampled rate: 200 Hz” [1] “Bin size: 8 ms; Downsampled rate: 125 Hz” [1] “Bin size: 10 ms; Downsampled rate: 100 Hz” [1] “Bin size: 20 ms; Downsampled rate: 50 Hz” [1] “Bin size: 25 ms; Downsampled rate: 40 Hz” [1] “Bin size: 40 ms; Downsampled rate: 25 Hz” [1] “Bin size: 50 ms; Downsampled rate: 20 Hz” [1] “Bin size: 100 ms; Downsampled rate: 10 Hz”
The SamplingRate
parameter in bin_prop
should be specified in Hertz (see check_samplingrate
), representing the original sampling rate of the data and the BinSize
should be specified in milliseconds (see ds_options
), representing the span of time over which to calculate the proportion. The bin_prop
function returns new columns corresponding to each interest area ID (e.g., IA_1_C
, IA_1_P
). The extension ‘_C’ indicates the count of samples in the bin and the extension ‘_P’ indicates the proportion.
dat4 <- bin_prop(dat3, NoIA = 4, BinSize = 20, SamplingRate = 1000)
[1] “Sampling rate OK. You’re good to go!”
In performing the calculation, the function effectively downsamples the data. To check this and to know the new sampling rate, simply call the function check_samplingrate
again.
check_samplingrate(dat4)
[1] “Sampling rate(s) present in the data are: 50 Hz.”
Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. Logits provide a transformation resulting in an unbounded measure. However, because logits range from negative infinity to infinity, the empirical logit transformation adds a constant thus avoiding +/- infinity.
In order to calculate empirical logits, it is necessary to know the number of samples in each bin. This will vary depending on your original sampling rate and bin size. To determine this, use the function check_samples_per_bin
.
check_samples_per_bin(dat4)
[1] “There are 20 samples in each bin.” [1] “One data point every 20 millisecond(s)”
The function transform_to_elogit
transforms the proportions to empirical logits and also calculates a weight for each value. The weight estimate the variance in each bin (because the variance of the logit depends on the mean). This is particularly important for regression analyses and should be specified in the model call (e.g., weight = 1 / IA_1_wts
).
These calculations are taken from: Barr, D. J., (2008) Analyzing ‘visual world’ eyetracking data using multilevel logistic regression, Journal of Memory and Language, 59(4), 457–474. Note that by default the calculation uses a constant of 0.5 (as indicated by Barr); however, a different value can be used by specifying it in the parameter Constant
.
dat5 <- transform_to_elogit(dat4, NoIA = 4, SamplesPerBin = 20)
Some researchers may prefer to perform a binomial analysis. Therefore the function create_binomial
uses (previously calculated) sample counts to create a success/failure column for each IA. This column is then suitable as a response variable in logistic regression.
dat5a <- create_binomial(data = dat4, NoIA = 4)
For advanced users who have worked with the package functions before and who are familiar with the required steps and output, there is a meta-function, called fasttrack
, which runs through the previous functions and outputs a dataframe with either empirical logits or binomial data. Note that using this function will still require the user to manually remove unneeded columns (see above). This meta-function takes as parameters all the required arguments to the component functions. Again, this is only recommended for users who have previously worked with visual world data and the functions contained in this package.
dat5b <- fasttrack(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid",
EventColumns = c("Subject", "TRIAL_INDEX"), NoIA = 4, Offset = 100, Recording = "LandR",
WhenLandR = "Right", BinSize = 20, SamplingRate = 1000,
SamplesPerBin = 20, Constant = 0.5, Output = "ELogit")
[1] “Preparing data…” [1] “Step 1 of 9…” [1] “RECORDING_SESSION_LABEL renamed to Subject.” [1] “Subject converted to factor.” [1] “Step 2 of 9…” [1] “itemid renamed to Item.” [1] “Item converted to factor” [1] “Step 3 of 9…” [1] “LEFT_INTEREST_AREA_ID converted to numeric.” [1] “Step 4 of 9…” [1] “RIGHT_INTEREST_AREA_ID already numeric.” [1] “Step 5 of 9…” [1] “LEFT_INTEREST_AREA_LABEL converted to factor.” [1] “Step 6 of 9…” [1] “RIGHT_INTEREST_AREA_LABEL already factor.” [1] “Step 7 of 9…” [1] “TIMESTAMP already numeric.” [1] “Step 8 of 9…” [1] “TRIAL_INDEX already numeric.” [1] “Step 9 of 9…” [1] “Event variable created from Subject and TRIAL_INDEX” [1] “Relabelling outside of 4 interest areas…” [1] “LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.” [1] “RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.” [1] “Creating time series with 100 ms offset…” [1] -100 [1] “The dataset contains recordings for ONLY the right eye. Set the Recording parameter in select_recorded_eye() to ‘R’.” [1] “Selecting recorded eye…” [1] “Gaze data summary for 320 events:” [1] “0 event(s) contained gaze data for both eyes, for which the Right eye has been selected.” [1] “The final data frame contains 319 event(s) using gaze data from the right eye.” [1] “The final data frame contains 0 event(s) using gaze data from the left eye.” [1] “The final data frame contains 1 event(s) with no samples falling within any interest area during the given time series.” [1] “Sampling rate(s) present in the data are: 1000 Hz.” [1] “Binning 1000 Hz data into 20 ms bins…” [1] “Calculating proportions…” [1] “Sampling rate OK. You’re good to go!” [1] “Sampling rate(s) present in the data are: 50 Hz.” [1] “There are 20 samples in each bin.” [1] “One data point every 20 millisecond(s)” [1] “Preparing ELogit output…”
Some may wish to rename the interest area columns created by the functions to something more meaningful than the numeric coding scheme. To do so, use the function rename_columns
. This will convert column names like IA_1_C
and IA_2_P
to IA_Target_C
and IA_Rhyme_P
, respectively. This will perform the operation on all the IA_
columns for upto 8 interest areas.
dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme",
IA3="OnsetComp", IA4="Distractor"))
[1] “Renaming 4 interest areas.”
It’s often desirable to visualize the proportion (or empirical logit) data, either as a grand average or by condition. In some cases it is even necessary to visualize the trend in the data over a continuous predictor. So, the functions plot_avg
and plot_avg_contour
provide straightforward plotting options for such cases. These functions internally calculate the average(s) and plot the results. The plotting is powered by ggplot2
, so further customization (plot titles, custom themes, etc) is still possible. For more information about ggplot2
, please refer to its reference manual and extensive documentation.
Using the function plot_avg
, it is possible to plot grand average of the data by interest area. The parameter type
specifies which type of plot to create: proportion or empirical logit. In IAColumns
, list the column names of the interest areas proportions (here we have used the default names) along with desired labels.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"),
Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
ErrorBar = TRUE, VWPreTheme = TRUE)
To add a title to the plot, simply add the title function from ggplot2
.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"),
Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
ErrorBar = TRUE, VWPreTheme = TRUE) + ggtitle("Grand Average Plot")
To customize the appearance of a plot (e.g., font, size, color, margins, etc.), the VWPreTheme
parameter can be set to FALSE
, which reverts to the default theming in gglpot2
. In doing the user can apply a custom theme to the plot. Detailed information about creating themes can be found at ggplot2 Theme Vignette. For the purpose of illustration, the default ggplot2 theme has been applied, with the axis text elements increased in size.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"),
Condition1 = NA, Condition2 = NA, Cond1Labels = NA, Cond2Labels = NA,
ErrorBar = TRUE, VWPreTheme = FALSE) + theme(axis.text = element_text(size = 15))
The function plot_avg
can also be used to plot averages for different conditions, based on a factor variable in the data. If the labels of the factor levels in the data are not suitable for plotting, specify new labels using a list in Cond1Labels
.
Specifying Condition1 will stack the plots.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"), Condition1 = "talker",
Condition2 = NA, Cond1Labels = c(CH1 = "Chinese 1", CH10 = "Chinese 3",
CH9 = "Chinese 2", EN3 = "English 1"),
Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)
Alternatively, specifying just Condition2 will plot the same information, but align it horizontally.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"), Condition1 = NA,
Condition2 = "talker", Cond1Labels = NA, Cond2Labels = c(CH1 = "Chinese 1",
CH10 = "Chinese 3",
CH9 = "Chinese 2",
EN3 = "English 1"),
ErrorBar = TRUE, VWPreTheme = TRUE)
For a 2x2 design, it is possible to specify both conditions. This will create a grid-style plot.
plot_avg(data = dat4, type = "proportion", xlim = c(0, 1000),
IAColumns = c(IA_1_P = "Target", IA_2_P = "Rhyme", IA_3_P = "OnsetComp",
IA_4_P = "Distractor"), Condition1 = "talker",
Condition2 = "Exp", Cond1Labels = c(CH1 = "Chinese 1", CH10 = "Chinese 3",
CH9 = "Chinese 2", EN3 = "English 1"),
Cond2Labels = c(High = "High Exp", Low = "Low Exp"), ErrorBar = TRUE,
VWPreTheme = TRUE)
The function plot_avg_diff
can also be used to plot the average difference between looks to two interest areas. As with plot_avg
upto two conditions can be supplied for conditional plotting.
plot_avg_diff(data = dat4, xlim = c(0, 1000), DiffCols = c(IA_1_P = "Target", IA_2_P = "Rhyme"),
Condition1 = NA, Condition2 = NA, Cond1Labels = NA,
Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)
plot_avg_diff(data = dat4, xlim = c(0, 1000), DiffCols = c(IA_1_P = "Target", IA_2_P = "Rhyme"),
Condition1 = "talker", Condition2 = NA, Cond1Labels = c(CH1 = "Chinese 1",
CH10 = "Chinese 3", CH9 = "Chinese 2", EN3 = "English 1"),
Cond2Labels = NA, ErrorBar = TRUE, VWPreTheme = TRUE)
In some cases, studies have not employed a factorial design; rather they aim to investigate continuous variables. Therefore, using the function plot_avg_contour
it is also possible to create a contour plot representing the looks to one interest area as a surface over the continuous variable and Time. This function calculates the average time series at each value of the continuous variable and applies a 3D smooth (utilizing gam
) over the surface. The function then plots the result as a contour plot. Here, the example plots looks to the target as a function of Rating and Time.
plot_avg_contour(data = dat4, IA = "IA_1_P", type = "proportion", Var = "Rating",
VarLabel = "Accent Rating", xlim = c(0,1000), Theme = FALSE,
Color = c("gray20", "gray90"))
It is possible to change the contour colors and add a title. ggplot2
accepts predefined palette colors, RGB, hexadecimal, among others.
plot_avg_contour(data = dat4, IA = "IA_1_P", type = "proportion", Var = "Rating",
VarLabel = "Accent Rating", xlim = c(0,1000), Theme = FALSE,
Color = c("red", "green")) + ggtitle("Looks to target")
There are two functions which provide diagnostic Shiny apps for inspecting the data: plot_var_app
and plot_indiv_app
. These are interactive and allow the user to inspect variability among subjects and items as well as individual averages compared to the grand average. In this way, the user can determine if there are particular subjects or items that might need to be removed from the dataset.
The function plot_var_app
allows the user to view by-subject or by-item Z-scores with respect to the overall mean. For this the user provides the desired interest area and time window. The length of the line indicates how far above or below the mean a particular subject or item is within the window. Additionally, the gray circles indicate the SD within each subject or item.
plot_var_app(dat4)
The function plot_indiv_app
allows the user to view by-subject or by-item averages for all interest areas, along side the grand average For this the user provides the desired interest areas and time window.
plot_indiv_app(dat4)
Before embarking on a statistical analysis, it is probably necessary to take a couple steps, such as paring down the data to only include the columns which will be needed later and ensuring the data are ordered appropriately. This is straightforward using dplyr
.
FinalDat <- dat5 %>%
# un-do any previous groupings
ungroup() %>%
# Select just the columns you want
select(., Subject, Item, Time, starts_with("IA"), Event, TRIAL_INDEX, Rating,
InteractChinese, Exp, target, rhymecomp, onsetcomp, distractor) %>%
# Order the data by Subject, Trial, and Time
arrange(., Subject, TRIAL_INDEX, Time)
Save the resulting dataset to a .rda file and use compression to make it more compact.
save(FinalDat, file = "FinalDat.rda", compress = "xz")