How to use the volkeR package?

First, load the package, set the plot theme and get some data.

# Load the package
library(volker)

# Set the basic plot theme
theme_set(theme_vlkr())

# Load an example dataset ds from the package
ds <- volker::chatgpt

How to generate tables and plots?

Decide whether your data is categorical or metric and choose the appropriate function:

The column selection determines whether to analyse single variables, item lists or to compare and correlate multiple variables. Try it out!

Categorical variables

# A single variable
tab_counts(ds, use_private)
Usage: in private context n p
never 12 12%
rarely 40 40%
several times a month 30 30%
several times a week 15 15%
almost daily 4 4%
total 101 100%
# A list of variables
tab_counts(ds, c(use_private, use_work))
Usage never rarely several times a month several times a week almost daily total
in private context 12% (12) 40% (40) 30% (30) 15% (15) 4% (4) 100% (101)
in professional context 38% (38) 21% (21) 15% (15) 17% (17) 10% (10) 100% (101)
# Variables matched by a pattern
tab_counts(ds, starts_with("use_"))
Usage never rarely several times a month several times a week almost daily total
in private context 12% (12) 40% (40) 30% (30) 15% (15) 4% (4) 100% (101)
in professional context 38% (38) 21% (21) 15% (15) 17% (17) 10% (10) 100% (101)

Metric variables

# One metric variable
tab_metrics(ds, sd_age)
Age value
min 18
q1 27
median 38
q3 52
max 68
mean 39.7
sd 13.8
n 101
# Multiple metric items
tab_metrics(ds, starts_with("cg_adoption_"))
Expectations min q1 median q3 max mean sd n
ChatGPT has clear advantages compared to similar offerings. 1 3 4 4 5 3.4 1.0 97
Using ChatGPT brings financial benefits. 1 2 3 4 5 2.7 1.2 97
Using ChatGPT is advantageous in many tasks. 1 3 4 4 5 3.6 1.1 97
Compared to other systems, using ChatGPT is more fun. 1 3 4 4 5 3.5 1.0 97
Much can go wrong when using ChatGPT. 1 2 3 4 5 3.1 1.1 97
There are legal issues with using ChatGPT. 1 2 3 4 5 3.1 1.2 97
The security of user data is not guaranteed with ChatGPT. 1 3 3 4 5 3.2 1.0 97
Using ChatGPT could bring personal disadvantages. 1 2 3 3 5 2.7 1.1 97
In my environment, using ChatGPT is standard. 1 2 2 3 5 2.5 1.1 97
Almost everyone in my environment uses ChatGPT. 1 1 2 3 5 2.4 1.2 97
Not using ChatGPT is considered being an outsider. 1 1 2 3 5 2.0 1.2 97
Using ChatGPT brings me recognition from my environment. 1 1 2 3 5 2.3 1.2 97

4 missing case(s) omitted.

Cross tabulation and group comparison

Provide a grouping column in the third parameter to compare different groups.

tab_counts(ds, adopter, sd_gender)
Innovator type total female male diverse
I try new offers immediately 15%
(15)
2%
(2)
12%
(12)
1%
(1)
I try new offers rather quickly 62%
(63)
25%
(25)
38%
(38)
0%
(0)
I wait until offers establish themselves 22%
(22)
13%
(13)
9%
(9)
0%
(0)
I only use new offers when I have no other choice 1%
(1)
0%
(0)
1%
(1)
0%
(0)
total 100%
(101)
40%
(40)
59%
(60)
1%
(1)

For metric variables, you can compare the mean values.

# Compare the means of one grouping variable  (including the confidence interval)
tab_metrics(ds, sd_age, sd_gender, ci = TRUE)
Gender min q1 median q3 max mean sd ci low ci high n
female 18 25.8 38.0 44.2 63 37.5 13.4 33.2 41.8 40
male 19 32.5 38.5 52.0 68 41.2 14.0 37.6 44.8 60
diverse 33 33.0 33.0 33.0 33 33.0 1
total 18 27.0 38.0 52.0 68 39.7 13.8 37.0 42.4 101

By default, the crossing variable is treated as categorical. You can change this behaviour using the metric-parameter to calculate correlations:

# Correlate two metric variables
tab_metrics(ds, sd_age, use_work, metric = TRUE, ci = TRUE)
Item 1 Item 2 n Pearson’s r ci low ci high
Age Usage: in professional context 101 -0.2 -0.38 0

Each table function has a corresponding plot function with parameters to pimp the result. See the function help (F1 key) to learn the options. For example, you can use the prop parameter to grow bars to 100%. The numbers parameter prints frequencies and percentages onto the bars.

ds |> 
  filter(sd_gender != "diverse") |> 
  plot_counts(adopter, sd_gender, prop="rows", numbers=c("p","n"))

Further, the effect-functions conduct statistical tests:

ds |> 
  filter(sd_gender != "diverse") |> 
  effect_counts(adopter, sd_gender)
Statistic Value
Cramer’s V 0.28
Number of cases 100
Degrees of freedom
Chi-squared 7.87
p value 0.030
stars *

Automatically generate reports

Getting started

Reports combine plots, tables and effect calculations. Optionally, for item batteries, an index, clusters or factors are calculated and reported.

To see an example or develop own reports, use the volker report template in RStudio:

  • Create a new R Markdown document from the main menu
  • In the popup select the “From Template” option
  • Select the volker template.
  • The template contains a working example. Just click knit to see the result.

Have fun with developing own reports!

Custom reports

To generate a volker-report from any R-Markdown document, add volker::html_report to the output options of your Markdown document:

---
title: "How to create reports?"
output: 
  volker::html_report
---

Then, you can generate combined outputs using the report-functions. One advantage of the report-functions is that plots are automatically scaled to fit the page. See the function help for further options (F1 key).

ds %>% 
  filter(sd_gender != "diverse") %>% 
  report_metrics(starts_with("cg_adoption_"), sd_gender, box=TRUE, ci=TRUE)

Expectations

4 missing case(s) omitted.

Expectations total female male
ChatGPT has clear advantages compared to similar offerings. 3.4
(1.0)
3.6
(1.0)
3.3
(1.0)
Using ChatGPT brings financial benefits. 2.7
(1.2)
2.6
(1.2)
2.7
(1.2)
Using ChatGPT is advantageous in many tasks. 3.6
(1.1)
3.7
(1.0)
3.5
(1.1)
Compared to other systems, using ChatGPT is more fun. 3.5
(1.0)
3.6
(1.0)
3.5
(1.0)
Much can go wrong when using ChatGPT. 3.1
(1.1)
3.1
(1.0)
3.1
(1.2)
There are legal issues with using ChatGPT. 3.1
(1.2)
3.0
(1.0)
3.1
(1.3)
The security of user data is not guaranteed with ChatGPT. 3.2
(1.0)
3.0
(1.0)
3.3
(1.1)
Using ChatGPT could bring personal disadvantages. 2.7
(1.1)
2.5
(0.9)
2.8
(1.2)
In my environment, using ChatGPT is standard. 2.5
(1.1)
2.5
(0.9)
2.5
(1.3)
Almost everyone in my environment uses ChatGPT. 2.4
(1.2)
2.4
(1.0)
2.3
(1.3)
Not using ChatGPT is considered being an outsider. 2.0
(1.2)
1.8
(1.0)
2.1
(1.3)
Using ChatGPT brings me recognition from my environment. 2.3
(1.2)
2.4
(1.2)
2.3
(1.3)

4 missing case(s) omitted.

Custom tab sheets

By default, a header and tabsheets are automatically created. You can mix in custom content.

  • If you want to add content before the report outputs, set the title parameter to FALSE and add your own title.
  • A good place for methodological details is a custom tabsheet next to the “Plot” and the “Table” buttons. You can add a tab by setting the close-parameter to FALSE and adding a new header on the fifth level (5 x # followed by the tab name). Close your custom new tabsheet with #### {-} (4 x #).

All together, the following report output is generated by the pattern:

#> ### Adoption types
#> 
#> ```{r echo=FALSE}
#> ds %>% 
#>   filter(sd_gender != "diverse") %>% 
#>   report_counts(adopter, sd_gender, prop="rows", title=FALSE, close=FALSE, box=TRUE, ci=TRUE)
#> ```
#>
#> ##### Method
#> Basis: Only male and female respondents.
#> 
#> #### {-}

Adoption types

Innovator type total female male
I try new offers immediately 100%
(14)
14%
(2)
86%
(12)
I try new offers rather quickly 100%
(63)
40%
(25)
60%
(38)
I wait until offers establish themselves 100%
(22)
59%
(13)
41%
(9)
I only use new offers when I have no other choice 100%
(1)
0%
(0)
100%
(1)
total 100%
(100)
40%
(40)
60%
(60)

Basis: Only male and female respondents.

Theming

Plot and table functions share a number of parameters that can be used to customize the outputs. Lookup the available parameters in the help of the specific function.

The theme_vlkr()-function lets you customise colors:

theme_set(theme_vlkr(
  base_fill = c("#F0983A","#3ABEF0","#95EF39","#E35FF5","#7A9B59"),
  base_gradient = c("#FAE2C4","#F0983A")
))

Custom labels

Labels used in plots and tables are stored in the comment attribute of the variable. You can inspect all labels using the codebook()-function:

codebook(ds)
#> # A tibble: 94 × 6
#>    item_name     item_group item_class item_label         value_name value_label
#>    <chr>         <chr>      <chr>      <chr>              <chr>      <chr>      
#>  1 case          case       numeric    case               <NA>       <NA>       
#>  2 sd_age        sd         numeric    Age                <NA>       <NA>       
#>  3 cg_activities cg         character  Activities with C… <NA>       <NA>       
#>  4 adopter       adopter    factor     Innovator type     I try new… I try new …
#>  5 adopter       adopter    factor     Innovator type     I try new… I try new …
#>  6 adopter       adopter    factor     Innovator type     I wait un… I wait unt…
#>  7 adopter       adopter    factor     Innovator type     I only us… I only use…
#>  8 adopter       adopter    factor     Innovator type     [no answe… [no answer]
#>  9 sd_gender     sd         factor     Gender             female     female     
#> 10 sd_gender     sd         factor     Gender             male       male       
#> # ℹ 84 more rows

You can set specific column labels by providing a named list to the items-parameter of labs_apply():

ds %>%
  labs_apply(
    items = list(
      "cg_adoption_advantage_01" = "Allgemeine Vorteile",
      "cg_adoption_advantage_02" = "Finanzielle Vorteile",
      "cg_adoption_advantage_03" = "Vorteile bei der Arbeit",
      "cg_adoption_advantage_04" = "Macht mehr Spaß"
    )
  ) %>% 
  tab_metrics(starts_with("cg_adoption_advantage_"))
Item min q1 median q3 max mean sd n
Allgemeine Vorteile 1 3 4 4 5 3.5 1.0 99
Finanzielle Vorteile 1 2 3 4 5 2.7 1.2 99
Vorteile bei der Arbeit 1 3 4 4 5 3.6 1.1 99
Macht mehr Spaß 1 3 4 4 5 3.5 1.0 99

2 missing case(s) omitted.

Labels for values inside a column can be adjusted by providing a named list to the values-parameter of labs_apply(). In addition, select the columns where value labels should be changed:


ds %>%
  labs_apply(
    cols=starts_with("cg_adoption"),  
    values = list(
      "1" = "Stimme überhaupt nicht zu",
      "2" = "Stimme nicht zu",
      "3" = "Unentschieden",
      "4" = "Stimme zu",
      "5" =  "Stimme voll und ganz zu"
    ) 
  ) %>% 
  plot_metrics(starts_with("cg_adoption"))

To conveniently manage all labels of a dataset, save the result of codebook() to an Excel file, change the labels manually in a copy of the Excel file, and finally call labs_apply() with your revised codebook.


library(readxl)
library(writexl)

# Save codebook to a file
codes <- codebook(ds)
write_xlsx(codes,"codebook.xlsx")

# Load and apply a codebook from a file
codes <- read_xlsx("codebook_revised.xlsx")
ds <- labs_apply(ds, codebook)

Be aware that some data operations such as mutate() from the tidyverse loose labels on their way. In this case, store the labels (in the codebook attribute of the data frame) before the operation and restore them afterwards:

ds %>%
  labs_store() %>%
  mutate(sd_age = 2024 - sd_age) %>% 
  labs_restore() %>% 
  
  tab_metrics(sd_age)
Age value
min 1956
q1 1972
median 1986
q3 1997
max 2006
mean 1984.3
sd 13.8
n 101

Index calculation for item batteries

You can calculate mean indexes from a bunch of items using add_index(). A new column is created with the average value of all selected columns for each case.

Reliability and number of items are calculated with psych::alpha() and stored as column attribute named “psych.alpha”. The reliability values are printed by tab_metrics().

Add a single index

ds %>%
  add_index(starts_with("cg_adoption_")) %>%
  tab_metrics(idx_cg_adoption)
Index: cg_adoption value
min 1
q1 2.4
median 2.8
q3 3.2
max 5
mean 2.9
sd 0.6
n 97
items 12
alpha 0.81

4 missing case(s) omitted.

Compare the index values by group

ds %>%
  add_index(starts_with("cg_adoption_")) %>%
  tab_metrics(idx_cg_adoption, adopter)
Innovator type min q1 median q3 max mean sd n items alpha
I try new offers immediately 1.5 3.2 3.3 4.1 5.0 3.5 0.9 15 12 0.81
I try new offers rather quickly 1.8 2.5 2.8 3.1 3.8 2.8 0.5 61 12 0.81
I wait until offers establish themselves 1.0 2.4 2.7 3.0 3.8 2.7 0.6 20 12 0.81
I only use new offers when I have no other choice 2.4 2.4 2.4 2.4 2.4 2.4 1 12 0.81
total 1.0 2.4 2.8 3.2 5.0 2.9 0.6 97 12 0.81

4 missing case(s) omitted.

Add multiple indizes and summarize them

ds %>%
  add_index(starts_with("cg_adoption_")) %>%
  add_index(starts_with("cg_adoption_advantage")) %>%
  add_index(starts_with("cg_adoption_fearofuse")) %>%
  add_index(starts_with("cg_adoption_social")) %>%
  tab_metrics(starts_with("idx_cg_adoption"))
Item min q1 median q3 max mean sd n items alpha
Index: cg_adoption 1 2.4 2.8 3.2 5 2.9 0.6 97 12 0.81
Index: cg_adoption_advantage_0 1 3.0 3.5 3.8 5 3.3 0.9 97 4 0.8
Index: cg_adoption_fearofuse_0 1 2.5 3.0 3.5 5 3.0 0.8 97 4 0.7
Index: cg_adoption_social_0 1 1.5 2.0 3.0 5 2.3 1.0 97 4 0.84

4 missing case(s) omitted.

Factor and cluster Analysis

The easiest way to conduct factor analysis or cluster analyses is to use the respective parameters in the report_metrics() function.

ds |> 
  report_metrics(starts_with("cg_adoption"), factors = TRUE, clusters = TRUE)

Expectations

4 missing case(s) omitted.

Expectations min q1 median q3 max mean sd n
ChatGPT has clear advantages compared to similar offerings. 1 3 4 4 5 3.4 1.0 97
Using ChatGPT brings financial benefits. 1 2 3 4 5 2.7 1.2 97
Using ChatGPT is advantageous in many tasks. 1 3 4 4 5 3.6 1.1 97
Compared to other systems, using ChatGPT is more fun. 1 3 4 4 5 3.5 1.0 97
Much can go wrong when using ChatGPT. 1 2 3 4 5 3.1 1.1 97
There are legal issues with using ChatGPT. 1 2 3 4 5 3.1 1.2 97
The security of user data is not guaranteed with ChatGPT. 1 3 3 4 5 3.2 1.0 97
Using ChatGPT could bring personal disadvantages. 1 2 3 3 5 2.7 1.1 97
In my environment, using ChatGPT is standard. 1 2 2 3 5 2.5 1.1 97
Almost everyone in my environment uses ChatGPT. 1 1 2 3 5 2.4 1.2 97
Not using ChatGPT is considered being an outsider. 1 1 2 3 5 2.0 1.2 97
Using ChatGPT brings me recognition from my environment. 1 1 2 3 5 2.3 1.2 97

4 missing case(s) omitted.

4 missing case(s) omitted.

Expectations Component 1 Component 2 Component 3 communality
ChatGPT has clear advantages compared to similar offerings. 0.1 0.9 0.0 0.8
Using ChatGPT brings financial benefits. 0.5 0.5 0.3 0.6
Using ChatGPT is advantageous in many tasks. 0.2 0.8 0.0 0.7
Compared to other systems, using ChatGPT is more fun. 0.2 0.8 0.0 0.7
Much can go wrong when using ChatGPT. -0.1 -0.2 0.8 0.7
There are legal issues with using ChatGPT. 0.2 0.2 0.6 0.5
The security of user data is not guaranteed with ChatGPT. 0.1 0.1 0.7 0.6
Using ChatGPT could bring personal disadvantages. 0.2 -0.1 0.7 0.6
In my environment, using ChatGPT is standard. 0.9 0.2 0.0 0.8
Almost everyone in my environment uses ChatGPT. 0.8 0.2 0.1 0.7
Not using ChatGPT is considered being an outsider. 0.7 0.0 0.3 0.6
Using ChatGPT brings me recognition from my environment. 0.8 0.2 0.0 0.6

4 missing case(s) omitted.

Component Eigenvalue Proportion of variance Cumulative proportion of variance
Component 1 3.0 0.3 0.3
Component 2 2.5 0.2 0.5
Component 3 2.2 0.2 0.6
Test Statistic value
KMO Test Cases 97
KMO Test Variables 12
KMO Test Cases-to-Variables Ratio 8.08
KMO Test Overall MSA 0.74
Bartlett Test Chi-squared 463.54
Bartlett Test df 66
Bartlett Test p 0.000
Bartlett Test stars ***
Eigenvalues for scree plot
Component Eigenvalue
1 4.2
2 2.1
3 1.4
4 0.8
5 0.7
6 0.6
7 0.5
8 0.5
9 0.4
10 0.4
11 0.3
12 0.2

Automatically selected k=3 by comparing eigenvalues with random data.

4 missing case(s) omitted.

Expectations total Cluster 1 Cluster 2
ChatGPT has clear advantages compared to similar offerings. 3.4
(1.0)
3.1
(1.0)
3.8
(0.9)
Using ChatGPT brings financial benefits. 2.7
(1.2)
2.0
(0.9)
3.4
(1.0)
Using ChatGPT is advantageous in many tasks. 3.6
(1.1)
3.1
(1.2)
4.0
(0.7)
Compared to other systems, using ChatGPT is more fun. 3.5
(1.0)
3.2
(1.0)
3.9
(0.8)
Much can go wrong when using ChatGPT. 3.1
(1.1)
3.2
(1.1)
3.0
(1.1)
There are legal issues with using ChatGPT. 3.1
(1.2)
3.0
(1.2)
3.2
(1.1)
The security of user data is not guaranteed with ChatGPT. 3.2
(1.0)
3.0
(1.1)
3.4
(1.0)
Using ChatGPT could bring personal disadvantages. 2.7
(1.1)
2.6
(1.0)
2.9
(1.2)
In my environment, using ChatGPT is standard. 2.5
(1.1)
1.7
(0.6)
3.4
(0.9)
Almost everyone in my environment uses ChatGPT. 2.4
(1.2)
1.6
(0.6)
3.3
(0.9)
Not using ChatGPT is considered being an outsider. 2.0
(1.2)
1.4
(0.6)
2.6
(1.3)
Using ChatGPT brings me recognition from my environment. 2.3
(1.2)
1.7
(0.8)
3.0
(1.3)

4 missing case(s) omitted.

Cluster n p
Cluster 1 50 52%
Cluster 2 47 48%
total 97 100%
Statistic Value
Within-Cluster Sum of Squares 910.04
Between-Cluster Sum of Squares 241.96
Within-Cluster Sum of Squares for Scree Plot
Clusters k WSS
1 1152.0
2 910.0
3 803.0
4 724.4
5 666.7
6 632.6
7 605.7
8 579.7
9 524.8
10 513.8

Automatically selected k=2 by the elbow criterion.

Currently, cluster analysis is performed using kmeans and factor analysis is a principal component analysis. Setting the parameters to true, automatically generates scree plots and selects the number of factors or clusters. Alternatively, you can explicitly specify the numbers.

If you want to work with the results, use add_factors() and add_clusters() respectively. For factor analysis, new columns prefixed with “fct_” are created to store the factor loadings based on the specified number of factors. For clustering, an additional column prefixed with “cls_” is added that assigns each observation to a cluster number. In the next step, you can use the new columns as shown below.

To automatically determine the optimal number of factors or clusters based on diagnostics, set k = NULL.

Add factor analysis results

ds |> 
  add_factors(starts_with("cg_adoption"), k = 3)  |>
  report_metrics(fct_cg_adoption_1, fct_cg_adoption_2, metric = TRUE)

Component 1

4 missing case(s) omitted.

Item 1 Item 2 n Pearson’s r
1 2 97 0

4 missing case(s) omitted.

Automatically determine the number of factors

ds |> 
  add_factors(starts_with("cg_adoption"), k = NULL) |>
  factor_tab(starts_with("fct_cg_adoption"))
Expectations Component 1 Component 2 Component 3 communality
ChatGPT has clear advantages compared to similar offerings. 0.1 0.9 0.0 0.8
Using ChatGPT brings financial benefits. 0.5 0.5 0.3 0.6
Using ChatGPT is advantageous in many tasks. 0.2 0.8 0.0 0.7
Compared to other systems, using ChatGPT is more fun. 0.2 0.8 0.0 0.7
Much can go wrong when using ChatGPT. -0.1 -0.2 0.8 0.7
There are legal issues with using ChatGPT. 0.2 0.2 0.6 0.5
The security of user data is not guaranteed with ChatGPT. 0.1 0.1 0.7 0.6
Using ChatGPT could bring personal disadvantages. 0.2 -0.1 0.7 0.6
In my environment, using ChatGPT is standard. 0.9 0.2 0.0 0.8
Almost everyone in my environment uses ChatGPT. 0.8 0.2 0.1 0.7
Not using ChatGPT is considered being an outsider. 0.7 0.0 0.3 0.6
Using ChatGPT brings me recognition from my environment. 0.8 0.2 0.0 0.6
Component Eigenvalue Proportion of variance Cumulative proportion of variance
Component 1 3.0 0.3 0.3
Component 2 2.5 0.2 0.5
Component 3 2.2 0.2 0.6
Test Statistic value
KMO Test Cases 97
KMO Test Variables 12
KMO Test Cases-to-Variables Ratio 8.08
KMO Test Overall MSA 0.74
Bartlett Test Chi-squared 463.54
Bartlett Test df 66
Bartlett Test p 0.000
Bartlett Test stars ***
Eigenvalues for scree plot
Component Eigenvalue
1 4.2
2 2.1
3 1.4
4 0.8
5 0.7
6 0.6
7 0.5
8 0.5
9 0.4
10 0.4
11 0.3
12 0.2

Compare values by cluster

ds |>
  add_clusters(starts_with("cg_adoption"), k = 3) |>
  report_counts(sd_gender, cls_cg_adoption, prop = "cols")

Gender

4 missing case(s) omitted.

Gender total Cluster 1 Cluster 2 Cluster 3
female 38%
(37)
32%
(9)
45%
(18)
34%
(10)
male 61%
(59)
68%
(19)
55%
(22)
62%
(18)
diverse 1%
(1)
0%
(0)
0%
(0)
3%
(1)
total 100%
(97)
100%
(28)
100%
(40)
100%
(29)

4 missing case(s) omitted.

What’s behind the scenes?

The volker-package is based on standard methods for data handling and visualisation. You can produce all outputs with a handful of functions. The package just makes your code dry - don’t repeat yourself - and wraps often used snippets into a simple interface.

The package provides print- and knit-functions that pimp console and markdown output. To make this work, the cleaned data, produced plots, tables and markdown snippets gain new classes (vlkr_df, vlkr_plt, vlkr_tbl, vlkr_list, vlkr_rprt).

Basically, all table values are calculated two tidyverse functions:

To shape the data frames, two essential functions come into play:

Plots are generated by ggplot().

Statistical tests, clustering and factor analysis are largely based on the stats, psych, car and effectsize packages.

Thanks to all the maintainers, authors and contributors of the packages that make the world of data a magical place.