In this vignette, you can see what a codebook generated from a dataset with rich metadata looks like. This dataset includes mock data for a short German Big Five personality inventory and an age variable. The dataset follows the format created when importing data from formr.org. However, data imported using the haven package uses similar metadata. You can also add such metadata yourself, or use the codebook package for unannotated datasets.

As you can see below, the codebook package automatically computes reliabilities for multi-item inventories, generates nicely labelled plots and outputs summary statistics. The same information is also stored in a table, which you can export to various formats. Additionally, codebook can show you different kinds of (labelled) missing values, and show you common missingness patterns. As you cannot see, but search engines will, the codebook package also generates JSON-LD metadata for the dataset. If you share your codebook as an HTML file online, this metadata should make it easier for others to find your data. See what Google sees here.

knit_by_pkgdown <- !is.null(knitr::opts_chunk$get("fig.retina"))
knitr::opts_chunk$set(warning = FALSE, message = TRUE, error = FALSE)
ggplot2::theme_set(ggplot2::theme_bw())

library(codebook)
data("bfi", package = 'codebook')
if (!knit_by_pkgdown) {
  library(dplyr)
    bfi <- bfi %>% select(-starts_with("BFIK_extra"),
                        -starts_with("BFIK_open"),
                        -starts_with("BFIK_consc"))
}
set.seed(1)
bfi$age <- rpois(nrow(bfi), 30)
library(labelled)
var_label(bfi$age) <- "Alter"

By default, we only set the required metadata attributes name and description to sensible values. However, there is a number of attributes you can set to describe the data better. Find out more.

metadata(bfi)$name <- "MOCK Big Five Inventory dataset (German metadata demo)"
metadata(bfi)$description <- "a small mock Big Five Inventory dataset"
metadata(bfi)$identifier <- "doi:10.5281/zenodo.1326520"
metadata(bfi)$datePublished <- "2016-06-01"
metadata(bfi)$creator <- list(
      "@type" = "Person",
      givenName = "Ruben", familyName = "Arslan",
      email = "ruben.arslan@gmail.com", 
      affiliation = list("@type" = "Organization",
        name = "MPI Human Development, Berlin"))
metadata(bfi)$citation <- "Arslan (2016). Mock BFI data."
metadata(bfi)$url <- "https://rubenarslan.github.io/codebook/articles/codebook.html"
metadata(bfi)$temporalCoverage <- "2016" 
metadata(bfi)$spatialCoverage <- "Goettingen, Germany" 
# We don't want to look at the code in the codebook.
knitr::opts_chunk$set(warning = TRUE, message = TRUE, echo = FALSE)
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reliability CIs
## could not be computed for BFIK_neuro
## Warning in doTryCatch(return(expr), name, parentenv, handler): missing value
## where TRUE/FALSE needed
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Loading required namespace: GPArotation

Metadata

Description

Dataset name: MOCK Big Five Inventory dataset (German metadata demo)

a small mock Big Five Inventory dataset

Metadata for search engines

name value
@type Person
givenName Ruben
familyName Arslan
email
affiliation list(@type = “Organization”, name = “MPI Human Development, Berlin”)
x
session
created
modified
ended
expired
BFIK_agree_4R
BFIK_agree_1R
BFIK_neuro_2R
BFIK_agree_3R
BFIK_neuro_3
BFIK_neuro_4
BFIK_agree_2
BFIK_agree
BFIK_neuro
age

Survey overview

28 completed rows, 28 who entered any information, 0 only viewed the first page. There are 0 expired rows (people who did not finish filling out in the requested time frame). In total, there are 28 rows including unfinished and expired rows.

There were 28 unique participants, of which 28 finished filling out at least one survey.

This survey was not repeated.

The first session started on 2016-07-08 09:54:16, the last session on 2016-11-02 21:19:50.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Starting date times

People took on average 127.36 minutes (median 1.48) to answer the survey.

## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Duration people took for answering the survey

#Variables

Scale: BFIK_agree

Overview

Reliability: ωordinal [95% CI] = 0.61 [0.37;0.84].

Missing: 0.

Likert plot of scale BFIK_agree items