joinspy works with base R data frames, tibbles, and data.tables. The
join wrappers (left_join_spy(), join_strict(),
etc.) detect the input class and dispatch to the right engine
automatically. The diagnostic layer (join_spy(),
key_check(), join_explain(), and friends) is
backend-agnostic: it runs the same analysis regardless of what class the
inputs are.
We walk through detection, explicit overrides, and class preservation below.
When we call left_join_spy() or
join_strict() without specifying a backend, joinspy
inspects the class of x and y and picks the
backend according to a fixed priority: data.table > tibble > base
R.
data.table takes priority because its merge implementation depends on key handling, indexing, and reference semantics that a dplyr join would discard. dplyr, on the other hand, handles a coerced data.table without issues. Both inputs are checked – if one side is a tibble and the other a plain data frame, dplyr is selected. If a mixed-class call selects a backend whose package is not installed, joinspy falls back to base R with a warning.
Here is the detection in action with each input type:
# Base R data frames: auto-detects "base"
orders_df <- data.frame(
id = c(1, 2, 3),
amount = c(100, 250, 75),
stringsAsFactors = FALSE
)
customers_df <- data.frame(
id = c(1, 2, 4),
name = c("Alice", "Bob", "Diana"),
stringsAsFactors = FALSE
)
result_base <- left_join_spy(orders_df, customers_df, by = "id", .quiet = TRUE)
class(result_base)
#> [1] "data.frame"# Tibbles: auto-detects "dplyr"
orders_tbl <- dplyr::tibble(
id = c(1, 2, 3),
amount = c(100, 250, 75)
)
customers_tbl <- dplyr::tibble(
id = c(1, 2, 4),
name = c("Alice", "Bob", "Diana")
)
result_dplyr <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result_dplyr)
#> [1] "tbl_df" "tbl" "data.frame"# data.tables: auto-detects "data.table"
orders_dt <- data.table::data.table(
id = c(1, 2, 3),
amount = c(100, 250, 75)
)
customers_dt <- data.table::data.table(
id = c(1, 2, 4),
name = c("Alice", "Bob", "Diana")
)
result_dt <- left_join_spy(orders_dt, customers_dt, by = "id", .quiet = TRUE)
class(result_dt)
#> [1] "data.table" "data.frame"When the two inputs have different classes, the higher-priority class wins:
All join wrappers and join_strict() accept a
backend argument that overrides auto-detection. The three
valid values are "base", "dplyr", and
"data.table".
We can force dplyr on plain data frames to get tibble output:
result <- left_join_spy(orders_df, customers_df, by = "id",
backend = "dplyr", .quiet = TRUE)
class(result)
#> [1] "data.frame"Or force base R to sidestep dplyr’s many-to-many warning when we already know the expansion is intentional:
# These have a legitimate many-to-many relationship
tags <- dplyr::tibble(
item_id = c(1, 1, 2),
tag = c("red", "large", "small")
)
prices <- dplyr::tibble(
item_id = c(1, 2, 2),
currency = c("USD", "USD", "EUR")
)
# Force base R to avoid dplyr's many-to-many warning
result <- left_join_spy(tags, prices, by = "item_id",
backend = "base", .quiet = TRUE)
nrow(result)
#> [1] 4Or force data.table on plain data frames for speed on large inputs:
result <- left_join_spy(orders_df, customers_df, by = "id",
backend = "data.table", .quiet = TRUE)
class(result)
#> [1] "data.table" "data.frame"An explicit backend must be installed. Requesting
backend = "dplyr" without dplyr will error, not silently
fall back – auto-detection is a convenience, but an explicit override is
a contract.
Setting backend = "base" is also a way to guarantee
reproducibility across environments where dplyr may or may not be
installed.
joinspy preserves input class through the full diagnostic-repair-join cycle:
join_spy(),
key_check(), etc.) accept any data frame subclass and
return report objects without modifying the input.join_repair()) operates on key
columns in place and returns the same class it received.Here is a full cycle with base R data frames:
messy_df <- data.frame(
code = c("A-1 ", "B-2", " C-3"),
value = c(10, 20, 30),
stringsAsFactors = FALSE
)
lookup_df <- data.frame(
code = c("A-1", "B-2", "C-3"),
label = c("Alpha", "Beta", "Gamma"),
stringsAsFactors = FALSE
)
# 1. Diagnose
report <- join_spy(messy_df, lookup_df, by = "code")
# 2. Repair
repaired_df <- join_repair(messy_df, by = "code")
#> ✔ Repaired 2 value(s)
class(repaired_df) # still data.frame
#> [1] "data.frame"
# 3. Join
joined_df <- left_join_spy(repaired_df, lookup_df, by = "code", .quiet = TRUE)
class(joined_df) # still data.frame
#> [1] "data.frame"
joined_df
#> code value label
#> 1 A-1 10 Alpha
#> 2 B-2 20 Beta
#> 3 C-3 30 GammaThe same cycle with tibbles:
messy_tbl <- dplyr::tibble(
code = c("A-1 ", "B-2", " C-3"),
value = c(10, 20, 30)
)
lookup_tbl <- dplyr::tibble(
code = c("A-1", "B-2", "C-3"),
label = c("Alpha", "Beta", "Gamma")
)
repaired_tbl <- join_repair(messy_tbl, by = "code")
#> ✔ Repaired 2 value(s)
class(repaired_tbl) # still tbl_df
#> [1] "tbl_df" "tbl" "data.frame"
joined_tbl <- left_join_spy(repaired_tbl, lookup_tbl, by = "code", .quiet = TRUE)
class(joined_tbl) # still tbl_df
#> [1] "tbl_df" "tbl" "data.frame"
joined_tbl
#> # A tibble: 3 × 3
#> code value label
#> <chr> <dbl> <chr>
#> 1 A-1 10 Alpha
#> 2 B-2 20 Beta
#> 3 C-3 30 GammaAnd with data.tables:
messy_dt <- data.table::data.table(
code = c("A-1 ", "B-2", " C-3"),
value = c(10, 20, 30)
)
lookup_dt <- data.table::data.table(
code = c("A-1", "B-2", "C-3"),
label = c("Alpha", "Beta", "Gamma")
)
repaired_dt <- join_repair(messy_dt, by = "code")
#> ✔ Repaired 2 value(s)
class(repaired_dt) # still data.table
#> [1] "data.table" "data.frame"
joined_dt <- left_join_spy(repaired_dt, lookup_dt, by = "code", .quiet = TRUE)
class(joined_dt) # still data.table
#> [1] "data.table" "data.frame"
joined_dt
#> Key: <code>
#> code value label
#> <char> <num> <char>
#> 1: A-1 10 Alpha
#> 2: B-2 20 Beta
#> 3: C-3 30 GammaWhen join_repair() receives both x and
y, it returns a list with $x and
$y, each preserving the class of the corresponding
input.
join_strict() also preserves class – the cardinality
check runs before the join, so a satisfied constraint returns the native
class and a violated one errors before any output is produced.
The one exception is an explicit backend override that does not match
the input class. Passing backend = "data.table" on a tibble
returns a data.table, because that is what the data.table engine
produces.
The diagnostic functions (join_spy(),
key_check(), key_duplicates(),
join_explain(), detect_cardinality(),
check_cartesian()) operate purely on column values and
never call a join engine. They produce identical results regardless of
input class.
This means we can diagnose on data.tables and join with dplyr, or diagnose in a base-R script and pass the data to a Shiny app that uses dplyr internally.
# Diagnose on data.tables
orders_dt <- data.table::data.table(
id = c(1, 2, 3),
amount = c(100, 250, 75)
)
customers_dt <- data.table::data.table(
id = c(1, 2, 4),
name = c("Alice", "Bob", "Diana")
)
report <- join_spy(orders_dt, customers_dt, by = "id")
# Join with dplyr (convert first)
orders_tbl <- dplyr::as_tibble(orders_dt)
customers_tbl <- dplyr::as_tibble(customers_dt)
result <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result)
#> [1] "tbl_df" "tbl" "data.frame"The report object is structurally identical across backends –
$issues, $expected_rows, and
$match_analysis contain the same values. This also means we
can write unit tests for key quality using plain data frames even when
production code uses data.table.
The three backends differ in a few ways worth noting:
.x/.y suffixes; data.table appends
i. to right-table columns.If we switch backends mid-project, it is worth checking that column references and row-order assumptions still hold.
vignette("quickstart") for a quick introduction to
joinspy
vignette("common-issues") for a catalogue of join
problems and solutions
?left_join_spy, ?join_strict for
backend parameter documentation