A join runs without error but the row count is off – rows vanished,
or they multiplied. The columns look fine. The key values look identical
in the console. str() reveals nothing useful.
The problem is almost never the join logic. It is the strings. R’s
merge() and dplyr’s *_join() compare key
values byte-for-byte. When keys fail to match, they are genuinely
different – the difference is just invisible: a trailing space, a case
mismatch, a zero-width Unicode character with no visual
representation.
This vignette walks through five scenarios where joins fail for string-level reasons that resist casual inspection. The data is synthetic; the patterns come from real pipelines.
A retail analytics team receives monthly sales data from a
distribution partner as a CSV exported from Excel. They join it against
their internal customer database on customer_id. For two
quarters, everything works. Then one month, 30% of the sales records
stop matching. Nobody changed the code or the customer database. The
partner’s IDs are all present in the internal system – or so it
appears.
partner_sales <- data.frame(
customer_id = c("CUST-1001", "CUST-1002 ", "CUST-1003",
" CUST-1004", "CUST-1005 ", "CUST-1006"),
amount = c(2500, 1800, 3200, 950, 4100, 1600),
stringsAsFactors = FALSE
)
internal_db <- data.frame(
customer_id = c("CUST-1001", "CUST-1002", "CUST-1003",
"CUST-1004", "CUST-1005", "CUST-1006", "CUST-1007"),
region = c("West", "East", "West", "South", "East", "North", "West"),
stringsAsFactors = FALSE
)Nothing in str() or print() reveals the
issue – trailing spaces are invisible in console output.
join_spy() catches it:
Three of the six partner IDs carry whitespace.
"CUST-1002 " is a different string from
"CUST-1002" as far as R is concerned.
join_repair() trims both tables at once:
repaired <- join_repair(partner_sales, internal_db, by = "customer_id")
#> ✔ Repaired 3 value(s)
partner_fixed <- repaired$x
internal_fixed <- repaired$yWe can verify the repair worked:
key_check(partner_fixed, internal_fixed, by = "customer_id")
#> ✔ Key check passed: no issues detectedAnd now the join gives us what we expected:
The root cause was an Excel CONCATENATE formula that
preserved trailing spaces from a variable-width source column. Excel
renders "CUST-1002" and "CUST-1002 "
identically, so nobody noticed. Trailing whitespace is the single most
common join failure we see in practice.
A SaaS company wants to join CRM profiles to clickstream events for a churn analysis. The CRM stores email addresses in uppercase (a database migration decision from the late 1990s). The web app stores them in lowercase. Both systems are internally consistent.
crm_profiles <- data.frame(
email = c("ALICE@ACME.COM", "BOB@ACME.COM", "CAROL@ACME.COM",
"DAVE@ACME.COM", "EVE@ACME.COM"),
plan = c("enterprise", "starter", "pro", "enterprise", "starter"),
stringsAsFactors = FALSE
)
click_events <- data.frame(
email = c("alice@acme.com", "bob@acme.com", "carol@acme.com",
"dave@acme.com", "frank@acme.com"),
page_views = c(47, 12, 89, 33, 5),
stringsAsFactors = FALSE
)An inner join returns zero rows. R’s string comparison is
case-sensitive, so every key pair fails. join_spy() flags
the situation before the join:
suggest_repairs() generates the fix:
suggest_repairs(report)
#>
#> ── Suggested Repairs ───────────────────────────────────────────────────────────
#> # Standardize case:
#> x$email <- tolower(x$email)
#> y$email <- tolower(y$email)Or we can use join_repair() directly, specifying case
standardization:
repaired <- join_repair(
crm_profiles, click_events,
by = "email",
standardize_case = "lower"
)
#> ✔ Repaired 5 value(s)After lowercasing both sides, the inner join returns four matched rows (everyone except Eve, who has no click data, and Frank, who is not in the CRM):
result <- merge(repaired$x, repaired$y, by = "email")
nrow(result)
#> [1] 4
result
#> email plan page_views
#> 1 alice@acme.com enterprise 47
#> 2 bob@acme.com starter 12
#> 3 carol@acme.com pro 89
#> 4 dave@acme.com enterprise 33Email addresses are case-insensitive by RFC 5321, so lowercasing is
the right normalization here. For other identifier types (product codes,
country abbreviations), "upper" may be more
appropriate.
A public health researcher compiles data from multiple sources for a systematic review. A few studies published supplementary tables only as PDF, so she copies the table from the PDF viewer, pastes into a spreadsheet, cleans up the columns, and reads the CSV into R. The data looks perfect – every country name is spelled correctly. But half the countries fail to match a reference population table.
# Simulating PDF copy-paste artifacts:
# \u00A0 is non-breaking space, \u200B is zero-width space
pdf_data <- data.frame(
country = c("Brazil", "India\u200B", "Germany",
"Japan\u00A0", "Canada", "France\u200B"),
prevalence = c(12.3, 8.7, 5.1, 3.9, 6.2, 4.8),
stringsAsFactors = FALSE
)
reference <- data.frame(
country = c("Brazil", "India", "Germany", "Japan",
"Canada", "France", "Italy"),
population_m = c(214, 1408, 84, 125, 38, 68, 59),
stringsAsFactors = FALSE
)Printing the PDF data shows nothing wrong:
The zero-width space after “India” and “France” occupies zero pixels.
The non-breaking space after “Japan” renders like a regular space but is
U+00A0, not U+0020 – trimws() will not always remove it.
The merge shows the damage:
Three of six countries match. join_spy() detects the
invisible characters:
repaired <- join_repair(pdf_data, reference, by = "country")
#> ✔ Repaired 3 value(s)
nrow(merge(repaired$x, repaired$y, by = "country"))
#> [1] 6Six matches. PDF copy-paste is the most common source of these
artifacts, but web scraping, OCR output, and legacy mainframe exports
can produce them too. One useful debugging trick outside of joinspy:
nchar("India\u200B") returns 6, not 5. But that requires
already suspecting the problem.
An e-commerce pipeline joins transaction records to a product catalogue. The pipeline ran cleanly for months, then match rates started drifting: 99% in January, 97% in February, 94% in March. Nobody noticed until finance flagged a margin discrepancy in April.
The code had not changed. A new data entry clerk had joined the
warehouse team in December. The canonical product code format was
"WDG-100" – uppercase prefix, dash, three-digit suffix. The
new clerk sometimes omitted the dash, sometimes typed lowercase. The
warehouse system did fuzzy matching internally, so it accepted the
codes. The ETL join did not.
# Product catalogue (canonical format)
catalogue <- data.frame(
product_code = c("WDG-100", "WDG-101", "WDG-102",
"WDG-103", "WDG-104", "WDG-105"),
product_name = c("Widget A", "Widget B", "Widget C",
"Widget D", "Widget E", "Widget F"),
margin = c(0.35, 0.28, 0.42, 0.31, 0.39, 0.25),
stringsAsFactors = FALSE
)
# Recent transactions (mix of old and new clerk entries)
transactions <- data.frame(
product_code = c("WDG-100", "WDG-101", "WDG102",
"wdg-103", "WDG-104", "wdg105",
"WDG-100", "WDG103"),
quantity = c(5, 3, 7, 2, 4, 6, 1, 8),
stringsAsFactors = FALSE
)The damage is partial – some codes match, some do not – which makes it harder to spot than a complete failure.
Here is where this scenario differs from the previous ones.
join_repair() can fix the case issue, but it cannot insert
the missing dashes – that requires domain knowledge about the code
format.
We can do a dry run to see what join_repair() would
fix:
join_repair(transactions, catalogue,
by = "product_code",
standardize_case = "upper",
dry_run = TRUE)
#>
#> ── Repair Preview (Dry Run) ────────────────────────────────────────────────────
#>
#> ── Left table (x) ──
#>
#> ℹ product_code: upper case (2)After applying the mechanical fixes:
repaired <- join_repair(transactions, catalogue,
by = "product_code",
standardize_case = "upper")
#> ✔ Repaired 2 value(s)The case issues are resolved, but the missing dashes remain. A manual transformation handles those:
# Manual fix: insert dash if missing in product codes matching the pattern
fix_codes <- function(codes) {
gsub("^([A-Z]{3})(\\d)", "\\1-\\2", codes)
}
repaired$x$product_code <- fix_codes(repaired$x$product_code)All eight transactions match. join_repair() handles
context-free transformations (trimming, case normalization, stripping
invisible characters). Inserting a dash into "WDG102"
requires knowing the canonical format – that fix has to come from
someone who understands the data.
Two government datasets need to be linked: regional economic indicators and regional population estimates, keyed on region and year. The year column is numeric and matches without trouble. The region column has a whitespace problem that affects only some records.
economics <- data.frame(
region = c("North America", "Europe", "Asia Pacific ",
"North America", "Europe", "Asia Pacific "),
year = c(2022, 2022, 2022, 2023, 2023, 2023),
gdp_growth = c(2.1, 1.8, 4.2, 1.9, 0.9, 3.8),
stringsAsFactors = FALSE
)
population <- data.frame(
region = c("North America", "Europe", "Asia Pacific",
"North America", "Europe", "Asia Pacific"),
year = c(2022, 2022, 2022, 2023, 2023, 2023),
pop_millions = c(580, 450, 4300, 585, 448, 4350),
stringsAsFactors = FALSE
)In a compound key join, both columns must match. The year column is
fine everywhere, but "Asia Pacific " with a trailing space
will not match "Asia Pacific".
Four rows instead of six. North America and Europe link correctly; Asia Pacific silently vanishes. Two-thirds of the data is present – just enough to look plausible.
join_spy() pinpoints which column in the compound key
has the problem:
repaired <- join_repair(economics, population, by = c("region", "year"))
#> ✔ Repaired 2 value(s)
result <- merge(repaired$x, repaired$y, by = c("region", "year"))
nrow(result)
#> [1] 6Six rows. Compound keys are more fragile than single keys because the failure is partial: one clean column and one dirty column produces a result that looks reasonable but is quietly incomplete. The more columns in the key, the more places an invisible character can hide.
These five scenarios share three properties. The data looks correct
to standard inspection tools – str(),
summary(), print() all render the values
identically. The failure is silent – R does not warn when a merge drops
rows because of invisible whitespace; it just returns fewer rows. And
the fix is mechanical once the cause is known – trimming, lowercasing,
or stripping invisible Unicode are all one-line operations.
join_spy() closes the gap between a thirty-second fix
and the hours of confusion that precede it, particularly with data from
external sources, manual entry, PDF extraction, or cross-system
integrations.