Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2019-09-07

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 x 2
##    misspelling correct   
##    <chr>       <chr>     
##  1 abandonned  abandoned 
##  2 aberation   aberration
##  3 abilties    abilities 
##  4 abilty      ability   
##  5 abondon     abandon   
##  6 abbout      about     
##  7 abotu       about     
##  8 abouta      about a   
##  9 aboutit     about it  
## 10 aboutthe    about the 
## # … with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)

words
## # A tibble: 20,137 x 2
##    word  syllables
##    <chr>     <dbl>
##  1 hm            1
##  2 hmm           1
##  3 hmmm          1
##  4 hmph          1
##  5 mmhmm         2
##  6 mmhm          2
##  7 mm            1
##  8 mmm           1
##  9 mmmm          1
## 10 pff           1
## # … with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 728 x 4
##    misspelling correct word    syllables
##    <chr>       <chr>   <chr>       <dbl>
##  1 sould       should  could           1
##  2 sould       should  should          1
##  3 sould       should  sold            1
##  4 sould       should  soul            1
##  5 sould       should  sound           1
##  6 sould       should  would           1
##  7 fiels       feels   field           1
##  8 fiels       feels   fils            1
##  9 conscent    consent consent         2
## 10 fleed       freed   bleed           1
## # … with 718 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## # A tibble: 455 x 3
##    misspelling correct          n
##    <chr>       <chr>        <int>
##  1 abritrary   arbitrary        1
##  2 accademic   academic         1
##  3 accension   ascension        2
##  4 accessable  accessible       1
##  5 accidant    accident         1
##  6 accidentaly accidentally     1
##  7 accordeon   accordion        1
##  8 addopt      adopt            1
##  9 addtional   additional       1
## 10 admendment  amendment        1
## # … with 445 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## # A tibble: 455 x 4
## # Groups:   misspelling [444]
##    misspelling correct      guesses one_correct
##    <chr>       <chr>          <int> <lgl>      
##  1 abritrary   arbitrary          1 TRUE       
##  2 accademic   academic           1 TRUE       
##  3 accension   ascension          2 TRUE       
##  4 accessable  accessible         1 TRUE       
##  5 accidant    accident           1 TRUE       
##  6 accidentaly accidentally       1 FALSE      
##  7 accordeon   accordion          1 TRUE       
##  8 addopt      adopt              1 TRUE       
##  9 addtional   additional         1 TRUE       
## 10 admendment  amendment          1 TRUE       
## # … with 445 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8527473
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 294

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have used stringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,273 x 4
##    misspelling  correct      word   syllables
##    <chr>        <chr>        <chr>      <dbl>
##  1 charactors   characters   <NA>          NA
##  2 Brasillian   Brazilian    <NA>          NA
##  3 sould        should       could          1
##  4 sould        should       should         1
##  5 sould        should       sold           1
##  6 sould        should       soul           1
##  7 sould        should       sound          1
##  8 sould        should       would          1
##  9 belligerant  belligerent  <NA>          NA
## 10 incorportaed incorporated <NA>          NA
## # … with 1,263 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 545 x 4
##    misspelling  correct      word  syllables
##    <chr>        <chr>        <chr>     <dbl>
##  1 charactors   characters   <NA>         NA
##  2 Brasillian   Brazilian    <NA>         NA
##  3 belligerant  belligerent  <NA>         NA
##  4 incorportaed incorporated <NA>         NA
##  5 awya         away         <NA>         NA
##  6 occuring     occurring    <NA>         NA
##  7 surveilence  surveillance <NA>         NA
##  8 abondoned    abandoned    <NA>         NA
##  9 alledges     alleges      <NA>         NA
## 10 deliberatly  deliberately <NA>         NA
## # … with 535 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 7,691 x 4
##    misspelling correct    word       syllables
##    <chr>       <chr>      <chr>          <dbl>
##  1 charactors  characters character          3
##  2 charactors  characters charactery         4
##  3 Brasillian  Brazilian  <NA>              NA
##  4 sould       should     auld               1
##  5 sould       should     bold               1
##  6 sould       should     bound              1
##  7 sould       should     cold               1
##  8 sould       should     could              1
##  9 sould       should     fold               1
## 10 sould       should     foul               1
## # … with 7,681 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 264 x 4
##    misspelling   correct       word  syllables
##    <chr>         <chr>         <chr>     <dbl>
##  1 Brasillian    Brazilian     <NA>         NA
##  2 belligerant   belligerent   <NA>         NA
##  3 occuring      occurring     <NA>         NA
##  4 abondoned     abandoned     <NA>         NA
##  5 correponding  corresponding <NA>         NA
##  6 archeaologist archaeologist <NA>         NA
##  7 emmediately   immediately   <NA>         NA
##  8 possessess    possesses     <NA>         NA
##  9 unahppy       unhappy       <NA>         NA
## 10 Guilio        Giulio        <NA>         NA
## # … with 254 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.