# Fuzzy string matching for R

`levitate` is based on the Python fuzzywuzzy package for fuzzy string matching. An R port of this already exists, but unlike fuzzywuzzyR, `levitate` is written entirely in R with no external dependencies on `reticulate` or Python. It also offers a couple of extra bells and whistles in the form of vectorised functions.

View the docs at https://lewinfox.github.io/levitate/.

## Why “`levitate`”?

A common measure of string similarity is the Levenshtein distance, and the name was available on CRAN.

## Installation

Install the development version from Github:

``devtools::install_github("lewinfox/levitate")``

## Examples

### `lev_distance()`

The edit distance is the number of additions, subtractions or substitutions needed to transform one string into another. Base R provides the `adist()` function to compute this. `levitate` provides `lev_distance()` which is powered by the `stringdist` package.

``````lev_distance("cat", "bat")
#> [1] 1

lev_distance("rat", "rats")
#> [1] 1

lev_distance("cat", "rats")
#> [1] 2``````

The function can accept vectorised input. Where the inputs have a `length()` greater than 1 the results are returned as a vector unless `pairwise = FALSE`, in which case a matrix is returned.

``````lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"))
#> [1] 1 1 2

lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"), pairwise = FALSE)
#>      rat log frog
#> cat    1   3    4
#> dog    3   1    2
#> clog   4   1    2``````

If at least one (or both) of the inputs is scalar (length 1) the result will be a vector. The elements of the vector are named based on the longer input (unless `useNames = FALSE`).

``````lev_distance(c("cat", "dog", "clog"), "rat")
#>  cat  dog clog
#>    1    3    4

lev_distance("cat", c("rat", "log", "frog", "other"))
#>   rat   log  frog other
#>     1     3     4     5

lev_distance("cat", c("rat", "log", "frog", "other"), useNames = FALSE)
#> [1] 1 3 4 5``````

### `lev_ratio()`

More useful than the edit distance, `lev_ratio()` makes it easier to compare similarity across different strings. Identical strings will get a score of 1 and entirely dissimilar strings will get a score of 0.

This function behaves exactly like `lev_distance()`:

``````lev_ratio("cat", "bat")
#> [1] 0.6666667

lev_ratio("rat", "rats")
#> [1] 0.75

lev_ratio("cat", "rats")
#> [1] 0.5

lev_ratio(c("cat", "dog", "clog"), c("rat", "log", "frog"))
#> [1] 0.6666667 0.6666667 0.5000000``````

### `lev_partial_ratio()`

If `a` and `b` are different lengths, this function compares all the substrings of the longer string that are the same length as the shorter string and returns the highest `lev_ratio()` of all of them. E.g. when comparing `"actor"` and `"tractor"` we would compare `"actor"` with `"tract"`, `"racto"` and `"actor"` and return the highest score (in this case 1).

``````lev_partial_ratio("actor", "tractor")
#> [1] 1

# What's actually happening is the max() of this result is being returned
lev_ratio("actor", c("tract", "racto", "actor"))
#> tract racto actor
#>   0.2   0.6   1.0``````

### `lev_token_sort_ratio()`

The inputs are tokenised and the tokens are sorted alphabetically, then the resulting strings are compared.

``````x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"

# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)
#> [1] 0.3529412

# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)
#> [1] 0.9354839``````

### `lev_token_set_ratio()`

Similar to `lev_token_sort_ratio()` this function breaks the input down into tokens. It then identifies any common tokens between strings and creates three new strings:

``````x <- {common_tokens}
y <- {common_tokens}{remaining_unique_tokens_from_string_a}
z <- {common_tokens}{remaining_unique_tokens_from_string_b}``````

and performs three pairwise `lev_ratio()` calculations between them (`x` vs `y`, `y` vs `z` and `x` vs `z`). The highest of those three ratios is returned.

``````x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"

lev_ratio(x, y)
#> [1] 0.2916667

lev_token_sort_ratio(x, y)
#> [1] 0.6458333

lev_token_set_ratio(x, y)
#> [1] 0.7435897``````

## Porting code from `fuzzywuzzy` or `fuzzywuzzyR`

Results differ between `levitate` and `fuzzywuzzy`, not least because `stringdist` offers several possible similarity measures. Be careful if you are porting code that relies on hard-coded or learned cutoffs for similarity measures.