Using a cluster for record linkage

Jan van der Laan

Introduction

reclin2 has the functionality to use a cluster created by parallel or snow for record linkage. There are a couple of advantages to this. First, record linkage can be a computationally intensive problem as all records from both datasets have to be compared to each other. Splitting the computation over multiple cores or CPU’s can give a substantial speed benefit. The problem easily to parallelize. Second, when using a snow cluster, the computation can be distributed over multiple machines allowing reclin2 to use the memory of these multiple machined. Besides computationally intensive, record linkage can also be memory intensive as all pairs are stored in memory.

Parallelization over k cluster nodes is realised by randomly splitting the first dataset x into k equally sized parts and distribution over the nodes. The second dataset y is copied to each of the nodes. Therefore, it is beneficial for memory consumption if the first dataset is the largest of the two. On each node the local y is compared to the local x and a local set of pairs is generated. For most operations there exist methods for cluster_pairs. These usually consist of running the operations for the regular pairs on each of the nodes.

Below an example is given using a small cluster. It is assumed that the reader has read the introduction vignette and knows the general procedure of record linkage.

Basic example

In this example the example in the introduction vignette is repeated using a cluster.

> library(reclin2)

We will work with a pair of data sets with artificial data. They are tiny, but that allows us to see what happens. In this example we will perform ‘classic’ probabilistic record linkage.

> data("linkexample1", "linkexample2")
> print(linkexample1)
  id lastname firstname    address sex postcode
1  1    Smith      Anna 12 Mainstr   F  1234 AB
2  2    Smith    George 12 Mainstr   M  1234 AB
3  3  Johnson      Anna 61 Mainstr   F  1234 AB
4  4  Johnson   Charles 61 Mainstr   M  1234 AB
5  5  Johnson    Charly 61 Mainstr   M  1234 AB
6  6 Schwartz       Ben  1 Eaststr   M  6789 XY
> print(linkexample2)
  id lastname firstname       address  sex postcode
1  2    Smith    Gearge 12 Mainstreet <NA>  1234 AB
2  3   Jonson        A. 61 Mainstreet    F  1234 AB
3  4  Johnson   Charles    61 Mainstr    F  1234 AB
4  6 Schwartz       Ben        1 Main    M  6789 XY
5  7 Schwartz      Anna     1 Eaststr    F  6789 XY

We first have to start a cluster. Pairs can then be generated using any of the cluster_pair_* functions.

> library(parallel)
> cl <- makeCluster(2)
> pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y
    <int> <int>
 1:     3     3
 2:     5     2
 3:     5     1
 4:     1     2
 5:     3     2
 6:     2     1
 7:     4     3
 8:     4     2
 9:     4     1
10:     2     2

The print function collects a few (max 6) pairs from each of the nodes and shows those. Other cluster_pair_* functions are cluster_pair and cluster_pair_minsim.

The cluster_pair_* functions return an object of type cluster_pairs. Most other methods work the same as for regular pairs. For example, to compare the pairs on variables:

> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), 
+   default_comparator = cmp_jarowinkler(0.9), inplace = TRUE)
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname   address   sex
    <int> <int>    <num>     <num>     <num> <num>
 1:     1     2 0.000000 0.5833333 0.8641026     1
 2:     1     1 1.000000 0.4722222 0.9230769    NA
 3:     3     1 0.447619 0.4722222 0.8641026    NA
 4:     1     3 0.447619 0.4642857 0.9333333     1
 5:     5     1 0.447619 0.5555556 0.8641026    NA
 6:     2     1 1.000000 0.8888889 0.9230769    NA
 7:     2     3 0.447619 0.5396825 0.9333333     0
 8:     4     3 1.000000 1.0000000 1.0000000     0
 9:     6     5 1.000000 0.5277778 1.0000000     0
10:     6     4 1.000000 1.0000000 0.6111111     1

The code above was copy-pasted from the introduction. Here the argument inplace = TRUE was used, which adds the new variables to the existing pairs. One difference between regular pairs and cluster_pairs is that most methods will modify the existing pairs in place. Therefore, inplace is ignored here and we should use:

> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+   default_comparator = cmp_jarowinkler(0.9))
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname   address   sex
    <int> <int>    <num>     <num>     <num> <num>
 1:     5     1 0.447619 0.5555556 0.8641026    NA
 2:     5     2 0.952381 0.0000000 0.9230769     0
 3:     3     2 0.952381 0.5833333 0.9230769     1
 4:     1     3 0.447619 0.4642857 0.9333333     1
 5:     1     2 0.000000 0.5833333 0.8641026     1
 6:     6     4 1.000000 1.0000000 0.6111111     1
 7:     6     5 1.000000 0.5277778 1.0000000     0
 8:     4     1 0.447619 0.6428571 0.8641026    NA
 9:     2     1 1.000000 0.8888889 0.9230769    NA
10:     4     2 0.952381 0.0000000 0.9230769     0

Most methods for cluster_pairs do have a new_name argument that will generate a new set of pairs on the cluster nodes. For example, the following code will generate a new set of pairs and will not modify the existing pairs:

> pairs2 <- compare_pairs(pairs, on = 
+   c("lastname", "firstname", "address", "sex"), new_name = "pairs2")
> print(pairs2)
  Cluster 'pairs2' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname address    sex
    <int> <int>   <lgcl>    <lgcl>  <lgcl> <lgcl>
 1:     3     1    FALSE     FALSE   FALSE     NA
 2:     3     3     TRUE     FALSE    TRUE   TRUE
 3:     1     1     TRUE     FALSE   FALSE     NA
 4:     1     2    FALSE     FALSE   FALSE   TRUE
 5:     5     1    FALSE     FALSE   FALSE     NA
 6:     4     3     TRUE      TRUE    TRUE  FALSE
 7:     2     3    FALSE     FALSE   FALSE  FALSE
 8:     2     1     TRUE     FALSE   FALSE     NA
 9:     4     2    FALSE     FALSE   FALSE  FALSE
10:     4     1    FALSE     FALSE   FALSE     NA
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname   address   sex
    <int> <int>    <num>     <num>     <num> <num>
 1:     1     1 1.000000 0.4722222 0.9230769    NA
 2:     3     3 1.000000 0.4642857 1.0000000     1
 3:     5     2 0.952381 0.0000000 0.9230769     0
 4:     5     3 1.000000 0.8492063 1.0000000     0
 5:     1     3 0.447619 0.4642857 0.9333333     1
 6:     4     1 0.447619 0.6428571 0.8641026    NA
 7:     4     3 1.000000 1.0000000 1.0000000     0
 8:     2     3 0.447619 0.5396825 0.9333333     0
 9:     6     5 1.000000 0.5277778 1.0000000     0
10:     2     2 0.000000 0.0000000 0.8641026     0

The function compare_vars offers more flexibility than compare_pairs. It can for example compare multiple variables at the same time (e.g. compare birth day and month allowing for swaps) or generate multiple results from comparing on one variable. This method also works on cluster_pairs.

The next step in the process, is to determine which pairs of records belong to the same entity and which do not. As in the introduction vignette we will use the classic method. Again, we hardly need to change the code from the introduction:

> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> print(m)
M- and u-probabilities estimated by the EM-algorithm:
  Variable M-probability U-probability
  lastname     0.9990000   0.001152679
 firstname     0.1999999   0.000100000
   address     0.8999206   0.285831118
       sex     0.3002011   0.285427112

Matching probability: 0.5885595.
> pairs <- predict(m, pairs = pairs, add = TRUE)
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname   address   sex    weights
    <int> <int>    <num>     <num>     <num> <num>      <num>
 1:     5     3 1.000000 0.8492063 1.0000000     0  8.5458257
 2:     3     3 1.000000 0.4642857 1.0000000     1  7.9350221
 3:     1     2 0.000000 0.5833333 0.8641026     1 -5.9463949
 4:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426
 5:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862
 6:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508
 7:     2     2 0.000000 0.0000000 0.8641026     0 -6.3177171
 8:     6     5 1.000000 0.5277778 1.0000000     0  7.9139248
 9:     6     4 1.000000 1.0000000 0.6111111     1 14.6796595
10:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816

We can then select the pairs with a weight above a threshold.

> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 8)
> print(pairs)
  Cluster 'default' with size: 2
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

Showing a random selection of pairs:
       .x    .y lastname firstname   address   sex    weights threshold
    <int> <int>    <num>     <num>     <num> <num>      <num>    <lgcl>
 1:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862     FALSE
 2:     1     3 0.447619 0.4642857 0.9333333     1  0.8042090     FALSE
 3:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426     FALSE
 4:     3     2 0.952381 0.5833333 0.9230769     1  4.0674910     FALSE
 5:     1     2 0.000000 0.5833333 0.8641026     1 -5.9463949     FALSE
 6:     4     1 0.447619 0.6428571 0.8641026    NA  0.7713174     FALSE
 7:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816      TRUE
 8:     2     2 0.000000 0.0000000 0.8641026     0 -6.3177171     FALSE
 9:     4     2 0.952381 0.0000000 0.9230769     0  3.6961688     FALSE
10:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508     FALSE

And this is roughly where we have to stop working with cluster_pairs. The subset of selected pairs remaining should now be small enough that we can comfortably work locally. The most computationally intensive steps have been done. When we are not sure exactly what the threshold should be, we can also work with a more conservative threshold. That should still give us enough of a reduction in pairs that we can work locally. Using cluster_collect we can copy the selected pairs (or all pairs) locally:

> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 0)
> local_pairs <- cluster_collect(pairs, "threshold")
> print(local_pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 15 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address   sex    weights threshold
    <int> <int>    <num>     <num>     <num> <num>      <num>    <lgcl>
 1:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862      TRUE
 2:     1     3 0.447619 0.4642857 0.9333333     1  0.8042090      TRUE
 3:     3     1 0.447619 0.4722222 0.8641026    NA  0.6017106      TRUE
 4:     3     2 0.952381 0.5833333 0.9230769     1  4.0674910      TRUE
 5:     3     3 1.000000 0.4642857 1.0000000     1  7.9350221      TRUE
 6:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426      TRUE
 7:     5     2 0.952381 0.0000000 0.9230769     0  3.6961688      TRUE
 8:     5     3 1.000000 0.8492063 1.0000000     0  8.5458257      TRUE
 9:     2     1 1.000000 0.8888889 0.9230769    NA  8.6064218      TRUE
10:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508      TRUE
11:     4     1 0.447619 0.6428571 0.8641026    NA  0.7713174      TRUE
12:     4     2 0.952381 0.0000000 0.9230769     0  3.6961688      TRUE
13:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816      TRUE
14:     6     4 1.000000 1.0000000 0.6111111     1 14.6796595      TRUE
15:     6     5 1.000000 0.5277778 1.0000000     0  7.9139248      TRUE

local_pairs is a regular pairs object (and therefore a data.table) which can be operated upon as any pairs object. cluster_collect also has the option clear which when TRUE will delete the pairs on the cluster nodes. After this we can use the code from the introduction vignette:

> local_pairs <- compare_vars(local_pairs, "truth", on_x = "id", on_y = "id")
> local_pairs <- select_n_to_m(local_pairs, "weights", variable = "ntom", threshold = 0)
> table(local_pairs$truth, local_pairs$ntom)
       
        FALSE TRUE
  FALSE    11    0
  TRUE      0    4
> linked_data_set <- link(local_pairs, selection = "ntom")
> print(linked_data_set)
  Total number of pairs: 4 pairs

Key: <.y>
      .y    .x  id.x lastname.x firstname.x  address.x  sex.x postcode.x   .id
   <int> <int> <int>     <fctr>      <fctr>     <fctr> <fctr>     <fctr> <int>
1:     1     2     2      Smith      George 12 Mainstr      M    1234 AB     2
2:     2     3     3    Johnson        Anna 61 Mainstr      F    1234 AB     3
3:     3     4     4    Johnson     Charles 61 Mainstr      M    1234 AB     4
4:     4     6     6   Schwartz         Ben  1 Eaststr      M    6789 XY     6
    id.y lastname.y firstname.y     address.y  sex.y postcode.y
   <int>     <fctr>      <fctr>        <fctr> <fctr>     <fctr>
1:     2      Smith      Gearge 12 Mainstreet   <NA>    1234 AB
2:     3     Jonson          A. 61 Mainstreet      F    1234 AB
3:     4    Johnson     Charles    61 Mainstr      F    1234 AB
4:     6   Schwartz         Ben        1 Main      M    6789 XY

Internals

The cluster_pair object is a list with two elements:

On the cluster nodes there exists an environment (reclin2::reclin_env). For each set of pairs an environment is created in that environment containing the pairs. To demonstrate, let us get the first pair on each of the nodes:

> clusterCall(pairs$cluster, function(name) {
+   pairs <- reclin2:::reclin_env[[name]]$pairs
+   head(pairs, 1)
+ }, name = pairs$name)
[[1]]
  First data set:  3 records
  Second data set: 5 records
  Total number of pairs: 1 pairs
  Blocking on: 'postcode'

      .x    .y lastname firstname   address   sex  weights threshold
   <int> <int>    <num>     <num>     <num> <num>    <num>    <lgcl>
1:     1     1        1 0.4722222 0.9230769    NA 7.710386      TRUE

[[2]]
  First data set:  3 records
  Second data set: 5 records
  Total number of pairs: 1 pairs
  Blocking on: 'postcode'

      .x    .y lastname firstname   address   sex  weights threshold
   <int> <int>    <num>     <num>     <num> <num>    <num>    <lgcl>
1:     1     1        1 0.8888889 0.9230769    NA 8.606422      TRUE

Some specific methods for cluster_pairs

Regular pairs are also a data.table. Therefore, it is easy to manually create columns, select or aggregate. As for cluster_pairs the pairs are distributed over the cluster nodes, this is more difficult for cluster_pairs. In order to help with this, reclin2 has two helper functions: cluster_call and cluster_modify_pairs.

You can pass cluster_call the cluster_pairs object and a function. This function will be called on each cluster node and will be passed the pairs object, the local x and y (in that order). This can be used to modify the pairs, or calculate statistics from the pairs. The result of the function calls is returned by cluster_call. Therefore, if the sole goal is to modify the pairs, make sure to return NULL (or at least something small). Below we use cluster_call to make a random stratified sample of pairs:

> compare_vars(pairs, "id")
> cluster_call(pairs, function(pairs, ...) {
+   sel1 <- sample(which(pairs$id), 2)
+   sel2 <- sample(which(!pairs$id), 2)
+   pairs[, sample := FALSE]
+   pairs[c(sel1, sel2), sample := TRUE]
+   NULL
+ })
> sample <- cluster_collect(pairs, "sample")

cluster_modify_pairs is very similar to cluster_call but is mainly meant for modifying the pairs object. Although in the previous example we also used cluster_call for that. When the function passed to cluster_modify_pairs returns a data.table, this data.table will overwrite the pairs object. cluster_modify_pairs also accepts a new_name argument. When set a new pairs object will be created.

Let’s use the sample from above to estimate a model and then use cluster_modify_pairs to add the predictions to the pairs:

> mglm  <- glm(id ~ lastname + firstname, data = sample)
> cluster_modify_pairs(pairs, function(pairs, model, ...) {
+   pairs$pmodel <- predict(model, newdata = pairs, type = "response")
+   pairs
+ }, model = mglm)

And stop the cluster.

> stopCluster(cl)