Finnish personal ID number data toolkit for R (hetu)

Pyry Kantanen, Jussi Paananen, Mans Magnusson, Leo Lahti

2022-05-20

The hetu R package provides tools to work with Finnish personal identity numbers (hetu, short for the Finnish term “henkilötunnus”). Some functions can also be used with Finnish Business ID numbers (y-tunnus).

Where possible, we have unified the syntax with sweidnumbr.

Installation

Install the current devel version in R:

devtools::install_github("ropengov/hetu")

Test the installation by loading the library:

library(hetu)

We also recommend setting the UTF-8 encoding:

Sys.setlocale(locale="UTF-8") 

Introduction

Finnish personal identification numbers (Finnish: henkilötunnus, hetu in short), are used to identify citizens. Hetu PIN consists of eleven characters: DDMMYYCZZZQ, where DDMMYY is the day, month and year of birth, C is the century marker, ZZZ is the individual number and Q is the control character.

Males have odd and females have even individual number. The control character is determined by dividing DDMMYYZZZ by 31 and using the remainder (modulo 31) to pick up the corresponding character from the string “0123456789ABCDEFHJKLMNPRSTUVWXY”. For example, if the remainder is 0, the control character is 0 and if the remainder is 12, the control character is C.

A valid individual number is between 002-899. Individual numbers 900-999 are not in normal use and are used only for temporary or artificial PINs. These temporary PINs are sometimes used in different organizations, such as insurance companies or hospitals, if the individual is not a Finnish citizen, a permanent resident or if the exact identity of the individual cannot be determined at the time. Artificial or temporary PINs are not intended for continuous, long term use and they are not usually accepted by PIN validity checking algorithms.

Temporary PINs provide similar information about individual’s birth date or sex as regular PINs. Temporary PINs can also be safely used for testing purposes, as such a number cannot be linked to any real person.

Personal identification numbers (HETU)

The basic hetu function can be used to view information included in a Finnish personal identification number. The data is outputted as a data frame.

example_pin <- "111111-111C"
hetu(example_pin)
#>          hetu  sex p.num ctrl.char       date day month year century valid.pin
#> 1 111111-111C Male   111         C 1911-11-11  11    11 1911       -      TRUE

The output can be made prettier, for example by using knitr:

knitr::kable(hetu(example_pin))
hetu sex p.num ctrl.char date day month year century valid.pin
111111-111C Male 111 C 1911-11-11 11 11 1911 - TRUE

The hetu function also accepts vectors with several identification numbers as input:

example_pins <- c("010101-0101", "111111-111C")
knitr::kable(hetu(example_pins))
hetu sex p.num ctrl.char date day month year century valid.pin
010101-0101 Female 010 1 1901-01-01 1 1 1901 - TRUE
111111-111C Male 111 C 1911-11-11 11 11 1911 - TRUE

The hetu function does not print warning messages to the user if input vector contains invalid PINs. Validity of specific PINs can be determined by looking at the valid.pin column.

hetu(c("010101-0102", "111311-111C", "010101-0101"))
#>          hetu    sex p.num ctrl.char       date day month year century
#> 1 010101-0102 Female   010         2 1901-01-01   1     1 1901       -
#> 2 111311-111C   Male   111         C       <NA>  11    NA 1911       -
#> 3 010101-0101 Female   010         1 1901-01-01   1     1 1901       -
#>   valid.pin
#> 1     FALSE
#> 2     FALSE
#> 3      TRUE

Extracting specific information

Information contained in the PIN can be extracted with a generic extract parameter. Valid values for extraction are hetu, sex, personal.number, ctrl.char, date, day, month, year, century, valid.pin and is.temp.

is.temp can be extracted only if allow.temp is set to TRUE. If allow.temp is set to FALSE (default), temporary PINs are filtered from the output and information provided by is.temp would be meaningless.

hetu(example_pins, extract = "sex")
#> [1] "Female" "Male"
hetu(example_pins, extract = "ctrl.char")
#> [1] "1" "C"

Some fields can be extracted with specialized functions. Extracting sex with hetu_sex function:

hetu_sex(example_pins)
#> [1] "Female" "Male"

Extracting age at current date and at a given date with hetu_age function:

hetu_age(example_pins)
#> The age in years has been calculated at 2022-05-20.
#> [1] 121 110
hetu_age(example_pins, date = "2012-01-01")
#> The age in years has been calculated at 2012-01-01.
#> [1] 111 100
hetu_age(example_pins, timespan = "months")
#> The age in months has been calculated at 2022-05-20.
#> [1] 1456 1326

Dates (birth dates) also have their own function, hetu_date.

hetu_date(example_pins)
#> [1] "1901-01-01" "1911-11-11"

Validity checking

The basic hetu function output includes information on the validity of each pin, which can be extracted by using hetu-function with valid.pin as extract parameter.

The validity of the PINs can also be determined by using the hetu_ctrl function, which produces a vector:

hetu_ctrl(c("010101-0101", "111111-111C")) # TRUE TRUE
#> [1] TRUE TRUE
hetu_ctrl("010101-1010") # FALSE
#> [1] FALSE

Artificial and temporary personal identification numbers

The package functions can be made to accept artificial or temporary personal identification numbers. Artificial and temporary PINs can be used normally by allowing them through allow.temp parameter.

example_temp_pin <- "010101A900R"
knitr::kable(hetu(example_temp_pin, allow.temp = TRUE))
hetu sex p.num ctrl.char date day month year century valid.pin is.temp
010101A900R Female 900 R 2001-01-01 1 1 2001 A TRUE TRUE

A vector with regular and temporary PINs mixed together prints only regular PINs, if allow.temp is not set to TRUE. Automatic omitting of temporary PINs does not produce a visible error message and therefore users need to be cautious if they want to use temporary PINs.

If temporary PINs are not explicitly allowed and the input vector consists of temporary PINs only, the function will return an error.

example_temp_pins <- c("010101A900R", "010101-0101")
hetu_ctrl("010101A900R", allow.temp = FALSE)
#> [1] NA
knitr::kable(hetu(example_temp_pins))
hetu sex p.num ctrl.char date day month year century valid.pin
2 010101-0101 Female 010 1 1901-01-01 1 1 1901 - TRUE

When allow.temp is set to TRUE, all PINs are handled as if they were regular PINs.

knitr::kable(hetu(example_temp_pins, allow.temp = TRUE))
hetu sex p.num ctrl.char date day month year century valid.pin is.temp
010101A900R Female 900 R 2001-01-01 1 1 2001 A TRUE TRUE
010101-0101 Female 010 1 1901-01-01 1 1 1901 - TRUE FALSE
hetu_ctrl("010101A900R", allow.temp = TRUE)
#> [1] TRUE

Validation function hetu_ctrl produces a FALSE for every artificial / temporary PIN, if they are not explicitly allowed.

knitr::kable(hetu(example_temp_pins)) #FALSE TRUE
hetu sex p.num ctrl.char date day month year century valid.pin
2 010101-0101 Female 010 1 1901-01-01 1 1 1901 - TRUE
knitr::kable(hetu(example_temp_pins, allow.temp = TRUE)) #TRUE TRUE
hetu sex p.num ctrl.char date day month year century valid.pin is.temp
010101A900R Female 900 R 2001-01-01 1 1 2001 A TRUE TRUE
010101-0101 Female 010 1 1901-01-01 1 1 1901 - TRUE FALSE

Generating random PINs

Random PINs can be generated by using the rpin function.

rhetu(n = 4)
#> [1] "070502-3401" "030388-1862" "290391-7615" "151219A8600"
rhetu(n = 4, start.date = "1990-01-01", end.date = "2005-01-01")
#> [1] "151190-6358" "040494-121Y" "021297-2170" "280899-296L"

The number of males in the generated sample can be changed with parameter p.male. Default is 0.4.

random_sample <- rhetu(n = 4, p.male = 0.8)
table(random_sample)
#> random_sample
#> 030799+449L 120845-060R 220783-518Y 260661-539R 
#>           1           1           1           1

The default proportion of artificial / temporary PINs is 0.0, meaning that no artificial / temporary PINs are generated by default.

temp_sample <- rhetu(n = 4, p.temp = 0.5)
table(hetu(temp_sample, allow.temp = TRUE, extract = "is.temp"))
#> 
#> FALSE 
#>     4

Diagnostics

In addition to information mentioned in the section Extracting specific information, the user can choose to print additional columns containing information about checks done on PINs. The diagnostic checks produce a TRUE or FALSE for the following categories: valid.p.num, valid.checksum, correct.checksum, valid.date, valid.day, valid.month, valid.year, valid.length and valid.century, FALSE meaning that hetu is somehow incorrect.

diagnosis_example <- c("010101-0102", "111111-111Q", 
"010101B0101", "320101-0101", "011301-0101", 
"010101-01010", "010101-0011")
head(hetu(diagnosis_example, diagnostic = TRUE), 3)
#>          hetu    sex p.num ctrl.char       date day month year century
#> 1 010101-0102 Female   010         2 1901-01-01   1     1 1901       -
#> 2 111111-111Q   Male   111         Q 1911-11-11  11    11 1911       -
#> 3 010101B0101 Female   010         1       <NA>   1     1   NA       B
#>   valid.pin valid.p.num valid.ctrl.char correct.ctrl.char valid.date valid.day
#> 1     FALSE        TRUE            TRUE             FALSE       TRUE      TRUE
#> 2     FALSE        TRUE           FALSE             FALSE       TRUE      TRUE
#> 3     FALSE        TRUE            TRUE              TRUE      FALSE      TRUE
#>   valid.month valid.year valid.length valid.century
#> 1        TRUE       TRUE         TRUE          TRUE
#> 2        TRUE       TRUE         TRUE          TRUE
#> 3        TRUE       TRUE         TRUE         FALSE

Diagnostic information can be examined more closely by using subset or by using a separate hetu_diagnostics function. The user can print all diagnostic information for all PINs in the dataset:

tail(hetu_diagnostic(diagnosis_example), 3)
#>           hetu is.temp valid.p.num valid.ctrl.char correct.ctrl.char valid.date
#> 5  011301-0101   FALSE        TRUE            TRUE             FALSE      FALSE
#> 6 010101-01010   FALSE        TRUE            TRUE              TRUE       TRUE
#> 7  010101-0011   FALSE       FALSE            TRUE             FALSE       TRUE
#>   valid.day valid.month valid.year valid.length valid.century
#> 5      TRUE       FALSE       TRUE         TRUE          TRUE
#> 6      TRUE        TRUE       TRUE        FALSE          TRUE
#> 7      TRUE        TRUE       TRUE         TRUE          TRUE

By using extract parameter, the user can choose which columns will be printed in the output table. Valid extract values are listed in the function’s help file.

hetu_diagnostic(diagnosis_example, extract = c("valid.century", "correct.checksum"))
#> Error in hetu_diagnostic(diagnosis_example, extract = c("valid.century", : Trying to extract invalid diagnostic(s)

Because of the way PINs are handled in inside hetu-function, the diagnostics-function can show unexpected warning messages or introduce NAs by coercion if the date-part of the PIN is too long. This may result in inability to handle the PIN at all!

# Faulty example
hetu_diagnostic(c("01011901-01010"))

Business Identity Codes (BID)

The package has also the ability to generate Finnish Business ID codes (y-tunnus) and check their validity. Unlike with personal identification numbers, no additional information can be extracted from Business IDs.

Generating random BIDs

Similar to hetu PINs, random Finnish Business IDs (y-tunnus) can be generated by using rbid function.

bid_sample <- rbid(3)
bid_sample
#> [1] "0991107-0" "8377128-0" "1286283-9"

BID validity checking

The validity of Finnish Business Identity Codes can be checked with a similar function to hetu_ctrl: bid_ctrl.

bid_ctrl(c("0737546-2", "1572860-0")) # TRUE TRUE
#> [1] TRUE TRUE
bid_ctrl("0737546-1") # FALSE
#> [1] FALSE

Various examples

Data frames generated by hetu function work well with tidyverse/dplyr workflows as well.

library(hetu)
library(tidyverse)
library(dplyr)

# Generate data for this example
hdat<-tibble(pin=rhetu(n = 4, start_date = "1990-01-01", end_date = "2005-01-01"))

# Extract all the hetu information to tibble format
hdat<-hdat %>%
  mutate(result=map(.x=pin,.f=hetu::hetu)) %>% unnest(cols=c(result))
hdat

Licensing and Citations

This work can be freely used, modified and distributed under the open license specified in the DESCRIPTION file.

Kindly cite the work as follows

citation("hetu")
#> 
#> Kindly cite the hetu R package as follows:
#> 
#>   Pyry Kantanen, Mans Magnusson, Jussi Paananen and Leo Lahti (rOpenGov
#>   2022). hetu: Structural Handling of Finnish Personal Identity Codes.
#>   R package version 1.0.7 URL: http://github.com/rOpenGov/hetu
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Misc{,
#>     title = {hetu: Structural Handling of Finnish Personal Identity Codes},
#>     author = {Pyry Kantanen and Mans Magnusson and Jussi Paananen and Leo Lahti},
#>     url = {https://github.com/rOpenGov/hetu},
#>     year = {2022},
#>     note = {R package version 1.0.7},
#>   }
#> 
#> Many thanks for all contributors!

References

Session info

This vignette was created with

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/fi_FI.UTF-8/fi_FI.UTF-8/C/fi_FI.UTF-8/fi_FI.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] hetu_1.0.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] lubridate_1.8.0 digest_0.6.29   R6_2.5.1        backports_1.4.1
#>  [5] jsonlite_1.8.0  magrittr_2.0.3  evaluate_0.15   highr_0.9      
#>  [9] stringi_1.7.6   rlang_1.0.2     cli_3.3.0       jquerylib_0.1.4
#> [13] bslib_0.3.1     generics_0.1.2  checkmate_2.1.0 rmarkdown_2.14 
#> [17] tools_4.2.0     stringr_1.4.0   parallel_4.2.0  xfun_0.31      
#> [21] yaml_2.3.5      fastmap_1.1.0   compiler_4.2.0  htmltools_0.5.2
#> [25] knitr_1.39      sass_0.4.1