Navigating the shift of clinical laboratory data from primary
everyday clinical use to secondary research purposes presents a
significant challenge. Given the substantial time and expertise required
to preprocess and clean this data and the lack of all-in-one tools
tailored for this need, we developed our algorithm
lab2clean
as an open-source R-package.
lab2clean
package is set to automate and standardize the
intricate process of cleaning clinical laboratory results. With a keen
focus on improving the data quality of laboratory result values and
units, our goal is to equip researchers with a straightforward,
plug-and-play tool, making it smoother for them to unlock the true
potential of clinical laboratory data in clinical research and clinical
machine learning (ML) model development.
The lab2clean
package contains four key functions: Two
functions to clean & validate result values (Version 1.0) are
described in detail in Zayed et al. (2024) [https://doi.org/10.1186/s12911-024-02652-7]. The
clean_lab_result()
function cleans and standardizes the
laboratory results, and the validate_lab_result()
function
performs validation to ensure the plausibility of these results. The
other two functions to standardize & harmonize result units (added
in Version 2.0) are described in detail in Zayed et al. (2025) [https://doi.org/10.1016/j.ijmedinf.2025.106131]. The
standardize_lab_unit()
function cleans and standardize
formats of laboratory units of measurement according to the Unified Code
for Units of Measure (UCUM), and the harmonize_lab_unit()
function harmonizes the units found in a laboratory data set to
reference units following either SI or Conventional units, converting
the numeric result values.
This vignette aims to explain the theoretical background, usage, and customization of these functions.
lab2clean
packageYou can install and load the lab2clean
package directly
in R.
#install.packages("lab2clean")
After installation, load the package:
library(lab2clean)
The clean_lab_result()
has five arguments:
lab_data
: A dataset containing laboratory
data
raw_result
: The column in lab_data
that contains raw result values to be cleaned
locale
: A string representing the locale for the
laboratory data. Defaults to “NO”
report
: A report is written in the console.
Defaults to “TRUE”.
n_records
: In case you are loading a grouped list
of distinct results, then you can assign the n_records to the column
that contains the frequency of each distinct result. Defaults to
NA
Let us demonstrate the clean_lab_result()
function using
Function_1_dummy
and inspect the first six rows:
data("Function_1_dummy", package = "lab2clean")
head(Function_1_dummy,6)
raw_result | frequency |
---|---|
? | 108 |
* | 243 |
[ | 140 |
_ | 268 |
1.1 x 10^9 | 284 |
2.34 x 10E12 | 42 |
This dataset -for demonstration purposes- contains two columns:
raw_result
and the frequency
. The
raw_result
column holds raw laboratory results, and
frequency
indicates how often each result appeared. Let’s
explore the report
and n_records
arguments:
cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = TRUE, n_records = "frequency")
#> Step 1: Handling records with
extra variables stored with the result value removing interpretative
flags, or units
#>
==========================================================================================
#> ⚠ 8 distinct results (8.742% of the total result records) with
interpretative flags (e.g. positive, negative, H, L) -> flags removed
with cleaning comment added flag).
#> ⚠ 17
distinct results (20.043% of the
total result records) with unit (%, exponents, or other units) ->
units removed with cleaning comment added Percent, Exponent, or Units).
#>
Step 2: classify and standardize
different scale types - part 1
#>
==========================================================================================
#> ✔ 3 distinct results (5.373% of the total result records) of
scale type ‘Ord.2’, which describes
grades of positivity (e.g. 2+, 3+).
#> ✔ 7
distinct results (7.966% of the
total result records) of scale type ‘Qn.2’, which describes inequality
results (e.g. >120, <1).
#> ✔ 4
distinct results (6.233% of the
total result records) of scale type ‘Qn.3’, which describes numeric range
results (e.g. 2-4).
#> ✔
4 distinct results (3.092% of the total result records) of
scale type ‘Qn.4’, which describes
titer results (e.g. 1/40).
#> ✔ 55
distinct results (61.335% of the
total result records) of scale type ‘Qn.1’, which describes numeric results
(e.g. 56, 5.6, 5600).
#> ⚠
4 distinct results (4.853% of the total result records) with
numeric result values that cannot be determined without predefined
locale setting (US or DE) -> cleaning comment added locale_check).
#> ✔ 4
distinct results (4.888% of the
total result records) of scale type ‘Ord.1’, which describes positive or
negative results (Neg, Pos, or Normal).
#> ✔ 1
distinct results (1.019% of the
total result records) of scale type ‘Nom.1’, which describes blood groups
(e.g. A+, AB).
#> Last Step:
Classifying non-standard text records
#>
==========================================================================================
#> ⚠ 0 distinct results (0% of the total result records) with
multiple result values (e.g. postive X & negative Y) -> cleaning
comment added (multiple_results).
#> ⚠ 0
distinct results (0% of the total
result records) with words about sample or specimen (e.g. sample not
found) -> cleaning comment added (test_not_performed).
#> ⚠ 8
distinct results (8.777% of the
total result records) with meaningless inputs (e.g. = , .) ->
cleaning comment added (No_result).
#> ⚠ 1
distinct results (1.317% of the
total result records) that could not be standardized or classified ->
cleaning comment added (not_standardized).
#>
==========================================================================================
#> ✔ 78 distinct results (89.906% of the total result records) were
cleaned, classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>
The report
provides a detailed report on how the whole
process of cleaning the data is done, and offers some descriptive
insights of the process. The n_records
argument adds
percentages to each of the aforementioned steps to enhance the
reporting. For simplicity, we will use report = FALSE
in
the rest of this tutorial:
cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = FALSE)
#> ✔ 78 result records were cleaned,
classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>
cleaned_results
raw_result | frequency | clean_result | scale_type | cleaning_comments |
---|---|---|---|---|
? | 108 | NA | NA | No_result |
* | 243 | NA | NA | No_result |
[ | 140 | NA | NA | No_result |
_ | 268 | NA | NA | No_result |
1.1 x 10^9 | 284 | 1.1 | Qn.1 | Exponents |
2.34 x 10E12 | 42 | 2.34 | Qn.1 | Exponents |
2,34 X 10^12 | 173 | 2.34 | Qn.1 | Exponents |
3.14159 * 10^30 | 271 | 3.142 | Qn.1 | Exponents |
1.1x10+9 | 179 | 1.1 | Qn.1 | Exponents |
2,34X10^12 | 153 | 2.34 | Qn.1 | Exponents |
3.14159*10^30 | 288 | 3.142 | Qn.1 | Exponents |
3.142*10^30 | 152 | 3.142 | Qn.1 | Exponents |
1,1 x 10e9 | 213 | 1.1 | Qn.1 | Exponents |
3 | 185 | 3 | Qn.1 | NA |
1.1 x 10^-9 | 58 | 1.1 | Qn.1 | Exponents |
2.34 X 10-12 | 273 | 2.34 | Qn.1 | Exponents |
3.14159E-30 | 96 | 3.142 | Qn.1 | Exponents |
1x10^9 | 41 | 1 | Qn.1 | Exponents |
1E9 | 119 | 1 | Qn.1 | Exponents |
2+ | 288 | 2+ | Ord.2 | NA |
+ | 270 | 1+ | Ord.2 | NA |
+++ | 217 | 3+ | Ord.2 | NA |
0-1 | 203 | 0-1 | Qn.3 | NA |
1-2 | 298 | 1-2 | Qn.3 | NA |
1- | 207 | 1 | Qn.1 | flag |
01-02 | 221 | 1-2 | Qn.3 | NA |
1 -2 | 177 | 1-2 | Qn.3 | NA |
3 - 2 | 190 | NA | NA | not_standardized |
- | 108 | Neg | Ord.1 | NA |
+ 230 | 70 | 230 | Qn.1 | flag |
100* | 290 | 100 | Qn.1 | NA |
+56 | 274 | 56 | Qn.1 | flag |
- 5 | 216 | 5 | Qn.1 | flag |
80% | 245 | 80 | Qn.1 | Percent |
-5 | 37 | -5 | Qn.1 | NA |
> 12 | 159 | >12 | Qn.2 | NA |
<1050 | 235 | <1050 | Qn.2 | NA |
< 02 | 88 | <2 | Qn.2 | NA |
>= 20.3 | 116 | >=20.3 | Qn.2 | NA |
>1:40 | 93 | >1:40 | Qn.4 | NA |
1/80 | 69 | 1:80 | Qn.4 | NA |
<1/20 | 142 | <1:20 | Qn.4 | NA |
< 1/020 | 142 | <1:020 | Qn.4 | NA |
= | 130 | NA | NA | No_result |
/ | 71 | NA | NA | No_result |
0.2 | 67 | 0.2 | Qn.1 | NA |
33 Normal | 93 | 33 | Qn.1 | flag |
negative 0.1 | 156 | 0.1 | Qn.1 | flag |
H 256 | 102 | 256 | Qn.1 | flag |
30% | 262 | 30 | Qn.1 | Percent |
23 % | 42 | 23 | Qn.1 | Percent |
1056 | 149 | 1056 | Qn.1 | NA |
1056040 | 246 | 1056040 | Qn.1 | NA |
3560 | 63 | 3560 | Qn.1 | NA |
0,3 | 181 | 0.3 | Qn.1 | NA |
15,6 | 86 | 15.6 | Qn.1 | NA |
2.9 | 64 | 2.9 | Qn.1 | NA |
02.9 | 233 | 2.9 | Qn.1 | NA |
2.90 | 272 | 2.9 | Qn.1 | NA |
250 | 131 | 250 | Qn.1 | NA |
1.025 | 210 | 1.025 | Qn.1 | locale_check |
1.025 | 56 | 1.025 | Qn.1 | locale_check |
1025 | 134 | 1025 | Qn.1 | NA |
1025 | 104 | 1025 | Qn.1 | NA |
1025.7 | 250 | 1025.7 | Qn.1 | NA |
1.025,7 | 151 | 1025.7 | Qn.1 | NA |
1.025,36 | 249 | 1025.36 | Qn.1 | NA |
1,025.36 | 249 | 1025.36 | Qn.1 | NA |
>1.025,36 | 244 | >1025.36 | Qn.2 | NA |
<=1,025.36 | 149 | <=1025.36 | Qn.2 | NA |
1.015 | 234 | 1.015 | Qn.1 | locale_check |
1,060 | 200 | 1,060 | Qn.1 | locale_check |
2,5 | 222 | 2.5 | Qn.1 | NA |
2.5 | 30 | 2.5 | Qn.1 | NA |
>3,48 | 158 | >3.48 | Qn.2 | NA |
3.48 | 89 | 3.48 | Qn.1 | NA |
93 | 133 | 93 | Qn.1 | NA |
,825 | 195 | 0.825 | Qn.1 | NA |
0,825 | 125 | 0.825 | Qn.1 | NA |
1.256894 | 60 | 1.257 | Qn.1 | NA |
. | 96 | NA | NA | No_result |
, | 210 | NA | NA | No_result |
Négatif 0.3 | 143 | 0.3 | Qn.1 | flag |
Négatif | 243 | Neg | Ord.1 | NA |
Pøsitivo | 58 | Pos | Ord.1 | NA |
A+ | 147 | A | Nom.1 | NA |
pos & negative Y | 296 | Neg | Ord.1 | NA |
This function creates three different columns:
1- clean_result
: The cleaned version of the
raw_result
column. For example, “?” is converted to
2- scale_type
: Categorizes the cleaned results into
specific types like Quantitative (Qn), Ordinal (Ord), or Nominal (Nom),
with further subcategories for nuanced differences, such as
differentiating simple numeric results (Qn.1) from inequalities (Qn.2),
range results (Qn.3), or titer results (Qn.4) within the Quantitative
scale.
3- cleaning_comments
: Provides insights on how the
results were cleaned.
The process above provided a generic description on how the
clean_lab_result()
function operates. It would be useful to
delve into more details on the exact way that some of the specific raw
results are cleaned:
Locale
variable:In the clean_lab_result()
function, we have an argument
named locale. It addresses the variations in number formats with
different decimal and thousand separators that arise due to
locale-specific settings used internationally. We chose to standardize
these varying languages and locale-specific settings to have the cleaned
results in English, US. If the user did not identify the locale of the
dataset, the default is NO
, which means not specified. For
example for rows 71 and 72, there is a locale_check in the
cleaning_comments
, and the results are 1.015 and 1,060
respectively. That means that either “US” or “DE” locale should be
specified to identify this result value. If we specified the locale as
US
or DE
, we can see different values as
follows:
Function_1_dummy_subset <- Function_1_dummy[c(71,72),, drop = FALSE]
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "US")
#> ✔ 2 result records were cleaned,
classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>
cleaned_results
raw_result | frequency | clean_result | scale_type | cleaning_comments | |
---|---|---|---|---|---|
71 | 1.015 | 234 | 1.015 | Qn.1 | |
72 | 1,060 | 200 | 1060 | Qn.1 |
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "DE")
#> ✔ 2 result records were cleaned,
classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>
cleaned_results
raw_result | frequency | clean_result | scale_type | cleaning_comments | |
---|---|---|---|---|---|
71 | 1.015 | 234 | 1015 | Qn.1 | |
72 | 1,060 | 200 | 1.06 | Qn.1 |
Language
in common words
:In the clean_lab_result()
function, we support 19
distinct languages in representing frequently used terms such as “high,”
“low,” “positive,” and “negative. For example, the word
Pøsitivo
is included in the common words and will be
cleaned as Pos
.
Let us see how this data table works in our function:
data("common_words", package = "lab2clean")
common_words
Language | Positive | Negative | Not_detected | High | Low | Normal | Sample | Specimen |
---|---|---|---|---|---|---|---|---|
English | Positive | Negative | Not detected | High | Low | Normal | Sample | Specimen |
Spanish | Positivo | Negativo | No detectado | Alto | Bajo | Normal | Muestra | Especimen |
Portuguese | Positivo | Negativo | Nao detectado | Alto | Baixo | Normal | Amostra | Especime |
French | Positif | Negatif | Non detecte | Eleve | Bas | Normal | Echantillon | Specimen |
German | Positiv | Negativ | Nicht erkannt | Hoch | Niedrig | Normal | Probe | Probe |
Italian | Positivo | Negativo | Non rilevato | Alto | Basso | Normale | Campione | Campione |
Dutch | Positief | Negatief | Niet gedetecteerd | Hoog | Laag | Normaal | Staal | Monster |
Polish | Dodatni | Ujemny | Nie wykryto | Wysoki | Niski | Normalny | Probka | Probka |
Swedish | Positiv | Negativ | Inte upptackt | Hog | Lag | Normal | Prov | Prov |
Danish | Positiv | Negativ | Ikke opdaget | Hoj | Lav | Normal | Prove | Prove |
Norwegian | Positiv | Negativ | Ikke oppdaget | Hoy | Lav | Normal | Prove | Prove |
Finnish | Positiivinen | Negatiivinen | Ei havaittu | Korkea | Matala | Normaali | Nayte | Nayte |
Czech | Pozitivni | Negativni | Nezjisteno | Vysoky | Nizky | Normalni | Vzorek | Vzorek |
Hungarian | Pozitiv | Negativ | Nem eszlelt | Magas | Alacsony | Normal | Mintavetel | Mintadarab |
Croatian | Pozitivan | Negativan | Nije otkriveno | Visok | Nizak | Normalan | Uzorak | Uzorak |
Slovak | Pozitivny | Negativny | Nezistene | Vysoky | Nizky | Normalny | Vzorka | Vzorka |
Slovenian | Pozitiven | Negativen | Ni zaznano | Visok | Nizek | Normalno | Vzorec | Vzorec |
Estonian | Positiivne | Negatiivne | Ei tuvastatud | Korge | Madal | Normaalne | Proov | Proov |
Lithuanian | Teigiamas | Neigiamas | Neaptiktas | Aukstas | Zemas | Normalus | Pavyzdys | Pavyzdys |
As seen in this data, there are 19 languages for 8 common words. If
the words are positive or negative, then the result will either be
cleaned to Pos
or Neg
unless if it is
proceeded by a number, therefore the word is removed and a flag is added
to the cleaning_comments
. For example, the word
Négatif 0.3
is cleaned as 0.3
and the word
33 Normal
is cleaned as 33
. Finally, if the
result has one of those words “Sample” or “Specimen”, then a comment
will pop-up mentioning that test was not performed
.
Flag
creation:In addition to the common words, when there is a space between a
numeric value and a minus character, this also creates a flag. For
example, result - 5
is cleaned as 5
with a
flag, but the result -5
is cleaned as -5
, and
no flag is created because we can assume it was a negative value.
The validate_lab_result()
has seven arguments:
lab_data
: A data frame containing laboratory
data
result_value
: The column in lab_data with
quantitative result values for validation
result_unit
: The column in lab_data with result
units in a UCUM-valid format
loinc_code
: The column in lab_data indicating the
LOINC code of the laboratory test
patient_id
: The column in lab_data indicating the
identifier of the tested patient.
lab_datetime
: The column in lab_data with the date
or datetime of the laboratory test.
report
: A report is written in the console.
Defaults to “TRUE”.
Let us check how our package validates the results using the
validate_lab_result()
function. Let us consider the
Function_2_dummy
data that contains 86,863 rows and inspect
its first 6 rows;
data("Function_2_dummy", package = "lab2clean")
head(Function_2_dummy, 6)
patient_id | lab_datetime1 | loinc_code | result_value | result_unit |
---|---|---|---|---|
10000003 | 2023-08-09 | 1975-2 | 19 | umol/L |
10000003 | 2023-08-09 | 1968-7 | 20 | umol/L |
10000003 | 2023-09-09 | 1975-2 | 19 | mmol/L |
10000003 | 2023-09-09 | 1968-7 | 20 | umol/L |
10000003 | 2023-09-09 | 1968-7 | 20 | umol/L |
10000011 | 2023-10-09 | 1975-2 | 19 | umol/L |
Let us apply the validate_lab_result()
and see its
functionality:
validate_results <- validate_lab_result(Function_2_dummy,
result_value="result_value",
result_unit="result_unit",
loinc_code="loinc_code",
patient_id = "patient_id" ,
lab_datetime="lab_datetime1")
#> Preprocessing Step for
Duplicate Records
#>
===============================================================================================
#> ⚠ 166 duplicate records were flagged.
#> These are multiple records of the same test for the same patient
at the same result timestamp.
#> Check 1: Reportable Limits Check
#>
===============================================================================================
#> ⚠ 5 extremely low result records were
flagged (low_unreportable).
#> ⚠ 2
extremely high records were flagged (high_unreportable).
#> Check 2: Logic Consistency Checks
#>
===============================================================================================
#> ⚠ 7 result records were flagged for
violating relational logic rules (logic_flag).
#> Check 3: Delta Change Limits
Checks
#>
===============================================================================================
#> ⚠ 55 records were flagged for having
extreme change values from previous results within 7 days
(delta_flag_7d).
#> ⚠ 15 records were flagged for having
extreme change values from previous results within 8-90 days
(delta_flag_8_90d).
#>
===============================================================================================
#> ✔ 99.712% of the lab data records were
validated with no flag detected.
#> ⏰ Time taken is 0 min, 1.8 sec
#>
The validate_lab_result()
function generates a
flag
column, with different checks:
head(validate_results, 6)
loinc_code | result_unit | patient_id | lab_datetime1 | result_value | flag |
---|---|---|---|---|---|
13457-7 | mg/dL | 1e+07 | 2023-09-09 | 100.0 | NA |
13457-7 | mg/dL | 1e+07 | 2023-10-09 | 100.0 | logic_flag |
1751-7 | g/dl | 1e+07 | 2023-08-09 | 3.1 | NA |
1751-7 | g/dl | 1e+07 | 2023-09-09 | 7.5 | logic_flag |
1751-7 | g/dl | 1e+07 | 2023-10-09 | 7.5 | NA |
18262-6 | mg/dL | 1e+07 | 2023-11-09 | 100.0 | NA |
levels(factor(validate_results$flag))
#> [1] “delta_flag_7d” “delta_flag_8_90d” “duplicate”
#>
[4] “high_unreportable” “logic_flag” “low_unreportable”
We can now subset specific patients to explain the flags:
subset_patients <- validate_results[validate_results$patient_id %in% c("14236258", "10000003", "14499007"), ]
subset_patients
loinc_code | result_unit | patient_id | lab_datetime1 | result_value | flag |
---|---|---|---|---|---|
13457-7 | mg/dL | 10000003 | 2023-09-09 | 100.0 | NA |
13457-7 | mg/dL | 10000003 | 2023-10-09 | 100.0 | logic_flag |
1751-7 | g/dl | 10000003 | 2023-08-09 | 3.1 | NA |
1751-7 | g/dl | 10000003 | 2023-09-09 | 7.5 | logic_flag |
1751-7 | g/dl | 10000003 | 2023-10-09 | 7.5 | NA |
18262-6 | mg/dL | 10000003 | 2023-11-09 | 100.0 | NA |
1968-7 | umol/L | 10000003 | 2023-08-09 | 20.0 | logic_flag |
1968-7 | umol/L | 10000003 | 2023-09-09 | 20.0 | duplicate |
1968-7 | umol/L | 10000003 | 2023-09-09 | 20.0 | duplicate |
1968-7 | umol/L | 10000003 | 2023-10-09 | 20.0 | NA |
1975-2 | umol/L | 10000003 | 2023-08-09 | 19.0 | logic_flag |
1975-2 | mmol/L | 10000003 | 2023-09-09 | 19.0 | NA |
2085-9 | mg/dL | 10000003 | 2023-09-09 | 130.0 | NA |
2085-9 | mg/dL | 10000003 | 2023-10-09 | 130.0 | logic_flag |
2085-9 | mg/dL | 10000003 | 2023-11-09 | 130.0 | NA |
2093-3 | mg/dL | 10000003 | 2023-08-09 | 230.0 | NA |
2093-3 | mg/dL | 10000003 | 2023-09-09 | 230.0 | duplicate |
2093-3 | mg/dL | 10000003 | 2023-09-09 | 215.0 | duplicate |
2093-3 | mg/dL | 10000003 | 2023-10-09 | 230.0 | logic_flag |
2093-3 | ng/dL | 10000003 | 2023-11-09 | 230.0 | NA |
2885-2 | g/dl | 10000003 | 2023-08-09 | 7.0 | NA |
2885-2 | g/dl | 10000003 | 2023-09-09 | 7.0 | logic_flag |
2885-2 | mg/dl | 10000003 | 2023-10-09 | 7.0 | NA |
2160-0 | mg/dL | 14236258 | 2180-11-23 22:30:00 | 13.2 | NA |
2160-0 | mg/dL | 14236258 | 2181-02-22 08:10:00 | 13.1 | NA |
2160-0 | mg/dL | 14236258 | 2181-03-07 11:00:00 | 9.4 | NA |
2160-0 | mg/dL | 14236258 | 2181-03-24 16:35:00 | 27.2 | delta_flag_8_90d |
2160-0 | mg/dL | 14236258 | 2181-03-25 06:25:00 | 16.8 | delta_flag_7d |
2160-0 | mg/dL | 14236258 | 2181-03-26 06:10:00 | 19.0 | NA |
2160-0 | mg/dL | 14236258 | 2181-04-02 10:00:00 | 9.7 | delta_flag_7d |
2160-0 | mg/dL | 14236258 | 2181-06-29 14:00:00 | 16.9 | delta_flag_8_90d |
2160-0 | mg/dL | 14236258 | 2181-06-30 05:32:00 | 10.8 | delta_flag_7d |
2160-0 | mg/dL | 14236258 | 2181-07-10 22:44:00 | 10.0 | NA |
2160-0 | mg/dL | 14236258 | 2181-07-10 23:25:00 | 10.3 | NA |
2160-0 | mg/dL | 14236258 | 2181-07-11 10:00:00 | 11.6 | NA |
2160-0 | mg/dL | 14236258 | 2181-07-12 02:30:00 | 13.6 | NA |
2160-0 | mg/dL | 14236258 | 2181-10-17 17:10:00 | 10.6 | NA |
2160-0 | mg/dL | 14236258 | 2181-10-18 06:40:00 | 12.6 | NA |
2160-0 | mg/dL | 14236258 | 2181-11-30 07:00:00 | 19.7 | delta_flag_8_90d |
2160-0 | mg/dL | 14236258 | 2181-12-17 06:44:00 | 12.1 | delta_flag_8_90d |
2160-0 | mg/dL | 14499007 | 2180-06-02 07:10:00 | 1.0 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-26 15:00:00 | 0.8 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-27 05:53:00 | 1.0 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-27 15:15:00 | 0.0 | low_unreportable |
2160-0 | mg/dL | 14499007 | 2180-10-28 06:35:00 | 0.9 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-29 05:52:00 | 1.0 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-30 12:26:00 | 0.9 | NA |
2160-0 | mg/dL | 14499007 | 2180-10-31 03:11:00 | 0.8 | NA |
2160-0 | mg/dL | 14499007 | 2180-11-01 06:20:00 | 1.0 | NA |
2160-0 | mg/dL | 14499007 | 2180-11-02 04:22:00 | 1.0 | NA |
delta_flag_8_90d
and
delta_flag_7d
that is calculated by lower and upper
percentiles set to 0.0005 and 0.9995 respectively. While the delta check
is effective in identifying potentially erroneous result values, we
acknowledge that it may also flag clinically relevant changes.
Therefore, it is crucial that users interpret these flagged results in
conjunction with the patient’s clinical context.Let us also explain two tables that we used for the validation function. Let us begin with the reportable interval table.
data("reportable_interval", package = "lab2clean")
reportable_interval_subset <- reportable_interval[reportable_interval$interval_loinc_code == "2160-0", ]
reportable_interval_subset
interval_loinc_code | UCUM_unit | low_reportable_limit | high_reportable_limit |
---|---|---|---|
2160-0 | mg/dL | 1e-04 | 120 |
low_unreportable
. As
we can see, for the “2160-0” loinc_code, his result was 0.0 which was
not in the reportable range (0.0001, 120). In a similar note, patient
17726236 has a high_unreportable
.Logic rules ensure that related test results are consistent:
data("logic_rules", package = "lab2clean")
logic_rules <- logic_rules[logic_rules$rule_id == 3, ]
logic_rules
rule_id | rule_index | rule_part | rule_part_type |
---|---|---|---|
3 | 1 | 2093-3 | loinc_code |
3 | 2 | >( | operator |
3 | 3 | 2085-9 | loinc_code |
3 | 4 | + | operator |
3 | 5 | 13457-7 | loinc_code |
3 | 6 | ) | operator |
logic_flag
and
duplicate
. The duplicate
means that this
patient has a duplicate row, whereas the logic_flag
should
be interpreted as follows. For the loinc_code “2093-3”, which is
cholesterol, we need that the “2093-3” > “2085-9” + “13457-7”, or
equivalently cholesterol > hdl cholesterol + ldl cholesterol (from
the logic rules table). Therefore for patient 10000003, we have a logic
flag because LDL (“13457-7”) equals 100.0 and HDL (“2085-9”) equals
130.0. Total cholesterol (“2093-3”) equals 230. Therefore we see that
the rule “2093-3” > “2085-9” + “13457-7” is not satisfied because we
have 230 > 100+130, i.e. 230>230, which is clearly false, and thus
a logic flag is created.The standardize_lab_unit()
has four arguments:
lab_data
: A dataset containing laboratory
data
raw_unit
: The column in lab_data
that
contains raw units to be cleaned.
report
: A report is written in the console.
Defaults to “TRUE”.
n_records
: In case you are loading a grouped list
of distinct results, then you can assign the n_records to the column
that contains the frequency of each distinct result. Defaults to
NA
Let us check how our package standardizes the units of measurement
using the standardize_lab_unit()
function. Let us consider
the Function_3_dummy
data that contains 32 rows and inspect
its first 6 rows;
data("Function_3_dummy", package = "lab2clean")
head(Function_3_dummy, 6)
unit_raw | n_records | note |
---|---|---|
mg / dl | 42 | spaces+case |
Mg/Dl | 15 | case |
mcg/L | 30 | mcg alias |
µg/L | 27 | micro sign U+00B5 |
<U+00B5>g/L | 10 | textual unicode |
μg/L | 11 | greek mu U+03BC |
This dataset -for demonstration purposes- contains three columns:
unit_raw
, n_records
, and note
.
The unit_raw
column holds raw laboratory units as reported
in the database, and frequency
indicates how often each
unit appeared, while the note
details the different cases
handled by our function.
standardized_units <- standardize_lab_unit(Function_3_dummy, raw_unit = "unit_raw", n_records = "n_records")
#> Step 1: Preprocessing unit
srings
#>
==========================================================================================
#> ⚠ 1 distinct unit strings (5.155% of the total result records) with
no units after pre-processing -> cleaning comment added No unit).
#> Step 2: Lookup in commom units
database
#>
==========================================================================================
#> ✔ 19 distinct units (61.856% of the total result records) were
matched to ucum codes.
#> Step
3: Check Syntax Integrity of units with no UCUM match
#>
==========================================================================================
#> ⚠ 0 distinct unit strings (0% of the total result records) with not
valid syntax -> detailed cleaning comments added not_valid - reason).
#> Step 4: Parsing of units which passesd checks
(tokenize and classify)
#>
==========================================================================================
#> ⚠ 2 distinct unit strings (1.804% of the total result records) with
unrecognisable text -> cleaning comment added not_valid - unrecognisable text).
#> Step 5: Restructuring of parsed
units (apply correction rules & final validation)
#>
==========================================================================================
#> ⚠ 0 distinct unit strings (0% of the total result records) with
space characters (not corrected) -> cleaning comment added not_valid - space characters).
#>
✔ 10 distinct units (31.186% of the total result records) were
transformed and validated to ucum codes.
#>
==========================================================================================
#> ✔ 29 distinct results (93.041% of the total result records) were
standardized to UCUM.
#> ⏰ Time
taken is 0 min,
0.1 sec
#>
This function creates two new columns:
head(standardized_units, 10)
unit_raw | n_records | note | cleaning_comments | ucum_code | |
---|---|---|---|---|---|
1 | 20 | empty | No unit | NA | |
2 | % | 22 | percent only | NA | % |
4 | 10^3/uL | 7 | caret exponent | NA | 10*3/uL |
5 | 10exp3/uL | 8 | 10exp exponent | NA | 10*3/uL |
6 | 10E3/uL | 9 | 10E exponent | NA | 10*3/uL |
7 | 10*3/ul | 16 | lowercase uL + star | NA | 10*3/uL |
9 | U.I./mL | 8 | U.I. form | NA | [IU]/mL |
10 | I.U./mL | 8 | I.U. form | NA | [IU]/mL |
14 | IU/L | 18 | plain IU | NA | [IU]/L |
16 | meq/l | 13 | meq alias | NA | meq/L |
1- ucum_code
: Cleaned and standardized units according
to UCUM syntax.
2- cleaning_comments
: Comments about the cleaning
process for each unit.
The harmonize_lab_unit()
has six arguments:
lab_data
: A data frame containing laboratory
data
result_value
: The column in lab_data with
quantitative result values for validation
result_unit
: The column in lab_data with result
units in a UCUM-valid format
loinc_code
: The column in lab_data indicating the
LOINC code of the laboratory test
preferred_unit_system
: A string representing the
preference of the user for the unit system used for standardization.
Defaults to “SI”, the other option is “Conventional”.
report
: A report is written in the console.
Defaults to “TRUE”.
Let us demonstrate the harmonize_lab_unit()
function
using Function_4_dummy
and inspect the first six rows:
data("Function_4_dummy", package = "lab2clean")
head(Function_4_dummy,6)
loinc_code | result_value | result_unit |
---|---|---|
26444-0 | 0.00 | x1000/<U+00B5>L |
14334-7 | 0.90 | meq/L |
785-6 | 30.00 | pg |
2160-0 | 0.81 | mg/dL |
14679-5 | 13.10 | <U+00B5>g/dL |
1963-8 | 27.00 | mmol/l31 mOsm/l |
This dataset -for demonstration purposes- contains three columns:
loinc_code
, result_value
and the
result_unit
.
harmonized_units <- harmonize_lab_unit(Function_4_dummy,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit")
#> Step 1: Extracting unit
parameters (dimension & magnitude)
#> Step 2: Setting reference unit (LOINC-UCUM
mapping)
#> Step 3: Check
compatibility between reported unit and reference unit
#>
Step 4: Executing regular
conversion
#> Step 5:
Executing mass<>molar conversion
#> Step 6: Checking LOINC codes
#>
===============================================================================================
#> Reporting Results:
#> ✔ 37.5% of the lab data records were
harmonized to reference units.
#> ✔ 4
records had reported units same as reference units -> result value
not converted.
#> ✔ 2 records had different reported units,
but equivalent to reference units -> result value not converted.
#> ✔ 5 records harmonized to reference units
by regular conversions.
#> ✔
7 records harmonized to reference
units by mass to mole conversions.
#> ✔ 0
records harmonized to reference units by mole to mass conversions.
#>
===============================================================================================
#> ⚠ 62.5% of the lab data records could NOT
harmonized to reference units.
#> ⚠ 5
records not harmonized: no conversion between units of different
dimensions.
#> ⚠ 1 records not harmonized: no conversion
between arbitrary units and non-arbitrary units.
#> ⚠ 0
records not harmonized: no conversion between different arbitrary
units.
#> ⚠ 4 records not harmonized: no reference
unit available for the given loinc codes.
#> ⚠ 5
records not harmonized: no molecular weight available for the given
analytes (loinc codes).
#> ⚠
15 records not harmonized: reported
units are not ucum-valid.
#> ⚠ 0
records not harmonized: reported units require special conversion.
#> ⚠ 0 records not harmonized: result values
are not numeric.
#> ⚠ 0 records not harmonized: no reported
units.
#>
===============================================================================================
#> ⚠ 10 records with updated loinc code to
match the harmonized unit system.
#> ⏰ Time taken is 0 min, 0.1 sec
#>
This function creates six different columns:
head(harmonized_units, 6)
loinc_code | result_value | result_unit | new_loinc_code | new_value | harmonized_unit | OMOP_concept_id | property_group_id | cleaning_comments |
---|---|---|---|---|---|---|---|---|
14334-7 | 0.90 | meq/L | 3719-2 | 0.90000 | mmol/L | 8753 | NA | harmonized: different_unit_same_value, loinc_unit_mismatch |
785-6 | 29.00 | pg | 785-6 | 29.00000 | pg | 8564 | NA | harmonized: source = reference unit |
785-6 | 30.00 | pg | 785-6 | 30.00000 | pg | 8564 | NA | harmonized: source = reference unit |
2160-0 | 0.70 | mg/dL | 14682-9 | 61.88123 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
2160-0 | 0.82 | mg/dL | 14682-9 | 72.48944 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
2160-0 | 0.81 | mg/dL | 14682-9 | 71.60542 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
1- harmonized_unit
: Harmonized units according to the
preferred unit system.
2- OMOP_concept_id
: The concept id of the harmonized
unit, necessary for databases standardized to the OMOP Common Data
Model.
3- new_value
: The result value after the conversion.
4- new_loinc_code
: The unit conversion can lead to a new
loinc code than the reported one in two cases: * If the reported unit
did not match the property of the given loinc code. For example “mmol/L”
with a LOINC code of mass concentration property –>
“loinc_unitsystem_mismatch” is added in the cleaning comments. * If a
mass<>molar conversion was executed.
harmonized_units[which(harmonized_units$loinc_code != harmonized_units$new_loinc_code), ]
loinc_code | result_value | result_unit | new_loinc_code | new_value | harmonized_unit | OMOP_concept_id | property_group_id | cleaning_comments | |
---|---|---|---|---|---|---|---|---|---|
1 | 14334-7 | 0.900 | meq/L | 3719-2 | 0.90000 | mmol/L | 8753 | NA | harmonized: different_unit_same_value, loinc_unit_mismatch |
4 | 2160-0 | 0.700 | mg/dL | 14682-9 | 61.88123 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
5 | 2160-0 | 0.820 | mg/dL | 14682-9 | 72.48944 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
6 | 2160-0 | 0.810 | mg/dL | 14682-9 | 71.60542 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
9 | 2991-8 | 0.001 | nmol/L | 14914-6 | 1.00000 | pmol/L | 8729 | LG100-4/LG2970-4 | harmonized: regular_conversion, loinc_unit_mismatch |
10 | 2991-8 | 0.005 | nmol/L | 14914-6 | 5.00000 | pmol/L | 8729 | LG100-4/LG2970-4 | harmonized: regular_conversion, loinc_unit_mismatch |
11 | 2991-8 | 0.003 | nmol/L | 14914-6 | 3.00000 | pmol/L | 8729 | LG100-4/LG2970-4 | harmonized: regular_conversion, loinc_unit_mismatch |
16 | 53049-3 | 88.000 | mg/dL | 40193-5 | 4.88455 | mmol/L | 8753 | NA | harmonized: mass_to_mole_conversion |
5- property_group_id
: the code of the LOINC group
(parent group ID / Group ID).
6- cleaning_comments
: Comments about the harmonization
and conversion process for each lab result with two main cases: *
Success: harmonized with same value or with converted new value: - No
conversion in case of similar or equivalent source and reference units.
- Conversion with method clarified whether regular or mass<>molar
conversion. * Failure: not harmonized with detailed reason for each
failure case.
levels(factor(harmonized_units$cleaning_comments))
#> [1] “harmonized: different_unit_same_value”
#> [2]
“harmonized: different_unit_same_value, loinc_unit_mismatch”
#>
[3] “harmonized: mass_to_mole_conversion”
#> [4] “harmonized:
mass_to_mole_conversion, loinc_unit_mismatch”
#> [5]
“harmonized: regular_conversion”
#> [6] “harmonized:
regular_conversion, loinc_unit_mismatch”
#> [7] “harmonized:
source = reference unit”
#> [8] “not_harmonized: Non UCUM unit”
#> [9] “not_harmonized: between arbitrary units and
non-arbitrary units”
#> [10] “not_harmonized: different
dimensions”
#> [11] “not_harmonized: different dimensions,
loinc_unit_mismatch”
#> [12] “not_harmonized: no molecular
weight available”
#> [13] “not_harmonized:
no_reference_unit_available”
In the harmonize_lab_unit()
function, we have an
argument named preferred_unit_system
.
preferred_unit_system
: According to the user
preference, the reference units may change from SI units (usually molar
concentration) to conventional units commonly used in practice (usually
mass concentration) through mass<>molar conversions. For LOINC
codes which don’t have mass<>molar equivalent, the conventional
and SI units were considered the same. For some LOINC codes, the molar
concentration is the one used conventionally. Examples of differences in
using different preferred_unit_system
is detailed as
follows:Function_4_dummy_subset <- Function_4_dummy[c(27, 15, 38, 45),, drop = FALSE]
harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit",
report = FALSE,
preferred_unit_system = "SI")
#> ⏰ Time taken is 0 min, 0.1 sec
#>
harmonized_units
loinc_code | result_value | result_unit | new_loinc_code | new_value | harmonized_unit | OMOP_concept_id | property_group_id | cleaning_comments |
---|---|---|---|---|---|---|---|---|
2160-0 | 0.820 | mg/dL | 14682-9 | 72.48944 | umol/L | 8749 | LG100-4/LG2923-3 | harmonized: mass_to_mole_conversion |
2991-8 | 0.001 | nmol/L | 14914-6 | 1.00000 | pmol/L | 8729 | LG100-4/LG2970-4 | harmonized: regular_conversion, loinc_unit_mismatch |
2951-2 | 142.000 | mmol/L | 2951-2 | 142.00000 | mmol/L | 8753 | LG100-4/LG11363-5 | harmonized: source = reference unit |
786-4 | 32.000 | g/dL | 786-4 | 320.00000 | g/L | 8636 | NA | harmonized: regular_conversion |
harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit",
report = FALSE,
preferred_unit_system = "conventional")
#> ⏰ Time taken is 0 min, 0.1 sec
#>
harmonized_units
loinc_code | result_value | result_unit | new_loinc_code | new_value | harmonized_unit | OMOP_concept_id | property_group_id | cleaning_comments |
---|---|---|---|---|---|---|---|---|
2160-0 | 0.820 | mg/dL | 2160-0 | 8.2000000 | mg/L | 8751 | LG100-4/LG6657-3 | harmonized: regular_conversion |
2991-8 | 0.001 | nmol/L | 2991-8 | 0.2883998 | ng/L | 8725 | LG100-4/LG11447-6 | harmonized: mole_to_mass_conversion, loinc_unit_mismatch |
2951-2 | 142.000 | mmol/L | 2951-2 | 142.0000000 | mmol/L | 8753 | LG100-4/LG11363-5 | harmonized: source = reference unit |
786-4 | 32.000 | g/dL | 786-4 | 320.0000000 | g/L | 8636 | NA | harmonized: regular_conversion |
We fully acknowledge the importance of customization to accommodate
diverse user needs and tailor the functions to specific datasets. To
this end, the data in common_words
,
logic_rules
, reportable_interval
,
RWD_units_to_UCUM_V2
, annotable_strings
, and
loinc_reference_unit_v1
are not hard-coded within the
function scripts but are instead provided as separate data files in the
“data” folder of the package. This approach allows users to benefit from
the default data we have included, which reflects our best knowledge,
while also providing the flexibility to append or modify the data as
needed.
For example, users can easily customize the common_words
RData file by adding phrases that are used across different languages
and laboratory settings. This allows the clean_lab_result()
function to better accommodate the specific linguistic and contextual
nuances of their datasets. Similarly, users can adjust the
logic_rules
and reportable_interval
data files
for validate_lab_result()
function to reflect the unique
requirements or standards of their research or clinical environment.
Additionally, users can extend the RWD_units_to_UCUM_V2
data file by adding some locally used units or strings (especially which
have non-English letters or abbreviations) with their ucum-valid
equivalents customizing the output of
standardize_lab_unit()
function. Similarly, the
annotable_strings
data file can extended by adding
non-English strings for analytes locally used in units. Finally, the
harmonize_lab_unit()
function can be customized by adding
reference units to LOINC codes that were not covered in the
loinc_reference_unit_v1
or even editing the reference units
for some existing LOINC codes (though not recommended).
By providing these customizable data files, we aim to ensure that the
lab2clean
package is not only powerful but also adaptable
to the varied needs of the research and clinical communities.