This package has been commissioned by the NHS-R community and is intended to be used to web scrape the NHS Data Dictionary website for useful lookup tables. The NHS-R community have been pivotal in getting this package off the ground.
The package is maintained by Gary Hutson - Head of Advanced Analytics at Arden and GEM Commissioning Support Unit and to contact the maintainer directly you can navigate to this site.
Additionally, the package has been developed with generic web scraping functionality to allow other websites containing data tables and elements to be scraped.
To load the package, you can use the below command:
library(NHSDataDictionaRy)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
library(tibble)
This brings in the functions needed to work with the package. The below sub sections will show how to use the package, as intended.
This function expects no return and is a way to query the NHS Data Dictionary database to get the most recent list of data elements and their associated lookups. The return of this will provide a tibble of all the links currently on the NHS Data Dictionary website:
nhs_tibble <- NHSDataDictionaRy::nhs_data_elements()
print(head(nhs_tibble))
#> # A tibble: 6 x 6
#> link_name url full_url xpath_nat_code xpath_default_co~ xpath_also_known
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ABBREVIAT~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
#> 2 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
#> 3 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
#> 4 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
#> 5 ABLATIVE ~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
#> 6 ABNORMALI~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
This tibble gives a list of all lookups and their associated xpath codes i.e. a direct link to an HTML element, which is the standard way of extracting HTML DOM content. This is where the other functions in the package become powerful.
The NHSDataDictionaRy package provides a couple of Microsoft Excel convenience functions for working with text data. These are:
I will demonstrate how these can be used on the tibble extracted from the previous example in the following sub sections.
To utilise the left_xl function it expects two parameters - the first is the text to work with and the second is the number of characters to left trim by:
This works the same way as the left function, but trims from the right of the text inward:
This function takes a slightly different approach and expects 3 input parameter, the first being the text to trim, the second being where to start trimming and the third parameter is the termination point i.e. where to stop the trimming of the string:
#Grab a sub set of the data frame
df <- nhs_tibble[10,]
original <- df$link_name
#Original string
result <- NHSDataDictionaRy::mid_xl(df$link_name, 12, 20)
print(original); print(result)
#> [1] "ACCESSIBLE INFORMATION SPECIFIC INFORMATION FORMAT CODE (SNOMED CT)"
#> [1] "INFORMATION SPECIFIC"
class(result)
#> [1] "character"
This is a simple, but useful function, as it gets the length of the string:
This function can analyse a website and get all the current hyperlinks of a website. This function is used to produce the nhs_data_elements() function, as it calls this function to analyse all the current hyperlinks on the NHS Data Dictionary package, but my example shows an example of scraping the NHSR community website to access the links:
# Analyse all the links on a website
website_url <- "https://nhsrcommunity.com/home/webinars/"
results <- NHSDataDictionaRy::linkScrapeR(website_url)
print(head(results, 20))
#> # A tibble: 20 x 2
#> link_name url
#> <chr> <chr>
#> 1 "\n\t\t\t" https://nhsrcommunity.com/
#> 2 "\n\t\t\t" https://nhsrcommunity.com/
#> 3 "Home" https://nhsrcommunity.com/
#> 4 "About" #
#> 5 "About" https://nhsrcommunity.com/about/
#> 6 "Patients" https://nhsrcommunity.com/about/ppi/
#> 7 "Recordings" https://nhsrcommunity.com/learn-r/workshops/
#> 8 "Conferences" #
#> 9 "NHS-R Conference – 202~ https://nhsrcommunity.com/nhsr-conference-2020/
#> 10 "NHS-R Conference – 201~ https://nhsrcommunity.com/nhs-r-conference-2019/
#> 11 "NHS-R Conference – 201~ https://nhsrcommunity.com/nhs-r-conference-9-octobe~
#> 12 "Events" #
#> 13 "Events" https://nhsrcommunity.com/events/
#> 14 "Webinars" https://nhsrcommunity.com/home/webinars/
#> 15 "Blog" #
#> 16 "NHS-R Blog" https://nhsrcommunity.com/blog/
#> 17 "R tips" https://nhsrcommunity.com/blog/category/r-tips/
#> 18 "Authors" https://nhsrcommunity.com/authors/
#> 19 "R Groups" https://nhsrcommunity.com/r-near-me/
#> 20 "Contact" https://nhsrcommunity.com/contact/
To navigate to the specific URL you can use the utils::browseURL command:
This package provides functionality for working with the nhs_data_elements extracted from the NHS Data Dictionary website. The two main useful function to extract elements are the tableR function and the xPathTextR function. These can work with the tibble returned to extract useful lookups.
The scrapeR function is the workhorse, but the tableR wraps the results of the function in a nice tibble output. This will show you how to utilise the return tibble and to pass the function through the tableR to scrape a tibble to be utilised for lookups:
# Filter by a specific lookup required
reduced_tibble <-
dplyr::filter(nhs_tibble, link_name == "ACTIVITY TREATMENT FUNCTION CODE")
#Use the tableR function to query the NHS Data Dictionary website and return the associate tibble
national_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
xpath = reduced_tibble$xpath_nat_code,
title = "NHS Hospital Activity Treatment Function National Codes")
default_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
xpath = reduced_tibble$xpath_default_code,
title = "NHS Hospital Activity Treatment Function Default Codes")
# Here you could merge the codes - as you will have national and default codes
merged_frame <- national_codes %>%
dplyr::bind_rows(default_codes)
# The query has returned results, if the url does not have a lookup table an error will be thrown
print(head(national_codes,10))
#> # A tibble: 10 x 4
#> Code Description Dict_Type DttmExtracted
#> <chr> <chr> <chr> <dttm>
#> 1 100 General Surgery Service NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 2 101 Urology Service NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 3 102 Transplant Surgery Serv~ NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 4 103 Breast Surgery Service NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 5 104 Colorectal Surgery Serv~ NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 6 105 Hepatobiliary and Pancr~ NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 7 106 Upper Gastrointestinal ~ NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 8 107 Vascular Surgery Service NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 9 108 Spinal Surgery Service NHS Hospital Activity Tre~ 2021-02-17 14:16:44
#> 10 109 Bariatric Surgery Servi~ NHS Hospital Activity Tre~ 2021-02-17 14:16:44
print(head(default_codes), 10)
#> # A tibble: 2 x 4
#> Code Description Dict_Type DttmExtracted
#> <chr> <chr> <chr> <dttm>
#> 1 199 Non-UK provider; TREATMENT F~ NHS Hospital Activity~ 2021-02-17 14:16:44
#> 2 499 Non-UK provider; TREATMENT F~ NHS Hospital Activity~ 2021-02-17 14:16:44
print(head(merged_frame))
#> # A tibble: 6 x 4
#> Code Description Dict_Type DttmExtracted
#> <chr> <chr> <chr> <dttm>
#> 1 100 General Surgery Service NHS Hospital Activity Trea~ 2021-02-17 14:16:44
#> 2 101 Urology Service NHS Hospital Activity Trea~ 2021-02-17 14:16:44
#> 3 102 Transplant Surgery Serv~ NHS Hospital Activity Trea~ 2021-02-17 14:16:44
#> 4 103 Breast Surgery Service NHS Hospital Activity Trea~ 2021-02-17 14:16:44
#> 5 104 Colorectal Surgery Serv~ NHS Hospital Activity Trea~ 2021-02-17 14:16:44
#> 6 105 Hepatobiliary and Pancr~ NHS Hospital Activity Trea~ 2021-02-17 14:16:44
Not all lookups will have associated national code tables, if they are not returned you will receive a message saying the lookup table is not available for this NHS Data Dictionary type.
There are common lookups that are needed, and this is one such mapping between specialty code, to get the description of the specialty unit description. I will show an example with a made up data frame to illustrate the use case for these lookups and to have up to date lookups:
act_aggregations <- tibble(SpecCode = as.character(c(101,102,103, 104, 105)),
ActivityCounts = round(rnorm(5,250,3),0),
Month = rep("May", 5))
# Use dplyr to join the NHS activity by specialty code
act_aggregations %>%
left_join(merged_frame, by = c("SpecCode"="Code"))
#> # A tibble: 5 x 6
#> SpecCode ActivityCounts Month Description Dict_Type DttmExtracted
#> <chr> <dbl> <chr> <chr> <chr> <dttm>
#> 1 101 247 May Urology Servi~ NHS Hospital~ 2021-02-17 14:16:44
#> 2 102 252 May Transplant Su~ NHS Hospital~ 2021-02-17 14:16:44
#> 3 103 243 May Breast Surger~ NHS Hospital~ 2021-02-17 14:16:44
#> 4 104 251 May Colorectal Su~ NHS Hospital~ 2021-02-17 14:16:44
#> 5 105 252 May Hepatobiliary~ NHS Hospital~ 2021-02-17 14:16:44
# This easily joins the lookup on to your data
The benefit of having it in an R package is that you can instantaneously have a lookup of the most relevant and up to date NHS lookups, replacing the need to have a massive data warehouse to capture this information.
This function has been provided to return elements from a website, other than html tables, as these functions predominately work with tables. The below example shows how this can be implemented, but requires the retrieval of the xpath via the Inspect command in Google Chrome (CTRL + SHIFT + I):
url <- "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
xpath_element <- '//*[@id="element_abbreviated_mental_test_score.description"]'
# Run the xpathTextR function to retrieve details of the element retrieved
result_list <- NHSDataDictionaRy::xpathTextR(url, xpath_element)
print(result_list)
#> $result
#> [1] "Description\n \n \n \n \n ABBREVIATED MENTAL TEST SCORE\n is the \n PERSON SCORE\n where the \n ASSESSMENT TOOL TYPE\n is \n 'Abbreviated Mental Test Score'. \n The score is in the range 0 to 10.\n \n\n"
#>
#> $website_passed
#> [1] "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
#>
#> $xpath_passed
#> [1] "//*[@id=\"element_abbreviated_mental_test_score.description\"]"
#>
#> $html_node_result
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:whc="http://www.oxygenxml.com/webhelp/components" xml:lang="en" lang="en" whc:version="21.1">
#> [1] <head>\n<link rel="shortcut icon" href="../oxygen-webhelp%5Ctemplate%5Cre ...
#> [2] <body class="wh_topic_page frmBody">\n <a href="#wh_topic_body" cl ...
#>
#> $datetime_access
#> [1] "2021-02-17 14:16:44 GMT"
#>
#> $person_accessed
#> [1] "GARYH - LAPTOP-GE3S96EI"
This provides details of the result, the text retrieved live from the website - this would need some cleaning, the website passed to the function, the xpath included, the result of the node search, the date and time the list was generated and the person and domain accessing this.
The example below shows how the text could be cleaned once it is retrieved:
# Use the returned result and do some text processing
clean_text <- trimws(unlist(result_list$result))
clean_text <- clean_text %>%
gsub("[\r\n]", "", .) %>% #Remove new line and breaks
trimws() %>% #Get rid of any white space
as.character() #Cast to a character vector
print(clean_text)
#> [1] "Description ABBREVIATED MENTAL TEST SCORE is the PERSON SCORE where the ASSESSMENT TOOL TYPE is 'Abbreviated Mental Test Score'. The score is in the range 0 to 10."
I have used the trim white space function to extract the result element from the returned list from the previous function and now I use piping to a gsub function to remove newlines and spaces, I use the trimws() command again to make sure the spacing is sorted and then I convert (cast) this into a character string. Finally, the results are printed.
There are lots of use cases for this, but I would like to keep iterating this tool so please contact me with suggestions of what could be included in future versions.