1.0 Data merge

Attention:
Although each line of code has been validated, in order to save time knitting the R markdown document the next section is display only. If you are not data merging (section 1.0) or preparing the data (section 2.0), feel free to skip to Section 3.0 Initial flags.

1.1 Download ALA data

Download ALA data and create a new file in the DataPath to put those data into. You should also first make an account with ALA in order to download your data — https://auth.ala.org.au/userdetails/registration/createAccount

  BeeBDC::atlasDownloader(path = DataPath,
           userEmail = "your@email.edu.au",
           atlas = "ALA",
           ALA_taxon = "Apiformes")

1.2 Import and merge ALA, SCAN, iDigBio, and GBIF data

Supply the path to where the data is, the save_type is either “csv_files” or “R_file”.

  DataImp <- BeeBDC::repoMerge(path = DataPath, 
                  occ_paths = BeeBDC::repoFinder(path = DataPath),
                  save_type = "R_file")

If there is an error in finding a file, run repoFinder() by itself to troubleshoot. For example:

            #BeeBDC::repoFinder(path = DataPath)
            #OUTPUT:
            #$ALA_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/ALA_galah_path/galah_download_2022-09-15/data.csv"
  
            #$GBIF_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0000165-220831081235567/occurrence.txt"
            #[2] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436695-210914110416597/occurrence.txt"
            #[3] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436697-210914110416597/occurrence.txt"
            #[4] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436704-210914110416597/occurrence.txt"
            #[5] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436732-210914110416597/occurrence.txt"
            #[6] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436733-210914110416597/occurrence.txt"
            #[7] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436734-210914110416597/occurrence.txt"
                    
            #$iDigBio_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/iDigBio_webDL_30Aug2022/5aa5abe1-62e0-4d8c-bebf-4ac13bd9e56f/occurrence_raw.csv"
  
            #$SCAN_data
            #character(0)
            #Failing because SCAN_data seems to be missing. Downloaded separatly from the one drive

Load in the most-recent version of these data if needed. This will return a list with:

The occurrence dataset with attributes (.$Data_WebDL)

The appended eml file (.$eml_files)

DataImp <- BeeBDC::importOccurrences(path = DataPath,
                       fileName = "BeeData_")

1.3 Import USGS Data

The USGS_formatter() will find, import, format, and create metadata for the USGS dataset. The pubDate must be in day-month-year format.

  USGS_data <- BeeBDC::USGS_formatter(path = DataPath, pubDate = "19-11-2022")

1.4 Formatted Source Importer

Use this importer to find files that have been formatted and need to be added to the larger data file.

The attributes file must contain “attribute” in its name, and the occurrence file must not.

  Complete_data <- BeeBDC::formattedCombiner(path = DataPath, 
                                strings = c("USGS_[a-zA-Z_]+[0-9]{4}-[0-9]{2}-[0-9]{2}"), 
                                  # This should be the list-format with eml attached
                                existingOccurrences = DataImp$Data_WebDL,
                                existingEMLs = DataImp$eml_files)

In the column catalogNumber, remove ".*specimennumber:" as what comes after should be the USGS number to match for duplicates.

  Complete_data$Data_WebDL <- Complete_data$Data_WebDL %>%
    dplyr::mutate(catalogNumber = stringr::str_replace(catalogNumber,
                                                       pattern = ".*\\| specimennumber:",
                                                       replacement = ""))

1.5 Save data

Choose the type of data format you want to use in saving your work in 1.x.

  BeeBDC::dataSaver(path = DataPath,# The main path to look for data in
       save_type = "CSV_file", # "R_file" OR "CSV_file"
       occurrences = Complete_data$Data_WebDL, # The existing datasheet
       eml_files = Complete_data$eml_files, # The existing EML files
       file_prefix = "Fin_") # The prefix for the fileNames
rm(Complete_data, DataImp)

2.0 Data preparation

The data preparatin section of the script relates mostly to integrating bee occurrence datasets and corrections and so may be skipped by many general taxon users.

2.1 Standardise datasets

You may either use:

1. the bdc import method (works well with general datasets) or
1. the jbd import method (works well with above data merge)

a. bdc import

The bdc import is NOT truly supported here, but provided as an example. Please go to section 2.1b below. Read in the bdc metadata and standardise the dataset to bdc.

        bdc_metadata <- readr::read_csv(paste(DataPath, "out_file", "bdc_integration.csv", sep = "/"))
        # ?issue — datasetName is a darwinCore field already!
        # Standardise the dataset to bdc
        db_standardized <- bdc::bdc_standardize_datasets(
          metadata = bdc_metadata,
          format = "csv",
          overwrite = TRUE,
          save_database = TRUE)
        # read in configuration description file of the column header info
        config_description <- readr::read_csv(paste(DataPath, "Output", "bdc_configDesc.csv",
                                                    sep = "/"), 
                                              show_col_types = FALSE, trim_ws = TRUE)

b. jbd import

Find the path, read in the file, and add the database_id column.

  occPath <- BeeBDC::fileFinder(path = DataPath, fileName = "Fin_BeeData_combined_")


  db_standardized <- readr::read_csv(occPath, 
                                       # Use the basic ColTypeR function to determine types
                                     col_types = BeeBDC::ColTypeR(), trim_ws = TRUE) %>%
                                     dplyr::mutate(database_id = paste("Dorey_data_", 
                                     1:nrow(.), sep = ""),
                                     .before = family)

c. optional thin

You can thin the dataset for TESTING ONLY!

         check_pf <- check_pf %>%
           # take every 100th record
           filter(row_number() %% 100 == 1)

2.2 Paige dataset

Paige Chesshire’s cleaned American dataset — https://doi.org/10.1111/ecog.06584

Import data

If you haven’t figured it out by now, don’t worry about the column name warning — not all columns occur here.

  PaigeNAm <- readr::read_csv(paste(DataPath, "Paige_data", "NorAmer_highQual_only_ALLfamilies.csv",
                                    sep = "/"), col_types = BeeBDC::ColTypeR()) %>%
     # Change the column name from Source to dataSource to match the rest of the data.
    dplyr::rename(dataSource = Source) %>%
     # EXTRACT WAS HERE
      # add a NEW database_id column
    dplyr::mutate(
      database_id = paste0("Paige_data_", 1:nrow(.)),
      .before = scientificName)

Attention:
It is recommended to run the below code on the full bee dataset with more than 16GB RAM. Robert ran this on a laptop with 16GB RAM and an Intel(R) Core(TM) i7-8550U processor (4 cores and 8 threads) — it struggled.

Merge Paige’s data with downloaded data

  db_standardized <- BeeBDC::PaigeIntegrater(
      db_standardized = db_standardized,
      PaigeNAm = PaigeNAm,
        # This is a list of columns by which to match Paige's data to the most-recent download with. 
        # Each vector will be matched individually
      columnStrings = list(
        c("decimalLatitude", "decimalLongitude", 
          "recordNumber", "recordedBy", "individualCount", "samplingProtocol",
          "associatedTaxa", "sex", "catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"),         # Iteration 1
        c("catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"), # Iteration 2
        c("decimalLatitude", "decimalLongitude", 
          "recordedBy", "genus", "specificEpithet"),# Iteration 3
        c("id", "decimalLatitude", "decimalLongitude"),# Iteration 4
        c("recordedBy", "genus", "specificEpithet", "locality"), # Iteration 5
        c("recordedBy", "institutionCode", "genus", 
          "specificEpithet","locality"),# Iteration 6
        c("occurrenceID","decimalLatitude", "decimalLongitude"),# Iteration 7
        c("catalogNumber","decimalLatitude", "decimalLongitude"),# Iteration 8
        c("catalogNumber", "locality") # Iteration 9
      ) )

Remove spent data.

  rm(PaigeNAm)

2.3 USGS

The USGS dataset also partially occurs on GBIF from BISON. However, the occurrence codes are in a silly place… We will correct these here to help identify duplicates later.

    db_standardized <- db_standardized %>%
          # Remove the discoverlife html if it is from USGS
      dplyr::mutate(occurrenceID = dplyr::if_else(
        stringr::str_detect(occurrenceID, "USGS_DRO"),
        stringr::str_remove(occurrenceID, "http://www\\.discoverlife\\.org/mp/20l\\?id="),
        occurrenceID)) %>%
          # Use otherCatalogNumbers when occurrenceID is empty AND when USGS_DRO is detected there
      dplyr::mutate(
        occurrenceID = dplyr::if_else(
          stringr::str_detect(otherCatalogNumbers, "USGS_DRO") & is.na(occurrenceID),
          otherCatalogNumbers, occurrenceID)) %>%
           # Make sure that no eventIDs have snuck into the occurrenceID columns 
           # For USGS_DRO, codes with <6 digits are event ids
      dplyr::mutate(
        occurrenceID = dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO", negate = TRUE),
             # Keep occurrenceID if it's NOT USGS_DRO
           occurrenceID, 
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO[0-9]{6,10}"),
                         occurrenceID, NA_character_)),
        catalogNumber = dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO", negate = TRUE),
             # Keep catalogNumber if it's NOT USGS_DRO
          catalogNumber, 
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO[0-9]{6,10}"),
                         catalogNumber, NA_character_)))

2.4 Additional datasets

Import additional and potentially private datasets.

Note: Private dataset functions are provided but the data itself is not integrated here until those datasets become freely available.

There will be some warnings were a few rows may not be formatted correctly or where dates fail to parse. This is normal.

a. EPEL

Guzman, L. M., Kelly, T. & Elle, E. A data set for pollinator diversity and their interactions with plants in the Pacific Northwest. Ecology, e3927 (2022). https://doi.org/10.1002/ecy.3927

EPEL_Data <- BeeBDC::readr_BeeBDC(dataset = "EPEL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/bee_data_canada.csv",
                      outFile = "jbd_EPEL_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

b. Allan Smith-Pardo

Data from Allan Smith-Pardo

ASP_Data <- BeeBDC::readr_BeeBDC(dataset = "ASP",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Allan_Smith-Pardo_Dorey_ready2.csv",
                      outFile = "jbd_ASP_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

c. Minckley

Data from Robert Minckley

BMin_Data <- BeeBDC::readr_BeeBDC(dataset = "BMin",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Bob_Minckley_6_1_22_ScanRecent-mod_Dorey.csv",
                        outFile = "jbd_BMin_data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

d. BMont

Delphia, C. M. Bumble bees of Montana. https://www.mtent.org/projects/Bumble_Bees/bombus_species.html. (2022)

BMont_Data <- BeeBDC::readr_BeeBDC(dataset = "BMont",
                                 path = paste0(DataPath, "/Additional_Datasets"),
                          inFile = "/InputDatasets/Bombus_Montana_dorey.csv",
                          outFile = "jbd_BMont_data.csv",
                          dataLicense = "https://creativecommons.org/licenses/by-sa/4.0/")

e. Ecd

Ecdysis. Ecdysis: a portal for live-data arthropod collections, https://ecdysis.org/index.php (2022).

Ecd_Data <- BeeBDC::readr_BeeBDC(dataset = "Ecd",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Ecdysis_occs.csv",
                      outFile = "jbd_Ecd_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

f. Gai

Gaiarsa, M. P., Kremen, C. & Ponisio, L. C. Pollinator interaction flexibility across scales affects patch colonization and occupancy. Nature Ecology & Evolution 5, 787-793 (2021). https://doi.org/10.1038/s41559-021-01434-y

Gai_Data <- BeeBDC::readr_BeeBDC(dataset = "Gai",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/upload_to_scan_Gaiarsa et al_Dorey.csv",
                      outFile = "jbd_Gai_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

g. CAES

From the Connecticut Agricultural Experiment Station.

Zarrillo, T. A., Stoner, K. A. & Ascher, J. S. Biodiversity of bees (Hymenoptera: Apoidea: Anthophila) in Connecticut (USA). Zootaxa (Accepted).

Ecdysis. Occurrence dataset (ID: 16fca9c2-f622-4cb1-aef0-3635a7be5aeb). https://ecdysis.org/content/dwca/CAES-CAES_DwC-A.zip. (2023)

CAES_Data <- BeeBDC::readr_BeeBDC(dataset = "CAES",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/CT_BEE_DATA_FROM_PBI.xlsx",
                        outFile = "jbd_CT_Data.csv",
                        sheet = "Sheet1",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

h. GeoL

GeoL_Data <- BeeBDC::readr_BeeBDC(dataset = "GeoL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Geolocate and BELS_certain and accurate.xlsx",
                        outFile = "jbd_GeoL_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

i. EaCO

EaCO_Data <- BeeBDC::readr_BeeBDC(dataset = "EaCO",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Eastern Colorado bee 2017 sampling.xlsx",
                        outFile = "jbd_EaCo_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

j. FSCA

Florida State Collection of Arthropods

FSCA_Data <- BeeBDC::readr_BeeBDC(dataset = "FSCA",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "InputDatasets/fsca_9_15_22_occurrences.csv",
                        outFile = "jbd_FSCA_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

k. Texas SMC

Published or unpublished data from Texas literature not in an online database, usually copied into spreadsheet from document format, or otherwise copied from a very differently-formatted spreadsheet. Unpublished or partially published data were obtained with express permission from the lead author.

SMC_Data <- BeeBDC::readr_BeeBDC(dataset = "SMC",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/TXbeeLitOccs_31Oct22.csv", 
                      outFile = "jbd_SMC_Data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

l. Texas Bal

Data with GPS coordinates (missing accidentally from records on Dryad) from Ballare, K. M., Neff, J. L., Ruppel, R. & Jha, S. Multi-scalar drivers of biodiversity: local management mediates wild bee community response to regional urbanization. Ecological Applications 29, e01869 (2019), https://doi.org/10.1002/eap.1869. The version on Dryad is missing site GPS coordinates (by accident). Kim is okay with these data being made public as long as her paper is referenced. - Elinor Lichtenberg

Bal_Data <- BeeBDC::readr_BeeBDC(dataset = "Bal",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Beedata_ballare.xlsx", 
                      outFile = "jbd_Bal_Data.csv",
                      sheet = "animal_data",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

m. Palouse Lic

Elinor Lichtenberg’s canola data: Lichtenberg, E. M., Milosavljević, I., Campbell, A. J. & Crowder, D. W. Differential effects of soil conservation practices on arthropods and crop yields. Journal of Applied Entomology, (2023) https://doi.org/10.1111/jen.13188. These are the data I will be putting on SCAN. - Elinor Lichtenberg

Lic_Data <- BeeBDC::readr_BeeBDC(dataset = "Lic",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Lichtenberg_canola_records.csv", 
                      outFile = "jbd_Lic_Data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

n. Arm

Data from Armando Falcon-Brindis from the University of Kentucky.

Arm_Data <- BeeBDC::readr_BeeBDC(dataset = "Arm",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Bee database Armando_Final.xlsx",
                      outFile = "jbd_Arm_Data.csv",
                      sheet = "Sheet1",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

o. Dor

From several papers:

Dorey, J. B., Fagan-Jeffries, E. P., Stevens, M. I., & Schwarz, M. P. (2020). Morphometric comparisons and novel observations of diurnal and low-light-foraging bees. Journal of Hymenoptera Research, 79, 117–144. doi:https://doi.org/10.3897/jhr.79.57308
Dorey, J. B. (2021). Missing for almost 100 years: the rare and potentially threatened bee Pharohylaeus lactiferus (Hymenoptera, Colltidae). Journal of Hymenoptera Research, 81, 165-180. doi: https://doi.org/10.3897/jhr.81.59365
Dorey, J. B., Schwarz, M. P., & Stevens, M. I. (2019). Review of the bee genus Homalictus Cockerell (Hymenoptera: Halictidae) from Fiji with description of nine new species. Zootaxa, 4674(1), 1–46. doi:https://doi.org/10.11646/zootaxa.4674.1.1

  Dor_Data <- BeeBDC::readr_BeeBDC(dataset = "Dor",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/DoreyData.csv",
                    outFile = "jbd_Dor_Data.csv",
                    dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

p. VicWam

These data are originally from the Victorian Museum and Western Australian Museum in Australia. However, in their current form they are from Dorey et al. 2021.

PADIL. (2020). PaDIL. https://www.padil.gov.au
Houston, T. F. (2000). Native bees on wildflowers in Western Australia. Western Australian Insect Study Society.
Dorey, J. B., Rebola, C. M., Davies, O. K., Prendergast, K. S., Parslow, B. A., Hogendoorn, K., . . . Caddy-Retalic, S. (2021). Continental risk assessment for understudied taxa post catastrophic wildfire indicates severe impacts on the Australian bee fauna. Global Change Biology, 27(24), 6551-6567. doi:https://doi.org/10.1111/gcb.15879

 VicWam_Data <- BeeBDC::readr_BeeBDC(dataset = "VicWam",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/Combined_Vic_WAM_databases.xlsx",
                    outFile = "jbd_VicWam_Data.csv",
                    dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/",
                    sheet = "Combined")

2.5 Merge all

Remove these spent datasets.

  rm(EPEL_Data, ASP_Data, BMin_Data, BMont_Data, Ecd_Data, Gai_Data, CAES_Data, 
  GeoL_Data, EaCO_Data, FSCA_Data, SMC_Data, Bal_Data, Lic_Data, Arm_Data, Dor_Data,
  VicWam_Data)

Read in and merge all. There are more readr_BeeBDC() supported than currently implemented and these represent datasets that will be publicly released in the future. See ‘?readr_BeeBDC()’ for details.

db_standardized <- db_standardized %>%
  dplyr::bind_rows(
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_ASP_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EPEL_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMin_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMont_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Ecd_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Gai_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_CT_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_GeoL_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EaCo_Data.csv"), col_types = BeeBDC::ColTypeR()), 
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_SMC_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Bal_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Lic_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Arm_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Dor_Data.csv"), col_types = BeeBDC::ColTypeR()),
readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                       "/jbd_VicWam_Data.csv"), col_types = BeeBDC::ColTypeR())) %>% 
    # END bind_rows
  suppressWarnings(classes = "warning") # End suppressWarnings — due to col_types

2.6 Match database_id

If you have prior runs from which you’d like to match database_ids with from the current run, you may use the below script to try to match database_ids with prior runs.

Read in a prior run of choice.

  priorRun <- BeeBDC::fileFinder(path = DataPath,
                          file = "01_prefilter_database_9Aug22.csv") %>%
    readr::read_csv(file = ., col_types = BeeBDC::ColTypeR())

This function will attempt to find the database_ids from prior runs.

  db_standardized <- BeeBDC::idMatchR(
  currentData = db_standardized,
  priorData = priorRun,
    # First matches will be given preference over later ones
  matchBy = tibble::lst(c("gbifID", "dataSource"),
                        c("catalogNumber", "institutionCode", "dataSource", "decimalLatitude",
                          "decimalLongitude"),
                        c("occurrenceID", "dataSource","decimalLatitude","decimalLongitude"),
                        c("recordId", "dataSource","decimalLatitude","decimalLongitude"),
                        c("id", "dataSource","decimalLatitude","decimalLongitude"),
                        # Because INHS was entered as it's own dataset but is now included in the GBIF    download...
                        c("catalogNumber", "institutionCode", "dataSource",
                          "decimalLatitude","decimalLongitude")),
    # You can exclude datasets from prior by matching their prefixs — before first underscore:
  excludeDataset = c("ASP", "BMin", "BMont", "CAES", "EaCO", "Ecd", "EcoS",
                     "Gai", "KP", "EPEL", "CAES", "EaCO", "FSCA", "SMC", "Lic", "Arm",
                     "VicWam"))

 # Remove redundant files
rm(priorRun)

Save the dataset.

  db_standardized %>%
    readr::write_excel_csv(.,
                     paste(OutPath_Intermediate, "00_prefilter_database.csv",
                           sep = "/"))

3.0 Initial flags

Read data back in if needed. OutPath_Intermediate (and a few other directories) should be have been created and saved to the global environment by dirMaker().

if(!exists("db_standardized")){
  db_standardized <- readr::read_csv(paste(OutPath_Intermediate, "00_prefilter_database.csv",
                                    sep = "/"), col_types = BeeBDC::ColTypeR())}

Normally, you would use the full dataset, as read in above. But, for the sake of this vignette, we will use a combination of two example datasets. These example datasets can further be very useful for testing functions if you’re ever feeling a bit confused and overwhelmed!

data("bees3sp", package = "BeeBDC")
data("beesRaw", package = "BeeBDC")
db_standardized <- dplyr::bind_rows(beesRaw, 
                                      # Only keep a subset of columns from bees3sp
                             bees3sp %>% dplyr::select(tidyselect::all_of(colnames(beesRaw)), countryCode))

For more details about the bdc package, please see their tutorial.

3.1 SciName

Flag occurrences without scientificName provided.

check_pf <- bdc::bdc_scientificName_empty(data = db_standardized, sci_name = "scientificName")
## 
## bdc_scientificName_empty:
## Flagged 0 records.
## One column was added to the database.
# now that this is saved, remove it to save space in memory
rm(db_standardized)

3.2 MissCoords

Flag occurrences with missing decimalLatitude and decimalLongitude.

check_pf <- bdc::bdc_coordinates_empty(data = check_pf, lat = "decimalLatitude",
    lon = "decimalLongitude")
## 
## bdc_coordinates_empty:
## Flagged 42 records.
## One column was added to the database.

3.3 OutOfRange

Flag occurrences that are not on Earth (outside of -180 to 180 or -90 to 90 degrees).

check_pf <- bdc::bdc_coordinates_outOfRange(data = check_pf, lat = "decimalLatitude",
    lon = "decimalLongitude")
## 
## bdc_coordinates_outOfRange:
## Flagged 0 records.
## One column was added to the database.

3.4 Source

Flag occurrences that don’t match the basisOfRecord types below.

check_pf <- bdc::bdc_basisOfRecords_notStandard(
  data = check_pf,
  basisOfRecord = "basisOfRecord",
  names_to_keep = c(
    # Keep all plus some at the bottom.
    "Event",
    "HUMAN_OBSERVATION",
    "HumanObservation",
    "LIVING_SPECIMEN",
    "LivingSpecimen",
    "MACHINE_OBSERVATION",
    "MachineObservation",
    "MATERIAL_SAMPLE",
    "O",
    "Occurrence",
    "MaterialSample",
    "OBSERVATION",
    "Preserved Specimen",
    "PRESERVED_SPECIMEN",
    "preservedspecimen Specimen",
    "Preservedspecimen",
    "PreservedSpecimen",
    "preservedspecimen",
    "S",
    "Specimen",
    "Taxon",
    "UNKNOWN",
    "",
    NA,
    "NA",
    "LITERATURE", 
    "None", "Pinned Specimen", "Voucher reared", "Emerged specimen"
  ))
## 
## bdc_basisOfRecords_notStandard:
## Flagged 1 of the following specific nature:
##  MATERIAL_CITATION 
## One column was added to the database.

3.5 CountryName

Try to harmonise country names.

a. prepare dataset

Fix up country names based on common problems above and extract ISO2 codes for occurrences.

check_pf_noNa <- BeeBDC::countryNameCleanR(
  data = check_pf,
    # Create a Tibble of common issues in country names and their replacements
  commonProblems = dplyr::tibble(problem = c('U.S.A.', 'US','USA','usa','UNITED STATES',
                                              'United States','U.S.A','MX','CA','Bras.','Braz.',
                                              'Brasil','CNMI','USA TERRITORY: PUERTO RICO'),
                                  fix = c('United States of America','United States of America',
                                          'United States of America','United States of America',
                                          'United States of America','United States of America',
                                          'United States of America','Mexico','Canada','Brazil',
                                          'Brazil','Brazil','Northern Mariana Islands','PUERTO.RICO'))
  )
##  - Using default country names and codes from https:en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - static version from July 2022.

b. run function

Get country name from coordinates using a wrapper around the jbd_country_from_coordinates() function. Because our dataset is much larger than those used to design bdc, we have made it so that you can analyse data in smaller pieces. Additionally, like some other functions in BeeBDC, we have implemented parallel operations (using mc.cores = #cores in stepSize = #rowsPerOperation); see ‘?jbd_CfC_chunker()’ for details. NOTE: In an actual run you should use scale = “large”

suppressWarnings(
  countryOutput <- BeeBDC::jbd_CfC_chunker(data = check_pf_noNa,
                                   lat = "decimalLatitude",
                                   lon = "decimalLongitude",
                                   country = "country",
                                    # How many rows to process at a time
                                   stepSize = 1000000,
                                    # Start row
                                   chunkStart = 1,
                                   path = OutPath_Intermediate,
                                    # Normally, please use scale = "large"
                                   scale = "medium",
                                   mc.cores = 1),
  classes = "warning")
##  - Starting parallel operation. Unlike the serial operation (mc.cores = 1) , a parallel operation will not provide running feedback. Please be patient  as this function may take some time to complete. Each chunk will be run on  a seperate thread so also be aware of RAM usage.
##  - We have updated the country names of 39 occurrences that previously had no country name assigned.

c. re-merge

Join these datasets.

check_pf <- dplyr::left_join(check_pf, countryOutput, by = "database_id", suffix = c("",
    "CO")) %>%
    # Take the new country name if the original is NA
dplyr::mutate(country = dplyr::if_else(is.na(country), countryCO, country)) %>%
    # Remove duplicates if they arose from left_join!
dplyr::distinct()

Save the dataset.

check_pf %>%
    readr::write_excel_csv(., paste(OutPath_Intermediate, "01_prefilter_database.csv",
        sep = "/"))

Read in if needed.

if (!exists("check_pf")) {
    check_pf <- readr::read_csv(paste(DataPath, "Output", "Intermediate", "01_prefilter_database.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())
}

Remove these interim datasets.

rm(check_pf_noNa, countryOutput)

3.6 StandardCoNames

Run the function, which standardises country names and adds ISO2 codes, if needed.

  # Standardise country names and add ISO2 codes if needed
check_pf <- bdc::bdc_country_standardized(
  # Remove the countryCode and country_suggested columns to avoid an error with 
    # where two "countryCode" and "country_suggested" columns exist (i.e. if the dataset has been  
    # run before)
  data = check_pf %>% dplyr::select(!tidyselect::any_of(c("countryCode", "country_suggested"))),
  country = "country"
) 
## Loading auxiliary data: country names
## Standardizing country names
## country found: Argentina
## country found: Australia
## country found: Belgium
## country found: Brazil
## country found: Canada
## country found: Colombia
## country found: Costa Rica
## country found: Ecuador
## country found: Estonia
## country found: Finland
## country found: France
## country found: Germany
## country found: Ireland
## country found: Mexico
## country found: Norway
## country found: South Africa
## country found: Sweden
## country found: Switzerland
## 
## bdc_country_standardized:
## The country names of 5 records were standardized.
## Two columns ('country_suggested' and 'countryCode') were added to the database.

3.7 TranspCoords

Flag and correct records when decimalLatitude and decimalLongitude appear to be transposed. We created this chunked version of bdc::bdc_coordinates_transposed() because it is very RAM-heavy using our large bee dataset. Like many of our other ‘jbd_…’ functions there are other improvements - e.g., parallel running.

NOTE: Usually you would use scale = “large”, which requires rnaturalearthhires

check_pf <- BeeBDC::jbd_Ctrans_chunker(
  # bdc_coordinates_transposed inputs
  data = check_pf,
  id = "database_id",
  lat = "decimalLatitude",
  lon = "decimalLongitude",
  country = "country",
  countryCode = "countryCode",
  border_buffer = 0.2, # in decimal degrees (~22 km at the equator)
  save_outputs = TRUE,
  sci_names = "scientificName",
  # chunker inputs
  stepSize = 1000000,  # How many rows to process at a time
  chunkStart = 1,  # Start row
  append = FALSE,  # If FALSE it may overwrite existing dataset
  progressiveSave = FALSE,
    # In a normal run, please use scale = "large"
  scale = "medium",
  path = OutPath_Check,
  mc.cores = 1
) 
##  - Running chunker with:
## stepSize = 1,000,000
## chunkStart = 1
## chunkEnd = 1,000,000
## append = FALSE
##  - Starting chunk 1...
## From 1 to 1,000,000
##  - Finished chunk 1 of 1. Total records examined: 205

Get a quick summary of the number of transposed records.

table(check_pf$coordinates_transposed, useNA = "always")

Save the dataset.

check_pf %>%
    readr::write_excel_csv(., paste(OutPath_Intermediate, "01_prefilter_database.csv",
        sep = "/"))

Read the data in again if needed.

if (!exists("check_pf")) {
    check_pf <- readr::read_csv(paste(OutPath_Intermediate, "01_prefilter_database.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())
}

3.8 Coord-country

Collect all country names in the country_suggested column. We rebuilt a bdc function to flag occurrences where the coordinates are inconsistent with the provided country name.

check_pf <- BeeBDC::jbd_coordCountryInconsistent(data = check_pf, lon = "decimalLongitude",
    lat = "decimalLatitude", scale = 50, pointBuffer = 0.01)
##  - Downloading naturalearth map...
## Spherical geometry (s2) switched off
##  - Extracting initial country names without buffer...
##  - Buffering naturalearth map by pointBuffer...
## dist is assumed to be in decimal degrees (arc_degrees).
##  - Extracting FAILED country names WITH buffer...
## 
## jbd_coordinates_country_inconsistent:
## Flagged 2 records.
## The column, '.coordinates_country_inconsistent', was added to the database.
##  - Completed in 2.12 secs

Save the dataset.

check_pf %>%
    readr::write_excel_csv(., paste(OutPath_Intermediate, "01_prefilter_database.csv",
        sep = "/"))

3.9 GeoRefIssue

This function identifies records whose coordinates can potentially be extracted from locality information, which must be manually checked later.

xyFromLocality <- bdc::bdc_coordinates_from_locality(data = check_pf, locality = "locality",
    lon = "decimalLongitude", lat = "decimalLatitude", save_outputs = FALSE)
## 
## bdc_coordinates_from_locality 
## Found 38 records missing or with invalid coordinates but with potentially useful information on locality.

# Save the resultant data
xyFromLocality %>%
    readr::write_excel_csv(paste(OutPath_Check, "01_coordinates_from_locality.csv",
        sep = "/"))

Remove spent data.

rm(xyFromLocality)

3.10 Flag Absent

Flag the records marked as “absent”.

check_pf <- BeeBDC::flagAbsent(data = check_pf, PresAbs = "occurrenceStatus")
## \.occurrenceAbsent:
##  Flagged 8 absent records:
##  One column was added to the database.

3.11 flag License

Flag the records that may not be used according to their license information.

check_pf <- BeeBDC::flagLicense(data = check_pf,
                    strings_to_restrict = "all",
                    # DON'T flag if in the following dataSource(s)
                    excludeDataSource = NULL)
## \.unLicensed:
##  Flagged 0 records that may NOT be used.
##  One column was added to the database.

3.12 GBIF issue

Flag select issues that are flagged by GBIF.

check_pf <- BeeBDC::GBIFissues(data = check_pf, issueColumn = "issue", GBIFflags = c("COORDINATE_INVALID",
    "ZERO_COORDINATE"))
##  - jbd_GBIFissues:
## Flagged 0 
##   The .GBIFflags column was added to the database.

3.13 Flag Reports

a. Save flags

Save the flags so far. This function will make sure that you keep a copy of everything that has been flagged up until now. This will be updated throughout the script and can accessed at the end, so be wary of moving files around manually. However, these data will also still be maintained in the main running file, so this is an optional fail-safe.

flagFile <- BeeBDC::flagRecorder(
  data = check_pf,
  outPath = paste(OutPath_Report, sep =""),
  fileName = paste0("flagsRecorded_", Sys.Date(),  ".csv"),
    # These are the columns that will be kept along with the flags
  idColumns = c("database_id", "id", "catalogNumber", "occurrenceID", "dataSource"),
    # TRUE if you want to find a file from a previous part of the script to append to
  append = FALSE)

Update the .summary column

check_pf <- BeeBDC::summaryFun(
  data = check_pf,
    # Don't filter these columns (or NULL)
  dontFilterThese = NULL,
    # Remove the filtering columns?
  removeFilterColumns = FALSE,
    # Filter to ONLY cleaned data?
  filterClean = FALSE)
##  - We will flag all columns starting with '.'
##  - summaryFun:
## Flagged 52 
##   The .summary column was added to the database.

c. Reporting

Use bdc to generate reports.

(report <- bdc::bdc_create_report(data = check_pf, database_id = "database_id", workflow_step = "prefilter",
    save_report = TRUE))

3.14 Save

Save the intermediate dataset.

check_pf %>%
    readr::write_excel_csv(., paste(OutPath_Intermediate, "01_prefilter_output.csv",
        sep = "/"))

4.0 Taxonomy

For more information about the corresponding bdc functions used in this section, see their tutorial.

Read in the filtered dataset or rename the 3.x dataset for 4.0.

if (!exists("check_pf")) {
    database <- readr::read_csv(paste(OutPath_Intermediate, "01_prefilter_output.csv",
        sep = "/"), col_types = BeeBDC::ColTypeR())
} else {
    # OR rename and remove
    database <- check_pf
    # Remove spent dataset
    rm(check_pf)
}

Remove names_clean if it already exists (i.e. you have run the following functions before on this dataset before).

database <- database %>%
    dplyr::select(!tidyselect::any_of("names_clean"))

4.1 Prep data names

This step cleans the database’s scientificName column.

! MAC: You might need to install gnparser through terminal — brew brew tap gnames/gn brew install gnparser

Attention:
This can be difficult for a Windows install. Ensure you have the most recent version of R, R Studio, and R packages. Also, check package ‘rgnparser’ is installed correctly. If you still can not get the below code to work, you may have to download the latest version of ‘gnparser’ from here. You may then need to manually install it and edit your systems environmental variable PATH to locate ‘gnparser.exe’. See here.

parse_names <- bdc::bdc_clean_names(sci_names = database$scientificName, save_outputs = FALSE)

## The latest gnparser version is v1.7.4
## gnparser has been installed to /home/runner/bin
## 
## >> Family names prepended to scientific names were flagged and removed from 0 records.
## >> Terms denoting taxonomic uncertainty were flagged and removed from 0 records.
## >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and     removing extra spaces, were flagged and corrected or removed from 1 records.
## >> Infraspecific terms were flagged and removed from 0 records.

Keep only the .uncer_terms and names_clean columns.

parse_names <- parse_names %>%
    dplyr::select(.uncer_terms, names_clean)

Merge names with the complete dataset.

database <- dplyr::bind_cols(database)
rm(parse_names)

4.2 Harmonise taxonomy

Download the custom taxonomy file from the BeeBDC package and Discover Life website.

taxonomyFile <- BeeBDC::beesTaxonomy()

Attention:
As of version 1.1.0, BeeBDC now has a new function that can download taxonomies using the taxadb package and transform them into the BeeBDC format. The function, BeeBDC::taxadbToBeeBDC(), allows the user to choose their desired provider (e.g., “gbif”, “itis”…), version, taxon name and rank, and to save the taxonomy as a readable csv or not. For example for the bee genus Apis:

ApisTaxonomy <- BeeBDC::taxadbToBeeBDC(
  name = "Apis",
  rank = "Genus",
  provider = "gbif",
  version = "22.12",
  outPath = getwd(),
  fileName = "ApisTaxonomy.csv"
  )

Harmonise the names in the occurrence tibble. This flags the occurrences without a matched name and matches names to their correct name according to Discover Life. You can also use multiple cores to achieve this. See ‘?harmoniseR()’ for details.

database <- BeeBDC::harmoniseR(path = DataPath, #The path to a folder that the output can be saved
                       taxonomy = taxonomyFile, # The formatted taxonomy file
                       data = database,
                       mc.cores = 1)
##  - Formatting taxonomy for matching...
## The names_clean column was not found and will be temporarily copied from scientificName
## 
##  - Harmonise the occurrence data with unambiguous names...
## 
##  - Attempting to harmonise the occurrence data with ambiguous names...
##  - Formatting merged datasets...
## Removing the names_clean column...
##  - We matched valid names to 196 of 205 occurrence records. This leaves a total of 9 unmatched occurrence records.
## 
## harmoniseR:
## 9
## records were flagged.
## The column, '.invalidName' was added to the database.
##  - We updated the following columns: scientificName, species, family, subfamily, genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship. The previous scientificName column was converted to verbatimScientificName
##  - Completed in 1.27 secs

You don’t need this file any more…

rm(taxonomyFile)

Save the harmonised file.

database %>%
    readr::write_excel_csv(., paste(DataPath, "Output", "Intermediate", "02_taxonomy_database.csv",
        sep = "/"))

4.3 Save flags

Save the flags so far. This will find the most-recent flag file and append your new data to it. You can double-check the data and number of columns if you’d like to be thorough and sure that all of data are intact.

flagFile <- BeeBDC::flagRecorder(data = database, outPath = paste(OutPath_Report,
    sep = ""), fileName = paste0("flagsRecorded_", Sys.Date(), ".csv"), idColumns = c("database_id",
    "id", "catalogNumber", "occurrenceID", "dataSource"), append = TRUE, printSummary = TRUE)

BeeBDC vignette

0.0 Script preparation

0.1 Working directory

0.2 Install packages (if needed)

0.3 Load packages

1.0 Data merge

1.1 Download ALA data

1.2 Import and merge ALA, SCAN, iDigBio, and GBIF data

1.3 Import USGS Data

1.4 Formatted Source Importer

1.5 Save data

2.0 Data preparation

2.1 Standardise datasets

a. bdc import

b. jbd import

c. optional thin

2.2 Paige dataset

Import data

Merge Paige’s data with downloaded data

2.3 USGS

2.4 Additional datasets

a. EPEL

b. Allan Smith-Pardo

c. Minckley

d. BMont

e. Ecd

f. Gai

g. CAES

h. GeoL

i. EaCO

j. FSCA

k. Texas SMC

l. Texas Bal

m. Palouse Lic

n. Arm

o. Dor

p. VicWam

2.5 Merge all

2.6 Match database_id

3.0 Initial flags

3.1 SciName

3.2 MissCoords

3.3 OutOfRange

3.4 Source

3.5 CountryName

a. prepare dataset

b. run function

c. re-merge

3.6 StandardCoNames

3.7 TranspCoords

3.8 Coord-country

3.9 GeoRefIssue

3.10 Flag Absent

3.11 flag License

3.12 GBIF issue

3.13 Flag Reports

a. Save flags

c. Reporting

3.14 Save

4.0 Taxonomy

4.1 Prep data names

4.2 Harmonise taxonomy

4.3 Save flags

5.0 Space

5.1 Coordinate precision

5.2 Common spatial issues

5.3 Diagonal + grid

5.4 Uncertainty

5.5 Country & continent checklists

5.6 Map spatial errors

5.7 Space report

5.8 Space figures

5.9 Save flags

5.10 Save

6.0 Time

6.1 Recover dates

6.2 No eventDate

6.3 Old records

6.4 Time report

6.5 Time figures