User Guide

Sheeja Manchira Krishnan

2019-10-15

library(IPDFileCheck)

IPDFileCheck

IPDFileCheck is a package that can be used to check the data file from a randomised clinical trial (RCT). The standard checks on data file from RCT will be of the following 1. To check the file exists and readable 2. To check if the column exists, 3. To get the column number if the column name is known, 4. To test column contents ie do they contain specific items in a given list? 5. To test column names of a data being different from what specified, 6. To check the format of column ‘age’ in data 7. To check the format of column ‘gender’ in data 8. To check the format of column contents -numeric or string 9. To check the format of a numeric column 10. To return the column number if the pattern is contained in the colnames of a data 11. To return descriptive statistics, sum, no of observations, mean, mode. median, range, standard deviation and standard error 12. To present the mean and sd of a data set in the form Mean (SD) 13. To return a subgroup when certain variable equals the given value while omitting those with NA 14. To estimate standard error of the mean and the mode 15. To find the number and percentages of categories 16. To represent categorical data in the form - numbers (percentage) 17. To calculate age from date of birth and year of birth

Data

For demonstration purposes, two simulated data sets (one with valid data and another with invalid data) representing treatment and control arm of randomised controlled trial will be used.

 set.seed(17)
 rctdata <- data.frame(age=abs(rnorm(10, 60, 20)),
                           sex=c("M", "F","M", "F","M", "F","M", "F","F","F"),
                           yob=sample(seq(1930,2000), 10, replace=T),
                           dob=c("07/12/1969","16/02/1962","03/09/1978","17/02/1969",                                      "25/11/1960","17/04/1970","18/03/1997","30/01/1988",
                                               "03/02/1990","25/09/1978"),
                           arm=c("Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention"),stringsAsFactors = FALSE)
 
rctdata_error <- data.frame(age=runif(10, -60, 120),
                           sex=c("M", "F","M", "F","M", "F","M", "F","F","F"),
                           yob=sample(seq(1930,2000), 10, replace=T),
                           dob=c("1997 May 28","1987-June-18",NA,"2015/July/09","1997 May 28","1987-June-18",NA,"2015/July/09","1997 May 28","1987-June-18"),
                           arm=c("Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention"),stringsAsFactors = FALSE)

Examples- IPDFileCheck

1. To check the file exists and readable

The function testFileExistenceReadability tests if the user provided file exists and is readable. Returns 0 for success and -1 for failure. Here in the example a directory named “nodir” is generated and tested.

  thisdir=getwd()
  testFileExistenceReadability(thisdir)
#> [1] 0
  nodir=paste(thisdir,"test",sep="/")
  testFileExistenceReadability(nodir)
#> Warning in testFileExistenceReadability(nodir): Invalid directory or file
#> [1] -1

2. To check if the column exists

The function checkColumnExists tests if the column with user specified column name exists in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘sex’ exists but ‘gender’ do not. Thus the function returns 0 when ‘sex’ is used while returns -1 when ‘gender’ is used.

checkColumnExists("sex",rctdata)
#> [1] 0
checkColumnExists("gender",rctdata)
#> Warning in checkColumnExists("gender", rctdata): Data does not contain the
#> column with the specfied column name
#> [1] -1

3. To get the column number if the column name is known

The function getColumnNoForNames returns the column number of the column with user specified column name in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘sex’ exists and it is the 2nd column but ‘gender’ do not. Thus the function returns column number 2 when ‘sex’ is used while returns -1 when ‘gender’ is used.

getColumnNoForNames(rctdata,"sex")
#> [1] 2
getColumnNoForNames(rctdata,"gender")
#> Warning in getColumnNoForNames(rctdata, "gender"): Column name does not
#> exist
#> [1] -1

4. To test column contents ie do they contain specific items in a given list?

The function testColumnContents tests if the column contents are from a list provided by the user. The user can also give an optional code that correspond to the non response in the data. In the simulated data shown above the column ‘sex’ contains ‘M’ and ‘F’ as the entries and we can test this as shown below. If the entries are of the given format the function returns 0 else -1 to indicate error.

testColumnContents(rctdata,"sex",c("M","F"),NA)
#> [1] 0
testColumnContents(rctdata,"sex",c("M","F"))
#> [1] 0
testColumnContents(rctdata,"sex",c("Male","Female"))
#> Warning in testColumnContents(rctdata, "sex", c("Male", "Female")): Invalid
#> entry in column
#> [1] -2

5. To test column names of a data being different from what specified

The function testDataColumnNames tests if the column names in the data are that provided by the user. In the simulated data ‘rctdata’, shown above the column ‘sex’ contains ‘M’ and ‘F’ as the entries and we can test this as shown below. If the entries are of the given format the function returns 0 else -1 to indicate error.

testDataColumnNames(c("age","sex","dob","yob","arm"),rctdata)
#> [1] 0
testDataColumnNames(c("arm","age","yob","dob","age"),rctdata)
#> Warning in testDataColumnNames(c("arm", "age", "yob", "dob", "age"),
#> rctdata): One or other column may have different names
#> [1] -1
testDataColumnNames(c("arm","gender","yob","dob","age"),rctdata)
#> Warning in testDataColumnNames(c("arm", "gender", "yob", "dob", "age"), :
#> One or other column may have different names
#> [1] -1

6. To check the format of column ‘age’ in data

The function testAge tests if the contents of the column ‘age’ is valid. User can provide the name of the column and the optional code of non response. Age should be numeric and with in limits of 0 and 150. In the simulated data ‘rctdata’ the ‘age’ column contents are valid thus returning 0. But with the given dataset ‘rctdata_error’ the age can have negative numbers, such that the function returns -1 to indicate an error .

testAge(rctdata,"age",NA)
#> [1] 0
testAge(rctdata_error,"age",NA)
#> Warning in testAge(rctdata_error, "age", NA): Invalid entry in age column
#> [1] -2

7. To check the format of column ‘gender’ in data

The function testGender tests if the contents of the gender column is valid. User provides the name of the gender column, how it is coded, and the optional code of non response. In the simulated data ‘rctdata’ the gender column name is ‘sex’ and coded as ‘M’ and ‘F’. Thus the function returns 0. but if the user tells that the gender is coded as “Male” and “Female” the function returns error -1.

testGender(rctdata,c("M","F"),"sex",NA)
#> [1] 0
testGender(rctdata,c("Male","Female"),"sex",NA)
#> Warning in testGender(rctdata, c("Male", "Female"), "sex", NA): Invalid
#> entry in gender column
#> [1] -2

8. To check the format of column contents -numeric or string

The function testDataNumeric tests if the column contents are numeric. User provides the minimum and maximum values the numeric values in the column can have along with an optional code that suggests the non response. If the entries are numeric, format the function returns 0 else -1 to indicate error. In the ‘rctdata’ above,The age is from 0 to 100, hence it returns 0, while the year of birth column “yob” has values greater than 100, hence returning a -1.

testDataNumeric("age",rctdata,NA,0,100)
#> [1] 0
testDataNumeric("yob",rctdata,NA,0,100)
#> Warning in testDataNumeric("yob", rctdata, NA, 0, 100): Invalid ranges in
#> column
#> [1] -2

The function testDataNumericNorange tests if the column contents are numeric (but with no ranges provided). User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else -1 to indicate error. As the column “arm” has no numeric data in ‘rctdata’, the function returns -1 to indicate error.

testDataNumericNorange("age",rctdata,NA)
#> [1] 0
testDataNumericNorange("yob",rctdata,NA)
#> [1] 0
testDataNumericNorange("arm",rctdata,NA)
#> Warning in testDataNumericNorange("arm", rctdata, NA): Some values-other
#> than NR code is not numeric
#> [1] -2

The function testDataString tests if the column contents are string. User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else -1 to indicate error. As the column “arm” has no numeric data in ‘rctdata’, the function returns 0 and ‘yob’ with numeric data, -1 to indicate error.

testDataString(rctdata,"arm",NA)
#> [1] 0
testDataString(rctdata,"yob",NA)
#> Warning in testDataString(rctdata, "yob", NA): Numeric entry in column
#> [1] -2

The function testDataStringRestriction tests if the column contents are string but with given restrictions. User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else -1 to indicate error. As the column “arm” has no numeric data in ‘rctdata’ and they contain the entries as specified, the function returns 0. But the column ‘sex’ contain “M” and “F” other than “Male” and “Female”.

testDataStringRestriction(rctdata,"arm",NA,c("Intervention","Control"))
#> [1] 0
testDataStringRestriction(rctdata,"sex",NA,c("M","F"))
#> [1] 0
testDataStringRestriction(rctdata,"sex",NA,c("Male","Female"))
#> Warning in testDataStringRestriction(rctdata, "sex", NA, c("Male",
#> "Female")): Invalid entry in column
#> [1] -2

9 To return the column number if the pattern is contained in the colnames of a data

The function getColumnNoForPatternInColumnname returns the column number of the column with column name that contains user specified pattern in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘dob’ and ‘yob’ contains the pattern ‘ob’ and they are the 4th and 5th columns but ‘gender’ do not exist in the data (to return -1).

getColumnNoForPatternInColumnname("ob",colnames(rctdata))
#> [1] 3 4
getColumnNoForPatternInColumnname("gender",rctdata)
#> Warning in getColumnNoForPatternInColumnname("gender", rctdata): The
#> pattern does not form any part of columnnames
#> [1] -2

10. To return descriptive statistics, sum, no of observations, mean, mode. median, range, standard deviation and standard error

The function descriptiveStatisticsDataColumn returns the descriptive statistics of the column with the user specified column name. This includes mean, standard deviation, median, mode, standard error f the mean, minimum and maximum values to the 95% confidence intervals. If the column contents are not numeric or any error in calculating any of the quantities, the function returns -1 to indicate error. For example, the column ‘age’ is numeric and can return the descriptive statistics, but the column ‘sex’ is not. Hence the function returns -1 to indicate error.

descriptiveStatisticsDataColumn(rctdata, "age")
#>          Sum     Mean       SD Median     Mode       SE  Minimum  Maximum
#> age 635.4561 63.54561 16.54699 61.756 39.69983 5.232617 39.69983 94.33068
#>     Count       LQ       UQ 95%CI.low 95%CI.high
#> age    10 55.67713 73.41427  40.58966   90.98421
descriptiveStatisticsDataColumn(rctdata, "sex")
#> Warning in testDataNumericNorange(column.name, data, nrcode): Some values-
#> other than NR code is not numeric
#> Warning in descriptiveStatisticsDataColumn(rctdata, "sex"): Non numeric
#> columns, cant estimate the descriptive statistics
#> [1] -1

11. To present the mean and sd of a data set in the form Mean (SD)

The function presentMeanSdRemoveNAText returns the mean and SD in the form mean (SD). If the column contents are not numeric or any error in calculating, the function returns -1 to indicate error. For example, the column ‘age’ is numeric and can return the mean and SD, but the column ‘sex’ is not. Hence the function returns -1 to indicate error.

presentMeanSdRemoveNAText(rctdata, "age")
#> [1] "63.55 (16.55)"
presentMeanSdRemoveNAText(rctdata, "sex")
#> Warning in testDataNumericNorange(column.name, data, nrcode): Some values-
#> other than NR code is not numeric
#> Warning in descriptiveStatisticsDataColumn(data, column.name, nrcode): Non
#> numeric columns, cant estimate the descriptive statistics
#> Warning in presentMeanSdRemoveNAText(rctdata, "sex"): Error or no data to
#> analyse
#> [1] -1

12. To return a subgroup when certain variable equals the given value while omitting those with NA

The function returnSubgroupOmitNA returns the subgroup using the user defined condition while omitting any non response values. mean and SD in the form mean (SD). If the column contents are not numeric or any error in calculating, the function returns -1 to indicate error. For example, the first command below gives all the female in the data, while the second command retrieves all those in control arm.

returnSubgroupOmitNA(rctdata, "sex","F")
#>         age sex  yob        dob          arm
#> 2  58.40727   F 1973 16/02/1962 Intervention
#> 4  43.65464   F 1987 17/02/1969 Intervention
#> 6  56.68776   F 1936 17/04/1970 Intervention
#> 8  94.33068   F 1970 30/01/1988 Intervention
#> 9  65.10474   F 1997 03/02/1990      Control
#> 10 67.33162   F 1945 25/09/1978 Intervention
returnSubgroupOmitNA(rctdata, "arm","control")
#> [1] age sex yob dob arm
#> <0 rows> (or 0-length row.names)

13. To find the number and percentages of categorical data

The function representCategoricalDataText returns the descriptive statistics using number and percentage in a categorical column.User provides the number of categories, how it is coded, and the column name. For example it returns the number and percentage of “M” and “F” in the column “sex” or the number and percentage of “Intervention” and “Control” in the column “arm”

representCategoricalDataText(rctdata, "sex",NA)
#> [1] "4 (40)" "6 (60)"
representCategoricalDataText(rctdata, "arm",NA)
#> [1] "5 (50)" "5 (50)"

14. To calculate age from date of birth and year of birth

The function calculateAgeFromDob returns the age calculated from given date of birth. User may provide the column name containing date of birth, format of date of birth and optional non response code.For example, in the ‘rctdata’ shown above, the ‘dob’ column has dates in the format “%d/%m/%y”. The allowed formats for the dates should be in numeric. For example, in the rctdata_error, the dates are in combined numeric and text format, which will return -1 to indicate error.

calculateAgeFromDob(rctdata,"dob","%d/%m/%y",NA)
#>         age sex  yob        dob          arm calc.age.dob
#> 1  39.69983   M 1992 07/12/1969      Control     49.85479
#> 2  58.40727   F 1973 16/02/1962 Intervention     57.66027
#> 3  55.34026   M 1982 03/09/1978      Control     41.11507
#> 4  43.65464   F 1987 17/02/1969 Intervention     50.65753
#> 5  75.44182   M 1994 25/11/1960      Control     58.89013
#> 6  56.68776   F 1936 17/04/1970 Intervention     49.49589
#> 7  79.45749   M 1970 18/03/1997      Control     22.57808
#> 8  94.33068   F 1970 30/01/1988 Intervention     31.70685
#> 9  65.10474   F 1997 03/02/1990      Control     29.69589
#> 10 67.33162   F 1945 25/09/1978 Intervention     41.05479
calculateAgeFromDob(rctdata_error,  "dob",0,NA)
#> Warning in convertStdDateFormat(entry, index, monthfirst = NULL): Date not
#> in numeric formats
#> Warning in calculateAgeFromDob(rctdata_error, "dob", 0, NA): Date format is
#> not right- use numeric values for dates separated by - or /
#> [1] -2

The function calculateAgeFromBirthYear returns the age calculated from given year of birth. User may provide the column name containing date of birth and optional non response code.For example, in the ‘rctdata’ shown above, the ‘yob’ column has birth year.

calculateAgeFromBirthYear(rctdata,"yob",NA)
#>         age sex  yob        dob          arm calc.age.yob
#> 1  39.69983   M 1992 07/12/1969      Control           27
#> 2  58.40727   F 1973 16/02/1962 Intervention           46
#> 3  55.34026   M 1982 03/09/1978      Control           37
#> 4  43.65464   F 1987 17/02/1969 Intervention           32
#> 5  75.44182   M 1994 25/11/1960      Control           25
#> 6  56.68776   F 1936 17/04/1970 Intervention           83
#> 7  79.45749   M 1970 18/03/1997      Control           49
#> 8  94.33068   F 1970 30/01/1988 Intervention           49
#> 9  65.10474   F 1997 03/02/1990      Control           22
#> 10 67.33162   F 1945 25/09/1978 Intervention           74