Getting Started with pdfsearch

Brandon LeBeau

2019-01-09

This package defines a few useful functions for keyword searching using the pdftools package developed by rOpenSci.

Basic Usage

There are currently two functions in this package of use to users. The first keyword_search takes a single pdf and searches for keywords from the pdf. The second keyword_directory does the same search over a directory of pdfs.

keyword_search Example

The package comes with two pdf files from arXiv to use as test cases. Below is an example of using the keyword_search function.

library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('measurement', 'error'),
            path = TRUE)
head(result)
#> # A tibble: 6 x 5
#>   keyword     page_num line_num line_text token_text
#>   <chr>          <int>    <int> <list>    <list>    
#> 1 measurement        1        2 <chr [1]> <list [1]>
#> 2 measurement        1        4 <chr [1]> <list [1]>
#> 3 measurement        1       10 <chr [1]> <list [1]>
#> 4 measurement        1       12 <chr [1]> <list [1]>
#> 5 measurement        1       15 <chr [1]> <list [1]>
#> 6 measurement        1       17 <chr [1]> <list [1]>
head(result$line_text, n = 2)
#> [[1]]
#> [1] "Reiter, Maria DeYoreo* arXiv:1610.00147v1 [stat.ME] 1 Oct 2016 Abstract Often in surveys, key items are subject to measurement errors. "
#> 
#> [[2]]
#> [1] "In some settings, however, analysts have access to a data source on different individuals with high quality measurements of the error-prone survey items. "

The location of the keyword match, including page number and line number, and the actual line of text are returned by default.

Surrounding lines of text

It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argument surround_lines as follows:

Combine hyphenated words

Typeset PDF files commonly contain words that wrap from one line to the next and are hyphenated. An example of this is shown in the following image.

hyphenated example

hyphenated example

Any hyphenated words are treated as two words and the keyword search may not perform as desired if a matching word would be returned if it is not hyphenated. Fortunately, there is a remove_hyphen argument within the keyword_search function that removes the hyphenated words at the end of a line and combines them with the word on the next line in the document. Below is an example of this working, showing the results before and after using the remove_hyphen argument. By default this argument is set to TRUE.

You’ll notice that the removal of the hyphen added a few additional keyword matches to the results. These were cases where the word “measurement” wrapped across two lines and was hyphenated (see the image above that has an example of this).

One specific note about removing hyphens in multiple column PDF files. The ability of the function to perform this action is still experimental and many times does not work the best as of yet. Use the remove_hyphen argument with caution with multiple column PDF files.

Split document into words

Using the tokenizers R package, it is also possible to split the document into individual words. This may be most useful when the interest is in performing a text analysis rather than a keyword search. Below is an example showing the first page of the text converted to words. By default, hyphenated words at the end of the lines are removed (see previous section for description of this).

token_result <- convert_tokens(file, path = TRUE)[[1]]
head(token_result)
#> [[1]]
#>   [1] "data"           "fusion"         "for"            "correcting"    
#>   [5] "measurement"    "errors"         "tracy"          "schifeling"    
#>   [9] "jerome"         "p"              "reiter"         "maria"         
#>  [13] "deyoreo"        "arxiv"          "1610.00147v1"   "stat.me"       
#>  [17] "1"              "oct"            "2016"           "abstract"      
#>  [21] "often"          "in"             "surveys"        "key"           
#>  [25] "items"          "are"            "subject"        "to"            
#>  [29] "measurement"    "errors"         "given"          "just"          
#>  [33] "the"            "data"           "it"             "can"           
#>  [37] "be"             "difficult"      "to"             "determine"     
#>  [41] "the"            "distribution"   "of"             "this"          
#>  [45] "error"          "process"        "and"            "hence"         
#>  [49] "to"             "obtain"         "accurate"       "inferences"    
#>  [53] "that"           "involve"        "the"            "error"         
#>  [57] "prone"          "variables"      "in"             "some"          
#>  [61] "settings"       "however"        "analysts"       "have"          
#>  [65] "access"         "to"             "a"              "data"          
#>  [69] "source"         "on"             "different"      "in"            
#>  [73] "dividuals"      "with"           "high"           "quality"       
#>  [77] "measurements"   "of"             "the"            "error"         
#>  [81] "prone"          "survey"         "items"          "we"            
#>  [85] "present"        "a"              "data"           "fusion"        
#>  [89] "framework"      "for"            "leveraging"     "this"          
#>  [93] "information"    "to"             "improve"        "infer"         
#>  [97] "ences"          "in"             "the"            "error"         
#> [101] "prone"          "survey"         "the"            "basic"         
#> [105] "idea"           "is"             "to"             "posit"         
#> [109] "models"         "about"          "the"            "rates"         
#> [113] "at"             "which"          "individuals"    "make"          
#> [117] "errors"         "coupled"        "with"           "models"        
#> [121] "for"            "the"            "values"         "reported"      
#> [125] "when"           "errors"         "are"            "made"          
#> [129] "this"           "can"            "avoid"          "the"           
#> [133] "unrealistic"    "assumption"     "of"             "conditional"   
#> [137] "independence"   "typically"      "used"           "in"            
#> [141] "data"           "fusion"         "we"             "apply"         
#> [145] "the"            "approach"       "on"             "the"           
#> [149] "re"             "ported"         "values"         "of"            
#> [153] "educational"    "attainments"    "in"             "the"           
#> [157] "american"       "community"      "survey"         "using"         
#> [161] "the"            "national"       "survey"         "of"            
#> [165] "college"        "graduates"      "as"             "the"           
#> [169] "high"           "quality"        "data"           "source"        
#> [173] "in"             "doing"          "so"             "we"            
#> [177] "account"        "for"            "the"            "informative"   
#> [181] "sampling"       "design"         "used"           "to"            
#> [185] "select"         "the"            "national"       "survey"        
#> [189] "of"             "college"        "graduates"      "we"            
#> [193] "also"           "present"        "a"              "process"       
#> [197] "for"            "assessing"      "the"            "sensitivity"   
#> [201] "of"             "various"        "analyses"       "to"            
#> [205] "different"      "choices"        "for"            "the"           
#> [209] "measurement"    "error"          "models"         "supplemental"  
#> [213] "material"       "is"             "available"      "online"        
#> [217] "key"            "words"          "fusion"         "imputation"    
#> [221] "measurement"    "error"          "missing"        "survey"        
#> [225] "this"           "research"       "was"            "supported"     
#> [229] "by"             "the"            "national"       "science"       
#> [233] "foundation"     "under"          "award"          "ses"           
#> [237] "11"             "31897"          "the"            "authors"       
#> [241] "wish"           "to"             "thank"          "seth"          
#> [245] "sanders"        "for"            "his"            "input"         
#> [249] "on"             "informative"    "prior"          "specifications"
#> [253] "and"            "mauricio"       "sadinle"        "for"           
#> [257] "discussion"     "that"           "improved"       "the"           
#> [261] "strategy"       "for"            "accounting"     "for"           
#> [265] "the"            "informative"    "sample"         "design"        
#> [269] "1"

Another implementation of the convert_tokens function, is to convert the result text to tokens. This could be interesting when used in tandem with the surround_lines argument for input into a text analysis. These tokens are included by default when calling the keyword_search function.

keyword_directory Example

The keyword_directory function is useful when you have a directory of many pdf files that you want to search a series of keywords in a single function call. This can be particularly useful in the context of a research synthesis or to screen studies for characteristics to include in a meta-analysis.

There are two files that come with the package from ArXiv in a single directory that will be used as an example use case for the package.

directory <- system.file('pdf', package = 'pdfsearch')

result <- keyword_directory(directory,
                           keyword = c('repeated measures', 'mixed effects',
                                       'error'),
                           surround_lines = 1, full_names = TRUE)
head(result)
#>   ID       pdf_name           keyword page_num line_num
#> 1  1 1501.00450.pdf repeated measures        1        9
#> 2  1 1501.00450.pdf repeated measures        1       30
#> 3  1 1501.00450.pdf repeated measures        2       57
#> 4  1 1501.00450.pdf repeated measures        2       59
#> 5  1 1501.00450.pdf repeated measures        2       69
#> 6  1 1501.00450.pdf repeated measures        3      165
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                line_text
#> 1                                                                                                                                                                                                                                                                                                                                                                                                                                                        We             Running under powered experiments have many perils. , Not introduce more sophisticated experimental designs, specifi-           only would we miss potentially beneficial effects, we may also cally the repeated measures design, including the crossover           get false confidence about lack of negative effects. , Statistical design and related variants, to increase KPI sensitivity with         power increases with larger effect size, and smaller variances. the same traffic size and duration of experiment. 
#> 2                                                                                                                                                                                                                                                          This poses  a limitation to any online experimentation platform, where       within-subject variation. , We also discuss practical considfast iterations and testing many ideas can reap the most         erations to repeated measures design, with variants to the rewards.                                                         crossover design to study the carry over effect, including the “re-randomized” design (row 5 in table 1). , 1.1     Motivation To improve sensitivity of measurement, apart from accurate       1.2     Main Contributions implementation and increase sample size and duration, we         In this paper, we propose a framework called FORME (Flexcan employ statistical methods to reduce variance. 
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                           In the Table 1: Repeated Measures Designs                     following section we assume the minimum experimentation “period” to be one full week, and may extend to up to two In this paper we extend the idea further by employing the        weeks. , To facilitate our illustration, in all the derivation repeated measures design in different stages of treatment        in this section we assume all users appear in all periods, assignment. , The traditional A/B test can be analyzed us-         i.e. no missing measurement. 
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             The traditional A/B test can be analyzed us-         i.e. no missing measurement. , We also restrict ourselves ing the repeated measures analysis, reporting a “per week”       to metrics that are defined as simple average and assume treatment effect, as show in row 3 “parallel” design in ta-      treatment and control have the same sample size. , We furble 1. 
#> 5                                                                                                                                                                                                                                                                                       This way            average treatment effect (ATE) d = µT - µC which is a each user serves as his/her own control in the measurement.      fixed effects in the model in this section. , This way, various In fact, the crossover design is a type of repeated measures     designs considered can be examined in the same framework design commonly used in biomedical research to control for       and easily compared.  , We will proceed to show, with theoretical derivations, that            2.1     Two Sample T-test given the same total traffic                                           Let X denote the observed average metric value in control group and Y denote that in the treatment group. 
#> 6 5.     , FLEXIBLE AND SCALABLE REPEATED One way to see measurements are not missing at random is                   MEASURES ANALYSIS VIA FORME to realize infrequent users are more likely to have missing         5.1 Review of Existing Methods values and the absence in a specific time window can still          It is common to analyze data from repeated measures design provide information on the user behavior and in reality there       with the repeated measures ANOVA model and the F-test, might be other factors causing user to be missing that are          under certain assumptions, such as normality, sphericity (honot even observed. , Instead of throwing away data points             mogeneity of variances in differences between each pair of where user appeared in only one period and is exposed to            within-subject values), equal time points between subjects, only one of the two treatments, in practice, we included an         and no missing data. 
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          token_text
#> 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       we, running, under, powered, experiments, have, many, perils, not, introduce, more, sophisticated, experimental, designs, specifi, only, would, we, miss, potentially, beneficial, effects, we, may, also, cally, the, repeated, measures, design, including, the, crossover, get, false, confidence, about, lack, of, negative, effects, statistical, design, and, related, variants, to, increase, kpi, sensitivity, with, power, increases, with, larger, effect, size, and, smaller, variances, the, same, traffic, size, and, duration, of, experiment
#> 2                                                                                                                                                                                                                                                                                                                           this, poses, a, limitation, to, any, online, experimentation, platform, where, within, subject, variation, we, also, discuss, practical, considfast, iterations, and, testing, many, ideas, can, reap, the, most, erations, to, repeated, measures, design, with, variants, to, the, rewards, crossover, design, to, study, the, carry, over, effect, including, the, re, randomized, design, row, 5, in, table, 1, 1.1, motivation, to, improve, sensitivity, of, measurement, apart, from, accurate, 1.2, main, contributions, implementation, and, increase, sample, size, and, duration, we, in, this, paper, we, propose, a, framework, called, forme, flexcan, employ, statistical, methods, to, reduce, variance
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 in, the, table, 1, repeated, measures, designs, following, section, we, assume, the, minimum, experimentation, period, to, be, one, full, week, and, may, extend, to, up, to, two, in, this, paper, we, extend, the, idea, further, by, employing, the, weeks, to, facilitate, our, illustration, in, all, the, derivation, repeated, measures, design, in, different, stages, of, treatment, in, this, section, we, assume, all, users, appear, in, all, periods, assignment, the, traditional, a, b, test, can, be, analyzed, us, i.e, no, missing, measurement
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   the, traditional, a, b, test, can, be, analyzed, us, i.e, no, missing, measurement, we, also, restrict, ourselves, ing, the, repeated, measures, analysis, reporting, a, per, week, to, metrics, that, are, defined, as, simple, average, and, assume, treatment, effect, as, show, in, row, 3, parallel, design, in, ta, treatment, and, control, have, the, same, sample, size, we, furble, 1
#> 5                                                                                                                                                                                                                                                                                                                                  this, way, average, treatment, effect, ate, d, µt, µc, which, is, a, each, user, serves, as, his, her, own, control, in, the, measurement, fixed, effects, in, the, model, in, this, section, this, way, various, in, fact, the, crossover, design, is, a, type, of, repeated, measures, designs, considered, can, be, examined, in, the, same, framework, design, commonly, used, in, biomedical, research, to, control, for, and, easily, compared, we, will, proceed, to, show, with, theoretical, derivations, that, 2.1, two, sample, t, test, given, the, same, total, traffic, let, x, denote, the, observed, average, metric, value, in, control, group, and, y, denote, that, in, the, treatment, group
#> 6 5, flexible, and, scalable, repeated, one, way, to, see, measurements, are, not, missing, at, random, is, measures, analysis, via, forme, to, realize, infrequent, users, are, more, likely, to, have, missing, 5.1, review, of, existing, methods, values, and, the, absence, in, a, specific, time, window, can, still, it, is, common, to, analyze, data, from, repeated, measures, design, provide, information, on, the, user, behavior, and, in, reality, there, with, the, repeated, measures, anova, model, and, the, f, test, might, be, other, factors, causing, user, to, be, missing, that, are, under, certain, assumptions, such, as, normality, sphericity, honot, even, observed, instead, of, throwing, away, data, points, mogeneity, of, variances, in, differences, between, each, pair, of, where, user, appeared, in, only, one, period, and, is, exposed, to, within, subject, values, equal, time, points, between, subjects, only, one, of, the, two, treatments, in, practice, we, included, an, and, no, missing, data

The full_names argument is needed here to specify that the full file path needs to be used to access the pdf files. If the search is done directly from the repository (i.e. when using an R project in RStudio), then full_names could be set to FALSE.

Limitations

Currently there are a handful of limitations, mostly around how pdfs are read into R using the pdftools R package. When pdfs are created in a multiple column layout, a line in the pdf consists of the entire line across both columns. This can lead to fragmented text that may not give the full contents, even with using the surround_lines argument.

Another limitation is when performing keyword searching with multiple words or phrases. If the match is on a single line, the match would be returned. However, if the words or phrase spans multiple lines, the current implementation will not return a result that spans multiple lines in the PDF file.

Shiny App

The package also has a simple Shiny app that can be called using the following command

run_shiny()