| Type: | Package |
| Title: | Full Corpus Support for the 'koRpus' Package |
| Description: | Enhances 'koRpus' text object classes and methods to also support large corpora. Hierarchical ordering of corpus texts into arbitrary categories will be preserved. Provided classes and methods also improve the ability of using the 'koRpus' package together with the 'tm' package. To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please subscribe to the koRpus-dev mailing list (https://korpusml.reaktanz.de). |
| Author: | m.eik michalke [aut, cre] |
| Maintainer: | m.eik michalke <meik.michalke@hhu.de> |
| Depends: | R (≥ 3.5.0),koRpus (≥ 0.13-1),sylly (≥ 0.1-6) |
| Imports: | methods,parallel,tm,NLP |
| Suggests: | koRpus.lang.en,testthat,knitr,rmarkdown |
| VignetteBuilder: | knitr |
| URL: | https://reaktanz.de/?c=hacking&s=koRpus |
| BugReports: | https://github.com/unDocUMeantIt/tm.plugin.koRpus/issues |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| LazyLoad: | yes |
| Version: | 0.4-2 |
| Date: | 2021-05-17 |
| RoxygenNote: | 7.1.1 |
| Collate: | '01_class_01_kRp.corpus.R' '02_method_01_kRp.corpus-class_readability.R' '02_method_02_kRp.corpus-class_hyphen.R' '02_method_03_kRp.corpus-class_lex.div.R' '02_method_04_kRp.corpus-class_read.corp.custom.R' '02_method_05_kRp.corpus-class_freq.analysis.R' '02_method_06_kRp.corpus-class_summary.R' '02_method_07_kRp.corpus-class_correct.R' '02_method_08_kRp.corpus-class_query.R' '02_method_09_kRp.corpus-class_filterByClass.R' '02_method_10_kRp.corpus-class_jumbleWords.R' '02_method_11_kRp.corpus-class_clozeDelete.R' '02_method_12_kRp.corpus-class_cTest.R' '02_method_13_kRp.corpus-class_textTransform.R' '02_method_14_kRp.corpus-class_docTermMatrix.R' '02_method_15_kRp.corpus-class_split_by_doc_id.R' '02_method_20_kRp.corpus_get_set_is.R' '02_method_21_kRp.corpus-class_show.R' 'corpus_files.R' 'deprecated.R' 'kRpSource.R' 'readCorpus.R' 'tm.plugin.koRpus-internal.R' 'tm.plugin.koRpus-package.R' |
| NeedsCompilation: | no |
| Packaged: | 2021-05-18 11:08:16 UTC; m |
| Repository: | CRAN |
| Date/Publication: | 2021-05-18 12:50:02 UTC |
Full Corpus Support for the 'koRpus' Package
Description
Enhances 'koRpus' text object classes and methods to also support large corpora. Hierarchical ordering of corpus texts into arbitrary categories will be preserved. Provided classes and methods also improve the ability of using the 'koRpus' package together with the 'tm' package. To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please subscribe to the koRpus-dev mailing list (<https://korpusml.reaktanz.de>).
Details
The DESCRIPTION file:
| Package: | tm.plugin.koRpus |
| Type: | Package |
| Version: | 0.4-2 |
| Date: | 2021-05-17 |
| Depends: | R (>= 3.5.0),koRpus (>= 0.13-1),sylly (>= 0.1-6) |
| Encoding: | UTF-8 |
| License: | GPL (>= 3) |
| LazyLoad: | yes |
| URL: | https://reaktanz.de/?c=hacking&s=koRpus |
Author(s)
m.eik michalke [aut, cre]
Maintainer: m.eik michalke <meik.michalke@hhu.de>
See Also
Useful links:
Report bugs at https://github.com/unDocUMeantIt/tm.plugin.koRpus/issues
Apply cTest() to all texts in kRp.corpus objects
Description
This method calls cTest on all tagged text objects
inside the given obj object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
cTest(obj, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
obj |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as obj.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
taggedText(myCorpus)[20:30,]
myCorpus <- cTest(myCorpus)
taggedText(myCorpus)[20:30,]
} else {}
Apply clozeDelete() to all texts in kRp.corpus objects
Description
This method calls clozeDelete on all tagged text objects
inside the given obj object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
clozeDelete(obj, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
obj |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as obj.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- clozeDelete(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
Deprecated functions and methods
Description
These functions were used in earlier versions of the package but either replaced or removed.
Usage
corpusTagged(obj, ...)
corpusTTR(obj, ...)
corpusLevel(...)
corpusCategory(...)
corpusID(...)
corpusPath(...)
Arguments
obj |
No longer used. |
... |
No longer used. |
Get a comprehensive data frame describing the files of your corpus
Description
The function translates the hierarchy defintion given into a data frame with one row for each file, including the generated document ID.
Usage
corpus_files(
dir,
hierarchy = list(),
fsep = .Platform$file.sep,
full_list = FALSE
)
Arguments
dir |
File path to the root directory of the text corpus, or a TIF[1] compliant data frame. |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
fsep |
Character string defining the path separator to use. |
full_list |
Logical, see return value. |
Value
Either a data frame with columns doc_id, file,
path and one further factor
column for each hierarchy level,
or (if full_list=TRUE) a list containing that data frame
(all_files) and also data frames describing the hierarchy by given names (hier_names),
directories (hier_dirs) and relative paths (hier_paths).
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
myCorpusFiles <- corpus_files(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus"
),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
)
)
Methods to correct kRp.corpus objects
Description
These methods enable you to correct errors that occurred during automatic processing, e.g., wrong hyphenation.
Usage
## S4 method for signature 'kRp.corpus'
correct.hyph(obj, word = NULL, hyphen = NULL, cache = TRUE)
Arguments
obj |
An object of class |
word |
A character string,
the (possibly incorrectly hyphenated) |
hyphen |
A character string,
the new manually hyphenated version of |
cache |
Logical, if |
Details
For details on what these methods do on a per text object basis, please refer to the
documentation of correct.hyph in the sylly
package.
Value
An object of the same class as obj.
Generate a document-term matrix from a corpus object
Description
Calculates a sparse document-term matrix calculated from a given object of class
kRp.corpus and adds it to the object's feature list.
You can also calculate the term frequency inverted document frequency value (tf-idf) for each
term.
Usage
## S4 method for signature 'kRp.corpus'
docTermMatrix(
obj,
terms = "token",
case.sens = FALSE,
tfidf = FALSE,
as.feature = TRUE
)
Arguments
obj |
An object of class |
terms |
A character string defining the |
case.sens |
Logical, whether terms should be counted case sensitive. |
tfidf |
Logical,
if |
as.feature |
Logical,
whether the output should be just the sparse matrix or the input object with
that matrix added as a feature. Use |
Details
The settings of terms, case.sens,
and tfidf will be stored in the object's meta slot,
so you can use corpusMeta(..., "doc_term_matrix") to fetch it.
See the examples to learn how to limit the analysis to desired word classes.
Value
Either an object of the input class or a sparse matrix of class
dgCMatrix.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# get the document-term frequencies in a sparse matrix
myDTMatrix <- docTermMatrix(myCorpus, as.feature=FALSE)
# combine with filterByClass() to, e.g., exclude all punctuation
myDTMatrix <- docTermMatrix(filterByClass(myCorpus), as.feature=FALSE)
# instead of absolute frequencies, get the tf-idf values
myDTMatrix <- docTermMatrix(
filterByClass(myCorpus),
tfidf=TRUE,
as.feature=FALSE
)
} else {}
Apply filterByClass() to all texts in kRp.corpus objects
Description
This method calls filterByClass on all tagged text objects
inside the given txt object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
filterByClass(txt, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
txt |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as txt.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
# remove all punctuation
myCorpus <- filterByClass(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
Apply freq.analysis() to all texts in kRp.corpus objects
Description
This method calls freq.analysis on all tagged text objects
inside the given txt.file object.
Usage
## S4 method for signature 'kRp.corpus'
freq.analysis(txt.file, ...)
Arguments
txt.file |
An object of class |
... |
options to pass through to |
Details
If corp.freq was not specified but a valid object of class kRp.corp.freq
is found in the freq slot of txt.file,
it is used automatically. That is the case if you called
read.corp.custom on the object previously.
Value
An object of the same class as txt.file.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- read.corp.custom(myCorpus)
myCorpus <- freq.analysis(myCorpus)
corpusFreq(myCorpus)
} else {}
Apply hyphen() to all texts in kRp.corpus objects
Description
This method calls hyphen on all tagged text objects
inside the given words object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
hyphen(words, mc.cores = getOption("mc.cores", 1L), quiet = TRUE,
...)
Arguments
words |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
quiet |
Logical,
if |
... |
options to pass through to |
Value
An object of the same class as words.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_new"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- hyphen(myCorpus)
} else {}
Apply jumbleWords() to all texts in kRp.corpus objects
Description
This method calls jumbleWords on all tagged text objects
inside the given words object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
jumbleWords(words, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
words |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as words.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- jumbleWords(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
S4 Class kRp.corpus
Description
Objects of this class can contain full text corpora in a hierachical structure. It supports both the tm package's
Corpus class and koRpus' own object classes and stores them in separated slots.
Details
Objects should be created using the readCorpus function.
Slots
langA character string, naming the language that is assumed for the tokenized texts in this object.
descA named list of descriptive statistics of the tagged texts.
metaA named list. Can be used to store meta information. Currently, no particular format is defined.
rawA list of objects of class
Corpus.tokensA data frame as used for the
tokensslot in objects of classkRp.text. In addition to the columns usually found in those objects, this data frame also has a factor column for each hierarchical category defined (if any).featuresA named logical vector, indicating which features are available in this object's
feat_listslot. Common features are listed in the description of thefeat_listslot.feat_listA named list with optional analysis results or other content as used by the defined
features:hierarchyA named list of named character vectors describing the directory hierarchy level by level.hyphenA named list of objects of classkRp.hyphen.readabilityA named list of objects of classkRp.readability.lex_divA named list of objects of classkRp.TTR.freqThefreq.analysisslot of akRp.txt.freqclass object afterfreq.analysiswas called.corp_freqAn object of classkRp.corp.freq, e.g., results of a call toread.corp.custom.diffA named list ofdifffeatures of akRp.textobject after a method liketextTransformwas called.summaryA summary data frame for the full corpus, including descriptive statistics on all texts, as well as results of analyses like readability and lexical diversity, if available.doc_term_matrixA sparse document-term matrix, as produced bydocTermMatrix.stopwordsA numeric vector with the total number of stopwords in each text, if stopwords were analyzed during tokenizing or POS tagging.
See the
getter and setter methodsfor easy access to these sub-slots. There can actually be any number of additional features, the above is just a list of those already defined by this package.
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp.corpus(...) can be used instead of
new("kRp.corpus", ...). Whenever possible, stick to
readCorpus.
Note
There is also getter and setter methods for objects of this class.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
} else {}
# manual creation
emptyCorpus <- kRp.corpus()
A source function for tm
Description
An rather untested attempt to sketch a Source function for tm.
Supposed to be used to translate tagged koRpus objects into tm objects.
Usage
kRpSource(obj, encoding = "UTF-8")
Arguments
obj |
An object of class |
encoding |
Character string, defining the character encoding of the object. |
Details
Also provided are the methods getElem and pGetElem for S3 class kRpSource.
Value
An object of class Source,
also inheriting class kRpSource.
Apply lex.div() to all texts in kRp.corpus objects
Description
This method calls lex.div on all tagged text objects
inside the given txt object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
lex.div(
txt,
summary = TRUE,
mc.cores = getOption("mc.cores", 1L),
char = "",
quiet = TRUE,
...
)
Arguments
txt |
An object of class |
summary |
Logical, determines if the |
mc.cores |
The number of cores to use for parallelization,
see |
char |
Character vector to specify measures of which characteristics should be computed,
see
|
quiet |
Logical, if |
... |
options to pass through to |
Value
An object of the same class as txt.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- lex.div(myCorpus)
corpusSummary(myCorpus)
} else {}
Apply query() to all texts in kRp.corpus objects
Description
This method calls query on all tagged text objects
inside the given object.
Usage
## S4 method for signature 'kRp.corpus'
query(
obj,
var,
query,
rel = "eq",
as.df = TRUE,
ignore.case = TRUE,
perl = FALSE,
regexp_var = "token"
)
Arguments
obj |
An object of class |
var |
A character string naming a column in the tagged text. If set to
|
query |
A character vector (for words), regular expression,
or single number naming values to be matched in the variable.
Can also be a vector of two numbers to query a range of frequency data,
or a list of named lists for multiple queries (see
"Query lists" section of |
rel |
A character string defining the relation of the queried value and desired results.
Must either be |
as.df |
Logical, if |
ignore.case |
Logical, passed through to |
perl |
Logical, passed through to |
regexp_var |
A character string naming the column to query if |
Value
Depending on the arguments, might include whole objects, lists, single values etc.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
query(myCorpus, var="lttr", query="7", rel="gt")
} else {}
Apply read.corp.custom() to all texts in kRp.corpus objects
Description
This method calls read.corp.custom on all tagged text objects
inside the given corpus object.
Usage
## S4 method for signature 'kRp.corpus'
read.corp.custom(corpus, caseSens = TRUE, log.base = 10,
keep_dtm = FALSE, ...)
Arguments
corpus |
An object of class |
caseSens |
Logical. If |
log.base |
A numeric value defining the base of the logarithm used for inverse document frequency (idf). See
|
keep_dtm |
Logical. If |
... |
Options to pass through to the |
Details
Since the analysis is based on a document term matrix,
a pre-existing matrix as a feature of the corpus object
will be used if it matches the case sensitivity setting. Otherwise a new matrix will be generated (but not replace the
existing one). If no document term matrix is present yet,
also one will be generated and can be kept as an additional feature
of the resulting object.
Value
An object of the same class as corpus.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- read.corp.custom(myCorpus)
corpusCorpFreq(myCorpus)
} else {}
Create kRp.corpus objects from text files or data frames
Description
You can either read a corpus from text files (one file per text, also see the Hierarchy section below) or from TIF compliant data frames (see the Data frames section below).
Usage
readCorpus(
dir,
hierarchy = list(),
lang = "kRp.env",
tagger = "kRp.env",
encoding = "",
pattern = NULL,
recursive = FALSE,
ignore.case = FALSE,
mode = "text",
format = "file",
mc.cores = getOption("mc.cores", 1L),
id = "",
...
)
Arguments
dir |
Either a file path to the root directory of the text corpus,
or a TIF compliant data frame.
If a directory path (character string),
texts can be recursively ordered into subfolders named
exactly as defined by |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
lang |
A character string naming the language of the analyzed corpus.
See |
tagger |
A character string pointing to the tokenizer/tagger command you want to use for basic text analysis.
Defaults to |
encoding |
Character string describing the current encoding.
See |
pattern |
A regular expression for file matching.
See |
recursive |
Logical, indicates whether directories should be read recursively.
See |
ignore.case |
Logical, indicates whether |
mode |
Character string defining the reading mode.
See |
format |
Either "file" or "obj",
depending on whether you want to scan files or analyze the text in a given object,
like a character vector. If the latter and |
mc.cores |
The number of cores to use for parallelization,
see |
id |
A character string describing the main subject/purpose of the text corpus. |
... |
Additional options which are passed through to the defined |
Value
An object of class kRp.corpus.
Hierarchy
To import a hierarchically structured text corpus you must categorize all texts in a directory
structure that resembles the hierarchy. If for example you would like to import a corpus on two
different topics and two differnt sources,
your hierarchy has two nested levels (topic and source).
The root directory dir would then need to have two subdirectories (one for each topic)
which in turn must have two subdirectories (one for each source),
and the actual text files
are found in those.
To use this hierarchical structure in readCorpus,
the hierarchy argument is used.
It is a named list,
where each list item represents one hierachical level (here again topic and source),
and its value is a named character vector describing the actual topics and sources to be used. It is
important to understand how these character vectors are treated: The names of elements must exactly match
the corresponding subdirectroy name,
whereas the value is a free text description. The names of the
list items however describe the hierachical level and are not matched with directory names.
Data frames
In order to import a corpus from a data frame,
the object must be in Text Interchange Format (TIF)
as described by [1]. As a minimum, the data frame must have two character columns,
doc_id
and text.
You can provide additional information on hierarchical categories by using further
columns,
where the column name must match the category name (hierachical level). The order of those
columns in the data frame is not important,
as you must still fully define the hierarchical structure
using the hierarchy argument. All columns you omit are ignored,
but the values used in
the hierarchy list and the respective columns must match,
as rows with unmatched category levels
will also be ignored.
Note that the special column names path and file will also be imported automatically.
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
# "flat" corpus, parse all texts in the given dir
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_prev"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# corpus with one category names "Source"
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# two hieraryhical levels, "Topic" and "Source"
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# get hierarchy from directory tree
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=TRUE,
tagger="tokenize",
lang="en"
)
## Not run:
# if the same corpus is available as TIF compliant data frame
myCorpus <- readCorpus(
dir=myCorpus_df,
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
lang="en",
format="obj"
)
## End(Not run)
} else {}
Apply readability() to all texts in kRp.corpus objects
Description
This method calls readability on all tagged text objects
inside the given txt.file object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
readability(
txt.file,
summary = TRUE,
mc.cores = getOption("mc.cores", 1L),
quiet = TRUE,
...
)
Arguments
txt.file |
An object of class |
summary |
Logical, determines if the |
mc.cores |
The number of cores to use for parallelization,
see |
quiet |
Logical,
if |
... |
options to pass through to |
Value
An object of the same class as txt.file.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myTexts <- readability(myCorpus)
corpusSummary(myCorpus)
} else {}
Show methods for kRp.corpus objects
Description
Show methods for S4 objects of class kRp.corpus.
Usage
## S4 method for signature 'kRp.corpus'
show(object)
Arguments
object |
An object of class |
Turn a kRp.corpus object into a list of kRp.text objects
Description
For some analysis steps it might be important to have individual tagged texts instead of one large corpus object. This method produces just that.
Usage
## S4 method for signature 'kRp.corpus'
split_by_doc_id(obj, keepFeatures = TRUE)
Arguments
obj |
An object of class |
keepFeatures |
Either logical, whether to keep all features or drop them, or a character vector of names of features to keep if present. |
Value
A named list of objects of class kRp.text.
Elements are named by their doc_id.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpusList <- split_by_doc_id(myCorpus)
} else {}
Apply summary() to all texts in kRp.corpus objects
Description
This method performs a summary call on all text objects inside the given
object object. Contrary to what other summary methods do, this method
always returns the full object with an updated summary slot.
Usage
## S4 method for signature 'kRp.corpus'
summary(object, missing = NA, ...)
corpusSummary(obj)
## S4 method for signature 'kRp.corpus'
corpusSummary(obj)
corpusSummary(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusSummary(obj) <- value
Arguments
object |
An object of class |
missing |
Character string to use for missing values. |
... |
Used for internal processes. |
obj |
An object of class |
value |
The new value to replace the current with. |
Details
The summary slot contains a data.frame with aggregated information of
all texts that the respective object contains.
corpusSummary is a simple method to get or set the summary slot
in kRp.corpus objects directly.
Value
An object of the same class as object.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# calculate readability, but prevent a summary table from being added
myCorpus <- readability(myCorpus, summary=FALSE)
corpusSummary(myCorpus)
# add summaries
myCorpus <- summary(myCorpus)
corpusSummary(myCorpus)
} else {}
Getter/setter methods for kRp.corpus objects
Description
These methods should be used to get or set values of text objects
generated by functions like readCorpus.
Usage
## S4 method for signature 'kRp.corpus'
taggedText(obj)
## S4 replacement method for signature 'kRp.corpus'
taggedText(obj) <- value
## S4 method for signature 'kRp.corpus'
doc_id(obj, has_id = NULL)
## S4 method for signature 'kRp.corpus'
describe(obj, doc_id = NULL, simplify = TRUE, ...)
## S4 replacement method for signature 'kRp.corpus'
describe(obj, doc_id = NULL, ...) <- value
## S4 method for signature 'kRp.corpus'
language(obj)
## S4 replacement method for signature 'kRp.corpus'
language(obj) <- value
## S4 method for signature 'kRp.corpus'
hasFeature(obj, feature = NULL)
## S4 replacement method for signature 'kRp.corpus'
hasFeature(obj, feature) <- value
## S4 method for signature 'kRp.corpus'
feature(obj, feature, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
feature(obj, feature) <- value
## S4 method for signature 'kRp.corpus'
corpusReadability(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusReadability(obj) <- value
corpusTm(obj)
## S4 method for signature 'kRp.corpus'
corpusTm(obj)
corpusTm(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusTm(obj) <- value
corpusMeta(obj, meta = NULL, fail = TRUE)
## S4 method for signature 'kRp.corpus'
corpusMeta(obj, meta = NULL, fail = TRUE)
corpusMeta(obj, meta = NULL) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusMeta(obj, meta = NULL) <- value
## S4 method for signature 'kRp.corpus'
corpusHyphen(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusHyphen(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusLexDiv(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusLexDiv(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusFreq(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusFreq(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusCorpFreq(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusCorpFreq(obj) <- value
corpusHierarchy(obj, ...)
## S4 method for signature 'kRp.corpus'
corpusHierarchy(obj)
corpusHierarchy(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusHierarchy(obj) <- value
corpusFiles(obj, paths = FALSE, ...)
## S4 method for signature 'kRp.corpus'
corpusFiles(obj, paths = FALSE)
corpusFiles(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusFiles(obj) <- value
corpusDocTermMatrix(obj, ...)
## S4 method for signature 'kRp.corpus'
corpusDocTermMatrix(obj)
corpusDocTermMatrix(obj, terms = NULL, case.sens = NULL, tfidf = NULL) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusDocTermMatrix(obj, terms = NULL, case.sens = NULL,
tfidf = NULL) <- value
## S4 method for signature 'kRp.corpus'
corpusStopwords(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusStopwords(obj) <- value
## S4 method for signature 'kRp.corpus'
diffText(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
diffText(obj) <- value
## S4 method for signature 'kRp.corpus'
originalText(obj)
is.corpus(obj)
## S4 method for signature 'kRp.corpus,ANY,ANY,ANY'
x[i, j, ..., drop = TRUE]
## S4 replacement method for signature 'kRp.corpus,ANY,ANY,ANY'
x[i, j, ...] <- value
## S4 method for signature 'kRp.corpus'
x[[i, doc_id = NULL, ...]]
## S4 replacement method for signature 'kRp.corpus'
x[[i, doc_id = NULL, ...]] <- value
## S4 method for signature 'kRp.corpus'
tif_as_tokens_df(tokens)
tif_as_corpus_df(corpus)
## S4 method for signature 'kRp.corpus'
tif_as_corpus_df(corpus)
Arguments
obj |
An object of class |
value |
A new value to replace the current with. |
has_id |
A character vector with |
doc_id |
A character vector to limit the scope to one or more particular document IDs. |
simplify |
If |
... |
Additional arguments to pass through, depending on the method. |
feature |
Character string naming the object feature to look for. |
meta |
If not NULL, the |
fail |
Logical,
whether the method should fail with an error if |
paths |
Logical,
indicates for |
terms |
A character string defining the |
case.sens |
Logical, whether terms were counted case sensitive. Stored in object's meta data slot. |
tfidf |
Logical,
use |
x |
See |
i |
Defines the row selector ( |
j |
Defines the column selector in the tokens slot. |
drop |
See |
tokens |
An object of class |
corpus |
An object of class |
Details
taggedText()returns thetokensslot.describe()returns thedescslot.hasFeature()returnsTRUEor codeFALSE, depending on whether the requested feature is present or not.feature()returns the list entry of thefeat_listslot for the requested feature.corpusReadability()returns the list ofkRp.readabilityobjects.corpusTm()returns theVCorpusobject.corpusMeta()returns the list with meta information.corpusHyphen()returns the list ofkRp.hyphenobjects.corpusLexDiv()returns the list ofkRp.TTRobjects.corpusFiles()returns the character vector of file names of the object.corpusFreq()returns the frequency analysis data from thefeat_listslot.corpusCorpFreq()returns thekRp.corp.freqobject of thefeat_listslot.corpusHierarchy()returns the corpus' hierarchy structure.corpusDocTermMatrix()returns the sparse document term matrix of thefeat_listslot.corpusStopwords()returns the number of stopwords found in each text (if analyzed) from thefeat_listslot.diffText()returns thediffelement of thefeat_listslot.originalTextregenerates the original text before text transformations and returns it as a data frame.[/[[can be used as a shortcut to index the results oftaggedText().tif_as_corpus_dfreturns the whole corpus in a single TIF[1] compliant data.frame.tif_as_tokens_dfreturns thetokensslot in a TIF[1] compliant data.frame, i.e.,doc_idis not a factor but a character vector.
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_new"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
taggedText(myCorpus)
corpusMeta(myCorpus, "note") <- "an interesting read!"
# export object to TIF compliant data frame
myCorpus_df <- tif_as_corpus_df(myCorpus)
} else {}
Apply textTransform() to all texts in kRp.corpus objects
Description
This method calls textTransform on all tagged text objects
inside the given txt object (using mclapply).
Usage
## S4 method for signature 'kRp.corpus'
textTransform(txt, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
txt |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as txt.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- textTransform(myCorpus, scheme="minor")
head(taggedText(myCorpus), n=10)
} else {}