| Title: | Comparing Automated Subject Indexing Methods in R |
| Version: | 0.3.3 |
| Description: | Perform evaluation of automatic subject indexing methods. The main focus of the package is to enable efficient computation of set retrieval and ranked retrieval metrics across multiple dimensions of a dataset, e.g. document strata or subsets of the label set. The package also provides the possibility of computing bootstrap confidence intervals for all major metrics, with seamless integration of parallel computation and propensity scored variants of standard metrics. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.1 |
| Imports: | dplyr (≥ 1.1.1), furrr, purrr, rsample, tidyr, rlang, collapse (≥ 2.1.0), stringr, options, withr |
| Suggests: | testthat (≥ 3.0.0), tibble, tidyverse, ggplot2, future |
| Config/testthat/edition: | 3 |
| Depends: | R (≥ 4.1.0) |
| LazyData: | true |
| URL: | https://deutsche-nationalbibliothek.github.io/casimir/ |
| NeedsCompilation: | no |
| Packaged: | 2025-11-13 07:50:18 UTC; maximilian |
| Author: | Maximilian Kähler |
| Maintainer: | Maximilian Kähler <m.kaehler@dnb.de> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-17 21:30:07 UTC |
casimir: Comparing Automated Subject Indexing Methods in R
Description
Functions for evaluating automatic subject indexing results. The main focus of the package is to enable efficient computation of set retrieval and ranked retrieval metrics across multiple dimensions of a dataset, e.g. document strata or subsets of the label set. The package also provides the possibility of computing bootstrap confidence intervals for all major metrics, with seamless integration of parallel computation and propensity scored variants of standard metrics.
Author(s)
Maintainer: Maximilian Kähler m.kaehler@dnb.de (ORCID)
Authors:
Markus Schumacher m.schumacher@dnb.de
Other contributors:
Deutsche Nationalbibliothek [copyright holder]
See Also
Useful links:
Filter predictions based on score and rank
Description
Helper function for filtering predictions with score above a certain threshold or rank below some limit rank.
Usage
apply_threshold(threshold, limit = NA_real_, base_compare)
Arguments
threshold |
A numeric threshold between 0 and 1. |
limit |
An integer cutoff >= 1 for rank-based thresholding. Requires a
column |
base_compare |
A data.frame as created by |
Value
A data.frame with observations that satisfy (score >=
threshold AND (if applicable) rank <= limit) OR gold ==
TRUE. A new logical column suggested indicates TRUE if score
>= threshold AND (if applicable) rank <= limit, and FALSE for
false negative observations (that may have no score, a score below the
threshold or rank above the limit).
Examples
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score,
"A", "a", 0.9,
"A", "d", 0.7,
"A", "f", 0.3,
"A", "c", 0.1,
"B", "a", 0.8,
"B", "e", 0.6,
"B", "d", 0.1,
"C", "f", 0.1,
"C", "c", 0.2,
"C", "e", 0.2
)
base_compare <- create_comparison(gold, pred)
res_0 <- apply_threshold(
threshold = 0.3,
base_compare = base_compare
)
Compute bootstrap replica of pr auc
Description
A wrapper for use within bootstrap computation of pr auc which covers the repeated application of:
join with resampled doc_ids
summarise_intermediate_results
postprocessing of curve data
auc computation
Usage
boot_worker_fn(
sampled_id_list,
intermed_res,
propensity_scored,
replace_zero_division_with
)
Arguments
sampled_id_list |
A list of all doc_ids of the examples drawn in each bootstrap iteration. |
intermed_res |
Intermediate results as produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
Value
A data.frame with a column "pr_auc" and optional
grouping_vars.
Coerce id columns to character
Description
Internal helper function designed to ensure that id columns are not passed as
factor variables. Factor variables in id columns may cause undesired
behaviour with the drop_empty_group argument.
Usage
check_id_vars(df)
Arguments
df |
An input data.frame. |
Value
The input data.frame df with the id columns being no
longer factor variables.
Coerce column to character
Description
Check an arbitrary column in a data.frame for factor type and coerce to character.
Usage
check_id_vars_col(df, col)
Arguments
df |
An input data.frame. |
col |
The name of the column to check. |
Value
The input data.frame df with the specified column being no
longer a factor variable.
Check for inconsistent relevance values
Description
Internal helper function to check a comparison matrix for inconsistent relevance values of gold standard and predicted labels.
Usage
check_repair_relevance_compare(
gold_vs_pred,
ignore_inconsistencies = options::opt("ignore_inconsistencies")
)
Arguments
gold_vs_pred |
As created by |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
Value
A valid comparison matrix with possibly corrected relevance values,
being compatible with compute_intermediate_results.
Check for inconsistent relevance values
Description
Internal helper function to check a data.frame with predicted labels for a valid relevance column.
Usage
check_repair_relevance_pred(
predicted,
ignore_inconsistencies = options::opt("ignore_inconsistencies")
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
Value
A valid predicted data.frame with possibly eliminated missing
values.
Compute intermediate set retrieval results per group
Description
Compute intermediate set retrieval results per group such as number of gold standard and predicted labels, number of true positives, false positives and false negatives, precision, R-precision, recall and F1 score.
Usage
compute_intermediate_results(
gold_vs_pred,
grouping_var,
propensity_scored = FALSE,
cost_fp = NULL,
drop_empty_groups = options::opt("drop_empty_groups"),
check_group_names = options::opt("check_group_names")
)
compute_intermediate_results_dplyr(
gold_vs_pred,
grouping_var,
propensity_scored = FALSE,
cost_fp = NULL
)
Arguments
gold_vs_pred |
A data.frame with logical columns |
grouping_var |
A character vector of grouping variables that must be
present in |
propensity_scored |
Logical, whether to use propensity scores as weights. |
cost_fp |
A numeric value > 0, defaults to NULL. |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
check_group_names |
Perform replacement of dots in grouping columns. Disable for faster
computation if you can make sure that all columns used for grouping
("doc_id", "label_id", "doc_groups", "label_groups") do not contain
dots. (Defaults to |
Value
A list of two elements:
-
results_tableA data.frame with columns"n_gold", "n_suggested", "tp", "fp", "fn", "prec", "rprec", "rec", "f1". -
grouping_varThe input vectorgrouping_var.
Functions
-
compute_intermediate_results_dplyr(): Variant with dplyr based internals rather than collapse internals.
Examples
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "d",
"A", "f",
"B", "a",
"B", "e",
"C", "f"
)
gold_vs_pred <- create_comparison(gold, pred)
compute_intermediate_results(gold_vs_pred, "doc_id")
Compute intermediate ranked retrieval results per group
Description
Compute intermediate ranked retrieval results per group such as Discounted Cumulative Gain (DCG), Ideal Discounted Cumulative Gain (IDCG), Normalised Discounted Cumulative Gain (NDCG) and Label Ranking Average Precision (LRAP).
Usage
compute_intermediate_results_rr(
gold_vs_pred,
grouping_var,
drop_empty_groups = options::opt("drop_empty_groups")
)
Arguments
gold_vs_pred |
A data.frame as generated by |
grouping_var |
A character vector of grouping variables that must be
present in |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
Value
A data.frame with columns "dcg", "idcg", "ndcg", "lrap".
Examples
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"A", "d",
"A", "e",
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score,
"A", "f", 0.3277,
"A", "e", 0.32172,
"A", "b", 0.13517,
"A", "g", 0.10134,
"A", "h", 0.09152,
"A", "a", 0.07483,
"A", "i", 0.03649,
"A", "j", 0.03551,
"A", "k", 0.03397,
"A", "c", 0.03364
)
gold_vs_pred <- create_comparison(gold, pred)
compute_intermediate_results_rr(
gold_vs_pred,
rlang::syms(c("doc_id"))
)
Compute area under precision-recall curve
Description
Compute the area under the precision-recall curve with support for
bootstrap-based confidence intervals and different stratification and
aggregation modes for the underlying precision and recall aggregation.
Precision is calculated as the best value at a given level of recall for all
possible thresholds on score and limits on rank. In essence,
compute_pr_auc performs a two-dimensional optimisation over thresholds
and limits applying both threshold-based cutoff as well as rank-based cutoff.
Usage
compute_pr_auc(
predicted,
gold_standard,
doc_groups = NULL,
label_groups = NULL,
mode = "doc-avg",
steps = 100,
thresholds = NULL,
limit_range = NA_real_,
compute_bootstrap_ci = FALSE,
n_bt = 10L,
seed = NULL,
graded_relevance = FALSE,
rename_metrics = FALSE,
propensity_scored = FALSE,
label_distribution = NULL,
cost_fp_constant = NULL,
replace_zero_division_with = options::opt("replace_zero_division_with"),
drop_empty_groups = options::opt("drop_empty_groups"),
ignore_inconsistencies = options::opt("ignore_inconsistencies"),
verbose = options::opt("verbose"),
progress = options::opt("progress")
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
mode |
One of the following aggregation modes: |
steps |
Number of breaks to divide the interval |
thresholds |
Alternatively to steps, one can manually set the thresholds
to be used to build the pr curve. Defaults to the quantiles of the true
positive suggestions' score distribution to be obtained from |
limit_range |
A vector of limit values to apply on the rank column. Defaults to NA, applying no cutoff on the predictions' label rank. |
compute_bootstrap_ci |
A logical indicator for computing bootstrap CIs. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
seed |
Pass a seed to make bootstrap replication reproducible. |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
rename_metrics |
If set to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A data.frame with columns "pr_auc" and (if applicable)
"ci_lower", "ci_upper" and additional stratification variables.
See Also
compute_set_retrieval_scores,
compute_pr_auc_from_curve
Examples
library(ggplot2)
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score, ~rank,
"A", "a", 0.9, 1,
"A", "d", 0.7, 2,
"A", "f", 0.3, 3,
"A", "c", 0.1, 4,
"B", "a", 0.8, 1,
"B", "e", 0.6, 2,
"B", "d", 0.1, 3,
"C", "f", 0.1, 3,
"C", "c", 0.2, 1,
"C", "e", 0.2, 1
)
auc <- compute_pr_auc(pred, gold, mode = "doc-avg")
Compute area under precision-recall curve
Description
Compute the area under the precision-recall curve given pr curve data. This
function is mainly intended for generating plot data. For computation of the
area under the curve, use compute_pr_auc. The function uses a simple
trapezoidal rule approximation along the steps of the generated curve data.
Usage
compute_pr_auc_from_curve(
pr_curve_data,
grouping_vars = NULL,
drop_empty_groups = options::opt("drop_empty_groups")
)
Arguments
pr_curve_data |
A data.frame as produced by
|
grouping_vars |
Additional columns of the input data to group by. |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
Value
A data.frame with a column "pr_auc" and optional
grouping_vars.
See Also
compute_pr_curve
Examples
library(ggplot2)
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score, ~rank,
"A", "a", 0.9, 1,
"A", "d", 0.7, 2,
"A", "f", 0.3, 3,
"A", "c", 0.1, 4,
"B", "a", 0.8, 1,
"B", "e", 0.6, 2,
"B", "d", 0.1, 3,
"C", "f", 0.1, 3,
"C", "c", 0.2, 1,
"C", "e", 0.2, 1
)
pr_curve <- compute_pr_curve(
gold,
pred,
mode = "doc-avg",
optimize_cutoff = TRUE
)
auc <- compute_pr_auc_from_curve(pr_curve)
# note that pr curves take the cummax(prec), not the precision
ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) +
geom_point(
data = pr_curve$opt_cutoff,
aes(x = rec, y = prec_cummax),
color = "red",
shape = "star"
) +
geom_text(
data = pr_curve$opt_cutoff,
aes(
x = rec + 0.2, y = prec_cummax,
label = paste("f1_opt =", round(f1_max, 3))
),
color = "red"
) +
geom_path() +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
Compute precision-recall curve
Description
Compute the precision-recall curve for a given step size and limit range.
Usage
compute_pr_curve(
predicted,
gold_standard,
doc_groups = NULL,
label_groups = NULL,
mode = "doc-avg",
steps = 100,
thresholds = NULL,
limit_range = NA_real_,
optimize_cutoff = FALSE,
graded_relevance = FALSE,
propensity_scored = FALSE,
label_distribution = NULL,
cost_fp_constant = NULL,
replace_zero_division_with = options::opt("replace_zero_division_with"),
drop_empty_groups = options::opt("drop_empty_groups"),
ignore_inconsistencies = options::opt("ignore_inconsistencies"),
verbose = options::opt("verbose"),
progress = options::opt("progress")
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
mode |
One of the following aggregation modes: |
steps |
Number of breaks to divide the interval |
thresholds |
Alternatively to steps, one can manually set the thresholds
to be used to build the pr curve. Defaults to the quantiles of the true
positive suggestions' score distribution to be obtained from |
limit_range |
A vector of limit values to apply on the rank column. Defaults to NA, applying no cutoff on the predictions' label rank. |
optimize_cutoff |
Logical. If |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A list of three elements:
-
plot_dataA data.frame with full pr curves and columns"searchspace_id", "prec", "rec", "prec_cummax", "mode". -
opt_cutoffA data.frame with optimal cutoffs and columns"thresholds", "limits", "searchspace_id", "f1_max", "prec", "rec", "prec_cummax", "mode". -
all_cutoffsA data.frame with all cutoffs and columns"thresholds", "limits", "searchspace_id", "metric", "value", "support", "f1_max", "prec", "rec", "prec_cummax", "mode".
All three data.frames may contain additional stratification variables
passed with doc_groups and label_groups. The latter two
data.frames are non-empty only if optimize_cutoff == TRUE.
Examples
library(ggplot2)
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score, ~rank,
"A", "a", 0.9, 1,
"A", "d", 0.7, 2,
"A", "f", 0.3, 3,
"A", "c", 0.1, 4,
"B", "a", 0.8, 1,
"B", "e", 0.6, 2,
"B", "d", 0.1, 3,
"C", "f", 0.1, 1,
"C", "c", 0.2, 2,
"C", "e", 0.2, 2
)
pr_curve <- compute_pr_curve(
pred,
gold,
mode = "doc-avg",
optimize_cutoff = TRUE
)
auc <- compute_pr_auc_from_curve(pr_curve$plot_data)
# note that pr curves take the cummax(prec), not the precision
ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) +
geom_point(
data = pr_curve$opt_cutoff,
aes(x = rec, y = prec_cummax),
color = "red",
shape = "star"
) +
geom_text(
data = pr_curve$opt_cutoff,
aes(
x = rec + 0.2, y = prec_cummax,
label = paste("f1_opt =", round(f1_max, 3))
),
color = "red"
) +
geom_path() +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
Compute inverse propensity scores
Description
Compute inverse propensity scores based on a label distribution. Propensity scores for extreme multi-label learning are proposed in Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Aug, 935–944. doi:10.1145/2939672.2939756.
Usage
compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)
Arguments
label_distribution |
Expects a data.frame with columns |
a |
A numeric parameter for the propensity score calculation, defaults to 0.55. |
b |
A numeric parameter for the propensity score calculation, defaults to 1.5. |
Value
A data.frame with columns "label_id", "label_weight".
Examples
library(tidyverse)
library(casimir)
label_distribution <- dnb_label_distribution
compute_propensity_scores(label_distribution)
Compute ranked retrieval scores
Description
This function computes the ranked retrieval scores Discounted Cumulative Gain (DCG), Ideal Discounted Cumulative Gain (IDCG), Normalised Discounted Cumulative Gain (NDCG) and Label Ranking Average Precision (LRAP). Ranked retrieval, unlike set retrieval, assumes ordered predictions. Unlike set retrieval metrics, ranked retrieval metrics are logically bound to a document-wise evaluation. Thus, only the aggregation mode "doc-avg" is available for these scores.
Usage
compute_ranked_retrieval_scores(
predicted,
gold_standard,
doc_groups = NULL,
drop_empty_groups = options::opt("drop_empty_groups"),
progress = options::opt("progress")
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A data.frame with columns "metric", "mode", "value", "support"
and optional grouping variables supplied in doc_groups. Here,
support is defined as number of documents that contribute to the
document average in aggregation of the overall result.
Examples
# some dummy results
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"A", "d",
"A", "e",
)
pred <- tibble::tribble(
~doc_id, ~label_id, ~score,
"A", "f", 0.3277,
"A", "e", 0.32172,
"A", "b", 0.13517,
"A", "g", 0.10134,
"A", "h", 0.09152,
"A", "a", 0.07483,
"A", "i", 0.03649,
"A", "j", 0.03551,
"A", "k", 0.03397,
"A", "c", 0.03364
)
results <- compute_ranked_retrieval_scores(
pred,
gold
)
Compute multi-label metrics
Description
Compute multi-label metrics precision, recall, F1 and R-precision for subject indexing results.
Usage
compute_set_retrieval_scores(
predicted,
gold_standard,
k = NULL,
mode = "doc-avg",
compute_bootstrap_ci = FALSE,
n_bt = 10L,
doc_groups = NULL,
label_groups = NULL,
graded_relevance = FALSE,
rename_metrics = FALSE,
seed = NULL,
propensity_scored = FALSE,
label_distribution = NULL,
cost_fp_constant = NULL,
replace_zero_division_with = options::opt("replace_zero_division_with"),
drop_empty_groups = options::opt("drop_empty_groups"),
ignore_inconsistencies = options::opt("ignore_inconsistencies"),
verbose = options::opt("verbose"),
progress = options::opt("progress")
)
compute_set_retrieval_scores_dplyr(
predicted,
gold_standard,
k = NULL,
mode = "doc-avg",
compute_bootstrap_ci = FALSE,
n_bt = 10L,
doc_groups = NULL,
label_groups = NULL,
graded_relevance = FALSE,
rename_metrics = FALSE,
seed = NULL,
propensity_scored = FALSE,
label_distribution = NULL,
cost_fp_constant = NULL,
ignore_inconsistencies = FALSE,
verbose = FALSE,
progress = FALSE
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
k |
An integer limit on the number of predictions per document to
consider. Requires a column |
mode |
One of the following aggregation modes: |
compute_bootstrap_ci |
A logical indicator for computing bootstrap CIs. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
rename_metrics |
If set to
|
seed |
Pass a seed to make bootstrap replication reproducible. |
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A data.frame with columns "metric", "mode", "value", "support"
and optional grouping variables supplied in doc_groups or
label_groups. Here, support is defined for each mode
as:
mode == "doc-avg"The number of tested documents.
mode == "subj-avg"The number of labels contributing to the subj-average.
mode == "micro"The number of doc-label pairs contributing to the denominator of the respective metric, e.g.
tp + fpfor precision,tp + fnfor recall,tp + (fp + fn)/2for F1 andmin(tp + fp, tp + fn)for R-precision.
Functions
-
compute_set_retrieval_scores_dplyr(): Variant with internal usage of dplyr rather than collapse library. Tends to be slower, but more stable.
Examples
library(tidyverse)
library(casimir)
library(furrr)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f",
)
pred <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "d",
"A", "f",
"B", "a",
"B", "e",
"C", "f",
)
plan(sequential) # or whatever resources you have
a <- compute_set_retrieval_scores(
pred, gold,
mode = "doc-avg",
compute_bootstrap_ci = TRUE,
n_bt = 100L
)
ggplot(a, aes(x = metric, y = value)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) +
facet_wrap(vars(metric), scales = "free")
# example with graded relevance
pred_w_relevance <- tibble::tribble(
~doc_id, ~label_id, ~relevance,
"A", "a", 1.0,
"A", "d", 0.0,
"A", "f", 0.0,
"B", "a", 1.0,
"B", "e", 1 / 3,
"C", "f", 1.0,
)
b <- compute_set_retrieval_scores(
pred_w_relevance, gold,
mode = "doc-avg",
graded_relevance = TRUE
)
Join gold standard and predicted results
Description
Join the gold standard and the predicted results in one table based on the document id and the label id.
Usage
create_comparison(
predicted,
gold_standard,
doc_groups = NULL,
label_groups = NULL,
graded_relevance = FALSE,
propensity_scored = FALSE,
label_distribution = NULL,
ignore_inconsistencies = options::opt("ignore_inconsistencies")
)
Arguments
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
Value
A data.frame with columns "label_id", "doc_id", "suggested",
"gold".
Examples
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "d",
"A", "f",
"B", "a",
"B", "e",
"C", "f"
)
create_comparison(pred, gold)
Create a rank column
Description
Create a rank per document id based on score.
Usage
create_rank_col(df)
create_rank_col_dplyr(df)
Arguments
df |
A data.frame with columns |
Value
The input data.frame df with an additional column
"rank".
Functions
-
create_rank_col_dplyr(): Variant with internal usage of dplyr rather than collapse library.
Helper function for document-wise computation of ranked retrieval scores
Description
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation of DCG to test against.
Usage
dcg_score(gold_vs_pred, limit = NULL)
Arguments
gold_vs_pred |
A data.frame as generated by |
limit |
An integer cutoff value for DCG@N. |
Value
The numeric value of DCG.
DNB gold standard data for computing evaluation metrics
Description
A subset of documents found in the catalogue of the DNB with intellectually
assigned subject labels from the GND subject vocabulary.
The document ids match those in the dnb_test_predictions dataset.
Usage
dnb_gold_standard
Format
dnb_gold_standard
A data.frame with 337 rows and 2 columns:
doc_idDNB identifier of a document in the catalogue.
label_idDNB identifier of a concept in the GND subject vocabulary.
DNB label distribution for computing propensity scored metrics
Description
A subset of labels used in the catalogue of the DNB along with their
frequencies of occurrence. The label_ids match those in the
dnb_gold_standard and dnb_test_predictions datasets.
Usage
dnb_label_distribution
Format
dnb_label_distribution
A data frame with 7,772 rows and 3 columns:
label_idDNB identifier of a concept in the GND subject vocabulary.
label_freqNumber of occurences of the specified label in the overall catalogue.
n_docsOverall number of documents in the ground truth dataset.
DNB test predictions for computing evaluation metrics
Description
A subset of documents found in the catalogue of the DNB with predictions
generated with some arbitrary indexing method. The document ids match those
in the dnb_gold_standard dataset.
Usage
dnb_test_predictions
Format
dnb_test_predictions
A data frame with 100,000 rows and 3 columns:
doc_idDNB identifier of a document in the catalogue.
label_idDNB identifier of a concept in the GND subject vocabulary.
scoreA confidence score in
[0, 1]generated by the indexing method.
Compute the denominator for R-precision
Description
Compute the denominator for R-precision based on propensity scored ranking of gold standard labels.
Usage
find_ps_rprec_deno(gold_vs_pred, grouping_var, cost_fp)
find_ps_rprec_deno_dplyr(gold_vs_pred, grouping_var, cost_fp)
Arguments
gold_vs_pred |
A data.frame with logical columns |
grouping_var |
A character vector of grouping variables that must be
present in |
cost_fp |
A numeric value > 0, defaults to NULL. |
Value
A data.frame with columns "n_gold", "n_suggested", "tp", "fp",
"fn", "delta_relevance", "rprec_deno".
Functions
-
find_ps_rprec_deno_dplyr(): Variant with dplyr based internals rather than collapse internals.
Compute bootstrap replica of pr auc
Description
Helper function which performs the major bootstrap operation and wraps the
repeated application of summarise_intermediate_results and
compute_pr_auc_from_curve for each bootstrap run.
Usage
generate_pr_auc_replica(
intermed_res_all_thrsld,
seed,
n_bt,
propensity_scored,
replace_zero_division_with = options::opt("replace_zero_division_with"),
progress = options::opt("progress")
)
Arguments
intermed_res_all_thrsld |
Intermediate results as produced by
|
seed |
Pass a seed to make bootstrap replication reproducible. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
propensity_scored |
Logical, whether to use propensity scores as weights. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A data.frame with columns "boot_replicate", "pr_auc".
Compute bootstrapping results
Description
Wrapper for computing n_bt bootstrap replica, combining the
functionality of compute_intermediate_results and
summarise_intermediate_results.
Usage
generate_replicate_results(
base_compare,
n_bt,
grouping_var,
seed = NULL,
ps_flags = list(intermed = FALSE, summarise = FALSE),
label_distribution = NULL,
cost_fp = NULL,
replace_zero_division_with = options::opt("replace_zero_division_with"),
drop_empty_groups = options::opt("drop_empty_groups"),
progress = options::opt("progress")
)
generate_replicate_results_dplyr(
base_compare,
n_bt,
grouping_var,
seed = NULL,
label_distribution = NULL,
ps_flags = list(intermed = FALSE, summarise = FALSE),
cost_fp = NULL,
progress = FALSE
)
Arguments
base_compare |
A data.frame as generated by |
n_bt |
An integer number of resamples to be used for bootstrapping. |
grouping_var |
A character vector of variables that must be present in
|
seed |
A seed passed to resampling step for reproducibility. |
ps_flags |
A list as returned by |
label_distribution |
Expects a data.frame with columns |
cost_fp |
A numeric value > 0, defaults to NULL. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
Value
A data.frame containing n_bt boot replica of results as
returned by compute_intermediate_results and
summarise_intermediate_results.
Functions
-
generate_replicate_results_dplyr(): Variant with dplyr based internals rather than collapse internals.
Calculate bootstrapping results for one sample
Description
Internal wrapper for computing bootstrapping results on one sample, combining
the functionality of compute_intermediate_results and
summarise_intermediate_results.
Usage
helper_f(
sampled_id_list,
compare_cpy,
grouping_var,
label_distribution = NULL,
ps_flags = list(intermed = FALSE, summarise = FALSE),
cost_fp = NULL,
replace_zero_division_with = options::opt("replace_zero_division_with"),
drop_empty_groups = options::opt("drop_empty_groups")
)
Arguments
sampled_id_list |
A list of all doc_ids of this bootstrap. |
compare_cpy |
As created by |
grouping_var |
A vector of variables to be used for aggregation. |
label_distribution |
Expects a data.frame with columns |
ps_flags |
A list as returned by |
cost_fp |
A numeric value > 0, defaults to NULL. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
Value
A data.frame as returned by summarise_intermediate_results.
Calculate bootstrapping results for one sample
Description
Internal wrapper for computing bootstrapping results on one sample, combining
the functionality of compute_intermediate_results and
summarise_intermediate_results.
Usage
helper_f_dplyr(
sampled_id_list,
compare_cpy,
grouping_var,
ps_flags = list(intermed = FALSE, summarise = FALSE),
label_distribution = NULL,
cost_fp = NULL
)
Arguments
sampled_id_list |
A list of all doc_ids of this bootstrap. |
compare_cpy |
As created by |
grouping_var |
A vector of variables to be used for aggregation. |
ps_flags |
A list with logicals |
label_distribution |
Expects a data.frame with columns |
cost_fp |
A numeric value > 0, defaults to NULL. |
Value
A data.frame as returned by
summarise_intermediate_results_dplyr.
Join propensity scores
Description
Helper function to perform a secure join of a comparison matrix with propensity scores.
Usage
join_propensity_scores(input_data, label_weights)
join_propensity_scores_dplyr(input_data, label_weights)
Arguments
input_data |
A data.frame containing at least the column
|
label_weights |
Expects a data.frame with columns |
Value
The input data.frame input_data with an additional column
"label_weight".
Functions
-
join_propensity_scores_dplyr(): Variant with dplyr based internals rather than collapse internals.
Examples
library(casimir)
gold <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "b",
"A", "c",
"B", "a",
"B", "d",
"C", "a",
"C", "b",
"C", "d",
"C", "f"
)
pred <- tibble::tribble(
~doc_id, ~label_id,
"A", "a",
"A", "d",
"A", "f",
"B", "a",
"B", "e",
"C", "f"
)
label_distribution <- tibble::tribble(
~label_id, ~label_freq, ~n_docs,
"a", 10000, 10100,
"b", 1000, 10100,
"c", 100, 10100,
"d", 1, 10100,
"e", 1, 10100,
"f", 2, 10100,
"g", 0, 10100
)
comp <- create_comparison(gold, pred)
label_weights <- compute_propensity_scores(label_distribution)
comp_w_label_weights <- join_propensity_scores(
input_data = comp,
label_weights = label_weights
)
Helper function for document-wise computation of ranked retrieval scores
Description
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation for Label Ranking Average Precision.
Usage
lrap_score(gold_vs_pred)
Arguments
gold_vs_pred |
A data.frame as generated by |
Value
The numeric value of LRAP.
Helper function for document-wise computation of ranked retrieval scores
Description
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation for NDCG to test against.
Usage
ndcg_score(gold_vs_pred, limit = NULL)
Arguments
gold_vs_pred |
A data.frame as generated by |
limit |
An integer cutoff value for NDCG@N. |
Value
The numeric value of NDCG.
Declaration of options to be used as identical function arguments
Description
Declaration of options to be used as identical function arguments
Arguments
check_group_names |
Perform replacement of dots in grouping columns. Disable for faster
computation if you can make sure that all columns used for grouping
("doc_id", "label_id", "doc_groups", "label_groups") do not contain
dots. (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
casimir Options
Description
Internally used, package-specific options. All options will prioritize R options() values, and fall back to environment variables if undefined. If neither the option nor the environment variable is set, a default value is used.
Checking Option Values
Option values specific to casimir can be
accessed by passing the package name to env.
options::opts(env = "casimir") options::opt(x, default, env = "casimir")
Options
- ignore_inconsistencies
-
Warnings about data inconsistencies will be silenced.
- default:
FALSE
- option:
casimir.ignore_inconsistencies
- envvar:
R_CASIMIR_IGNORE_INCONSISTENCIES (evaluated if possible, raw string otherwise)
- progress
-
Display progress bars for iterated computations (like bootstrap CI or pr curves).
- default:
FALSE
- option:
casimir.progress
- envvar:
R_CASIMIR_PROGRESS (evaluated if possible, raw string otherwise)
- verbose
-
Verbose reporting of computation steps for debugging.
- default:
FALSE
- option:
casimir.verbose
- envvar:
R_CASIMIR_VERBOSE (evaluated if possible, raw string otherwise)
- check_group_names
-
Perform replacement of dots in grouping columns. Disable for faster computation if you can make sure that all columns used for grouping ("doc_id", "label_id", "doc_groups", "label_groups") do not contain dots.
- default:
TRUE
- option:
casimir.check_group_names
- envvar:
R_CASIMIR_CHECK_GROUP_NAMES (evaluated if possible, raw string otherwise)
- drop_empty_groups
-
Should empty levels of factor variables be dropped in grouped set retrieval computation?
- default:
TRUE
- option:
casimir.drop_empty_groups
- envvar:
R_CASIMIR_DROP_EMPTY_GROUPS (evaluated if possible, raw string otherwise)
- replace_zero_division_with
-
In macro averaged results (doc-avg, subj-avg), it may occur that some instances have no predictions or no gold standard. In these cases, calculating precision and recall may lead to division by zero. CASIMiR standardly removes these missing values from macro averages, leading to a smaller support (count of instances that were averaged). Other implementations of macro averaged precision and recall default to 0 in these cases. This option allows to control the default. Set any value between 0 and 1.
- default:
NULL
- option:
casimir.replace_zero_division_with
- envvar:
R_CASIMIR_REPLACE_ZERO_DIVISION_WITH (evaluated if possible, raw string otherwise)
See Also
options getOption Sys.setenv Sys.getenv
Postprocessing of pr curve data
Description
Reshape pr curve data to a format that is easier for plotting.
Usage
pr_curve_post_processing(results_summary)
Arguments
results_summary |
As produced by |
Value
A data.frame with columns "searchspace_id", "prec", "rec",
"prec_cummax" and possible additional stratification variables.
Process cost for false positives
Description
Calculate the cost for false positives depending on the chosen
cost_fp_constant.
Usage
process_cost_fp(cost_fp_constant, gold_vs_pred)
Arguments
cost_fp_constant |
Constant cost assigned to false positives.
|
gold_vs_pred |
A data.frame with logical columns |
Value
A numeric value > 0.
Rename metrics
Description
Rename metric names for generalised precision etc. The output will be renamed if:
graded_relevance == TRUEprefixed with "g-" to indicate that metrics are computed with graded relevance.
propensity_scored == TRUEprefixed with "ps-" to indicate that metrics are computed with propensity scores.
!is.null(k)suffixed with "@k" to indicate that metrics are limited to top k predictions.
Usage
rename_metrics(
res_df,
k = NULL,
propensity_scored = FALSE,
graded_relevance = FALSE
)
Arguments
res_df |
A data.frame with a column |
k |
An integer limit on the number of predictions per document to
consider. Requires a column |
propensity_scored |
Logical, whether to use propensity scores as weights. |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
Value
The input data.frame res_df with renamed metrics for
generalised precision etc.
Set grouping variables
Description
Determine the appropriate grouping variables for each aggregation mode.
Usage
set_grouping_var(mode, doc_groups, label_groups, var = NULL)
Arguments
mode |
One of the following aggregation modes: |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
var |
Additional variables to include. |
Value
A character vector of variables determining the grouping structure.
Set flags for propensity scores
Description
Generate flags if propensity scores should be applied to intermediate results or summarised results.
Usage
set_ps_flags(mode, propensity_scored)
Arguments
mode |
One of the following aggregation modes: |
propensity_scored |
Logical, whether to use propensity scores as weights. |
Value
A list containing logical flags "intermed" and
"summarise".
Compute the mean of intermediate results
Description
Compute the mean of intermediate results created by
compute_intermediate_results.
Usage
summarise_intermediate_results(
intermediate_results,
propensity_scored = FALSE,
label_distribution = NULL,
set = FALSE,
replace_zero_division_with = options::opt("replace_zero_division_with")
)
Arguments
intermediate_results |
As produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
set |
Logical. Allow in-place modification of
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
Value
A data.frame with columns "metric", "value".
Compute the mean of intermediate results
Description
Compute the mean of intermediate results created by
compute_intermediate_results. Variant with dplyr based internals
rather than collapse internals.
Usage
summarise_intermediate_results_dplyr(
intermediate_results,
propensity_scored = FALSE,
label_distribution = NULL
)
Arguments
intermediate_results |
As produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
Value
A data.frame with columns "metric", "value".