
The cooccure R package enables building
co-occurrence networks from multiple data formats. It accepts six input
formats and supports multiple similarity measures, scaling methods,
fractional counting, group-level splitting, and flexible filtering.
Results are returned as a tidy edge data frame (from,
to, weight, count) convertible to
igraph, tidygraph,
cograph, and Nestimate objects.
The main function cooccurrence() is also available as
the short alias co().
# CRAN release
install.packages("cooccure")
# Development version
remotes::install_github("mohsaqr/cooccure")cooccure auto-detects the input format from the arguments provided. Six formats are supported (delimited field, multi-column delimited, long/bipartite, binary matrix, wide sequence, and list of character vectors), covering the most common shapes data comes in.
A delimited field is a single column where multiple items
are stored as one string, separated by a consistent character such as
;, ,, |, or a space.
This is the most common format in bibliometrics and text analysis, where
each row represents a document.
Use the field argument to specify the column with the
relevant values, and sep to specify the delimiter.
df <- data.frame(
id = 1:3,
keywords = c("network; graph; matrix",
"graph; algebra",
"network; algebra; graph")
)
cooccurrence(df, field = "keywords", sep = ";")Whitespace around the separator is automatically trimmed
(" network " becomes "network"). Empty strings
and NAs are dropped. Duplicate items within a row are de-duplicated.
A multi-column delimited format is used when items are spread across multiple columns - for example, author keywords and index keywords in a Scopus export, or authors and affiliations. Values from all specified columns are pooled per row.
Use the field argument to specify the columns with the
relevant values, and sep to specify the delimiter.
df <- data.frame(
author_kw = c("machine learning; nlp", "deep learning", "nlp"),
index_kw = c("classification", "image recognition", "text mining")
)
cooccurrence(df, field = c("author_kw", "index_kw"), sep = ";")A long or bipartite format has one row per item-document pair, common in relational databases, survey data, and tidy data pipelines.
Use the field argument to specify the column containing
the items, and the by argument to specify which column
groups them into transactions.
citations <- data.frame(
paper_id = c(1, 1, 1, 2, 2, 3, 3, 3),
reference = c("Smith2020", "Jones2019", "Lee2021",
"Jones2019", "Lee2021",
"Smith2020", "Lee2021", "Park2022")
)
cooccurrence(citations, field = "reference", by = "paper_id")Use the weight_by argument to pass a numeric weight
column for weighted long format — for example, LDA
topic-document probabilities where each document contributes a topic
with a given probability:
theta <- data.frame(
doc = c("d1","d1","d1","d2","d2","d3","d3"),
topic = c("T1","T2","T3","T1","T3","T2","T3"),
prob = c(0.6, 0.3, 0.1, 0.4, 0.6, 0.5, 0.5)
)
cooccurrence(theta, field = "topic", by = "doc", weight_by = "prob")In weighted long format, the co-occurrence between items
i and j is computed as
sum_d w_id * w_jd (the sum of the products of their weights
across all shared transactions) rather than a simple binary count. The
count column still reports the number of transactions where
both items appear together.
A binary matrix is a document-term matrix where columns are
items and values are 0/1 (absence or presence). This format is
auto-detected when all values are 0 or 1 and no field,
by, or sep arguments are provided.
dtm <- matrix(c(1,1,0,1,
0,1,1,0,
1,0,1,1), nrow = 3, byrow = TRUE,
dimnames = list(NULL, c("network", "graph", "algebra", "matrix")))
cooccurrence(dtm)Works with both matrix and data.frame
inputs. Columns without names are auto-named V1,
V2, etc.
A wide sequence format is used for non-binary data frames or
matrices where each row is a sequence or record, and the unique values
in each row form a transaction. This is the native format for sequence
analysis tools like TraMineR and tna.
Pass field = "all" to treat every column as a time
point.
sequences <- data.frame(
t1 = c("A", "B", "A"),
t2 = c("B", "C", "C"),
t3 = c("C", NA, NA)
)
cooccurrence(sequences, field = "all")NAs, empty strings, and TraMineR void markers
(%, *) are automatically removed.
A list of character vectors is the most direct format, where each list element is a transaction containing a set of categorical items.
baskets <- list(
c("bread", "milk", "eggs"),
c("bread", "butter"),
c("milk", "eggs", "butter"),
c("bread", "milk", "eggs", "butter")
)
cooccurrence(baskets)The similarity argument controls how raw co-occurrence
counts are normalized into a similarity or association measure. All
similarity measures are based on two inputs: the co-occurrence count
between two items (\(C_{ij}\)), and how
frequently each item appears individually across all transactions (\(f_i\), \(f_j\)).
# Jaccard similarity
cooccurrence(baskets, similarity = "jaccard")
# Association strength
cooccurrence(papers, field = "keywords", sep = ";", similarity = "association")Exploratory work: Start with "none"
to see raw counts and understand the data, then try
"jaccard" or "cosine" for a balanced
view.
General purpose: "jaccard" is a
good default choice. It normalizes co-occurrences by the union of
transactions containing either item, applying a balanced penalty for
non-overlap.
Bibliometric and scientometric networks:
"association" is recommended by van Eck & Waltman
(2009) because it correctly accounts for the expected number of
co-occurrences under independence. Two items that are both very frequent
will naturally co-occur often; association strength discounts this,
revealing which pairs co-occur more than chance alone would
predict.
Detecting hierarchical/subset structure:
"inclusion" (Simpson coefficient) reveals when one item
almost always appears with another, useful for finding items that are
subsets of broader categories or dependency relationships.
Binary presence/absence networks:
"dice" when you only care whether items co-occur,
not how often. Less strict than "jaccard", it
applies a less severe penalty for partial overlap.
Scale-invariant comparison:
"cosine" is invariant to absolute frequency, useful when
comparing co-occurrence patterns across datasets of different sizes or
when frequent items should not dominate the network.
Strict filtering: "equivalence"
(cosine squared) amplifies differences, pushing pairs with weak overlap
closer to zero and retaining only the strongest associations.
Asymmetric tendencies: "relative"
normalizes each row so that edge weights sum to 1, capturing the
relative tendency of one item to appear with another rather than
absolute co-occurrence counts.
| Method | Formula | Description | Best for |
|---|---|---|---|
"none" |
\(C_{ij}\) | Raw co-occurrence count | Exploratory analysis |
"jaccard" |
\(\frac{C_{ij}}{f_i + f_j - C_{ij}}\) | Divides co-occurrences by the total number of transactions containing either item | General purpose |
"cosine" |
\(\frac{C_{ij}}{\sqrt{f_i \cdot f_j}}\) | Divides co-occurrences by the geometric mean of item frequencies (Salton’s cosine) | Scale-invariant comparison |
"inclusion" |
\(\frac{C_{ij}}{\min(f_i, f_j)}\) | Divides co-occurrences by the frequency of the rarer item (Simpson coefficient) | Subset and hierarchical relationships |
"association" |
\(\frac{C_{ij}}{f_i \cdot f_j}\) | Divides co-occurrences by the product of item frequencies, discounting chance co-occurrences (van Eck & Waltman, 2009) | Bibliometric networks |
"dice" |
\(\frac{2 C_{ij}}{f_i + f_j}\) | Divides co-occurrences by the arithmetic mean of item frequencies | Binary presence/absence networks |
"equivalence" |
\(\frac{C_{ij}^2}{f_i \cdot f_j}\) | Squares the co-occurrence count before dividing by the product of frequencies (cosine squared) | Strict filtering |
"relative" |
Row-normalized (each row sums to 1) | Normalizes each row so that all edge weights from an item sum to 1 | Asymmetric tendencies |
The counting argument controls how much each transaction
contributes to the co-occurrence count. Under full counting (default),
each co-occurring pair adds 1 regardless of how many items are in the
transaction. Under fractional counting, each pair adds \(1/(n-1)\) where \(n\) is the number of items in the
transaction, preventing large transactions from dominating the network.
For example, a document with 10 keywords creates 45 pairs under full
counting but contributes only 1/9 per pair under fractional
counting.
# Full counting (default): each co-occurring pair adds 1
co(data, field = "keywords", sep = ";")
# Fractional: each pair adds 1/(n-1) where n = items in the transaction
co(data, field = "keywords", sep = ";", counting = "fractional")The scale argument applies a transformation to the
weights after similarity normalization, useful for visualization,
thresholding, or feeding into downstream models.
# Log-scaled Jaccard similarity
cooccurrence(baskets, similarity = "jaccard", scale = "log")
# Min-max scaled for visualization
cooccurrence(baskets, similarity = "cosine", scale = "minmax")| Method | Transformation | Description | Use case |
|---|---|---|---|
"minmax" |
Scale to \([0, 1]\) | Rescales all weights to the range \([0, 1]\) | Visualization and cross-network comparison |
"log" |
\(\log(1 + w)\) | Applies a natural log transformation | Compressing heavy-tailed distributions |
"log10" |
\(\log_{10}(1 + w)\) | Same as log but base 10 | When base 10 interpretation is preferred |
"sqrt" |
\(\sqrt{w}\) | Square root transformation | Mild compression of skewed weights |
"binary" |
1 if \(w > 0\), else 0 | Converts all positive weights to 1 | Presence/absence networks |
"zscore" |
\((w - \mu) / \sigma\) | Standardizes weights to mean 0 and standard deviation 1 | Statistical comparison across networks |
"proportion" |
\(w / \sum w\) | Divides each weight by the total sum of weights | Expressing edges as relative importance |
Three filtering arguments control which edges appear in the result:
min_occur: drops any entity appearing in fewer than a
specified number of transactions, removing rare items before
co-occurrences are computed.threshold: keeps only edges with a weight at or above a
specified value, applied after similarity normalization and
scaling.top_n: keeps only the \(n\) strongest edges by weight.# Drop entities appearing in fewer than 3 transactions
cooccurrence(baskets, min_occur = 3)
# Keep only edges with weight >= 0.5 (applied after similarity + scaling)
cooccurrence(baskets, similarity = "jaccard", threshold = 0.5)
# Keep only the 10 strongest edges
cooccurrence(baskets, top_n = 10)All three can be combined for fine-grained control over the network size and density:
cooccurrence(papers, field = "keywords", sep = ";",
similarity = "association", min_occur = 2,
threshold = 0.01, top_n = 50)The split_by argument computes a separate co-occurrence
network for each level of a grouping variable and returns them in a
single data frame with a group column. This is useful for
comparing co-occurrence patterns across time periods, disciplines,
journals, or any categorical variable. Each group gets its own
similarity computation, meaning item frequencies are group-specific. All
other parameters (similarity, scale,
threshold, min_occur, top_n)
apply per group.
papers <- data.frame(
year = c(2020, 2020, 2020, 2021, 2021, 2021),
keywords = c("network; graph; matrix", "graph; algebra",
"network; algebra; graph",
"deep learning; nlp", "nlp; transformers",
"deep learning; transformers; nlp")
)
co(papers, field = "keywords", sep = ";", split_by = "year",
similarity = "jaccard")
#> # cooccurrence: 7 nodes, 8 edges | split_by: year (2 groups) | similarity: jaccard
#> from to weight count group
#> algebra graph 0.6666667 2 2020
#> graph network 0.6666667 2 2020
#> deep learning nlp 0.6666667 2 2021
#> nlp transformers 0.6666667 2 2021
#> ...cooccurrence() returns a tidy data frame of class
cooccurrence that can be piped, filtered, and joined like
any standard data frame. The raw co-occurrence count is always preserved
in the count column regardless of similarity or scaling,
meaning the original counts can always be traced back.
The full matrix, item frequencies, and all parameters are stored as attributes on the returned data frame, making it easy to access the underlying data for further analysis or inspection.
attr(result, "matrix") # Normalized weight matrix
attr(result, "raw_matrix") # Raw count matrix (diagonal zeroed)
attr(result, "items") # Character vector of all items
attr(result, "frequencies") # Named vector of item frequencies
attr(result, "similarity") # Similarity measure used
attr(result, "scale") # Scaling method used
attr(result, "n_transactions") # Number of transactions
attr(result, "n_items") # Number of unique itemsThe cooccurrence object can be also printed, summarized,
and plotted directly as a co-occurrence network (Saqr et al., 2023):
# Summary statistics
summary(result)
# Heatmap
plot(result)
# Network plot (requires igraph)
plot(result, type = "network")The output argument controls the format returned
directly:
"default" — tidy data frame with from,
to, weight, count columns"gephi" — Gephi-ready format with Source,
Target, Weight, Type,
Count columns, writable directly to CSV"igraph" — igraph object for network analysis and
visualization"cograph" — cograph object for use with the cograph
package"matrix" — square co-occurrence matrix# Default: tidy data frame with from, to, weight, count
co(data, field = "keywords", sep = ";")
# Gephi-ready: Source, Target, Weight, Type, Count columns
co(data, field = "keywords", sep = ";", output = "gephi")
#> Source Target Weight Type Count
#> graph network 3 Undirected 3
# igraph object
g <- co(data, field = "keywords", sep = ";", output = "igraph")
# cograph object
net <- co(data, field = "keywords", sep = ";", output = "cograph")
# Square matrix
mat <- co(data, field = "keywords", sep = ";", output = "matrix")The Gephi output can be written directly to CSV for import:
write.csv(co(data, field = "keywords", sep = ";", output = "gephi"),
"network.csv", row.names = FALSE)A cooccurrence result can be converted to other network
formats using the built-in converter functions. All converter packages
are optional — install only what you need.
# Normalized similarity matrix
as_matrix(result)
# Raw co-occurrence count matrix
as_matrix(result, type = "raw")# install.packages("igraph")
g <- as_igraph(result)
plot(g, edge.width = igraph::E(g)$weight * 3)
igraph::degree(g)
igraph::betweenness(g)# install.packages("tidygraph")
tg <- as_tidygraph(result)
# Use with ggraph# remotes::install_github("mohsaqr/cograph")
net <- as_cograph(result)
cograph::splot(net)
cograph::communities(net)# remotes::install_github("mohsaqr/Nestimate")
net <- as_netobject(result)
Nestimate::centrality(net)
Nestimate::bootstrap_network(net)Regardless of input format, the internal pipeline is:
min_occur
frequencycrossprod()similarity
measurescale is
specifiedthreshold and keep
top_nThe computation is vectorized throughout - no loops in the hot path.
crossprod() delegates to optimized BLAS routines for the
matrix multiplication.
| Argument | Type | Description | Default |
|---|---|---|---|
data |
various | Input data (data.frame, matrix, or list) | required |
field |
character | Column(s) containing entities (nodes) | NULL |
by |
character | Column grouping entities into transactions | NULL |
weight_by |
character | Column with numeric weights for long format (e.g. LDA topic probabilities) | NULL |
sep |
character | Delimiter for splitting delimited fields | NULL |
split_by |
character | Column to split data by (separate network per group) | NULL |
similarity |
character | Normalization measure | "none" |
counting |
character | "full" or "fractional" |
"full" |
scale |
character | Weight scaling method | NULL |
threshold |
numeric | Minimum edge weight (after normalization + scaling) | 0 |
min_occur |
integer | Minimum entity frequency (transactions) | 1 |
top_n |
integer | Keep only the top N edges by weight (per group if split) | NULL |
output |
character | Output format: "default", "gephi",
"igraph", "cograph",
"matrix" |
"default" |
Saqr, M., López-Pernas, S., Conde, M. Á., & Hernández-García, Á. (2023). Social Network Analysis: A Primer, a Guide and a Tutorial in R. In Saqr, M. & López-Pernas, S. (Eds.), Learning Analytics Methods and Tutorials: A Practical Guide Using R. Springer. https://lamethods.org/book1/chapters/ch15-sna/ch15-sna.html
van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well‐known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651.
Mohammed Saqr — University of Eastern Finland · saqr.me
Sonsoles López-Pernas — University of Eastern Finland · sonsoles.me
Kamila Misiejuk — FernUniversität in Hagen · kamilamisiejuk.com
MIT