cooccure

The cooccure R package enables building co-occurrence networks from multiple data formats. It accepts six input formats and supports multiple similarity measures, scaling methods, fractional counting, group-level splitting, and flexible filtering. Results are returned as a tidy edge data frame (from, to, weight, count) convertible to igraph, tidygraph, cograph, and Nestimate objects.

The main function cooccurrence() is also available as the short alias co().

Installation

# CRAN release
install.packages("cooccure")

# Development version
remotes::install_github("mohsaqr/cooccure")

Input formats

cooccure auto-detects the input format from the arguments provided. Six formats are supported (delimited field, multi-column delimited, long/bipartite, binary matrix, wide sequence, and list of character vectors), covering the most common shapes data comes in.

1. Delimited field

A delimited field is a single column where multiple items are stored as one string, separated by a consistent character such as ;, ,, |, or a space. This is the most common format in bibliometrics and text analysis, where each row represents a document.

Use the field argument to specify the column with the relevant values, and sep to specify the delimiter.

df <- data.frame(
  id = 1:3,
  keywords = c("network; graph; matrix",
               "graph; algebra",
               "network; algebra; graph")
)
cooccurrence(df, field = "keywords", sep = ";")

Whitespace around the separator is automatically trimmed (" network " becomes "network"). Empty strings and NAs are dropped. Duplicate items within a row are de-duplicated.

2. Multi-column delimited

A multi-column delimited format is used when items are spread across multiple columns - for example, author keywords and index keywords in a Scopus export, or authors and affiliations. Values from all specified columns are pooled per row.

Use the field argument to specify the columns with the relevant values, and sep to specify the delimiter.

df <- data.frame(
  author_kw = c("machine learning; nlp", "deep learning", "nlp"),
  index_kw  = c("classification", "image recognition", "text mining")
)
cooccurrence(df, field = c("author_kw", "index_kw"), sep = ";")

3. Long / bipartite

A long or bipartite format has one row per item-document pair, common in relational databases, survey data, and tidy data pipelines.

Use the field argument to specify the column containing the items, and the by argument to specify which column groups them into transactions.

citations <- data.frame(
  paper_id = c(1, 1, 1, 2, 2, 3, 3, 3),
  reference = c("Smith2020", "Jones2019", "Lee2021",
                "Jones2019", "Lee2021",
                "Smith2020", "Lee2021", "Park2022")
)
cooccurrence(citations, field = "reference", by = "paper_id")

Use the weight_by argument to pass a numeric weight column for weighted long format — for example, LDA topic-document probabilities where each document contributes a topic with a given probability:

theta <- data.frame(
  doc   = c("d1","d1","d1","d2","d2","d3","d3"),
  topic = c("T1","T2","T3","T1","T3","T2","T3"),
  prob  = c(0.6, 0.3, 0.1, 0.4, 0.6, 0.5, 0.5)
)
cooccurrence(theta, field = "topic", by = "doc", weight_by = "prob")

In weighted long format, the co-occurrence between items i and j is computed as sum_d w_id * w_jd (the sum of the products of their weights across all shared transactions) rather than a simple binary count. The count column still reports the number of transactions where both items appear together.

4. Binary matrix

A binary matrix is a document-term matrix where columns are items and values are 0/1 (absence or presence). This format is auto-detected when all values are 0 or 1 and no field, by, or sep arguments are provided.

dtm <- matrix(c(1,1,0,1,
                0,1,1,0,
                1,0,1,1), nrow = 3, byrow = TRUE,
              dimnames = list(NULL, c("network", "graph", "algebra", "matrix")))
cooccurrence(dtm)

Works with both matrix and data.frame inputs. Columns without names are auto-named V1, V2, etc.

5. Wide sequence

A wide sequence format is used for non-binary data frames or matrices where each row is a sequence or record, and the unique values in each row form a transaction. This is the native format for sequence analysis tools like TraMineR and tna. Pass field = "all" to treat every column as a time point.

sequences <- data.frame(
  t1 = c("A", "B", "A"),
  t2 = c("B", "C", "C"),
  t3 = c("C", NA,  NA)
)
cooccurrence(sequences, field = "all")

NAs, empty strings, and TraMineR void markers (%, *) are automatically removed.

6. List of character vectors

A list of character vectors is the most direct format, where each list element is a transaction containing a set of categorical items.

baskets <- list(
  c("bread", "milk", "eggs"),
  c("bread", "butter"),
  c("milk", "eggs", "butter"),
  c("bread", "milk", "eggs", "butter")
)
cooccurrence(baskets)

Similarity measures

The similarity argument controls how raw co-occurrence counts are normalized into a similarity or association measure. All similarity measures are based on two inputs: the co-occurrence count between two items (\(C_{ij}\)), and how frequently each item appears individually across all transactions (\(f_i\), \(f_j\)).

# Jaccard similarity
cooccurrence(baskets, similarity = "jaccard")

# Association strength
cooccurrence(papers, field = "keywords", sep = ";", similarity = "association")

Which similarity to use?

Similarity method overview

Method Formula Description Best for
"none" \(C_{ij}\) Raw co-occurrence count Exploratory analysis
"jaccard" \(\frac{C_{ij}}{f_i + f_j - C_{ij}}\) Divides co-occurrences by the total number of transactions containing either item General purpose
"cosine" \(\frac{C_{ij}}{\sqrt{f_i \cdot f_j}}\) Divides co-occurrences by the geometric mean of item frequencies (Salton’s cosine) Scale-invariant comparison
"inclusion" \(\frac{C_{ij}}{\min(f_i, f_j)}\) Divides co-occurrences by the frequency of the rarer item (Simpson coefficient) Subset and hierarchical relationships
"association" \(\frac{C_{ij}}{f_i \cdot f_j}\) Divides co-occurrences by the product of item frequencies, discounting chance co-occurrences (van Eck & Waltman, 2009) Bibliometric networks
"dice" \(\frac{2 C_{ij}}{f_i + f_j}\) Divides co-occurrences by the arithmetic mean of item frequencies Binary presence/absence networks
"equivalence" \(\frac{C_{ij}^2}{f_i \cdot f_j}\) Squares the co-occurrence count before dividing by the product of frequencies (cosine squared) Strict filtering
"relative" Row-normalized (each row sums to 1) Normalizes each row so that all edge weights from an item sum to 1 Asymmetric tendencies

Counting

The counting argument controls how much each transaction contributes to the co-occurrence count. Under full counting (default), each co-occurring pair adds 1 regardless of how many items are in the transaction. Under fractional counting, each pair adds \(1/(n-1)\) where \(n\) is the number of items in the transaction, preventing large transactions from dominating the network. For example, a document with 10 keywords creates 45 pairs under full counting but contributes only 1/9 per pair under fractional counting.

# Full counting (default): each co-occurring pair adds 1
co(data, field = "keywords", sep = ";")

# Fractional: each pair adds 1/(n-1) where n = items in the transaction
co(data, field = "keywords", sep = ";", counting = "fractional")

Scaling

The scale argument applies a transformation to the weights after similarity normalization, useful for visualization, thresholding, or feeding into downstream models.

# Log-scaled Jaccard similarity
cooccurrence(baskets, similarity = "jaccard", scale = "log")

# Min-max scaled for visualization
cooccurrence(baskets, similarity = "cosine", scale = "minmax")

Scaling method overview

Method Transformation Description Use case
"minmax" Scale to \([0, 1]\) Rescales all weights to the range \([0, 1]\) Visualization and cross-network comparison
"log" \(\log(1 + w)\) Applies a natural log transformation Compressing heavy-tailed distributions
"log10" \(\log_{10}(1 + w)\) Same as log but base 10 When base 10 interpretation is preferred
"sqrt" \(\sqrt{w}\) Square root transformation Mild compression of skewed weights
"binary" 1 if \(w > 0\), else 0 Converts all positive weights to 1 Presence/absence networks
"zscore" \((w - \mu) / \sigma\) Standardizes weights to mean 0 and standard deviation 1 Statistical comparison across networks
"proportion" \(w / \sum w\) Divides each weight by the total sum of weights Expressing edges as relative importance

Filtering

Three filtering arguments control which edges appear in the result:

# Drop entities appearing in fewer than 3 transactions
cooccurrence(baskets, min_occur = 3)

# Keep only edges with weight >= 0.5 (applied after similarity + scaling)
cooccurrence(baskets, similarity = "jaccard", threshold = 0.5)

# Keep only the 10 strongest edges
cooccurrence(baskets, top_n = 10)

All three can be combined for fine-grained control over the network size and density:

cooccurrence(papers, field = "keywords", sep = ";",
             similarity = "association", min_occur = 2,
             threshold = 0.01, top_n = 50)

Splitting by groups

The split_by argument computes a separate co-occurrence network for each level of a grouping variable and returns them in a single data frame with a group column. This is useful for comparing co-occurrence patterns across time periods, disciplines, journals, or any categorical variable. Each group gets its own similarity computation, meaning item frequencies are group-specific. All other parameters (similarity, scale, threshold, min_occur, top_n) apply per group.

papers <- data.frame(
  year = c(2020, 2020, 2020, 2021, 2021, 2021),
  keywords = c("network; graph; matrix", "graph; algebra",
               "network; algebra; graph",
               "deep learning; nlp", "nlp; transformers",
               "deep learning; transformers; nlp")
)

co(papers, field = "keywords", sep = ";", split_by = "year",
   similarity = "jaccard")
#> # cooccurrence: 7 nodes, 8 edges | split_by: year (2 groups) | similarity: jaccard
#>           from           to    weight count group
#>        algebra        graph 0.6666667     2  2020
#>          graph      network 0.6666667     2  2020
#>  deep learning          nlp 0.6666667     2  2021
#>            nlp transformers 0.6666667     2  2021
#>            ...

Output

cooccurrence() returns a tidy data frame of class cooccurrence that can be piped, filtered, and joined like any standard data frame. The raw co-occurrence count is always preserved in the count column regardless of similarity or scaling, meaning the original counts can always be traced back.

The full matrix, item frequencies, and all parameters are stored as attributes on the returned data frame, making it easy to access the underlying data for further analysis or inspection.

attr(result, "matrix")          # Normalized weight matrix
attr(result, "raw_matrix")      # Raw count matrix (diagonal zeroed)
attr(result, "items")           # Character vector of all items
attr(result, "frequencies")     # Named vector of item frequencies
attr(result, "similarity")      # Similarity measure used
attr(result, "scale")           # Scaling method used
attr(result, "n_transactions")  # Number of transactions
attr(result, "n_items")         # Number of unique items

The cooccurrence object can be also printed, summarized, and plotted directly as a co-occurrence network (Saqr et al., 2023):

# Summary statistics
summary(result)

# Heatmap
plot(result)

# Network plot (requires igraph)
plot(result, type = "network")

Output formats

The output argument controls the format returned directly:

# Default: tidy data frame with from, to, weight, count
co(data, field = "keywords", sep = ";")

# Gephi-ready: Source, Target, Weight, Type, Count columns
co(data, field = "keywords", sep = ";", output = "gephi")
#>   Source  Target Weight       Type Count
#>    graph network      3 Undirected     3

# igraph object
g <- co(data, field = "keywords", sep = ";", output = "igraph")

# cograph object
net <- co(data, field = "keywords", sep = ";", output = "cograph")

# Square matrix
mat <- co(data, field = "keywords", sep = ";", output = "matrix")

The Gephi output can be written directly to CSV for import:

write.csv(co(data, field = "keywords", sep = ";", output = "gephi"),
          "network.csv", row.names = FALSE)

Converters

A cooccurrence result can be converted to other network formats using the built-in converter functions. All converter packages are optional — install only what you need.

Matrix

# Normalized similarity matrix
as_matrix(result)

# Raw co-occurrence count matrix
as_matrix(result, type = "raw")

igraph

# install.packages("igraph")
g <- as_igraph(result)
plot(g, edge.width = igraph::E(g)$weight * 3)
igraph::degree(g)
igraph::betweenness(g)

tidygraph

# install.packages("tidygraph")
tg <- as_tidygraph(result)
# Use with ggraph

cograph

# remotes::install_github("mohsaqr/cograph")
net <- as_cograph(result)
cograph::splot(net)
cograph::communities(net)

Nestimate

# remotes::install_github("mohsaqr/Nestimate")
net <- as_netobject(result)
Nestimate::centrality(net)
Nestimate::bootstrap_network(net)

How it works

Regardless of input format, the internal pipeline is:

  1. Parse input into a list of character vectors (transactions)
  2. Filter entities below min_occur frequency
  3. Build a binary transaction matrix \(B\) (rows = transactions, columns = items)
  4. Compute raw co-occurrence: \(C = B^\top B\) via crossprod()
  5. Normalize using the chosen similarity measure
  6. Scale weights if scale is specified
  7. Filter edges below threshold and keep top_n
  8. Return upper triangle as a tidy sorted edge data frame

The computation is vectorized throughout - no loops in the hot path. crossprod() delegates to optimized BLAS routines for the matrix multiplication.

Full parameter reference

Argument Type Description Default
data various Input data (data.frame, matrix, or list) required
field character Column(s) containing entities (nodes) NULL
by character Column grouping entities into transactions NULL
weight_by character Column with numeric weights for long format (e.g. LDA topic probabilities) NULL
sep character Delimiter for splitting delimited fields NULL
split_by character Column to split data by (separate network per group) NULL
similarity character Normalization measure "none"
counting character "full" or "fractional" "full"
scale character Weight scaling method NULL
threshold numeric Minimum edge weight (after normalization + scaling) 0
min_occur integer Minimum entity frequency (transactions) 1
top_n integer Keep only the top N edges by weight (per group if split) NULL
output character Output format: "default", "gephi", "igraph", "cograph", "matrix" "default"

References

Saqr, M., López-Pernas, S., Conde, M. Á., & Hernández-García, Á. (2023). Social Network Analysis: A Primer, a Guide and a Tutorial in R. In Saqr, M. & López-Pernas, S. (Eds.), Learning Analytics Methods and Tutorials: A Practical Guide Using R. Springer. https://lamethods.org/book1/chapters/ch15-sna/ch15-sna.html

van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well‐known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651.

Authors

Mohammed Saqr — University of Eastern Finland · saqr.me

Sonsoles López-Pernas — University of Eastern Finland · sonsoles.me

Kamila Misiejuk — FernUniversität in Hagen · kamilamisiejuk.com

License

MIT