cooccure

The cooccure R package enables building co-occurrence networks from multiple data formats. It accepts six input formats and supports multiple similarity measures, scaling methods, fractional counting, group-level splitting, and flexible filtering. Results are returned as a tidy edge data frame (from, to, weight, count) convertible to igraph, tidygraph, cograph, and Nestimate objects.

The main function cooccurrence() is also available as the short alias co().

Installation

# CRAN release
install.packages("cooccure")

# Development version
remotes::install_github("mohsaqr/cooccure")

Input formats

cooccure auto-detects the input format from the arguments provided. Six formats are supported (delimited field, multi-column delimited, long/bipartite, binary matrix, wide sequence, and list of character vectors), covering the most common shapes data comes in.

1. Delimited field

A delimited field is a single column where multiple items are stored as one string, separated by a consistent character such as ;, ,, |, or a space. This is the most common format in bibliometrics and text analysis, where each row represents a document.

Use the field argument to specify the column with the relevant values, and sep to specify the delimiter.

df <- data.frame(
  id = 1:3,
  keywords = c("network; graph; matrix",
               "graph; algebra",
               "network; algebra; graph")
)
cooccurrence(df, field = "keywords", sep = ";")

Whitespace around the separator is automatically trimmed (" network " becomes "network"). Empty strings and NAs are dropped. Duplicate items within a row are de-duplicated.

2. Multi-column delimited

A multi-column delimited format is used when items are spread across multiple columns - for example, author keywords and index keywords in a Scopus export, or authors and affiliations. Values from all specified columns are pooled per row.

Use the field argument to specify the columns with the relevant values, and sep to specify the delimiter.

df <- data.frame(
  author_kw = c("machine learning; nlp", "deep learning", "nlp"),
  index_kw  = c("classification", "image recognition", "text mining")
)
cooccurrence(df, field = c("author_kw", "index_kw"), sep = ";")

3. Long / bipartite

A long or bipartite format has one row per item-document pair, common in relational databases, survey data, and tidy data pipelines.

Use the field argument to specify the column containing the items, and the by argument to specify which column groups them into transactions.

citations <- data.frame(
  paper_id = c(1, 1, 1, 2, 2, 3, 3, 3),
  reference = c("Smith2020", "Jones2019", "Lee2021",
                "Jones2019", "Lee2021",
                "Smith2020", "Lee2021", "Park2022")
)
cooccurrence(citations, field = "reference", by = "paper_id")

Use the weight_by argument to pass a numeric weight column for weighted long format — for example, LDA topic-document probabilities where each document contributes a topic with a given probability:

theta <- data.frame(
  doc   = c("d1","d1","d1","d2","d2","d3","d3"),
  topic = c("T1","T2","T3","T1","T3","T2","T3"),
  prob  = c(0.6, 0.3, 0.1, 0.4, 0.6, 0.5, 0.5)
)
cooccurrence(theta, field = "topic", by = "doc", weight_by = "prob")

In weighted long format, the co-occurrence between items i and j is computed as sum_d w_id * w_jd (the sum of the products of their weights across all shared transactions) rather than a simple binary count. The count column still reports the number of transactions where both items appear together.

4. Binary matrix

A binary matrix is a document-term matrix where columns are items and values are 0/1 (absence or presence). This format is auto-detected when all values are 0 or 1 and no field, by, or sep arguments are provided.

dtm <- matrix(c(1,1,0,1,
                0,1,1,0,
                1,0,1,1), nrow = 3, byrow = TRUE,
              dimnames = list(NULL, c("network", "graph", "algebra", "matrix")))
cooccurrence(dtm)

Works with both matrix and data.frame inputs. Columns without names are auto-named V1, V2, etc.

5. Wide sequence

A wide sequence format is used for non-binary data frames or matrices where each row is a sequence or record, and the unique values in each row form a transaction. This is the native format for sequence analysis tools like TraMineR and tna. Pass field = "all" to treat every column as a time point.

sequences <- data.frame(
  t1 = c("A", "B", "A"),
  t2 = c("B", "C", "C"),
  t3 = c("C", NA,  NA)
)
cooccurrence(sequences, field = "all")

NAs, empty strings, and TraMineR void markers (%, *) are automatically removed.

6. List of character vectors

A list of character vectors is the most direct format, where each list element is a transaction containing a set of categorical items.

baskets <- list(
  c("bread", "milk", "eggs"),
  c("bread", "butter"),
  c("milk", "eggs", "butter"),
  c("bread", "milk", "eggs", "butter")
)
cooccurrence(baskets)

Similarity measures

The similarity argument controls how raw co-occurrence counts are normalized into a similarity or association measure. All similarity measures are based on two inputs: the co-occurrence count between two items (\(C_{ij}\)), and how frequently each item appears individually across all transactions (\(f_i\), \(f_j\)).

# Jaccard similarity
cooccurrence(baskets, similarity = "jaccard")

# Association strength
cooccurrence(papers, field = "keywords", sep = ";", similarity = "association")

Which similarity to use?

Exploratory work: Start with "none" to see raw counts and understand the data, then try "jaccard" or "cosine" for a balanced view.
General purpose: "jaccard" is a good default choice. It normalizes co-occurrences by the union of transactions containing either item, applying a balanced penalty for non-overlap.
Bibliometric and scientometric networks: "association" is recommended by van Eck & Waltman (2009) because it correctly accounts for the expected number of co-occurrences under independence. Two items that are both very frequent will naturally co-occur often; association strength discounts this, revealing which pairs co-occur more than chance alone would predict.
Detecting hierarchical/subset structure: "inclusion" (Simpson coefficient) reveals when one item almost always appears with another, useful for finding items that are subsets of broader categories or dependency relationships.
Binary presence/absence networks: "dice" when you only care whether items co-occur, not how often. Less strict than "jaccard", it applies a less severe penalty for partial overlap.
Scale-invariant comparison: "cosine" is invariant to absolute frequency, useful when comparing co-occurrence patterns across datasets of different sizes or when frequent items should not dominate the network.
Strict filtering: "equivalence" (cosine squared) amplifies differences, pushing pairs with weak overlap closer to zero and retaining only the strongest associations.
Asymmetric tendencies: "relative" normalizes each row so that edge weights sum to 1, capturing the relative tendency of one item to appear with another rather than absolute co-occurrence counts.

Similarity method overview

Method	Formula	Description	Best for
`"none"`	\(C_{ij}\)	Raw co-occurrence count	Exploratory analysis
`"jaccard"`	\(\frac{C_{ij}}{f_i + f_j - C_{ij}}\)	Divides co-occurrences by the total number of transactions containing either item	General purpose
`"cosine"`	\(\frac{C_{ij}}{\sqrt{f_i \cdot f_j}}\)	Divides co-occurrences by the geometric mean of item frequencies (Salton’s cosine)	Scale-invariant comparison
`"inclusion"`	\(\frac{C_{ij}}{\min(f_i, f_j)}\)	Divides co-occurrences by the frequency of the rarer item (Simpson coefficient)	Subset and hierarchical relationships
`"association"`	\(\frac{C_{ij}}{f_i \cdot f_j}\)	Divides co-occurrences by the product of item frequencies, discounting chance co-occurrences (van Eck & Waltman, 2009)	Bibliometric networks
`"dice"`	\(\frac{2 C_{ij}}{f_i + f_j}\)	Divides co-occurrences by the arithmetic mean of item frequencies	Binary presence/absence networks
`"equivalence"`	\(\frac{C_{ij}^2}{f_i \cdot f_j}\)	Squares the co-occurrence count before dividing by the product of frequencies (cosine squared)	Strict filtering
`"relative"`	Row-normalized (each row sums to 1)	Normalizes each row so that all edge weights from an item sum to 1	Asymmetric tendencies

Counting

The counting argument controls how much each transaction contributes to the co-occurrence count. Under full counting (default), each co-occurring pair adds 1 regardless of how many items are in the transaction. Under fractional counting, each pair adds \(1/(n-1)\) where \(n\) is the number of items in the transaction, preventing large transactions from dominating the network. For example, a document with 10 keywords creates 45 pairs under full counting but contributes only 1/9 per pair under fractional counting.

# Full counting (default): each co-occurring pair adds 1
co(data, field = "keywords", sep = ";")

# Fractional: each pair adds 1/(n-1) where n = items in the transaction
co(data, field = "keywords", sep = ";", counting = "fractional")

Scaling

The scale argument applies a transformation to the weights after similarity normalization, useful for visualization, thresholding, or feeding into downstream models.

# Log-scaled Jaccard similarity
cooccurrence(baskets, similarity = "jaccard", scale = "log")

# Min-max scaled for visualization
cooccurrence(baskets, similarity = "cosine", scale = "minmax")

Scaling method overview

Method	Transformation	Description	Use case
`"minmax"`	Scale to \([0, 1]\)	Rescales all weights to the range \([0, 1]\)	Visualization and cross-network comparison
`"log"`	\(\log(1 + w)\)	Applies a natural log transformation	Compressing heavy-tailed distributions
`"log10"`	\(\log_{10}(1 + w)\)	Same as log but base 10	When base 10 interpretation is preferred
`"sqrt"`	\(\sqrt{w}\)	Square root transformation	Mild compression of skewed weights
`"binary"`	1 if \(w > 0\), else 0	Converts all positive weights to 1	Presence/absence networks
`"zscore"`	\((w - \mu) / \sigma\)	Standardizes weights to mean 0 and standard deviation 1	Statistical comparison across networks
`"proportion"`	\(w / \sum w\)	Divides each weight by the total sum of weights	Expressing edges as relative importance

Filtering

Three filtering arguments control which edges appear in the result:

min_occur: drops any entity appearing in fewer than a specified number of transactions, removing rare items before co-occurrences are computed.
threshold: keeps only edges with a weight at or above a specified value, applied after similarity normalization and scaling.
top_n: keeps only the \(n\) strongest edges by weight.

# Drop entities appearing in fewer than 3 transactions
cooccurrence(baskets, min_occur = 3)

# Keep only edges with weight >= 0.5 (applied after similarity + scaling)
cooccurrence(baskets, similarity = "jaccard", threshold = 0.5)

# Keep only the 10 strongest edges
cooccurrence(baskets, top_n = 10)

All three can be combined for fine-grained control over the network size and density:

cooccurrence(papers, field = "keywords", sep = ";",
             similarity = "association", min_occur = 2,
             threshold = 0.01, top_n = 50)

Splitting by groups

The split_by argument computes a separate co-occurrence network for each level of a grouping variable and returns them in a single data frame with a group column. This is useful for comparing co-occurrence patterns across time periods, disciplines, journals, or any categorical variable. Each group gets its own similarity computation, meaning item frequencies are group-specific. All other parameters (similarity, scale, threshold, min_occur, top_n) apply per group.

papers <- data.frame(
  year = c(2020, 2020, 2020, 2021, 2021, 2021),
  keywords = c("network; graph; matrix", "graph; algebra",
               "network; algebra; graph",
               "deep learning; nlp", "nlp; transformers",
               "deep learning; transformers; nlp")
)

co(papers, field = "keywords", sep = ";", split_by = "year",
   similarity = "jaccard")
#> # cooccurrence: 7 nodes, 8 edges | split_by: year (2 groups) | similarity: jaccard
#>           from           to    weight count group
#>        algebra        graph 0.6666667     2  2020
#>          graph      network 0.6666667     2  2020
#>  deep learning          nlp 0.6666667     2  2021
#>            nlp transformers 0.6666667     2  2021
#>            ...

Output

cooccurrence() returns a tidy data frame of class cooccurrence that can be piped, filtered, and joined like any standard data frame. The raw co-occurrence count is always preserved in the count column regardless of similarity or scaling, meaning the original counts can always be traced back.

The full matrix, item frequencies, and all parameters are stored as attributes on the returned data frame, making it easy to access the underlying data for further analysis or inspection.

attr(result, "matrix")          # Normalized weight matrix
attr(result, "raw_matrix")      # Raw count matrix (diagonal zeroed)
attr(result, "items")           # Character vector of all items
attr(result, "frequencies")     # Named vector of item frequencies
attr(result, "similarity")      # Similarity measure used
attr(result, "scale")           # Scaling method used
attr(result, "n_transactions")  # Number of transactions
attr(result, "n_items")         # Number of unique items

The cooccurrence object can be also printed, summarized, and plotted directly as a co-occurrence network (Saqr et al., 2023):

# Summary statistics
summary(result)

# Heatmap
plot(result)

# Network plot (requires igraph)
plot(result, type = "network")

Output formats

The output argument controls the format returned directly:

"default" — tidy data frame with from, to, weight, count columns
"gephi" — Gephi-ready format with Source, Target, Weight, Type, Count columns, writable directly to CSV
"igraph" — igraph object for network analysis and visualization
"cograph" — cograph object for use with the cograph package
"matrix" — square co-occurrence matrix

# Default: tidy data frame with from, to, weight, count
co(data, field = "keywords", sep = ";")

# Gephi-ready: Source, Target, Weight, Type, Count columns
co(data, field = "keywords", sep = ";", output = "gephi")
#>   Source  Target Weight       Type Count
#>    graph network      3 Undirected     3

# igraph object
g <- co(data, field = "keywords", sep = ";", output = "igraph")

# cograph object
net <- co(data, field = "keywords", sep = ";", output = "cograph")

# Square matrix
mat <- co(data, field = "keywords", sep = ";", output = "matrix")

The Gephi output can be written directly to CSV for import:

write.csv(co(data, field = "keywords", sep = ";", output = "gephi"),
          "network.csv", row.names = FALSE)

Converters

A cooccurrence result can be converted to other network formats using the built-in converter functions. All converter packages are optional — install only what you need.

Matrix

# Normalized similarity matrix
as_matrix(result)

# Raw co-occurrence count matrix
as_matrix(result, type = "raw")

igraph

# install.packages("igraph")
g <- as_igraph(result)
plot(g, edge.width = igraph::E(g)$weight * 3)
igraph::degree(g)
igraph::betweenness(g)

tidygraph

# install.packages("tidygraph")
tg <- as_tidygraph(result)
# Use with ggraph

cograph

# remotes::install_github("mohsaqr/cograph")
net <- as_cograph(result)
cograph::splot(net)
cograph::communities(net)

Nestimate

# remotes::install_github("mohsaqr/Nestimate")
net <- as_netobject(result)
Nestimate::centrality(net)
Nestimate::bootstrap_network(net)

How it works

Regardless of input format, the internal pipeline is:

Parse input into a list of character vectors (transactions)
Filter entities below min_occur frequency
Build a binary transaction matrix \(B\) (rows = transactions, columns = items)
Compute raw co-occurrence: \(C = B^\top B\) via crossprod()
Normalize using the chosen similarity measure
Scale weights if scale is specified
Filter edges below threshold and keep top_n
Return upper triangle as a tidy sorted edge data frame

The computation is vectorized throughout - no loops in the hot path. crossprod() delegates to optimized BLAS routines for the matrix multiplication.

Full parameter reference

Argument	Type	Description	Default
`data`	various	Input data (data.frame, matrix, or list)	required
`field`	character	Column(s) containing entities (nodes)	`NULL`
`by`	character	Column grouping entities into transactions	`NULL`
`weight_by`	character	Column with numeric weights for long format (e.g. LDA topic probabilities)	`NULL`
`sep`	character	Delimiter for splitting delimited fields	`NULL`
`split_by`	character	Column to split data by (separate network per group)	`NULL`
`similarity`	character	Normalization measure	`"none"`
`counting`	character	`"full"` or `"fractional"`	`"full"`
`scale`	character	Weight scaling method	`NULL`
`threshold`	numeric	Minimum edge weight (after normalization + scaling)	`0`
`min_occur`	integer	Minimum entity frequency (transactions)	`1`
`top_n`	integer	Keep only the top N edges by weight (per group if split)	`NULL`
`output`	character	Output format: `"default"`, `"gephi"`, `"igraph"`, `"cograph"`, `"matrix"`	`"default"`

References

Saqr, M., López-Pernas, S., Conde, M. Á., & Hernández-García, Á. (2023). Social Network Analysis: A Primer, a Guide and a Tutorial in R. In Saqr, M. & López-Pernas, S. (Eds.), Learning Analytics Methods and Tutorials: A Practical Guide Using R. Springer. https://lamethods.org/book1/chapters/ch15-sna/ch15-sna.html

van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well‐known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651.

Authors

Mohammed Saqr — University of Eastern Finland · saqr.me

Sonsoles López-Pernas — University of Eastern Finland · sonsoles.me

Kamila Misiejuk — FernUniversität in Hagen · kamilamisiejuk.com

License

MIT