Title: Optimized Data Analysis System for AI-Based Text Processing
Version: 0.1.0
Description: Extracts machine-readable variables from natural language text using AI APIs. Optimized for speed and cost efficiency through parallel processing and direct CSV-formatted responses from language models. Supports multiple AI providers with robust error handling and automatic retry mechanisms for failed extractions.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3
Imports: stringr, openai, groqR, dplyr, rlang, parallel, future, future.apply
Suggests: testthat (≥ 3.0.0), irr
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2026-03-05 23:16:55 UTC; gelon
Author: Gabriel Lönn [aut, cre], Sebastian Schutte [ctb] (Original code and package idea provided by.)
Maintainer: Gabriel Lönn <gablon@prio.org>
Repository: CRAN
Date/Publication: 2026-03-10 20:40:02 UTC

rapidcodeR: Optimized Data Analysis System for AI-Based Text Processing

Description

A high-performance R package for extracting machine-readable variables from natural language text using AI APIs. The package is optimized for speed and cost efficiency through parallel processing and direct CSV-formatted responses from language models.

Funding

This work is financed in large part by the Research Council of Norway (grant #324931).

Key Features

Main Functions

parallel_execute

Main function for parallel text processing

set_api_specs

Configure API specifications for AI services

set_coding_instruction

Define how AI should extract variables

set_parameters

Configure processing parameters

calculate_overlap

Assess reliability across multiple runs

Workflow

  1. Set up API specifications using set_api_specs()

  2. Configure processing parameters with set_parameters() (optional)

  3. Define extraction instructions with set_coding_instruction()

  4. Process data with parallel_execute()

  5. Optionally assess reliability with calculate_overlap()

Data Format

Input data should contain columns specified by set_parameters():

Output contains configurable columns: id, language, text, and N extracted variables (Var1-VarN).

Author(s)

Maintainer: Gabriel Lönn gablon@prio.org

Other contributors:

Examples

## Not run: 
# Basic workflow
library(rapidcodeR)

# 1. Set up API specifications
set_api_specs(provider = "OpenAI", model = "gpt-4", temp = 0.7, api_key = "your_openai_key")

# 2. Configure processing parameters (optional)
set_parameters(n_variables = 4, id_column = 1, text_column = 2, sep = ",")

# 3. Define what to extract
instruction <- paste(
  "Extract sentiment and topic from each post.",
  "Return as CSV: id,language,text,sentiment,confidence,topic,relevance"
)
set_coding_instruction(instruction)

# 4. Process your data
results <- parallel_execute(
  test_data = my_text_data,
  slicing_n = 240,     # Process 240 random posts
  cores = 4,           # Use 4 CPU cores
  # Provider determined by set_api_specs()
)

# 5. Check reliability (optional)
run1 <- parallel_execute(my_data, 100, cores = 4, seed = 123)
run2 <- parallel_execute(my_data, 100, cores = 4, seed = 123)
agreement <- calculate_overlap(list(run1, run2))

## End(Not run)


Send Prompt to GROQ AI API

Description

This function sends a text prompt to the GROQ AI API and returns the model's response. It includes error handling and uses the groqR package for API communication.

Usage

ask_groq(prompt, temp, topp = 1, model, api_key)

Arguments

prompt

Character. The text prompt to send to the AI model.

temp

Numeric. Temperature parameter controlling randomness (0-1). Lower values make output more deterministic. Required (typically from set_api_specs()).

topp

Numeric. Top-p parameter controlling diversity (0-1). Default is 1.

model

Character. GROQ model to use. Required (typically from set_api_specs()).

api_key

Character. GROQ API key. Must be provided.

Details

The function uses the groqR package to communicate with GROQ's API. It includes error handling that returns NA if the API call fails. The function is optimized for text processing tasks with a maximum token limit of 5000.

Value

Character. The AI model's response to the prompt, or NA if an error occurs.

See Also

ask_openai()

Examples

## Not run: 
# Set API key first
Sys.setenv(GROQ_API_KEY = "your_api_key")

# Send a simple prompt
response <- ask_groq("What is the capital of France?")

# Use different parameters
response <- ask_groq("Analyze this text", temp = 0.3, topp = 0.9)

## End(Not run)


Send Prompt to OpenAI API

Description

This function sends a text prompt to the OpenAI API and returns the model's response. It includes robust error handling and uses the openai package for API communication.

Usage

ask_openai(prompt, temp = NULL, topp = 1, model = NULL)

Arguments

prompt

Character. The text prompt to send to the AI model.

temp

Numeric. Temperature parameter controlling randomness (0-1). Lower values make output more deterministic. If NULL, retrieved from set_api_specs().

topp

Numeric. Top-p parameter controlling diversity (0-1). Default is 1.

model

Character. OpenAI model to use. If NULL, retrieved from set_api_specs().

Details

The function uses the openai package to communicate with OpenAI's API. It includes comprehensive error handling that returns NA if the API call fails or returns an unexpected format. The function expects the OPENAI_API_KEY environment variable to be set via set_api_specs().

Value

Character. The AI model's response to the prompt, or NA if an error occurs.

See Also

ask_groq(), set_api_specs()

Examples

## Not run: 
# Set API specifications first
set_api_specs(provider = "OpenAI", model = "gpt-4", temp = 0.7, api_key = "your_api_key")

# Send a simple prompt
response <- ask_openai("What is the capital of France?")

# Use different parameters
response <- ask_openai("Analyze this text", temp = 0.3, topp = 0.9)

## End(Not run)


Calculate Inter-Rater Agreement Across Multiple Datasets

Description

This function calculates the percentage overlap (agreement) between multiple datasets containing the same variables. It's designed to assess reliability and consistency when the same data is processed multiple times or by different systems/raters.

Usage

calculate_overlap(datasets, alpha = FALSE)

Arguments

datasets

List. A list of data frames to compare. Each data frame should have the same structure with variable columns (Var1, Var2, etc.).

alpha

Logical. If TRUE, also compute Krippendorff's Alpha (nominal) for each variable across the provided datasets. Defaults to FALSE.

Details

The function performs comprehensive overlap analysis:

The function expects exactly the same data structure as produced by the main processing functions: N coded variables (Var1-VarN). Agreement is calculated as the percentage of cases where values match exactly.

Value

Matrix. If alpha = FALSE: one-row matrix with average percentage agreement for each variable (columns), rounded to 2 decimals. If alpha = TRUE: two-row matrix with row "Overlap" (percent agreement) and row "Alpha" (Krippendorff's Alpha for nominal data; 0-1), rounded to 3 decimals for Alpha.

See Also

parallel_execute()

Examples

## Not run: 
# Compare three processing runs
run1 <- parallel_execute(my_data, slicing_n = 100, cores = 4)
run2 <- parallel_execute(my_data, slicing_n = 100, cores = 4)
# Switch to GROQ for comparison
set_api_specs(provider = "Groq", model = "llama-3.3-70b-versatile",
  temp = 0.3, api_key = "your_groq_key")
run3 <- parallel_execute(my_data, slicing_n = 100, cores = 4)

overlap_scores <- calculate_overlap(list(run1, run2, run3))
# Returns vector like: c(Var1 = 85.6, Var2 = 92.1, ...)

## End(Not run)


Process Text Data Using OpenAI GPT API

Description

This function sends text data to the OpenAI GPT API for variable extraction, optimized for speed and cost efficiency. It formats the data for API consumption and parses the CSV-formatted response with robust error handling.

Usage

gpt_func(data_subset, n_post, coding_instruction, worker_env = NULL)

Arguments

data_subset

Data frame. Subset of text data to process. Must contain columns specified by id_column and text_column parameters (set via set_parameters()).

n_post

Integer. Number of posts to process in this batch.

coding_instruction

Character. Instructions for the AI model specifying how to extract and format variables from the text.

worker_env

Environment or NULL. When running in a parallel worker, the worker's package environment. If NULL (default), .package_env is used.

Details

The function performs several optimizations:

The function includes specific error handling for API failures and will return NULL if the API returns an error or invalid response.

Value

Data frame. Each row contains the AI's response for one post, with columns named Var1, Var2, etc. Returns NULL if no posts remain after filtering or if API errors occur.

See Also

ask_openai(), groq_func()


Process Text Data Using GROQ AI API

Description

This function sends text data to the GROQ AI API for variable extraction, optimized for speed and cost efficiency. It formats the data for API consumption and parses the CSV-formatted response.

Usage

groq_func(data_subset, n_post, coding_instruction, worker_env = NULL)

Arguments

data_subset

Data frame. Subset of text data to process. Must contain columns specified by id_column and text_column parameters (set via set_parameters()).

n_post

Integer. Number of posts to process in this batch.

coding_instruction

Character. Instructions for the AI model specifying how to extract and format variables from the text.

worker_env

Environment or NULL. When running in a parallel worker, the worker's package environment. If NULL (default), .package_env is used.

Details

The function performs several optimizations:

The function expects the AI to return one line per input post in CSV format. Trailing semicolons are automatically removed from the response.

Value

Data frame. Each row contains the AI's response for one post, with columns named Var1, Var2, etc. Returns NULL if no posts remain after filtering.

See Also

ask_groq(), gpt_func()


Main Text Processing Function for AI-Based Variable Extraction

Description

This function processes batches of text data using AI models to extract machine-readable variables. It implements robust error handling and retry logic to ensure reliable processing even with API failures.

Usage

main_func(df, to_code_max_id, n_post, provider, worker_env = NULL)

Arguments

df

Data frame. Input data subset containing text to be processed.

to_code_max_id

Integer. Maximum number of posts to process in this batch.

n_post

Integer. Number of posts to process per API call (typically 15).

provider

Character. AI provider to use, either "OpenAI" or "Groq".

worker_env

Environment or NULL. When running in a parallel worker, the worker's package environment. If NULL (default), .package_env is used.

Details

The function implements a robust processing loop that:

The AI models are instructed to return data in CSV format for efficient parsing. The function expects responses with exactly N columns as set by set_parameters().

Value

List of data frames. Each element contains successfully processed results with extracted variables. Returns empty list if no successful processing.

See Also

gpt_func(), groq_func(), make_value_row()

Examples

## Not run: 
# For set_parameters(n_variables = 6):
instruction <- "Extract variables: var1,var2,var3,var4,var5,var6"
result <- main_func(
  df = my_data_subset,
  to_code_max_id = 30,
  n_post = 15,
  provider = "OpenAI"
)
processed_data <- bind_rows(result)  # Combine all dataframes

## End(Not run)


Parse AI Response into Formatted Row String

Description

This function takes a single AI response string and formats it into a standardized SQL-style row format for consistent data processing. It handles text sanitization and ensures the response has exactly N elements (N set by set_parameters()).

Usage

make_value_row(ai_response)

Arguments

ai_response

Character. A single response string from an AI model, typically containing semicolon-separated values.

Details

The function performs several formatting operations:

The function expects exactly N values as set by set_parameters().

Value

Character. A formatted row string in SQL format like "('val1','val2',...)", or NA if validation fails.

See Also

main_func()

Examples

## Not run: 
# AI response with semicolon-separated values (assuming set_parameters(n_variables = 6))
ai_response <- "123;en;Hello world;positive;high;topic1"
formatted_row <- make_value_row(ai_response)
# Returns: "('123','en','hello world','positive','high','topic1')"

## End(Not run)


Process Missing Data with Parallel Execution

Description

This function handles data that failed to process in the initial parallel run by reprocessing missing IDs with optimized batch sizes and parallel execution.

Usage

missingness_func(
  subset_trial,
  missing_ids,
  n_post,
  cores,
  provider,
  multi_core = TRUE,
  verbose = TRUE
)

Arguments

subset_trial

Data frame. The original data subset containing all posts.

missing_ids

Numeric vector. IDs that failed processing and need retry.

n_post

Integer. Number of posts to process in each API call.

cores

Integer. Number of CPU cores available for parallel processing.

provider

Character. AI provider to use, either "OpenAI" or "Groq".

multi_core

Logical. If TRUE, uses multicore backend for parallel processing. If FALSE, uses multisession backend.

verbose

Logical. If TRUE (default), progress and status messages are shown via message(). If FALSE, suppressed.

Details

It continues until the number of missing items is reduced to a manageable level. The function implements an adaptive retry strategy:

The function uses future-based parallel processing with automatic cleanup and includes timing information for performance monitoring.

Value

Data frame. Successfully processed results from the missing data, or NULL if no missing data was recovered. The function also outputs debugging information about filtered posts with 4 or fewer characters.

See Also

parallel_execute(), track_progress()


Execute Parallel Text Processing with AI Models

Description

This is the main function that orchestrates parallel processing of text data using AI APIs for variable extraction. It optimizes for speed and cost by distributing work across multiple cores and handling missing data efficiently.

Usage

parallel_execute(
  test_data,
  slicing_n,
  n_post = 15,
  cores = 8,
  seed = NULL,
  multi_core = FALSE,
  benchmarking = FALSE,
  verbose = TRUE
)

Arguments

test_data

Data frame. Input data containing text to be processed.

slicing_n

Integer. Number of rows to sample from the input data for processing. The batch size will be automatically calculated by dividing this by the number of cores.

n_post

Integer. Number of posts to process in each API call. Default is 15. Must be between 1 and 1000. Larger values may be more efficient but use more API credits.

cores

Integer. Number of CPU cores to use for parallel processing. Default is 8. The function will use the provider specified in set_api_specs() to determine whether to use OpenAI or GROQ services.

seed

Integer or NULL. Random seed for reproducible sampling. If NULL (default), sampling is random. If set to a number, ensures identical datasets across runs.

multi_core

Logical. If TRUE, uses multicore backend for parallel processing (faster but Unix/Mac only). If FALSE (default), uses multisession backend (works on all platforms including Windows).

benchmarking

Logical. If TRUE, returns processing time in seconds instead of the result data frame. Default is FALSE.

verbose

Logical. If TRUE (default), progress and status messages are printed via message(). If FALSE, such messages are suppressed.

Details

The function implements a multi-stage parallel processing workflow:

  1. Validates parameters and tests API connectivity

  2. Randomly samples data and splits into batches for parallel processing

  3. Executes parallel processing using futures and tracks progress

  4. Handles missing/failed extractions with a secondary processing stage

  5. Combines and deduplicates results

The function requires a coding instruction to be set via set_coding_instruction() before execution. This instruction tells the AI how to format its responses.

Value

Data frame. Processed results containing extracted variables, with duplicates removed (when multi_response=FALSE) and sorted by ID.

See Also

set_api_specs(), set_coding_instruction(), set_parameters()

Examples

## Not run: 
# Set up API specifications and coding instruction
set_api_specs(provider = "OpenAI", model = "gpt-4", temp = 0.7, api_key = "your_key")
set_coding_instruction("Extract sentiment: id,sentiment,confidence")

# Process data with random sampling (default)
results <- parallel_execute(
  test_data = my_data,
  slicing_n = 240,
  cores = 4
)

# Process data with reproducible sampling (for calculate_overlap)
results1 <- parallel_execute(
  test_data = my_data,
  slicing_n = 240,
  cores = 4,
  seed = 123
)

# Switch to GROQ for comparison
set_api_specs(provider = "Groq", model = "llama-3.3-70b-versatile",
  temp = 0.3, api_key = "your_groq_key")
results2 <- parallel_execute(
  test_data = my_data,
  slicing_n = 240,
  cores = 4,
  seed = 123
)

# Calculate overlap between identical datasets
overlap_scores <- calculate_overlap(list(results1, results2))


## End(Not run)


Set API Specifications for AI Models

Description

This function sets the API specifications including provider, model, temperature, and API key for AI model interactions. It replaces the need for separate set_api_keys calls and allows for more flexible configuration.

Usage

set_api_specs(provider, model, temp, api_key, multi_response = FALSE)

Arguments

provider

Character. The AI provider to use, either "OpenAI" or "Groq".

model

Character. The specific model to use (e.g., "gpt-4", "gpt-3.5-turbo", "llama-3.3-70b-versatile").

temp

Numeric. Temperature parameter controlling randomness (0-1). Lower values make output more deterministic.

api_key

Character. The API key for the specified provider.

multi_response

Logical. If TRUE, allows AI to return multiple rows per input (only works with n_post=1). If FALSE (default), AI must return exactly one row per input.

Details

This function stores the API specifications in the package's internal environment and sets the appropriate environment variables for API authentication. The specifications are automatically retrieved by processing functions.

Value

Character. A confirmation message.

See Also

set_coding_instruction(), set_parameters()

Examples

## Not run: 
# Set OpenAI specifications
set_api_specs(
  provider = "OpenAI",
  model = "gpt-4",
  temp = 0.7,
  api_key = "your_openai_key"
)

# Set GROQ specifications
set_api_specs(
  provider = "Groq",
  model = "llama-3.3-70b-versatile",
  temp = 0.3,
  api_key = "your_groq_key"
)

## End(Not run)


Set Coding Instructions for AI Models

Description

This function sets the coding instructions that will be used by AI models to extract machine-readable variables from natural language text. The instruction defines how the AI should format and structure its responses.

Usage

set_coding_instruction(instruction)

Arguments

instruction

Character. A string containing the coding instruction that tells the AI model how to process and format the text data. Should specify the expected output format (typically CSV-structured responses).

Details

The coding instruction is stored in the package's internal environment and is automatically retrieved by processing functions. This instruction typically defines:

Value

Character. A confirmation message indicating the instruction has been set.

Examples

## Not run: 
instruction <- "Extract sentiment and topics from posts. Return as CSV: id,sentiment,topic"
set_coding_instruction(instruction)

## End(Not run)


Set Processing Parameters for AI Models

Description

This function configures the parameters that will be used by AI models to extract machine-readable variables from natural language text.

Usage

set_parameters(n_variables = 9, id_column = 1, text_column = 2, sep = ";")

Arguments

n_variables

Integer. Number of columns to extract (default is 9). This determines the total number of columns in the final dataset. Must be between 1 and 20.

id_column

Integer. The column number containing the unique ID for each observation (default is 1). Must be a positive integer.

text_column

Integer. The column number containing the text content to be processed (default is 2). Must be a positive integer.

sep

Character. Separator used in API response parsing (default is ";"). Must be one of: ";" (semicolon), "," (comma), or "|" (vertical bar).

Details

The function sets the expected parameters in the package's internal environment. This affects:

The package automatically adds a unique internal ID column to each row during processing. This internal ID is used for tracking and is removed from the final output.

The total expected columns will be n_variables:

Value

Character. A confirmation message indicating the parameters have been set.

Examples

## Not run: 
# Extract 4 columns, ID in column 1, text in column 2, semicolon separator
set_parameters(n_variables = 4, id_column = 1, text_column = 2, sep = ";")

# Extract 10 columns, ID in column 3, text in column 5, comma separator
set_parameters(n_variables = 10, id_column = 3, text_column = 5, sep = ",")

# Extract 6 columns, ID in column 2, text in column 3, vertical bar separator
set_parameters(n_variables = 6, id_column = 2, text_column = 3, sep = "|")

## End(Not run)


Track Progress of Parallel Processing Tasks

Description

This function wraps the main processing function with progress tracking, error handling, and rate limiting. It's designed to be called from parallel workers to process data batches while providing feedback on progress.

Usage

track_progress(
  df,
  index,
  total_tasks,
  n_post,
  batch_size,
  provider,
  worker_env = NULL,
  verbose = TRUE
)

Arguments

df

Data frame. A subset of data to process in this task.

index

Integer. The current task number (for progress tracking).

total_tasks

Integer. Total number of tasks in the parallel job.

n_post

Integer. Number of posts to process in each API call.

provider

Character. AI provider to use, either "OpenAI" or "Groq".

worker_env

Environment or NULL. When running in a parallel worker, the worker's package environment so config (e.g. coding_instruction, id_column) is available. If NULL (default), the package's .package_env is used.

verbose

Logical. If TRUE (default), progress messages are shown. If FALSE, suppressed.

Details

The function provides several important features:

This function is primarily used internally by the parallel processing framework and calls main_func() with the n_post parameter passed by the caller.

Value

Data frame with successfully processed results from main_func(), or NULL if an error occurs during processing.

See Also

main_func(), parallel_execute()