Structured data

When using an LLM to extract data from text or images, you can ask the chatbot to format it in JSON or any other format that you like. This works well most of the time, but there’s no guarantee that you’ll get the exact format you want. In particular, if you’re trying to get JSON, you’ll find that it’s typically surrounded in ```json, and you’ll occasionally get text that isn’t valid JSON. To avoid these problems, you can use a recent LLM feature: structured data (aka structured output). With structured data, you supply the type specification that defines the object structure you want and the LLM ensures that’s what you’ll get back.

library(ellmer)

Structured data basics

To extract structured data you call $chat_structured() instead of the $chat() method. You’ll also need to define a type specification that describes the structure of the data that you want (more on that shortly). Here’s a simple example that extracts two specific values from a string:

chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(
  "My name is Susan and I'm 13 years old",
  type = type_object(
    name = type_string(),
    age = type_number()
  )
)
#> $name
#> [1] "Susan"
#> 
#> $age
#> [1] 13

The same basic idea works with images too:

chat$chat_structured(
  content_image_url("https://www.r-project.org/Rlogo.png"),
  type = type_object(
    primary_shape = type_string(),
    primary_colour = type_string()
  )
)
#> $primary_shape
#> [1] "the image consists of a large gray oval with a bold blue letter 'R' overlaid on it"
#> 
#> $primary_colour
#> [1] "gray (oval), blue (letter R)"

If you need to extract data from multiple prompts, you can use the same techniques with parallel_chat_structured(). It takes the same arguments as $chat_structured() with two exceptions: it needs a chat object since it’s a standalone function, not a method, and it can take a vector of prompts.

prompts <- list(
  "I go by Alex. 42 years on this planet and counting.",
  "Pleased to meet you! I'm Jamal, age 27.",
  "They call me Li Wei. Nineteen years young.",
  "Fatima here. Just celebrated my 35th birthday last week.",
  "The name's Robert - 51 years old and proud of it.",
  "Kwame here - just hit the big 5-0 this year."
)
parallel_chat_structured(
  chat,  
  prompts,
  type = type_object(
    name = type_string(),
    age = type_number()
  )
)
#> [working] (0 + 0) -> 5 -> 1 | ■■■■■■                            17%
#> [working] (0 + 0) -> 0 -> 6 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#>     name age
#> 1   Alex  42
#> 2  Jamal  27
#> 3 Li Wei  19
#> 4 Fatima  35
#> 5 Robert  51
#> 6  Kwame  50

Data types basics

To define your desired type specification (also known as a schema), you use the type_() functions. (You might already be familiar with these if you’ve done any function calling, as discussed in vignette("function-calling")). The type functions can be divided into three main groups:

Using these type specifications ensures that the LLM will return JSON. But ellmer goes one step further to convert the results to the closest R analog. Currently, this converts arrays of boolean, integers, numbers, and strings into logical, integer, numeric, and character vectors. Arrays of objects are converted into data frames. You can opt-out of this and get plain lists by setting convert = FALSE in $chat_structured().

In addition to defining types, you need to provide the LLM with some information about what you actually want. This is the purpose of the first argument, description, which is a string that describes the data that you want. This is a good place to ask nicely for other attributes you’ll like the value to have (e.g. minimum or maximum values, date formats, …). There’s no guarantee that these requests will be honoured, but the LLM will usually make a best effort to do so.

type_type_person <- type_object(
  "A person",
  name = type_string("Name"),
  age = type_integer("Age, in years."),
  hobbies = type_array(
    "List of hobbies. Should be exclusive and brief.",
    items = type_string()
  )
)

Now we’ll dive into some examples before coming back to talk more about the details of data types.

Examples

The following examples, which are closely inspired by the Claude documentation, hint at some of the ways you can use structured data extraction.

Example 1: Article summarisation

text <- readLines(system.file("examples/third-party-testing.txt", package = "ellmer"))
# url <- "https://www.anthropic.com/news/third-party-testing"
# html <- rvest::read_html(url)
# text <- rvest::html_text2(rvest::html_element(html, "article"))

type_summary <- type_object(
  "Summary of the article.",
  author = type_string("Name of the article author"),
  topics = type_array(
    'Array of topics, e.g. ["tech", "politics"]. Should be as specific as possible, and can overlap.',
    type_string(),
  ),
  summary = type_string("Summary of the article. One or two paragraphs max"),
  coherence = type_integer("Coherence of the article's key points, 0-100 (inclusive)"),
  persuasion = type_number("Article's persuasion score, 0.0-1.0 (inclusive)")
)

chat <- chat_openai()
#> Using model = "gpt-4.1".
data <- chat$chat_structured(text, type = type_summary)
cat(data$summary)
#> This article argues that effective, broadly trusted third-party testing and evaluation regimes are essential for the safe deployment and governance of frontier AI systems. Anthropic proposes that self-governance and company-led safety policies, like their Responsible Scaling Policy (RSP), are necessary but not sufficient; independent, third-party testing—spanning government agencies, academia, and private contractors—is needed to ensure AI models do not cause accidental or deliberate harm. The article details reasons for needing such a regime, including AI's general-purpose nature, emergent risks, and the precedent from other sectors (food, medicine, etc.). 
#> 
#> Anthropic outlines the principles for creating fair and effective third-party testing systems: precise scope to avoid over-burdening small companies, application mainly to high-compute frontier models, and the inclusion of various actors (government, academia, private). They emphasize that such a system will help avoid catastrophic incidents and knee-jerk regulation, foster public trust, and mitigate risks like regulatory capture and undue barriers to competition. The article also discusses the importance of openly accessible models, scenario planning, and the challenges around regulation and open-source AI. Anthropic commits to proactive steps: prototyping testing, red teaming, supporting government funding of institutions like NIST, and advocating for balanced, minimal but effective policy. Ultimately, they view third-party testing as a cornerstone for responsible AI oversight and hope to inspire broader societal standards and critique.

str(data)
#> List of 5
#>  $ author    : chr "Anthropic Policy Team (unspecified individual)"
#>  $ topics    : chr [1:11] "AI policy" "third-party testing" "AI safety" "regulation" ...
#>  $ summary   : chr "This article argues that effective, broadly trusted third-party testing and evaluation regimes are essential fo"| __truncated__
#>  $ coherence : int 92
#>  $ persuasion: num 0.82

Example 2: Named entity recognition

text <- "
  John works at Google in New York. He met with Sarah, the CEO of
  Acme Inc., last week in San Francisco.
"

type_named_entity <- type_object(
  name = type_string("The extracted entity name."),
  type = type_enum("The entity type", c("person", "location", "organization")),
  context = type_string("The context in which the entity appears in the text.")
)
type_named_entities <- type_array(items = type_named_entity)

chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(text, type = type_named_entities)
#>            name         type
#> 1          John       person
#> 2        Google organization
#> 3      New York     location
#> 4         Sarah       person
#> 5     Acme Inc. organization
#> 6 San Francisco     location
#>                                                              context
#> 1                John is mentioned as working at Google in New York.
#> 2               Google is mentioned as the company where John works.
#> 3      New York is mentioned as the city where John works at Google.
#> 4         Sarah is mentioned as the CEO of Acme Inc., whom John met.
#> 5      Acme Inc. is mentioned as the company where Sarah is the CEO.
#> 6 San Francisco is mentioned as the place where John met with Sarah.

Example 3: Sentiment analysis

text <- "
  The product was okay, but the customer service was terrible. I probably
  won't buy from them again.
"

type_sentiment <- type_object(
  "Extract the sentiment scores of a given text. Sentiment scores should sum to 1.",
  positive_score = type_number("Positive sentiment score, ranging from 0.0 to 1.0."),
  negative_score = type_number("Negative sentiment score, ranging from 0.0 to 1.0."),
  neutral_score = type_number("Neutral sentiment score, ranging from 0.0 to 1.0.")
)

chat <- chat_openai()
#> Using model = "gpt-4.1".
str(chat$chat_structured(text, type = type_sentiment))
#> List of 3
#>  $ positive_score: num 0.05
#>  $ negative_score: num 0.75
#>  $ neutral_score : num 0.2

Note that while we’ve asked nicely for the scores to sum 1, which they do in this example (at least when I ran the code), this is not guaranteed.

Example 4: Text classification

text <- "The new quantum computing breakthrough could revolutionize the tech industry."

type_classification <- type_array(
  "Array of classification results. The scores should sum to 1.",
  type_object(
    name = type_enum(
      "The category name",
      values = c(
        "Politics",
        "Sports",
        "Technology",
        "Entertainment",
        "Business",
        "Other"
      )
    ),
    score = type_number(
      "The classification score for the category, ranging from 0.0 to 1.0."
    )
  )
)

chat <- chat_openai()
#> Using model = "gpt-4.1".
data <- chat$chat_structured(text, type = type_classification)
data
#>         name score
#> 1 Technology     1

Example 5: Working with unknown keys

type_characteristics <- type_object(
  "All characteristics",
  .additional_properties = TRUE
)

prompt <- "
  Given a description of a character, your task is to extract all the characteristics of that character.

  <description>
  The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.
  </description>
"

chat <- chat_anthropic()
#> Using model = "claude-sonnet-4-20250514".
str(chat$chat_structured(prompt, type = type_characteristics))
#>  list()

This example only works with Claude, not GPT or Gemini, because only Claude supports adding additional, arbitrary properties.

Example 6: Extracting data from an image

The final example comes from Dan Nguyen (you can see other interesting applications at that link). The goal is to extract structured data from this screenshot:

Screenshot of schedule A: a table showing assets and “unearned” income
Screenshot of schedule A: a table showing assets and “unearned” income

Even without any descriptions, ChatGPT does pretty well:

type_asset <- type_object(
  assert_name = type_string(),
  owner = type_string(),
  location = type_string(),
  asset_value_low = type_integer(),
  asset_value_high = type_integer(),
  income_type = type_string(),
  income_low = type_integer(),
  income_high = type_integer(),
  tx_gt_1000 = type_boolean()
)
type_assets <- type_array(items = type_asset)

chat <- chat_openai()
#> Using model = "gpt-4.1".
image <- content_image_file("congressional-assets.png")
data <- chat$chat_structured(image, type = type_assets)
data
#>                            assert_name owner
#> 1  11 Zinfandel Lane - Home & Vineyard    JT
#> 2 25 Point Lobos - Commercial Property    SP
#>                              location asset_value_low asset_value_high
#> 1             St. Helena/Napa, CA, US         5000001         25000000
#> 2 San Francisco/San Francisco, CA, US         5000001         25000000
#>   income_type income_low income_high tx_gt_1000
#> 1 Grape Sales     100001     1000000      FALSE
#> 2        Rent     100001     1000000      FALSE

Advanced data types

Now that you’ve seen a few examples, it’s time to get into more specifics about data type declarations.

Required vs optional

By default, all components of an object are required. If you want to make some optional, set required = FALSE. This is a good idea if you don’t think your text will always contain the required fields as LLMs may hallucinate data in order to fulfill your spec.

For example, here the LLM hallucinates a date even though there isn’t one in the text:

type_article <- type_object(
  "Information about an article written in markdown",
  title = type_string("Article title"),
  author = type_string("Name of the author"),
  date = type_string("Date written in YYYY-MM-DD format.")
)

prompt <- "
  Extract data from the following text:

  <text>
  # Structured Data
  By Hadley Wickham

  When using an LLM to extract data from text or images, you can ask the chatbot to nicely format it, in JSON or any other format that you like.
  </text>
"

chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(prompt, type = type_article)
#> $title
#> [1] "Structured Data"
#> 
#> $author
#> [1] "Hadley Wickham"
#> 
#> $date
#> [1] ""
str(data)
#> 'data.frame':    2 obs. of  9 variables:
#>  $ assert_name     : chr  "11 Zinfandel Lane - Home & Vineyard" "25 Point Lobos - Commercial Property"
#>  $ owner           : chr  "JT" "SP"
#>  $ location        : chr  "St. Helena/Napa, CA, US" "San Francisco/San Francisco, CA, US"
#>  $ asset_value_low : int  5000001 5000001
#>  $ asset_value_high: int  25000000 25000000
#>  $ income_type     : chr  "Grape Sales" "Rent"
#>  $ income_low      : int  100001 100001
#>  $ income_high     : int  1000000 1000000
#>  $ tx_gt_1000      : logi  FALSE FALSE

Note that I’ve used more of an explict prompt here. For this example, I found that this generated better results and that it’s a useful place to put additional instructions.

If I let the LLM know that the fields are all optional, it’ll return NULL for the missing fields:

type_article <- type_object(
  "Information about an article written in markdown",
  title = type_string("Article title", required = FALSE),
  author = type_string("Name of the author", required = FALSE),
  date = type_string("Date written in YYYY-MM-DD format.", required = FALSE)
)
chat$chat_structured(prompt, type = type_article)
#> $title
#> [1] "Structured Data"
#> 
#> $author
#> [1] "Hadley Wickham"
#> 
#> $date
#> NULL

Data frames

If you want to define a data frame like object, you might be tempted to create a definition similar to what R uses: an object (i.e., a named list) containing multiple vectors (i.e., an array):

type_my_df <- type_object(
  name = type_array(items = type_string()),
  age = type_array(items = type_integer()),
  height = type_array(items = type_number()),
  weight = type_array(items = type_number())
)

This, however, is not quite right becuase there’s no way to specify that each array should have the same length. Instead, you’ll need to turn the data structure “inside out” and create an array of objects:

type_my_df <- type_array(
  items = type_object(
    name = type_string(),
    age = type_integer(),
    height = type_number(),
    weight = type_number()
  )
)

If you’re familiar with the terms row-oriented and column-oriented data frames, this is the same idea. Since most languages don’t possess vectorisation like R, row-oriented structures tend to be much more common in the wild.

Token usage

provider model input output price
OpenAI gpt-4.1 9325 1899 $0.03
OpenAI gpt-4.1-nano 501 108 $0.00
Anthropic claude-sonnet-4 1283 2043 $0.03