LLMs vs Schema: Why Parsing Fails – and How to Fix It

Brian
Hansford

Table of Contents

TL;DR

  • LLMs can read JSON-LD, but they don’t validate Schema.org rules.

  • Most failures are pipeline issues: discovery → syntax → graph → trust.

  • Ship server-rendered JSON-LD, stable @id links, tight graphs, and deterministic validation in CI. If your facts aren’t seen, linked, and trusted, you won’t be cited.

What it means when someone says “LLMs can’t parse your schema”

When folks say “LLMs can’t parse your schema,” they’re not claiming JSON-LD or Schema.org is broken. They’re pointing at the AI ingestion pipeline: crawler → fetcher → parser → entity linker → trust scorer – and noting that somewhere along that conveyor belt your facts weren’t seen, couldn’t be read, didn’t resolve to known entities, or weren’t trusted.

LLMs aren’t schema validators. If your JSON-LD is unseen, malformed, unlinked, or contradictory, the ingestion pipeline discards it, even when an LLM can ‘read’ the text. There’s also a second truth hiding underneath: LLMs aren’t deterministic validators. They can read JSON-LD text, but they don’t reliably enforce the formal rules your schema requires. Your job is to fix both the plumbing and the guardrails.

Why This Matters Now

Generative systems turn the web into answers. If your facts don’t make it into those systems, your brand fades from AI Overviews, assistant responses, and zero-click results. JSON-LD is your machine-readable business card, but it only works when pipelines can find it, parse it, connect it, and believe it – and when your generation process doesn’t produce pretty-but-invalid schema.

Four Failure Modes Behind “LLMs Can’t Parse Your Schema”

1) Discovery failure: the crawler never saw your JSON-LD

  • Blocked/throttled AI user-agents in robots.txt or WAF rules.
  • Client-side only insertion (after JavaScript, hydration, or tag managers); many crawlers don’t execute JS consistently.
  • Wrong place/type: markup in comments, Shadow DOM, wrong MIME type, or in an external file the crawler never fetched.
  • Conflicting duplicates across templates; the pipeline shrugs and moves on.

Fix the plumbing

  • Render one authoritative <script type="application/ld+json"> in the initial HTML (no tag-manager injection), and allow AI user-agents to fetch without JS.
  • One authoritative block per entity per URL.
  • Confirm AI user-agents can fetch without obstacles or wild 429s.

2) Syntax failure: machines bail before semantics

  • Invalid JSON (smart quotes, trailing commas, comments, mixed encodings).
  • Wrong @context/@type; properties outside vocabulary.
  • Arrays vs. objects flipped; malformed or relative-only URLs.

Fix the bytes

  • Treat JSON-LD like code: lint it with a strict JSON parser before any “rich result” tool.
  • Keep it ASCII-clean and minified.

Validate types/properties against Schema.org.

3) Semantic/graph failure: valid JSON-LD that says nothing useful

  • No stable IDs: nodes lack persistent @id IRIs; the graph can’t resolve.
  • Orphaned nodes in @graph with no link to the primary entity.
  • Vague types (Thing) instead of specific (Organization, Product, Service).
  • Missing disambiguators: no sameAs, unclear ownership between Organization and Services/Products.

Fix the graph

  • Assign stable @id IRIs to each major entity (e.g., https://example.com/#org) and link related nodes via those IDs (provider, brand, publisher).
  • Link nodes via @id and provider/brand/publisher relationships.
  • Add sameAs to strong, consistent profiles to anchor your entity.

4) Contradiction/credibility failure: the pipeline doesn’t trust it

  • Markup conflicts with visible page copy, OpenGraph/Twitter meta, or other JSON-LD (different names, prices, or phone numbers).
  • Boilerplate stuffed with marketing fluff instead of concise facts.
  • Graph sprawl: huge @graphs full of half-populated nodes.

Fix the trust signals

  • Keep schema tight, factual, and consistent with what users see.
  • One site-wide Organization; per-page entities (Product, Service, Article, Event) that match the page’s purpose.
  • Prefer fewer, higher-quality facts over sprawling graphs.

LLMs vs. Parsers: Who Does What?

LLMs are phenomenal at language understanding and pattern matching. They are not deterministic schema validators. That distinction drives a lot of confusion.

Task

What LLMs do well

Where “can’t parse” shows up

Parsing

Read JSON-LD text and extract fields.

Don’t enforce schema rules; treat JSON as text unless a formal tool is in the loop.

Schema semantics

Recognize labels like Recipe or Service.

Don’t natively enforce formal constraints (e.g., prepTime must be ISO 8601 PT30M).

Validation

Notice obvious typos or missing values sometimes.

Can’t reliably confirm required properties/types per Schema.org profiles.

Reasoning

Summarize or infer from surrounding copy.

Error-prone for rule-based filters (“find Product with reviewCount > 10”) vs. a database/graph query.

Implication: LLMs can generate or read schema-looking JSON, but you still need deterministic checks to guarantee compliance with the vocabulary’s rules, data types, and relationships.

Going Deeper: To enforce Schema.org’s rules deterministically, you need tools like JSON Schema (for basic structure and type checking), or more advanced SHACL (Shapes Constraint Language) or ShEx (Schema for RDF) for complex graph shape validation. These are what should be wired into your CI/CD pipeline.

 

Practical Risks Without Guardrails

  • Schema generation via LLMs can produce syntactically valid JSON that violates type constraints, omits required properties, or formats durations/prices incorrectly.
  • Data validation done only by “it looks right” prompts leads to quiet failures—pipelines down-weight or ignore your markup.
  • Complex reasoning (filters, joins, thresholds) belongs to databases or graph engines, not free-form language models.

Mitigation

  • Wire validators into CI: reject on invalid JSON, wrong types/properties, or missing required fields. Schema that ‘looks right’ isn’t good enough – fail the build.
  • Use tests for canonical pages (e.g., homepage Organization, top Product/Service pages).
  • Fail the build on schema errors; don’t ship “pretty but wrong.”

 

“Good Bones” Examples

1. Site-wide Organization (homepage or layout template)

<script type=”application/ld+json”>

{

  “@context”: “https://schema.org”,

  “@type”: “Organization”,

  “@id”: “https://example.com/#org”,

  “name”: “Acme Robotics”,

  “url”: “https://example.com/”,

  “logo”: “https://example.com/assets/logo.png”,

  “sameAs”: [

    “https://www.linkedin.com/company/acme-robotics/”,

    “https://x.com/acmerobotics”

  ],

  “contactPoint”: [{

    “@type”: “ContactPoint”,

    “contactType”: “sales”,

    “email”: “sales@example.com”

  }]

}

</script>

 

2. Service page linked to the Organization

<script type=”application/ld+json”>

{

  “@context”: “https://schema.org”,

  “@type”: “Service”,

  “@id”: “https://example.com/services/workflow-automation/#service”,

  “name”: “Workflow Automation for Enterprise Teams”,

  “provider”: { “@id”: “https://example.com/#org” },

  “areaServed”: “US”,

  “audience”: { “@type”: “BusinessAudience”, “name”: “Enterprise” },

  “offers”: {

    “@type”: “Offer”,

    “price”: “1999”,

    “priceCurrency”: “USD”,

    “availability”: “https://schema.org/InStock”,

    “url”: “https://example.com/pricing/”

  },

  “url”: “https://example.com/services/workflow-automation/”

}

</script>

Notes

  • Stable @id IRIs let pipelines resolve relationships.
  • Keep properties factual, specific, and aligned with visible copy.
  • Don’t over-scope: include only entities the page truly presents.

 

Anti-Patterns That Quietly Nuke Parsing

  • Multiple Organizations across templates (“Marketing Site LLC” vs. “MSite, Inc.”).
  • JS-injected schema from a tag manager that sometimes fails to fire.
  • Microdata + JSON-LD telling different stories.
  • Copy-pasted examples with placeholder values (“Your Company”).
  • Graph sprawl: dozens of loosely defined nodes per page.

 

Five-Minute Triage

  1. Fetch it like a bot discovery test

curl -s -A GPTBot https://example.com/ | grep -i ‘ld+json’ || echo ‘No JSON-LD found’

If empty, you’re injecting via JS or blocking the crawler.
Reason: Copy-paste test with success/fail signal.

  1. Validate the JSON
    Run through a strict JSON parser (catch encoding/quote/comma errors) before schema tools.
  2. Check types & properties
    Ensure @type is specific; confirm every property belongs to that type.
  3. Check the graph
    Stable @id for each major node. Link Service/Product back to Organization.
  4. Check alignment
    Spot-check facts (name, phone, price, durations) against visible copy and meta tags.

Architecture That Pays Off

  • Server-render schema in templates; avoid client-only injection.
  • One Organization site-wide with a stable @id; reuse that ID in related nodes.
  • Per-page entities that match page intent (Article vs. Product vs. Service).
  • Short, factual fields – numbers, names, dates, URLs – over prose.
  • Consistent naming across site, socials, directories.
  • Change control: when marketing updates prices or contacts, update schema and page copy in the same PR.
  • Deterministic validation in CI/CD; fail builds on schema errors.

 

FAQ

Is JSON-LD better than microdata for LLMs?
Usually yes in practice: it’s easier to keep clean and server-rendered, reducing discovery and syntax failures.

Do I need a massive @graph on every page?
No. Smaller, accurate graphs beat bloated ones. Include only entities the page truly represents.

Will perfect schema guarantee AI citations?
No. Schema improves discoverability, clarity, and trust, but citations also depend on content quality, corroboration across the web, and the assistant’s product logic.

Can I let an LLM “validate” my schema?
Use LLMs for drafting, never for final validation. Always run deterministic checks (JSON validators, schema validators, SHACL/ShEx/JSON Schema).

What about durations, prices, and required fields?
These are classic failure points. For example, prepTime on a Recipe must be ISO 8601 (“PT30M”). Don’t rely on LLMs to enforce that. Validate it.