Cloudflare Content Signals Policy for robots.txt

Brian
Hansford
robots.txt illustrating Content-signal: search=yes, ai-train=no, ai-input=no—allowing classic search while blocking AI training and AI inputs, shown as a clean split-screen graphic.

Table of Contents

Cloudflare’s Content Signals in robots.txt: Precise Controls for Search, AI Training, and AI Inputs

Summary: Traditional robots.txt tells crawlers where they can go. Cloudflare’s Content Signals Policy adds a simple, machine-readable line that allows you to specify how your content may be used after access, including classic search indexing, AI training, and real-time AI inputs (e.g., AI Overviews, RAG). Signals are preferences, not hard blocks, so pair them with enforcement (WAF/Bot tools). Industry support and standards are evolving.

Why this matters (for SEOs and digital marketers)

  • Zero-click generative answers are here. Content is ingested to power AI summaries and assistants, often without a click-through. Publishers need a way to allow search while restricting AI reuse.

  • robots.txt alone is too blunt. It regulates crawling, not secondary uses (training, AI answers). Cloudflare’s Content Signals add that missing layer in a standard place (robots.txt) with both human-readable comments and machine-readable directives.

 

What Cloudflare actually added to robots.txt

Cloudflare introduces three content signals you can express under any User-agent group:

  • search — permission to build a search index and show links/snippets. Excludes AI-generated search summaries.

  • ai-input — permission to use your content as input to models (e.g., RAG/grounding/AI Overviews).

  • ai-train — permission to use your content to train or fine-tune models.

Cloudflare describes Content Signals as “a new addition to robots.txt” that expresses preferences about use after access. It’s not a ratified protocol extension. The Cloudflare Blog

User-agent: *
Content-signal: search=yes, ai-train=no, ai-input=no
Allow: /

 

Quick start: minimal, safe defaults

If you want classic search but no model training or AI inputs, scope a global rule:

  • The Content-signal line is machine-readable; comments above it (optional) are human-readable policy text.

  • You can set different signals inside other User-agent groups if you need per-bot exceptions.

 

Managed robots.txt defaults (Cloudflare)

If you enable Cloudflare’s managed robots.txt, Cloudflare prepends their policy and, by default, sets:

User-Agent: *
Content-signal: search=yes, ai-train=no
Allow: /

Cloudflare intentionally doesn’t set ai-input for you (they don’t assume your preference). The managed block also includes Disallow: rules for many AI crawler user-agents (e.g., GPTBot, Google-Extended, ClaudeBot), which you can customize. Cloudflare Docs

Note: Some sites see Google Search Console warnings like “Syntax not understood” for newer directives; Cloudflare reports no crawl rate impact from these warnings.

 

Signals are preferences – here’s how to enforce them

  • Signals ≠ blocks. Cloudflare is explicit: some companies may ignore signals. Combine signals with WAF/Bot Management or the AI Crawl Control feature to block, allow, or even return 402 (Payment Required) for specific AI crawlers.

  • AI Crawl Control uses WAF custom rules under the hood (order of precedence matters). You can configure per-crawler actions and response codes (403 or 402) and track robots.txt violations in the dashboard. 

  • Cloudflare’s recent push includes default blocking of known AI crawlers for new domains and a Pay-Per-Crawl model in private beta – evidence the stack is moving beyond voluntary compliance. 

 

Will Google (and others) honor this?

  • Trade press coverage is clear: Google has not committed to honoring Content Signals (e.g., for AI Overviews). Plan for uneven compliance and test. 

  • Real-world behavior varies. Recent investigations have shown that some AI agents (e.g., stealthy UA, IP rotation) underscore the need for technical enforcement.

 

Standards status (don’t over-promise yet)

  • Cloudflare says they’ll work in standards bodies; meanwhile, the IETF has chartered the AIPREF (AI Preferences) Working Group to standardize mechanisms for expressing AI usage preferences (e.g., drafts for HTTP-level signaling). This is a work in progress, not a finalized standard.

 

Recommended implementation pattern

  1. Decide your policy

    • Most publishers start with: search=yes, ai-train=no, then decide on ai-input depending on business model and tolerance for AI summaries. 

  2. Express it in robots.txt

    • Add Cloudflare’s comment block (optional but helpful) and your Content-signal line(s) under appropriate User-agent groups.

  3. Turn on enforcement

    • Use AI Crawl Control to block/allow/charge specific crawlers; set 403 or 402 responses and a human-readable message that points to your licensing terms. 

  4. Monitor & iterate

    • Watch robots.txt violations and request logs; adjust per-crawler rules. Cloudflare exposes violations per crawler in the dashboard. 

 

Code examples you can copy

Global allow-search, deny-training, deny-AI-inputs

User-agent: *
Content-signal: search=yes, ai-train=no, ai-input=no
Allow: /

Managed robots.txt (Cloudflare default when enabled)

# (Cloudflare prepends their policy comments here)
User-Agent: *
Content-signal: search=yes, ai-train=no
Allow: /
# …plus Disallow entries for many AI crawler UAs

 

FAQ 

Does search=yes permit AI summaries?
No. Cloudflare’s definition of search explicitly excludes AI-generated search summaries. That’s the job of ai-input.

Can I rely on Cloudflare Content Signals for robots.txt alone?
No. Signals are preferences, not technical controls. Pair with WAF/Bot Management or AI Crawl Control for real enforcement and analytics.

Will this hurt my SEO?
Not when configured correctly. You can allow classic search (search=yes) while denying ai-train and ai-input. Cloudflare notes GSC may show “Syntax not understood” warnings without affecting crawl behavior. 

Is this a web standard yet?
No. The IETF’s AIPREF group is actively working on standardizing AI usage preferences; treat Cloudflare’s approach as pragmatic, interim signaling.

 

Bottom line for technical SEOs

  • Adopt signals to clearly state allowed uses, then

  • Enforce with AI Crawl Control/WAF, then

  • Monitor violations and update per-crawler rules.

This lets you say “yes to search” while drawing a bright line on training and real-time AI reuse—without bluntly blocking all crawlers (and losing discoverability).