Home / Blog / Technical Guide

Technical Guide

AI Search Visibility: Getting Cited by ChatGPT & Perplexity

RRogue AI·2026-05-22·11 min read

A source document being cited into an AI answer panel, generative engine optimization

A growing share of the people looking for what you do never see a list of blue links any more. They ask ChatGPT, Perplexity, or Google's AI Overview a question and read the answer it synthesises. If your site isn't one of the sources that answer is built from, you are invisible to them, no matter where you rank in classic search. Generative engine optimization (GEO) is the discipline of being the source the model quotes, and this guide covers what actually moves that needle, drawn from making this very portfolio citable.

GEO Is Not SEO, But It Rhymes

Classic SEO optimises a URL to rank in a list of links. GEO optimises your content to be retrieved, trusted, and quoted inside a synthesised answer that may never show a link at all. The user's journey ends at the answer, not at your page, so the win is being named as a source, not being clicked.

Most of the foundations carry over. The new part is narrow but real:

Carries over from SEO

Be crawlable, fast, technically clean, and authoritative. A page a search engine cannot index is a page an answer engine cannot cite. The base layer is the same one good SEO already builds.

New: citability

Can a model lift one correct, self-contained claim from your page without the surrounding context? Ranking rewards whole pages; citation rewards extractable sentences. They are not the same craft.

New: machine access

AI crawlers are a separate set of user-agents with their own rules. Whether you appear in an AI answer is decided partly by access choices that classic SEO never had to make.

Step Zero: Let the Right AI Crawlers In

You cannot be quoted by a system that was never allowed to read you. But "AI crawler" is not one thing, and the distinction decides whether blocking one quietly removes you from an answer surface.

Training crawlers, GPTBot, ClaudeBot, Google-Extended, CCBot

They gather corpora for model training. Blocking them is a content-licensing decision about whether your text trains future models, not, by itself, a decision about whether you show up in today's answers.

Retrieval crawlers, OAI-SearchBot, PerplexityBot, ChatGPT-User

These fetch live pages to answer a user's question right now. Block them and you opt out of being cited in ChatGPT search and Perplexity. This is the set most sites accidentally lock out.

The Google nuance most people get wrong

Google AI Overviews are served from the normal Googlebot index. Disallowing Google-Extended controls Gemini training only, it does not remove you from AI Overviews. If you are indexed by Google, you are eligible.

Decide per crawler. Do not inherit a blanket Disallow from a copied template, and do not blanket-allow without knowing what each agent does. A deliberate stance looks like this:

# Invite the retrieval crawlers explicitly
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Keep sensitive paths out of every agent
User-agent: *
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

One hard rule underneath all of it: whatever a crawler can fetch, it can quote. A noindex tag stops a page from ranking, it does not stop a retrieval crawler from fetching it and reading it into an answer. If a page should never be quoted, it must be unreachable to these agents at the path level, not merely de-indexed.

llms.txt: A Map You Hand to Language Models

llms.txt is an emerging convention: a single Markdown file at your site root that hands a model a clean, link-rich index of what you publish and where the canonical version of each thing lives. Where robots.txt tells a crawler what it may not touch, llms.txt tells it what is worth reading and how it fits together.

The pattern that works: treat llms.txt as a navigational index, a short description plus grouped links with a one-line summary each, and an optional llms-full.txt as the full-text companion that inlines the actual content for models that want it in one fetch.

Only list what you are happy to see quoted

These are plain, fetchable, AI-facing files. A page you keep out of your nav and sitemap for a reason but then link from llms.txt, or inline into llms-full.txt, is no longer hidden. The full-text file in particular re-publishes whatever you feed it; keep private or identity-bearing pages out of both.

Write So a Model Can Lift a Single Paragraph

Retrieval works on chunks, not whole pages. The model pulls the few passages that match the question and builds an answer from them. Your job is to make each passage correct and complete on its own.

Answer first, then expand

Lead each section with the direct answer in one or two sentences, then add the nuance. Inverted pyramid. A model that finds the answer in the first line of a chunk quotes that line.

Self-contained sections

Avoid "as mentioned above" and pronouns that point at other paragraphs. Each section should make sense lifted out of the page entirely, because that is exactly what happens to it.

Concrete, checkable claims

"Cut document review from two hours to three minutes" is quotable; "significantly faster" is not. Specific numbers, dates, and names give a model something it can attribute and a reader something it can verify.

Headings as questions, comparisons as tables

Phrase headings the way people ask, the heading is a retrieval signal. Put comparisons in tables and lists; structured data is easier to extract cleanly than the same facts buried in prose.

Structured Data Is Machine-Readable Authorship

JSON-LD does not change what a human reads, but it tells a machine who said this, when, and on whose authority, as data, not something it has to infer from your prose. For GEO the high-value types are few:

Organization or Person, your identity

One entity with a stable @id and a sameAs array linking your profiles. This is what lets a model treat your site, your LinkedIn, and your GitHub as one author.

Article or BlogPosting, your content

With author, publisher, datePublished, and dateModified. Recency and clear authorship are signals an answer engine weighs when it decides whom to trust.

speakable and FAQPage, the extractable parts

speakable marks the passages safe to read aloud; FAQPage turns genuine question-and-answer content into structured pairs a model can lift whole. Use them where they are true, not as decoration.

The discipline that matters most here is consistency: the same canonical @id for your entity on every page. Three pages claiming three different identities give a model nothing to consolidate. One identity, repeated, compounds.

Authority: Be an Entity, Not Just a Page

Answer engines bias hard toward sources they can attribute and trust, the same experience, expertise, authority, and trust signals classic search rewards, read by a machine that wants a name to put next to a claim. The practical moves:

Put a real byline and a dated, maintained publication history on your content. Link your profiles consistently so sameAs resolves to one coherent entity. Cite your own sources, a page that shows its working is easier to trust and quote than one that asserts. And be specific about what you have actually done; demonstrated, dated experience is the signal that no amount of keyword tuning fakes.

Measuring Something You Cannot Rank-Track

There is no "position three" in a synthesised answer, so the old rank-tracking playbook does not transfer. You measure GEO from several noisier angles instead:

Ask the assistants directly

Run the questions your buyers ask through ChatGPT, Perplexity, and Gemini on a schedule. Are you named? Is what they say about you right? This is crude but it is the closest thing to a live ranking.

Watch referral traffic from AI surfaces

Citations carry links, and clicks on them show up as referrers from chatgpt.com, perplexity.ai, and similar. Small numbers today, but the trend line is the signal.

Read your server logs for AI user-agents

Are OAI-SearchBot and PerplexityBot actually fetching you, how often, and which pages? Crawl activity is the leading indicator that precedes any citation at all.

Monitor brand and claim mentions

Track where your name and your specific claims surface across the web, the corpus answer engines draw from. More accurate mentions in more places is the upstream cause of more citations.

Be honest with yourself about the data: attribution here is genuinely immature in 2026. You will not get the precision of a classic analytics funnel, and anyone selling you a tidy "AI visibility score" is smoothing over that fact. Treat the numbers as a direction, not a dial.

Where to Start

You do not need a GEO platform or a new budget line. The first pass is deliberately small:

One. Set your crawler stance per agent in robots.txt, deliberately, allow the retrieval crawlers you want, gate the paths you do not want quoted.

Two. Ship an llms.txt index. Add llms-full.txt only with content you are happy to see quoted verbatim.

Three. Rewrite your five most important pages answer-first, with concrete, self-contained claims and question-shaped headings.

Four. Add JSON-LD with one stable entity @id and a sameAs array, plus Article markup with real dates and author.

Five.Watch your logs for AI crawler hits, and ask the assistants your buyers' questions once a month. Iterate on what they get wrong.

GEO in 2026 is early, and that is the opportunity: the foundations are unglamorous and most sites still haven't done them. Be crawlable, be quotable, be a consistent and dated entity, and you are already ahead of every page still optimising only for a list of links nobody reads.

Quick Reference

AI crawler access, who fetches your site, where it shows up

Engine	Retrieval user-agent (live fetch)	Training user-agent (corpus)	Citation surface
ChatGPT web	OAI-SearchBot, ChatGPT-User	GPTBot	Source pills under the answer
Perplexity	PerplexityBot	(uses same crawler)	Inline numbered citation cards
Claude web	ClaudeBot	Anthropic-AI	Inline links in the answer
Google AI Overviews	Googlebot (standard index)	Google-Extended (Gemini training only)	Collapsed source list, paraphrases often
Bing Copilot	Bingbot	(uses same crawler)	Numbered citations in sidebar
Apple Intelligence	Applebot-Extended	(same UA)	Cited inside AI features and on-device summaries

Citation behaviour by platform, qualitative tiers observed in 2025

Platform	Citation density	Direct-quote tendency	Practical note
Perplexity	Highest, almost every answer carries sources	Medium-high, short quoted passages	Most rewarding surface to optimise for first
ChatGPT Search	High, browse mode cites consistently	Medium, paraphrase with linked pills	Citation rate rose sharply through 2025
Bing Copilot	Medium-high, sidebar always shows sources	Medium, numbered references	Tight integration with Bing index
Claude web search	Medium, cites when fetching live URLs	Low-medium, links rather than quotes	Newer surface; behaviour still evolving
Google AI Overviews	Lower, often paraphrases without an obvious link-out	Low, sources frequently aggregated	Standard Google ranking still dominates eligibility

Frequently Asked Questions

What is Generative Engine Optimization (GEO)?

GEO is the discipline of making your content extractable, trustworthy, and quotable inside the synthesised answers AI engines like ChatGPT, Perplexity, and Google AI Overviews return, instead of in a ranked list of links. It overlaps with SEO at the foundations (crawlability, page speed, authority signals) but diverges on what wins: ranking rewards whole pages, citation rewards self-contained passages a model can lift on their own.

How do AI crawlers differ from Googlebot?

AI crawlers split into two families. Training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) fetch corpora for model training. Retrieval crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) fetch live pages to answer a user's question right now. Blocking a training crawler is a content-licensing decision; blocking a retrieval crawler removes you from that engine's answer surface today. Googlebot remains the standard index that also feeds Google AI Overviews.

Does noindex block AI crawlers?

No. The noindex meta tag stops a page from appearing in search-engine results, but retrieval crawlers can still fetch the page, read its content, and quote it inside an AI answer. If a page must not be cited by an AI engine, disallow it at the path level in robots.txt, and do not link it from anywhere a crawler would discover it, including llms.txt, sitemap, navigation, or JSON-LD author/publisher URLs.

What is llms.txt?

llms.txt is an emerging convention: a single Markdown file at your site root that hands language models a clean, link-rich index of what you publish. Treat it as a navigational index, short description plus grouped links with one-line summaries each. An optional companion llms-full.txt inlines the full text of those pages for models that want them in one fetch. Anything you list in either file is content you are happy to see quoted verbatim.

Should I block GPTBot to protect my content?

Decide deliberately, not by reflex. Blocking GPTBot stops your text from being added to OpenAI's training corpus for future models, a content-licensing decision. It does not, by itself, remove you from ChatGPT's current answers, which are served by the separate OAI-SearchBot retrieval crawler. Most sites accidentally block both by allow-listing only Googlebot; the GEO move is to make each choice deliberately, per user-agent.

How do you measure GEO without a ranking position?

Run the questions your buyers ask through ChatGPT, Perplexity, and Gemini on a schedule and record whether you are named. Watch your analytics for referrer traffic from chatgpt.com, perplexity.ai, copilot.microsoft.com, and similar. Read your server logs for AI crawler user-agents, crawl activity is the leading indicator that precedes any citation. Treat the numbers as direction, not precision; AI-attribution tooling is genuinely immature in 2026.