Technical Guide

AI Search Visibility: Getting Cited by ChatGPT & Perplexity

R
Rogue AI
··11 min read

A growing share of the people looking for what you do never see a list of blue links any more. They ask ChatGPT, Perplexity, or Google's AI Overview a question and read the answer it synthesises. If your site is not one of the sources that answer is built from, you are invisible to them — no matter where you rank in classic search. Generative engine optimization (GEO) is the discipline of being the source the model quotes. This guide covers what actually moves that needle, from making this portfolio citable.

GEO Is Not SEO, But It Rhymes

Classic SEO optimises a URL to rank in a list of links. GEO optimises your content to be retrieved, trusted, and quoted inside a synthesised answer that may never show a link at all. The user's journey ends at the answer, not at your page — so the win is being named as a source, not being clicked.

Most of the foundations carry over. The new part is narrow but real:

Carries over from SEO

Be crawlable, fast, technically clean, and authoritative. A page a search engine cannot index is a page an answer engine cannot cite. The base layer is the same one good SEO already builds.

New: citability

Can a model lift one correct, self-contained claim from your page without the surrounding context? Ranking rewards whole pages; citation rewards extractable sentences. They are not the same craft.

New: machine access

AI crawlers are a separate set of user-agents with their own rules. Whether you appear in an AI answer is decided partly by access choices that classic SEO never had to make.

Step Zero: Let the Right AI Crawlers In

You cannot be quoted by a system that was never allowed to read you. But "AI crawler" is not one thing, and the distinction decides whether blocking one quietly removes you from an answer surface.

Training crawlers — GPTBot, ClaudeBot, Google-Extended, CCBot

They gather corpora for model training. Blocking them is a content-licensing decision about whether your text trains future models — not, by itself, a decision about whether you show up in today's answers.

Retrieval crawlers — OAI-SearchBot, PerplexityBot, ChatGPT-User

These fetch live pages to answer a user's question right now. Block them and you opt out of being cited in ChatGPT search and Perplexity. This is the set most sites accidentally lock out.

The Google nuance most people get wrong

Google AI Overviews are served from the normal Googlebot index. Disallowing Google-Extended controls Gemini training only — it does not remove you from AI Overviews. If you are indexed by Google, you are eligible.

Decide per crawler. Do not inherit a blanket Disallow from a copied template, and do not blanket-allow without knowing what each agent does. A deliberate stance looks like this:

# Invite the retrieval crawlers explicitly
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Keep sensitive paths out of every agent
User-agent: *
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

One hard rule underneath all of it: whatever a crawler can fetch, it can quote. A noindex tag stops a page from ranking — it does not stop a retrieval crawler from fetching it and reading it into an answer. If a page should never be quoted, it must be unreachable to these agents at the path level, not merely de-indexed.

llms.txt: A Map You Hand to Language Models

llms.txt is an emerging convention: a single Markdown file at your site root that hands a model a clean, link-rich index of what you publish and where the canonical version of each thing lives. Where robots.txt tells a crawler what it may not touch, llms.txt tells it what is worth reading and how it fits together.

The pattern that works: treat llms.txt as a navigational index — a short description plus grouped links with a one-line summary each — and an optional llms-full.txt as the full-text companion that inlines the actual content for models that want it in one fetch.

Only list what you are happy to see quoted

These are plain, fetchable, AI-facing files. A page you keep out of your nav and sitemap for a reason but then link from llms.txt — or inline into llms-full.txt — is no longer hidden. The full-text file in particular re-publishes whatever you feed it; keep private or identity-bearing pages out of both.

Write So a Model Can Lift a Single Paragraph

Retrieval works on chunks, not whole pages. The model pulls the few passages that match the question and builds an answer from them. Your job is to make each passage correct and complete on its own.

Answer first, then expand

Lead each section with the direct answer in one or two sentences, then add the nuance. Inverted pyramid. A model that finds the answer in the first line of a chunk quotes that line.

Self-contained sections

Avoid "as mentioned above" and pronouns that point at other paragraphs. Each section should make sense lifted out of the page entirely, because that is exactly what happens to it.

Concrete, checkable claims

"Cut document review from two hours to three minutes" is quotable; "significantly faster" is not. Specific numbers, dates, and names give a model something it can attribute and a reader something it can verify.

Headings as questions, comparisons as tables

Phrase headings the way people ask — the heading is a retrieval signal. Put comparisons in tables and lists; structured data is easier to extract cleanly than the same facts buried in prose.

Structured Data Is Machine-Readable Authorship

JSON-LD does not change what a human reads, but it tells a machine who said this, when, and on whose authority — as data, not something it has to infer from your prose. For GEO the high-value types are few:

Organization or Person — your identity

One entity with a stable @id and a sameAs array linking your profiles. This is what lets a model treat your site, your LinkedIn, and your GitHub as one author.

Article or BlogPosting — your content

With author, publisher, datePublished, and dateModified. Recency and clear authorship are signals an answer engine weighs when it decides whom to trust.

speakable and FAQPage — the extractable parts

speakable marks the passages safe to read aloud; FAQPage turns genuine question-and-answer content into structured pairs a model can lift whole. Use them where they are true, not as decoration.

The discipline that matters most here is consistency: the same canonical @id for your entity on every page. Three pages claiming three different identities give a model nothing to consolidate. One identity, repeated, compounds.

Authority: Be an Entity, Not Just a Page

Answer engines bias hard toward sources they can attribute and trust — the same experience, expertise, authority, and trust signals classic search rewards, read by a machine that wants a name to put next to a claim. The practical moves:

Put a real byline and a dated, maintained publication history on your content. Link your profiles consistently so sameAs resolves to one coherent entity. Cite your own sources — a page that shows its working is easier to trust and quote than one that asserts. And be specific about what you have actually done; demonstrated, dated experience is the signal that no amount of keyword tuning fakes.

Measuring Something You Cannot Rank-Track

There is no "position three" in a synthesised answer, so the old rank-tracking playbook does not transfer. You measure GEO from several noisier angles instead:

Ask the assistants directly

Run the questions your buyers ask through ChatGPT, Perplexity, and Gemini on a schedule. Are you named? Is what they say about you right? This is crude but it is the closest thing to a live ranking.

Watch referral traffic from AI surfaces

Citations carry links, and clicks on them show up as referrers from chatgpt.com, perplexity.ai, and similar. Small numbers today, but the trend line is the signal.

Read your server logs for AI user-agents

Are OAI-SearchBot and PerplexityBot actually fetching you, how often, and which pages? Crawl activity is the leading indicator that precedes any citation at all.

Monitor brand and claim mentions

Track where your name and your specific claims surface across the web — the corpus answer engines draw from. More accurate mentions in more places is the upstream cause of more citations.

Be honest with yourself about the data: attribution here is genuinely immature in 2026. You will not get the precision of a classic analytics funnel, and anyone selling you a tidy "AI visibility score" is smoothing over that fact. Treat the numbers as a direction, not a dial.

Where to Start

You do not need a GEO platform or a new budget line. The first pass is deliberately small:

One. Set your crawler stance per agent in robots.txt, deliberately — allow the retrieval crawlers you want, gate the paths you do not want quoted.

Two. Ship an llms.txt index. Add llms-full.txt only with content you are happy to see quoted verbatim.

Three. Rewrite your five most important pages answer-first, with concrete, self-contained claims and question-shaped headings.

Four. Add JSON-LD with one stable entity @id and a sameAs array, plus Article markup with real dates and author.

Five.Watch your logs for AI crawler hits, and ask the assistants your buyers' questions once a month. Iterate on what they get wrong.

GEO in 2026 is early, and that is the opportunity: the foundations are unglamorous and most sites have not done them. Be crawlable, be quotable, be a consistent and dated entity — and you are already ahead of the pages still optimising only for a list of links nobody reads.

Related reading: LLM integration for business systems, AI agent orchestration, and why most AI projects fail before production.

Rogue AI • Production Systems •