LLMs do not cite sources because they are authoritative. They cite them because they pass a technical pipeline made of retrieval, ranking, passage-level extraction, and safety filters. Understanding the seven signals that govern this pipeline is the difference between investing optimization budget on real levers and burning it on industry superstitions.
2026 brought enough public evidence to separate measurable signals from noise. The paper Evaluating Verifiability in Generative Search Engines by Liu, Zhang, and Liang (Stanford, EMNLP 2023) opened the season showing a data point that frames everything else: only 51.5% of sentences generated by AI search engines are fully supported by the citations they report, and only 74.5% of citations actually support the sentence they are attached to. Three years later, a recent replication on fourteen modern models confirmed the asymmetry: link validity above 94%, but factual accuracy between 39% and 77%. The links are there, they often do not confirm what they claim.
This article works inside that frame. The seven signals that follow are ordered by available evidence, from the most grounded in academic literature to the most empirical, ending with one anti-signal that deserves to be named.
Signal 1: passing retrieval and fan-out
Before talking about authority, you need to pass the first technical gate. An LLM does not cite a source it does not retrieve, and retrieval is not a black box. Anthropic published in September 2024 the post Contextual Retrieval describing the standard 2026 architecture: contextual embeddings combined with contextual BM25 and a reranker. The stack reduces retrieval failure on the top-twenty chunks by 67%, moving from 5.7% down to 1.9%.
There is an intermediate step many GEO analyses ignore and that changes everything: query fan-out. When a user submits a question to ChatGPT, the model does not run a single search. It expands the query into multiple sub-queries to cover the search space before generating the answer. The study The Fan-Out Effect by AirOps, published in April 2026, quantified the phenomenon: 88.6% of queries generate exactly two fan-out sub-queries, only 8.8% generate none (typically simple product or entity queries), and 2.5% generate four or more (complex comparative queries).
The data point that closes the discussion on the relevance of retrieval comes from the same study: a page at position 1 in retrieval has a 58% citation rate, a page at position 10 stops at 14%. The median rank of pages cited across all three test runs is 2.5, while for pages never cited it rises to 13. Translated, retrieval rank dominates all other factors, and content quality alone cannot close that gap.
Two technical details matter for content optimization. The first is that embeddings capture semantic relationships but fail on exact matching. This means calling a product by its correct name in every paragraph is more important than doing it only in the title. The second is that the reranker reorders chunks already retrieved. This means breaking into the top twenty is the real quality jump, while moving from position five to position three depends on factors the reranker considers, not the embedder.
Fan-out also explains why a single brand can be cited for one query and disappear for an apparently synonymous variant: the generated sub-queries are different, the retrievals are different, and the final citation depends on the aggregate. Refinea monitors fan-out across ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews for every prompt in the customer’s panel, surfacing which actual sub-queries the model runs behind the scenes. Without this visibility, a GEO strategy optimizes one-third of the problem.
The operational implication is banal but rarely respected. Entity density and consistent brand-name repetition in the body of the content, not just in headings, remain the zero-cost optimization with the highest ROI. To these is added a less obvious action: optimizing content for the real variants that fan-out generates, not just for the main query.
Signal 2: granularity is the paragraph, not the page
Anthropic’s official Citations API documentation says it explicitly: documents are chunked to define the minimum citation granularity, and for plain text and PDFs the default chunking is at the sentence level. Claude cites the single sentence, or concatenates multiple consecutive sentences to cite a paragraph. Never a whole page, never an isolated heading.
This changes how content needs to be written. A three-thousand-word article structured as a monolith, with arguments developed across multiple paragraphs, produces fewer citations than a fifteen-hundred-word article organized into ten units of one hundred fifty words, each self-sufficient in answering a single question.
The BLUF pattern (Bottom Line Up Front) is its operational expression. The first paragraph of every section must contain the answer. The rest is expansion. Atomic citations reward this structure because the chunker does not read the section, it reads the sentence.
Signal 3: quotations, statistics, and external sources
The foundational GEO paper (Aggarwal et al., SIGKDD 2024) systematically tested nine optimization tactics across ten thousand queries and showed that GEO can boost visibility by up to 40% in generative engine responses. Table 1 of the paper breaks this figure down by tactic and produces the most important hierarchy academic literature has published on the topic.
The three tactics with the biggest lift on the position-adjusted word count metric, as reported by the authors, are:
- Quotation Addition: +27.8%
- Statistics Addition: +25.9%
- Cite Sources: +24.9%
At the bottom of the ranking is keyword stuffing, which produces a minimal lift of 17.8%. The authors explicitly describe many traditional SEO tactics as “little to no performance improvement” in the generative context. The hierarchy is clear: content that includes specific numbers, attributed quotes, and links to sources gets cited significantly more often than content that relies on argumentative text alone.
For Refinea this is the signal that justifies the existence of the Brand Memory module, which catalogues Proof Points, Expert Voices, and Facts precisely to make every generated piece of content dense with verifiable citations, statistics, and sources.
Signal 4: co-citation clusters, not traditional backlinks
The authority LLMs recognize is not the domain authority classical SEO has measured for twenty years. The most solid analyses published in 2025 and 2026 converge on one point: the domains AI engines cite most form a tight, recurring cluster. Reddit, Wikipedia, LinkedIn, Forbes, and Medium dominate consistently across the engines studied.
Profound’s analysis of 680 million citations quantifies the concentration. Wikipedia covers 7.8% of total ChatGPT citations, and 47.9% of top citations when looking only at the most recurring sources. Reddit covers 6.6% of Perplexity citations, 2.2% of Google AI Overviews, and reaches 46.7% of the top ten sources cited by Perplexity.
The critical point comes from Anthropic itself. In the post Multi-Agent Research System the engineers describe that their early agents consistently chose SEO-optimized content farms over authoritative but less highly ranked sources, like academic PDFs or personal blogs. The evaluation rubric they introduced afterward explicitly prefers primary sources over lower-quality secondary sources.
The conclusion is sharp. Traditional domain authority does not automatically translate into AI citation rate. Being cited or mentioned inside the Wikipedia-Reddit-LinkedIn-Forbes cluster is worth more than hundreds of backlinks from mid-low domains.
Signal 5: recency as a first-class field
Anthropic’s official Web Search Tool documentation shows exactly what the model sees when it receives a search result. Every result includes URL, title, cited text up to 150 characters, and a page_age field indicating when the site was last updated. Recency is not an implicit factor, it is a structured input the model reads alongside the content.
Crawler-side data confirms the preference. Seer Interactive analyzed crawl logs of three ChatGPT bots on more than five thousand URLs and found that 65% of hits land on content published in the past year, 79% on content from the past two years. It should be noted that the study measures crawl behaviour, not final citation rate. But the pattern is consistent: engines invest crawl resources on recent sources, and recent sources are the ones then proposed to the model.
The operational implication is precise. Substantially updating an evergreen article every six months, changing the numbers and enriching dated passages, produces more GEO value than publishing two new articles in the same time. The change must be substantial: cosmetic date bumps do not work because the model sees page_age but also the actual content.
Signal 6: the hallucination-risk asymmetry
The Stanford 2023 finding on 51.5% of sentences not fully supported by citations does not describe a temporary bug. It describes a permanent defensive behaviour of the model. When an LLM has to generate a verifiable answer, it prefers to cite sources that minimize the risk of fabricating a fact.
This explains Wikipedia’s dominance across every published study. Wikipedia covers 47.9% of top ChatGPT citations according to Profound’s analysis. The technical reason is not that Wikipedia has the highest journalistic quality on the web. It is that Wikipedia has the rarest combination: dense facts, verifiable internal citations, predictable structure, consistent formatting. A model that generates with citations prefers sources where it can easily anchor its statements to attributed sentences.
Operationalizing this for brands is less obvious than it sounds. It does not mean writing “like Wikipedia.” It means every important claim must be accompanied by a verifiable citation to a primary source, by a statistic with a reference, by a quote attributed to a real person. Content without factual anchors is perceived as hallucination risk by the model and gets avoided in citation even when it is retrieval-positive.
Signal 7: schema markup does not move AI citations
Ahrefs published in May 2026 the most rigorous study available on the effect of schema markup on AI citations. They added JSON-LD to 1,885 pages between August 2025 and March 2026, comparing them with a control group of 4,000 pages, measuring citations before and after. The deltas were: +2.4% on Google AI Mode, +2.2% on ChatGPT, −4.6% on Google AI Overviews. The first two variations are statistically indistinguishable from zero.
The study has an important caveat that needs to be reported honestly: the analysis was limited to pages already cited by AI engines (100+ citation baseline). For pages without consolidated AI visibility, schema markup might still help in the first retrieval wave. But for those already visible, JSON-LD does not produce the lift the GEO industry has been selling for two years.
This does not mean removing schema from the site. Schema continues to be relevant for traditional Google rich results. It means presenting it as a primary GEO lever is dishonest toward those paying for that consulting.
The asymmetry few name
One final technical note deserves to be made explicitly. Of the three main AI search providers, only Anthropic publishes meaningful technical documentation on its retrieval and citation mechanisms. The Contextual Retrieval page, the multi-agent system post, the API docs for Citations and Web Search are all verifiable primary sources.
OpenAI and Perplexity have no public equivalents. Their retrieval architectures for ChatGPT Search and Sonar are deliberately opaque. Everything you read online about their internal workings comes from reverse engineering, leaks, or third-party speculation. For those planning evidence-based GEO strategies, this asymmetry matters: most of what we actually know about LLM behaviour in citation derives from Anthropic papers and academic research.
What to do Monday morning
The seven signals translate into a concrete action list anyone can apply over the next seven days.
Entity density in the body. Verify that the brand name and main product names appear in full at least once every two hundred words in key content. Not in headings, in the body.
Content atomization. Identify the three most important articles on the blog and rewrite them in one-hundred-fifty-word blocks, each self-sufficient in answering a sub-question.
Citation and statistics density. Add to every strategic article at least five attributed quotes, three statistics with source, and five links to authoritative external sources.
Citation source map. Submit ten prompts representative of your category to ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews. Note the cited sources. Build the next quarter’s editorial plan starting from there.
Recency audit. Identify the ten evergreen articles with the highest historical traffic. Substantially update them over the next ninety days, changing numbers, rewriting dated sections, adding new data.
Refinea automates each of these steps at scale. But the framework works manually too for those who have patience. For the complete strategic frame, we have published the operational guide to Generative Engine Optimization. To see the same measurement logic applied publicly, Refinea Analysis measures entire Italian industries using the same protocol the platform applies to individual brands.
The seven signals are the technical layer. Above them, strategy. Below them, nothing.
