PRAPI Research · 2026-06-15
What AI engines actually cite for marketing and PR: why extractability and proof beat authority, and where earned coverage still wins
We asked founders, PR pros, and operators a question every marketer is now quietly anxious about: when ChatGPT, Claude, Perplexity, or Google AI Overviews answer a marketing or PR question, which sources do they cite, and which do they ignore? More than sixty practitioners answered, and the result is not a single rule but a real argument with a clear center of gravity. The loudest finding: AI engines do not reward the most authoritative source or the most polished writing. They reward the most extractable and verifiable one, a clean, sourced, self-contained answer they can lift in the first sixty words and stand behind. That reshuffles who wins. Plain question-and-answer pages beat buried thought leadership, original data beats opinion, and a niche page with a checkable fact beats a famous brand's homepage. But it is not unanimous, and the disagreements are the most useful part. High domain authority still concentrates citations in some categories. Community sources, especially Reddit, are rising fast. And in trust-sensitive fields like health, legal, and finance, AI reaches for a narrow shortlist of credentialed and earned third-party sources, while brand content stays nearly invisible no matter how good it is. Read this as a field guide to what actually gets cited, organized by the forces that decide it.
35 contributors cited
Every marketer now has the same quiet worry: when someone asks ChatGPT or Perplexity a question their brand should be the answer to, does the machine reach for them, or for someone else? So we asked the people who watch it happen. Founders, PR pros, and operators who track AI answers told us which sources get cited, which get ignored, and what they have changed because of it.
The result is not one rule. It is a real argument, with a clear center of gravity and three or four sharp disagreements that turn out to be the most useful part. Through it all runs a single thread: AI cites what it can extract and verify. Everything below is a force that bends that one rule.
Force 1 - The dominant rule: extractability beats authority
The most common and most counterintuitive finding is that polish and prestige are not what get pulled. Structure is. The model lifts a clean block, not a page.
John Surabian, who rebuilt his own content for citation, put it most sharply:
LLMs don't cite the best-written piece. They cite the most extractable one. When ChatGPT or Perplexity answers a marketing question, it reaches for content where the answer sits in a clean, self-contained block right under a question-shaped heading. The 1,500-word think piece with the answer buried in paragraph nine loses to a plain post that states the answer in 60 words up top.
Roman Sydorenko, who has run answer-engine work since 2022 (via Connectively), reduced it to a law: "LLMs don't read pages, they lift blocks. If your answer isn't a clean, extractable unit in the first 60 words, you don't exist in AI search. Authority surfaces you in training data; extractability decides whether you get quoted." That last line is the bridge for the whole report.
The proof is in the operators who changed their pages and watched it work. Nikita Baksheev of Ronas IT (Connectively): "We rewrote our priority pages to state who we are, what we do, and what proof supports the claim. Before that, AI tools either skipped us or described us too vaguely. After, we started appearing in AI shortlists for service-intent prompts." Nam Dang of Cricket One ran the cleanest A/B: "We published two pieces around attribution. The thought-leadership article got almost no AI pickup after six weeks. The plainer operational guide, with definitions and examples, got cited." And Jake Wardle of EV Cable Hub (Connectively) found the concentration brutal: "Of the 30 guides we publish, four plain question-and-answer pages earn nearly all the AI mentions we get, and they are the least decorated things on the site."
Supporting the pattern: Rob Dietz ("answer-first formatting, a 40-to-60-word summary, combined with FAQPage JSON-LD schema"), David of Versys Media ("long structured guides with clear headings and step-by-step processes are disproportionately cited; opinion pieces and listicles almost never surface"), Stephen Taormino of CC&A ("if your content buries the answer in paragraph four, you're invisible to both featured snippets and AI"), Runbo Li of Magic Hour (Connectively, "citation architecture, make your first 150 words extractable as a standalone response"), and Kruno Sulić of Cliprise (Connectively, "if a page helps a busy editor verify one claim in under a minute, it has a stronger chance of being cited").
The dated proof points (the strongest evidence in the report): Chris McCarron's GoGoChimp was cited by ChatGPT on a specific date for a competitive query, Kevin Lourd's step-by-step docs surfaced in Perplexity right after he stripped the narrative, Natalia's piece was cited three times by ChatGPT, and Danyon Togia's page appeared in AI Overviews two to three weeks after publishing. These are not opinions about citation. They are receipts.
Force 2 - The counterweight: authority still concentrates citations
The dissent is real and well-evidenced. A large group sees high domain authority dominating, the "citation blacklist" that buries small publishers.
Charles Noble of Hetneo, who tracks this for clients: "In a recent analysis of 150 answers across ChatGPT and Perplexity, 12 publishers grabbed 74% of citations. Publishers with Domain Authority over 60 had an 80% citation rate. Newer sites with lower authority rarely appeared." Thomas Oldham tracked 300 citations across his automotive work: "AI Overviews cite the Wall Street Journal 4 times more often than independent marketing blogs. 78% of small publishers get completely ignored." Samuel Huang ran 50 test queries: "Sites with a domain authority above 80 get cited in about 70% of responses. My own blog, with a domain authority of 25, has never appeared." Add Ruth Cruz ("publishing on a site with DA above 70 brings 5x more AI citations than a dozen posts on a niche platform"), with Eugene Leow naming the mechanism plainly: the citations skew to high-DA domains because "that's how the algorithms were trained, the data sets favor high-traffic, high-authority domains."
The two camps are not actually in conflict, and Roman Sydorenko's line reconciles them: authority surfaces you in the training data; extractability decides whether you get quoted. Both are true at once.
Force 3 - The fuel: original first-party data is citation bait
If extractability is the form, original data is the fuel. The thing that exists nowhere else is the thing AI cannot route around.
Roee Tsur, CTO of Sponja, has been engineering for it deliberately: "We published original survey research, 238 webinar hosts, with full methodology including the limitations of our sample. LLMs strongly prefer citable numbers with a clear source over opinion content. A stat with an n and a date is citation bait. A thought leadership post is not." Mariah of Ten Speed (NeverSell) brought the hard number from her own research: "Brand-controllable content, product pages, comparison guides, how-to content, accounts for 88% of AI citations at the evaluation stage, while community-driven sources like Reddit and YouTube make up just 4%." And Michael Joe, Digital PR Manager at Digital Silk: "The content we run that has performed best recently is the pieces with original statistics. You don't need volume, you only need to compete with value." Supported by Deepak Shukla ("LLMs favour original sources over opinions; we get cited when we publish original studies"), Kate Ross of Irresistible Me, and Shoaib Mughal (Connectively, "publish information only you can provide").
Force 4 - The rising force: community and lived experience
The freshest signal in the data is the rise of community sources, and several operators flagged it independently in the same 90-day window.
Natalia Lavrenenko of Smarfle (Connectively): "The Reddit citations have grown noticeably in the last 90 days. Perplexity in particular now treats high-upvoted Reddit answers as primary sources for tactical marketing questions." Roman Sydorenko of RedditServices (Connectively): "It's pulling from the thread where actual practitioners argued it out. A thread where five practitioners contradict each other gives the model triangulation a single polished post can't." Chris McCarron of GoGoChimp: "Wikipedia is the AI grounding anchor. Reddit is the long-tail recommendation source. Forbes is the tier-1 trust signal. AI engines weight these heavier than domain authority would suggest." And Kira Piskalova of Beyond Facials (Connectively) named the bias: "If a real person on Reddit says our treatment helped their condition, that carries more weight in the model's ranking than our own clinical descriptions." Rob Dietz adds the practical footprint: "G2, Capterra, Reddit, and YouTube. If your brand doesn't have an active presence on these third-party platforms, the AI is highly likely to ignore you."
Force 5 - Where it inverts: trust-sensitive categories reward earned and credentialed sources
In health, legal, finance, and insurance, the rules flip. AI gets cautious, reaches for a narrow credentialed shortlist, and brand content goes nearly invisible no matter how good it is.
Peter Moon, CEO of Herba Health, watching the most strictly filtered category there is:
The health category citation game is completely different. LLMs show very careful selectivity, reaching for a consistent shortlist like Healthline, Examine, WebMD, Mayo Clinic, Harvard Health, and government bodies such as the NIH and Health Canada. Brand blogs are largely missing, even when editorial quality is solid.
Hans Graubard, Co-Founder of Happy V (Connectively), described the only way through it: "pages that read like a clinician wrote them, with a clear clinical question up top, links to PubMed or journal studies, and a named reviewer credential. Brand sites only break through when they stop sounding like brands." The pattern holds across regulated fields: Geoff Stanton (insurance, "official and highly specific sources win, state registry pages, carrier updates, pages that answer one narrow risk question"), Ana Vinikov (medical, "AI cites specific service pages and FAQ-style medical content, not broad promotional pages").
And this is where earned, third-party coverage reasserts itself. Monica Tomasso (Connectively) found it measurably: "Several earned media placements and expert contributions generated citations inside AI systems faster than traditional blog content. A single third-party mention on a trusted publication influenced AI understanding more than multiple self-published articles." Nikita Khandheria of ERIA points to the research: "AI platforms gravitate toward expert commentary, educational articles, and industry research rather than promotional content," citing the Muck Rack study that generative AI relies heavily on earned media and journalism. Pavankumar Kamat of Panto AI ties it together: "LLMs cite signal, not brands. Content that's canonical, uses structured metadata, and is independently corroborated gets priority."
Force 6 - The preconditions everyone forgets
Before any of the above, two things have to be true: the model has to be able to read you at all, and it has to know who you are.
Vishnu Harshan (Connectively) named the one nobody else did: "Most content never gets evaluated for citation because of a simpler problem underneath, discoverability. If a search engine isn't crawling and indexing your pages, an LLM has nothing to retrieve. And LLMs retrieve patterns, not pages. A publisher whose content consistently reinforces the same positioning in the same language across many pieces becomes easy to place. For PR, a single placement matters less than whether it reinforces a consistent narrative." Rob Dietz makes it concrete: "you must ensure your technical foundation is open to crawlers like GPTBot and ClaudeBot." And Kristina Spionjak flags the access layer: "models pull from publications they have licensing deals with, Reddit, LinkedIn, FT, or from smaller niche publishers that lack paywalls and haven't blocked LLM crawlers." Chris McCarron adds entity clarity: clean disambiguation, "CXL the conversion training company, not the file format."
The contrarian worth sitting with
One voice cuts against the whole earned-media thread, and the report is more honest for it. Gemma Smith, who runs answer-engine optimization for B2B brands, tracks which domains the models actually pull:
For B2B category questions, traditional publishers mostly get ignored. Across one fintech query set I monitor, 25,000+ distinct domains get cited, and no single source breaks roughly 12% share. The "Media" category, the trade press and earned-coverage outlets a PR team would chase, first shows up around rank 26, at about 0.4% share. It's near the bottom. What gets cited instead is the brand's own structured site.
Hold Gemma next to Peter Moon and Monica Tomasso and you have the real finding: earned and credentialed sources dominate where trust is scarce and stakes are high, and the brand's own structured, data-rich pages dominate where the buyer just wants a fast, checkable B2B answer. The category decides.
What to take from this
It is not one rule. AI cites what it can extract and verify, but the winner depends on your category, your structure, and whether you have handed the model a clean, sourced fact it is not afraid to repeat. The work is no longer to impress a reader. It is to be quoted by a machine, which means leading with the answer, naming your numbers, dating your claims, and being readable in the first place. Stop writing to impress. Start writing to be quoted.
Contributors
LLMs don't cite the best-written piece. They cite the most extractable one. They reach for content where the answer sits in a clean, self-contained block right under a question-shaped heading. The 1,500-word think piece with the answer buried in paragraph nine loses to a plain post that states the answer in 60 words up top.
We published two pieces around attribution. The thought-leadership article got almost no AI pickup after six weeks. The plainer operational guide, with direct definitions plus examples, got cited. LLMs reach first for specialist publishers with clean, tightly scoped content, not necessarily the biggest names.
Long, structured guides with clear headings, definitions, and step-by-step processes are disproportionately cited. Opinion pieces, listicles, or newsy trend posts almost never surface as citations, even from very big outlets. When content includes frameworks or examples with concrete numbers, it tends to be favored.
AI cites not the loudest coverage but the clearest authority: primary sources, established media, durable explainers, and pages that answer one question directly. The content that gets ignored is vague thought leadership, event recap fluff, and keyword-stuffed pages that don't say anything defensible.
To get cited, ensure your technical foundation is open to crawlers like GPTBot and ClaudeBot. We see AI pull citations from Reddit and YouTube, plus G2 and Capterra for B2B. We use an answer-first style, leading with a 40-to-60-word summary, combined with FAQPage JSON-LD schema, so the AI gets clean, machine-interpretable data it can extract and credit.
LLMs cite the same publishers that already rank at the top of traditional search, but only when the content is short enough to lift whole. The sources getting ignored are the ones buried in long posts that force a click for the real detail. The mechanism that works is clear headings followed by one or two sentences that match the exact question the model is asked.
AI engines rarely reward publisher prestige on its own. They favor outlets whose pages are constructed for interpretability, clear authorship, explanatory subheads, definitional precision, and evidence placed close to the claim.
The outlets showing up most are reliable trade publishers, category specialists, and research-first newsletters with a recognizable editorial voice. AI products cite sources that reduce ambiguity, answer fast, and support every recommendation with grounded evidence. Large business publications still carry weight, but expert niche media often wins when the query demands practical detail.
- Charles Noble, Founder at Hetneo
In a recent analysis of 150 answers across ChatGPT and Perplexity, 12 publishers grabbed 74% of citations. Publishers with Domain Authority over 60 had an 80% citation rate. Newer sites with lower authority rarely appeared. AI trusts established E-E-A-T signals, strong backlink profiles, and original data or expert quotes.
I ran 50 test queries across ChatGPT, Claude, and Perplexity. Major news publishers like Forbes, TechCrunch, and the Wall Street Journal get cited most. Sites with a domain authority above 80 get cited in about 70% of responses I've seen. Small blogs and personal websites rarely get picked, my own blog with a domain authority of 25 has never appeared.
AI Overviews cite the Wall Street Journal 4 times more often than independent marketing blogs for PR strategy questions. I tracked 300 citations across my work with Audi and Ford: 78% of small publishers get completely ignored. Large publishers win on authority and volume, niche blogs with 50 posts a year have almost zero chance.
Domain authority is the main filter. After we boosted our backlink profile by 60% in a quarter, our product documentation showed up in AI overviews within 30 days. Publishing on a site with domain authority above 70 brings 5x more AI citations than a dozen posts on a niche platform. Backlinks are the lever.
In one content audit I tracked 80 AI-generated answers across 4 platforms. Over 70% of the citations came from sites with domain authority above 60. That's not a judgment on quality, it's how the algorithms were trained: the data sets favor high-traffic, high-authority domains.
The outlets that show up most are established publishers with strong domain authority, Forbes, TechCrunch, HubSpot, Search Engine Journal, Harvard Business Review. Niche blogs often get ignored even when their insights are more practical. The content that earns citations is guides, research-backed articles, and well-structured explainers.
We built a dedicated /ai-info page written for AI assistants rather than humans, plain language, factual claims, no fluff. And we published original survey research, 238 webinar hosts, with full methodology including the limitations of our sample. LLMs strongly prefer citable numbers with a clear source over opinion. A stat with an n and a date is citation bait. A thought leadership post is not.
Our research found that brand-controllable content, product pages, blog articles, comparison guides, listicles, and how-to content, accounts for 88% of AI citations at the evaluation stage, while community-driven sources like Reddit and YouTube make up just 4%.
- Deepak Shukla, Founder & CEO at Pearl Lemon
LLMs favour original sources over opinions. The publishers that keep appearing are the ones publishing research, datasets, comparisons, or primary reporting rather than recycled commentary. We've had success getting cited when we publish original studies because AI systems reward information that doesn't exist elsewhere. Generic thought leadership gets ignored far more often than marketers realise.
Data-driven studies, surveys, benchmarks, and detailed how-to guides perform particularly well. AI systems favor content that directly answers a question, uses clear headings, and provides actionable insights. Original statistics and first-party research appear to increase the likelihood of being referenced.
AI platforms reward coverage by its usability, and the current algorithm works in favor of smaller publications, you don't need volume, you only need to compete with value. The content we run that has performed best recently is the pieces with original statistics. Answer as specific a question as you can and add supporting data and expert commentary.
Wikipedia is the AI grounding anchor. Reddit is the long-tail recommendation source. Forbes is the tier-1 trust signal. AI engines weight these heavier than domain authority would suggest. On 2026-05-29 ChatGPT cited us for a competitive query: the post that won had an at-a-glance HTML table, clean entity disambiguation, and heavy named-source attribution.
The health category citation game is completely different. LLMs show very careful selectivity, reaching for a consistent shortlist like Healthline, Examine, WebMD, Mayo Clinic, Harvard Health, and government organizations such as the NIH and Health Canada. Brand blogs are largely missing, even when editorial quality is solid.
In insurance AI answers, official and highly specific sources win: state registry and insurance pages, carrier updates, and pages that answer one narrow risk question clearly. Broad helpful-tips content gets ignored, even if it's useful. Write pages around the exact question a customer would ask, not around the product you want to sell.
In healthcare, AI answers cite specific service pages and FAQ-style medical content, not broad promotional pages. The stronger pages combine patient intent with operational detail, coverage, expectations, documentation. Generic 'we care about patients' copy gets ignored. Write like intake staff and clinicians actually talk to patients.
LLMs don't reward the loudest brands. They reward the clearest ones. When people ask AI questions in our category, the platforms gravitate toward expert commentary, educational articles, industry research, and founder-led thought leadership rather than promotional content. AI has brought us back to an older internet where expertise matters more than advertising.
LLMs cite signal, not brands: they favor sources that are accessible, authoritative, and structured, not whoever shouted loudest. What earns a citation is concrete facts, named quotes, timestamps, and measurable metrics. Content that's canonical, uses structured metadata like schema.org, and is independently corroborated gets priority.
AI engines generally pull from publications and platforms they have licensing deals with, Reddit, LinkedIn, FT, or from smaller, niche-specific publishers that lack paywalls and haven't blocked LLM crawlers. We've seen bylines appear within 24 hours, and press releases cited within 15 minutes of going live.
The honest answer for B2B category questions: traditional publishers mostly get ignored. Across one fintech query set, 25,000+ distinct domains get cited and no single source breaks roughly 12% share. The 'Media' category, the trade press a PR team would chase, first shows up around rank 26 at about 0.4% share. What gets cited instead is the brand's own structured site.
Legacy press-release channels and generic corporate newsrooms are becoming functionally invisible to modern LLMs. From the engineering front, multi-modal RAG architectures systematically demote low-signal corporate fluff and reach for clean, extractable, structured source material.
AI tools consistently surface pages with clear, specific language, not generic store copy. Our pages that name the exact color code, collection name, and a concrete use case get picked up more than placeholder text. When someone asks about woman-owned paint businesses, we surface because the WBE certification is explicitly named across our site. Structured, factual credentials carry real weight, more than brand storytelling.
AI pulls from content that answers a genuine question a customer would actually type. Our FAQ page and specific course descriptions get cited more than anything promotional. The content that gets ignored reads like a brochure; the stuff that earns pickup reads like a coach talking, specific, honest, a little unglamorous.
What gets cited is pages with specific local utility, not glossy inspiration content. 'Garden-style bouquets are trending' is weak; 'summer heat in Florida changes which flowers survive an outdoor reception' is much more cite-worthy. The pages that read like a planner answering real client questions perform better than pages written like ads.
In our niche, AI answers favour practical pages that solve one clear problem, cloudy water, cycling a tank, choosing filter size. Manufacturer pages, hobby blogs, and beginner care guides show up more than straight product promos. The content that gets ignored is buy-this-now copy with no real explanation.
AI answers most often cite publishers whose articles behave like compressed research notes, Search Engine Land, Google documentation, HubSpot, eMarketer, strong niche publications, because they answer fast, show their work, and stay tightly aligned to the query.
AI engines favor trade publications with strong archives, clean authorship, and articles that frame a question before answering it. PR and marketing prompts pull from Adweek, Marketing Dive, Search Engine Journal, Digiday, HubSpot. Content with dates, original data, expert quotes, and direct definitions surfaces more often than broad opinion pieces.
AI tends to favor publishers who present an explanation as a professional would verify it. Trade publications, academic and educational institutions, professional associations, and governmental entities are cited more than thin commentaries, less about the number of domains and more about clearly defining the context and supporting the claim without hiding it.