MethodologyMay 24, 2026 · 11 min read

The AI tools we use to verify and source 1,290+ claims

The database has 1,290+ entries, twelve source types, six languages, and primary documents going back to the 1940s. No human team reads all of that. This is the AI stack we built to keep up — and the rules we use to keep it honest.

They Knew Editorial TeamPublished May 24, 2026

People ask how a small editorial team verifies and dates over a thousand claims spanning eight decades of declassified history. The honest answer is: we don't, on our own. We use AI as a research accelerator at every stage — OCR, translation, timeline reconciliation, citation extraction — and then we apply a strict rule that no model output reaches a published claim without a human pulling the original document and reading it themselves.

This piece walks through the actual stack, the actual prompts, the actual rejection rates, and the parts of the job AI still cannot do. If you run a fact-checking operation, a research desk, or a brief at scale, the trade-offs are the same.

Why we needed an AI stack at all

A single declassified CIA file release can run to 800,000 pages. The MKUltra FOIA cache alone is roughly 20,000 typewritten pages with bleed-through, hand annotations, and redaction bars. A French intelligence claim might first surface in Le Canard Enchaîné in 1987 and never appear in English-language press until a 2011 academic paper cites it. A Soviet-era cable might exist only as a scanned Russian-language fax stored at the Wilson Center archive.

The job is not “read one document carefully.” The job is “triangulate one claim across twelve document types, four languages, and forty years of contested chronology, then dig out the exact sentence that supports it.” That second job is where machine reading earns its keep.

The stack, in one breath

Our verification pipeline uses Claude Sonnet 4.6 for long-document reasoning and cross-language semantic matching, Claude Haiku 4.5 for batched classification and meta generation, GPT-4o for structured citation extraction, Whisper for testimony and hearing audio transcription, and Perplexity Pro for live cross-source triangulation. We picked each tool after head-to-head testing on real claims — not vendor blog posts.

The model choices look opinionated because they are. They came from running the same task on every credible model with a paid account and watching which one survived contact with a 50-year-old grainy PDF. The comparator we used to pick our stack is run by the same founder behind They Knew: every model is benchmarked on a paid plan, with real documents, on real research tasks, which is how we ended up with this combination instead of a single “general-purpose” assistant.

OCR for documents nobody wanted scanned

The first verification layer is just reading the page. That sounds trivial. It is not. The Church Committee final reports from 1975 are typewritten, photocopied multiple times, and riddled with redaction blocks where the agency painted over names with a black marker that bled into adjacent words. The FBI files on Martin Luther King Jr. include hand-written marginalia in pencil on top of typed memos. Tobacco industry litigation discovery has thermal-paper faxes from the 1980s that faded before the trial even ended.

We pipe scans through Claude Sonnet's vision endpoint with a structured extraction prompt: page text, marginalia (flagged separately), redaction zones (counted, not guessed), and a confidence score per paragraph. Anything below 92% page confidence gets routed to a human reviewer. The model is specifically instructed never to hallucinate text under a redaction bar — it must return the literal phrase [REDACTED, N chars] instead. We test this monthly with a control set where we know the redacted contents, and the rejection rate of fabricated fills is currently zero.

Cross-language matching: the leak that broke in Der Spiegel

A large share of intelligence-related claims first leak in the non-English press. The BND's domestic surveillance programs were first detailed in Der Spiegel. French SDECE/DGSE operations regularly surfaced in Le Canard Enchaîné. Japanese postwar CIA-funded political operations were documented in Asahi Shimbun archives years before American academic coverage. If we only search English sources we publish a fundamentally incomplete record.

We use Sonnet for cross-lingual semantic matching: given a claim in English, find candidate primary-source articles in French, German, Spanish, Russian, Italian, Portuguese, or Japanese that describe the same operation. The model returns candidates with the original-language headline, publication date, and a quoted excerpt in the source language. We never publish the machine-translated text. An editor with reading proficiency in that language — or, failing that, a translator we trust — reads the original. The AI is a search index for sentences, not a substitute for understanding them.

Timeline construction: the “when did they know” problem

The most contested data point on any claim is the timeline. When did the agency know? When was the first internal memo? When was the first denial? When did the first credible external warning appear, and was it dismissed? The dating of that sequence is what separates an honest mistake from a cover-up.

For each claim, we extract every date mention from every cited source — document creation date, mailing date, hearing date, publication date, reference date inside the body text — and we ask the model to assemble them into a single chronology with the source quoted next to each entry. When two sources conflict on a date, the claim is flagged for manual reconciliation and the conflicting sources are both shown in the published timeline rather than silently resolved. Roughly one in eight claims hits a date conflict serious enough to trigger this; we count it as a feature, not a bug.

A worked example: BlackRock cross-ownership

Our BlackRock, Vanguard, State Street piece claims that the three asset managers collectively hold major stakes in roughly 90% of S&P 500 companies. That sentence is one number. Verifying it cleanly required four steps.

Step one:pull the latest 13F filings from SEC EDGAR for all three firms. That's nine PDFs totaling roughly 4,000 pages of tabular holdings data. Step two:use GPT-4o with a tightly typed JSON schema to extract holdings rows into a database — ticker, share count, reporting date, filer. Step three:join that table against the current S&P 500 constituent list and compute the percentage of companies where at least one of the three appears in the top five shareholders. Step four: spot-check twenty randomly sampled companies by hand against Bloomberg and Yahoo Finance.

The AI-extracted figure came back at 91.2%. The hand-checked spot sample suggested the real number was closer to 89.4% — a 1.8-point gap, traced to one filing where Sonnet had counted a fund-of-funds layer twice. We corrected the script, re-ran the join, and published the 89% figure with a footnote. That round trip took roughly six hours, instead of the two weeks it would have taken to pull and tabulate 4,000 pages of 13F data by hand. The AI did not give us the answer. It gave us a first draft that a human could finish.

The hallucination problem: verifying the AI itself

The most dangerous failure mode in AI-assisted research is the confident fake citation. A model will return a perfectly formatted source — correct author, plausible journal, real-looking DOI — that simply does not exist. Or it will quote a sentence that is not in the document it claims to be quoting. We have caught both of these enough times that the entire pipeline assumes the model is lying until the source is proved.

Our defense is mechanical, not based on prompt engineering. Every AI-surfaced source must be re-fetched at its declared URL by a separate, non-AI script. Every quoted passage must be found verbatim in the fetched document via plain text search. If either fails, the citation is rejected and the claim is flagged for the editor. Our current rejection rate for Haiku- proposed citations is roughly 22%, for Sonnet roughly 6%, for GPT-4o roughly 11%. None of these numbers are low enough to trust the model unsupervised. All of them are low enough that the human is doing rejection work, not from-scratch research, which is the entire point.

Two further rules: we never let the AI paraphrase a source on the published page — if the claim depends on what a document said, we quote the document. And we never let the AI produce the final verification status. The Verified green badge on a claim is earned through community votes plus an editor reading the cited primary source. The model proposes; an editor disposes.

What AI still cannot do

Four things, and they matter. AI cannot judge the credibility of a single anonymous source — the call between “disgruntled middle manager with an axe” and “insider risking everything to tell the truth” is a human judgment built on context the model does not have. AI cannot tell the difference between confirmed and acknowledged — an agency saying “a program existed” is not the same as documenting what that program did, and a model will collapse the two if you let it.

AI cannot weigh whether a redacted page matters — sometimes the black bar covers a name that changes everything, sometimes it covers a phone number. And AI does not have the lived political memory to spot when a story is being seeded as a limited hangout — a small true admission designed to make the larger lie harder to question. All four of those judgments are the editor's job. The tools just clear the desk so we have time to make them.

“The model proposes. The editor disposes. Anything else is a press release with footnotes.”

If you're building a research operation and weighing the same trade-offs, the headline lesson is unromantic: AI is extraordinary at the boring half of fact-checking — the reading, the indexing, the cross-referencing — and still bad at the half that actually matters, which is judgment. Treat it accordingly and the throughput goes up without the standard going down. Treat it as an oracle and you will publish a fake citation within the month.