Back to Projects

RealityCheck

Role: Solo Builder : pipeline architecture, claim extraction, evidence retrieval, NLI verification framework, answer synthesis, evaluation

PythonNLPNatural Language InferenceSentence TransformersXGBoostWikipedia MediaWiki APIIBM WatsonXMeta LlamaMistralAIHuggingFaceTruthfulQA
RealityCheck

Overview

Every LLM hallucinates. ChatGPT, Llama, Mistral, they all generate factually wrong content with complete confidence, and most users have no way to know. RealityCheck is a modular, model-agnostic six-phase hallucination correction pipeline that sits as an external verification layer over any LLM. It takes the LLM's response, breaks it into atomic factual claims, retrieves Wikipedia-grounded evidence, verifies each claim through NLI + semantic alignment + rule-based reasoning, and delivers a corrected response, all before the answer reaches the user. Evaluated on TruthfulQA across IBM WatsonX Granite, Meta Llama, and MistralAI Mistral, it improved answer accuracy by up to 30 percentage points and hallucination recall from 37–45% to 78–83%.

The Problem

LLMs don't know what they don't know. They generate factually wrong content with the same confident tone as correct content, and the linguistic fluency of the output makes it nearly impossible for users to tell the difference. In healthcare, law, and academic research, this isn't a nuisance. It's dangerous. Existing approaches either intervene at generation time (RAG) and can't correct errors already in the output, or operate as post-hoc evaluators that diagnose hallucinations but don't fix them. RealityCheck was built to close that gap, inline correction, before the answer reaches the user.

How It Was Built

The Problem With Every Existing Approach

Retrieval-Augmented Generation (RAG) is the most common answer to hallucination. But RAG intervenes at generation time — it informs what the model writes, but it cannot retroactively fix erroneous claims already in the output. NLI-based detection systems can classify whether a claim is hallucinated but don't repair it. Post-hoc evaluators like FActScore measure factuality but hand the problem back to the user unsolved.

None of them do inline correction. That's the gap RealityCheck fills.

How The Pipeline Works

RealityCheck intercepts every LLM response before it reaches the user and runs it through six sequential phases:

Phase 1 — LLM Response Generation: The user's query is forwarded to their chosen LLM (IBM WatsonX Granite, Meta Llama, or MistralAI Mistral). The raw response is cleaned of markdown artefacts and assessed for quality flags (empty, evasive, or excessively short responses are flagged before proceeding).

Phase 2 — Claim Extraction & Semantic Decomposition: The response is broken into discrete, atomic factual claims using a rule-based linguistic pipeline. Each sentence is assessed for checkability — is it a verifiable factual assertion, or is it subjective, hedged, or opinion-based? Checkable claims are enriched with context to resolve anaphoric references before retrieval. Each claim is labelled essential or supporting based on position and lexical significance.

Phase 3 — Wikipedia-Grounded Evidence Retrieval: For each checkable claim, relevant evidence is retrieved through a cache-first three-layer architecture. First, a local seed evidence bank pre-populated with curated Wikipedia content for common TruthfulQA categories is consulted. If insufficient, the Wikipedia MediaWiki API is queried with the enriched claim. If a TruthfulQA question carries a direct Wikipedia source URL, that page is fetched as a high-priority source. Retrieved pages are chunked into fixed-size overlapping segments, embedded using a sentence transformer (multi-qa-MiniLM-L6-dot-v1), and ranked by semantic similarity to the claim. Top-k chunks are retained as the evidence set.

Phase 4 — Claim Verification & Contradiction Analysis: Each claim is verified against its evidence set through a four-signal fusion framework, not a single classifier. Signal 1: Semantic similarity as a relevance gate, preventing irrelevant evidence from influencing the decision. Signal 2: NLI classification (cross-encoder/nli-deberta-v3-base) scoring each evidence-claim pair as entailment, contradiction, or neutral. Signal 3: Rule-based overrides tuned to known NLI failure modes, misattributed quotations, historical misconceptions, negated factual assertions. Signal 4: Noisy-OR aggregation across multiple evidence chunks, so several medium-strength signals can jointly support a claim no single chunk alone confirms. A dedicated temporal sub-module handles age and date claims via deterministic arithmetic rather than NLI, which is known to perform poorly on such claims.

Each claim exits Phase 4 with one of four labels: Supported, Contradicted, Insufficient Evidence, or Non-checkable.

Phase 5 — Reasoning Override & Final Verdict: Heuristic rules, contradiction handling, and confidence scoring are applied to finalise the claim-level verdict.

Phase 6 — Evidence-Grounded Answer Synthesis: The corrected response is reconstructed deterministically, no second LLM call, no additional hallucination risk. Supported claims are preserved as-is. Contradicted claims are repaired using domain-aware correction templates derived from the best contradicting evidence. Insufficient evidence claims are epistemically hedged, rewritten to preserve informational value without asserting an unconfirmed fact. Non-checkable claims pass through unchanged.

The Key Design Decisions

No LLM in the correction loop: Answer synthesis is entirely deterministic. This eliminates the risk of secondary hallucination during correction and ensures full reproducibility. Most competing approaches invoke an LLM to rewrite the corrected response, which simply moves the hallucination problem one step downstream.

Epistemic hedging over deletion: When evidence is insufficient, RealityCheck doesn't delete the claim. It rewrites it with appropriate qualifiers. WikiChat, the closest comparable system, simply omits unverifiable claims, sacrificing informational completeness. RealityCheck preserves it.

Multi-signal verification over single classifier: NLI alone fails on paraphrased assertions, negated facts, and compound multi-entity claims. The rule-based override layer specifically targets these known failure modes.

The ablation study confirms that every phase contributes meaningfully, retrieval alone gives +5 points, NLI adds +7, rule overrides add +6, and full synthesis reaches +24. No single component carries the pipeline. It is the principled composition that makes it work.

72% of all responses are either validated or actively improved. Only 5% receive an insufficient evidence outcome. Overcorrection, incorrectly modifying a claim that was originally correct, stays below 7% across all three models.

Architecture Note

The pipeline is fully modular. The Wikipedia retrieval backend is replaceable with any knowledge source. The LLM integration layer is model-agnostic. The NLI model is swappable. Each phase has well-defined inputs and outputs. This isn't a demo, it's a deployable verification infrastructure.

Results & Impact

Improved answer-level accuracy by up to 30 percentage points and hallucination recall from under 45% to over 80% across three LLMs, without modifying a single model weight, without adding another LLM call, and with overcorrection rates kept below 7%.