RealityCheck

Role: Solo Builder : pipeline architecture, claim extraction, evidence retrieval, NLI verification framework, answer synthesis, evaluation

PythonNLPNatural Language InferenceSentence TransformersXGBoostWikipedia MediaWiki APIIBM WatsonXMeta LlamaMistralAIHuggingFaceTruthfulQA

View on GitHub

Overview

Every LLM hallucinates. ChatGPT, Llama, Mistral, they all generate factually wrong content with complete confidence, and most users have no way to know. RealityCheck is a modular, model-agnostic six-phase hallucination correction pipeline that sits as an external verification layer over any LLM. It takes the LLM's response, breaks it into atomic factual claims, retrieves Wikipedia-grounded evidence, verifies each claim through NLI + semantic alignment + rule-based reasoning, and delivers a corrected response, all before the answer reaches the user. Evaluated on TruthfulQA across IBM WatsonX Granite, Meta Llama, and MistralAI Mistral, it improved answer accuracy by up to 30 percentage points and hallucination recall from 37–45% to 78–83%.

The Problem

LLMs don't know what they don't know. They generate factually wrong content with the same confident tone as correct content, and the linguistic fluency of the output makes it nearly impossible for users to tell the difference. In healthcare, law, and academic research, this isn't a nuisance. It's dangerous. Existing approaches either intervene at generation time (RAG) and can't correct errors already in the output, or operate as post-hoc evaluators that diagnose hallucinations but don't fix them. RealityCheck was built to close that gap, inline correction, before the answer reaches the user.

How It Was Built

RAG intervenes at generation time — it informs what a model writes, but cannot fix erroneous claims already in the output. NLI-based detectors can classify hallucinations but don't repair them. Post-hoc evaluators like FActScore measure factuality but hand the problem back to the user unsolved. None of them do inline correction. That's the gap RealityCheck fills.

The Pipeline

RealityCheck intercepts every LLM response before it reaches the user and runs it through six sequential phases.

Phase	Name	What Happens
1	LLM Response Generation	Query forwarded to chosen LLM (WatsonX Granite, Llama, or Mistral). Raw response cleaned and quality-flagged.
2	Claim Extraction	Response decomposed into discrete, atomic factual claims. Each assessed for checkability. Anaphoric references resolved. Claims labelled essential or supporting.
3	Evidence Retrieval	Wikipedia-grounded, three-layer cache-first retrieval. Pages chunked, embedded via sentence transformer, ranked by semantic similarity to claim.
4	Claim Verification	Four-signal fusion: semantic similarity gate → NLI classification → rule-based overrides → Noisy-OR aggregation. Temporal sub-module handles date claims via arithmetic.
5	Reasoning Override	Heuristic rules, contradiction handling, and confidence scoring finalise each claim-level verdict.
6	Answer Synthesis	Corrected response reconstructed deterministically — no second LLM call. Claims repaired, hedged, or preserved based on verdict.

Phase 3 — Evidence Retrieval in Detail

Retrieved Wikipedia pages are chunked into fixed-size overlapping segments, embedded using multi-qa-MiniLM-L6-dot-v1, and ranked by semantic similarity to the claim. Top-k chunks are retained as the evidence set. A local seed evidence bank is consulted first, before hitting the MediaWiki API.

Phase 4 — Four-Signal Verification

NLI alone fails on paraphrased assertions, negated facts, and compound multi-entity claims. RealityCheck uses four signals in sequence.

Signal	Method	Purpose
1	Semantic similarity	Relevance gate — prevents irrelevant evidence influencing the decision
2	NLI classification	cross-encoder/nli-deberta-v3-base scores each evidence-claim pair
3	Rule-based overrides	Targets known NLI failure modes: misattributed quotes, negated facts, historical misconceptions
4	Noisy-OR aggregation	Multiple medium-strength signals can jointly support a claim no single chunk alone confirms

Each claim exits Phase 4 labelled: Supported, Contradicted, Insufficient Evidence, or Non-checkable.

Phase 6 — Deterministic Answer Synthesis

Claim Verdict	Action
Supported	Preserved as-is
Contradicted	Repaired using domain-aware correction templates from best contradicting evidence
Insufficient Evidence	Epistemically hedged — rewritten to preserve informational value without asserting an unconfirmed fact
Non-checkable	Passed through unchanged

Key Design Decisions

No LLM in the Correction Loop

Answer synthesis is entirely deterministic. This eliminates secondary hallucination risk during correction and ensures full reproducibility. Most competing approaches invoke an LLM to rewrite the corrected response — which moves the hallucination problem one step downstream.

Epistemic Hedging Over Deletion

When evidence is insufficient, RealityCheck doesn't delete the claim. It rewrites it with appropriate qualifiers. WikiChat — the closest comparable system — simply omits unverifiable claims, sacrificing informational completeness. RealityCheck preserves it.

Ablation Study

Every phase contributes meaningfully to the final result.

Component Added	Accuracy Gain
Retrieval only	+5 points
+ NLI classification	+7 points
+ Rule-based overrides	+6 points
+ Full synthesis (complete system)	+24 points

No single component carries the pipeline. It is the principled composition that makes it work.

Results

Evaluated across three LLMs: IBM WatsonX Granite, Meta Llama, and MistralAI Mistral.

Metric	Result
Responses validated or actively improved	72%
Insufficient evidence outcome	5%
Overcorrection rate	< 7%

RealityCheck isn't a hallucination detector. It's a hallucination corrector. The distinction matters — detection tells you something is wrong, correction fixes it before the user ever sees it.

Architecture

The pipeline is fully modular. The Wikipedia retrieval backend is replaceable with any knowledge source. The LLM integration layer is model-agnostic. The NLI model is swappable. Each phase has well-defined inputs and outputs. This isn't a demo — it's a deployable verification infrastructure.

Results & Impact

Improved answer-level accuracy by up to 30 percentage points and hallucination recall from under 45% to over 80% across three LLMs, without modifying a single model weight, without adding another LLM call, and with overcorrection rates kept below 7%.

View on GitHub