Back to Projects

SATORI

Role: Solo Builder : RAG pipeline architecture, PDF extraction and equation handling, hybrid retrieval system, dual-mode answer synthesis, session management, full-stack deployment

PythonChromaDBBGE-large EmbeddingsCross-Encoder RerankingIBM WatsonXMeta Llama 3.3 70BPyMuPDFTesseract OCRReactTypeScriptNode.jsExpress.jsTailwind CSSVite
SATORI

Overview

Most RAG systems force a choice: answer only from your documents (safe but limited) or answer from an LLM (powerful but hallucination-prone). SATORI refuses the tradeoff. Upload up to 20 PDFs, including scanned documents and equations, and SATORI builds a personal, session-isolated knowledge bank using BGE-large embeddings and ChromaDB. In Strict mode, every answer comes only from your PDFs with source citations and page numbers. In LLM Tutor mode, your PDF excerpts are sent as grounding context to Llama 3.3 70B on IBM WatsonX, which expands and elaborates without losing the document anchor. Context-based recall and follow-up detection make it feel like a conversation, not a search engine.

The Problem

Standard RAG systems have a well-known ceiling problem. They retrieve from documents accurately but can't synthesise, explain, or elaborate beyond what's literally written. Pure LLM tutors are fluid and expansive but hallucinate freely, especially on technical or domain-specific material. Every student using AI to study faces this exact frustration, the document-grounded answer is too narrow, the LLM answer can't be trusted. SATORI was built to resolve this at the architecture level: two modes, one knowledge bank, zero compromise on accuracy.

How It Was Built

The Architecture Problem With Most RAG Systems

A basic RAG system does three things: chunk a document, embed the chunks, retrieve the most relevant ones when a question is asked. This works fine for simple factual lookups. It fails spectacularly when:

— The question requires synthesis across multiple concepts spread across different pages — The document contains scanned pages where text isn't machine-readable — The question is a follow-up that relies on context from the previous exchange — The user needs elaboration or explanation that goes slightly beyond what the document states

SATORI was designed to address all four failure modes explicitly.

The Dual-Mode Design

SATORI gives users two answer modes that share the same knowledge bank but differ in how they synthesise answers.

Strict Mode operates as a pure document retrieval system. Every answer is constructed entirely from chunks extracted from the user's PDFs. Sources, page numbers, and similarity scores are shown alongside every answer. There is no LLM in the loop, zero hallucination risk. This is the mode you use when accuracy is non-negotiable: exam preparation, technical reference, legal or medical documents.

LLM Tutor Mode activates IBM WatsonX Llama 3.3 70B. Critically, the model does not answer from its parametric knowledge alone. The top-ranked PDF chunks are sent as grounding context in the prompt, the LLM is instructed to use the documents as its primary reference and only expand beyond them where the document knowledge is genuinely insufficient. The model acts as a tutor, not a generator. Users get depth and explanation without losing the document anchor.

The PDF Extraction Pipeline

This is where SATORI genuinely earns its complexity. Most RAG systems assume clean, digital PDFs. Real-world study material doesn't look like that.

PyMuPDF handles digital text extraction. For scanned pages, identified when the extracted text falls below 80 characters, Tesseract OCR runs full-page recognition. Equations present a special challenge: they render incorrectly as text in almost every extraction pipeline. SATORI crops equation regions as image assets, stores them separately, and serves them inline in the chat UI alongside the text answer. If an answer references an equation, the user sees the actual rendered equation from the original PDF, not a garbled LaTeX string.

The chunker operates topic-aware across page boundaries. Most chunkers split at fixed character counts, which frequently cuts a concept in half mid-explanation. SATORI's chunker spans page boundaries to ensure that concepts are never orphaned across a chunk split.

The Hybrid Retrieval System

SATORI uses a three-signal retrieval pipeline rather than pure dense vector similarity.

BGE-large-en-v1.5 (BAAI) generates dense embeddings, state-of-the-art for semantic retrieval tasks. Initial retrieval pulls the top candidates by cosine similarity from ChromaDB. A cross-encoder (ms-marco-MiniLM-L-6-v2) then reranks the candidates, cross-encoders score query-document pairs jointly, capturing nuanced relevance signals that bi-encoder retrieval misses. A keyword score boost is applied on top to ensure that exact terminology matches (critical in technical subjects) aren't deprioritised by semantic distance alone.

The result: retrieval that understands what you mean and what you said simultaneously.

Conversational Context and Follow-Up Detection

Single-turn RAG — one question, one answer, no memory, is adequate for search but breaks down as a study tool. SATORI maintains a three-turn conversation window. More importantly, it actively detects follow-up questions: short or ambiguous queries like "Can you elaborate?" or "Give me an example of that" are automatically enriched with context from the previous exchange before retrieval runs. The user doesn't need to repeat context, SATORI infers it.

Session Isolation and Lifecycle Management

Every user gets a fully isolated session environment with their own ChromaDB instance, their own uploaded PDFs, and their own conversation history. Sessions carry a 24-hour TTL with automatic cleanup, no data accumulates between sessions, no user's knowledge bank bleeds into another's. Subsequent PDF uploads within a session are incremental, only changed files are reprocessed, with SHA-256 hashing used to detect unchanged documents.

Architecture Summary

React + TypeScript frontend (Vite, Tailwind CSS) Claude.ai-style chat layout with typing indicator, mode badges, and markdown rendering. Node.js + Express backend handles session management, file uploads (Multer, 50MB PDF limit, up to 20 files), and bridges to the Python RAG engine via subprocess. The Python RAG Brain runs the full pipeline: extraction, chunking, embedding, retrieval, strict synthesis, and LLM tutor integration. ChromaDB provides persistent vector storage per session.

Results & Impact

A production-grade RAG system that handles scanned PDFs, embedded equations, multi-turn conversation, and session isolation, with a dual-mode architecture that gives users document accuracy and LLM depth in the same interface.