SATORI

Role: Solo Builder : RAG pipeline architecture, PDF extraction and equation handling, hybrid retrieval system, dual-mode answer synthesis, session management, full-stack deployment

PythonChromaDBBGE-large EmbeddingsCross-Encoder RerankingIBM WatsonXMeta Llama 3.3 70BPyMuPDFTesseract OCRReactTypeScriptNode.jsExpress.jsTailwind CSSVite

View on GitHub

Overview

Most RAG systems force a choice: answer only from your documents (safe but limited) or answer from an LLM (powerful but hallucination-prone). SATORI refuses the tradeoff. Upload up to 20 PDFs, including scanned documents and equations, and SATORI builds a personal, session-isolated knowledge bank using BGE-large embeddings and ChromaDB. In Strict mode, every answer comes only from your PDFs with source citations and page numbers. In LLM Tutor mode, your PDF excerpts are sent as grounding context to Llama 3.3 70B on IBM WatsonX, which expands and elaborates without losing the document anchor. Context-based recall and follow-up detection make it feel like a conversation, not a search engine.

The Problem

Standard RAG systems have a well-known ceiling problem. They retrieve from documents accurately but can't synthesise, explain, or elaborate beyond what's literally written. Pure LLM tutors are fluid and expansive but hallucinate freely, especially on technical or domain-specific material. Every student using AI to study faces this exact frustration, the document-grounded answer is too narrow, the LLM answer can't be trusted. SATORI was built to resolve this at the architecture level: two modes, one knowledge bank, zero compromise on accuracy.

How It Was Built

A basic RAG system chunks a document, embeds the chunks, and retrieves the most relevant ones at query time. This works for simple factual lookups. It fails when questions require synthesis across multiple concepts, documents contain scanned pages, follow-ups depend on prior context, or users need explanation that goes slightly beyond what the document states. SATORI was designed to address all four failure modes explicitly.

Failure Mode	SATORI's Answer
Multi-concept synthesis	Hybrid retrieval with cross-encoder reranking
Scanned / non-machine-readable pages	Tesseract OCR fallback + equation image cropping
Follow-up questions without context	3-turn conversation window with follow-up detection
Need for explanation beyond the document	LLM Tutor Mode with document-grounded prompting

Dual-Mode Design

Both modes share the same knowledge bank but differ in how they synthesise answers.

Strict Mode

Operates as a pure document retrieval system. Every answer is constructed entirely from chunks extracted from the user's PDFs. Sources, page numbers, and similarity scores are shown alongside every answer. There is no LLM in the loop — zero hallucination risk. This is the mode for when accuracy is non-negotiable: exam preparation, technical reference, legal or medical documents.

LLM Tutor Mode

Activates IBM WatsonX Llama 3.3 70B. Critically, the model does not answer from its parametric knowledge alone. Top-ranked PDF chunks are sent as grounding context in the prompt, and the model is instructed to use the documents as its primary reference — only expanding beyond them where document knowledge is genuinely insufficient. Users get depth and explanation without losing the document anchor.

PDF Extraction Pipeline

Most RAG systems assume clean, digital PDFs. Real-world study material doesn't.

Condition	Handling
Digital text PDF	PyMuPDF handles direct text extraction
Scanned page (< 80 chars extracted)	Tesseract OCR runs full-page recognition
Equation regions	Cropped as image assets, stored separately, served inline in the chat UI alongside text
Chunk boundaries	Topic-aware chunker spans page boundaries — concepts are never orphaned mid-explanation

Most chunkers split at fixed character counts, which frequently cuts a concept in half. SATORI's chunker detects topic continuity across page boundaries to preserve explanatory coherence.

Hybrid Retrieval

SATORI uses a three-signal retrieval pipeline rather than pure dense vector similarity.

Signal	Method	Purpose
1	BGE-large-en-v1.5 (BAAI) dense embeddings + cosine similarity via ChromaDB	Semantic retrieval — understands what you mean
2	Cross-encoder reranking (ms-marco-MiniLM-L-6-v2)	Scores query-document pairs jointly, capturing nuanced relevance signals bi-encoders miss
3	Keyword score boost	Ensures exact terminology matches aren't deprioritised by semantic distance alone

Retrieval that understands what you mean and what you said simultaneously.

Conversational Context

Single-turn RAG breaks down as a study tool. SATORI maintains a 3-turn conversation window and actively detects follow-up questions. Short or ambiguous queries like "Can you elaborate?" or "Give me an example of that" are automatically enriched with context from the previous exchange before retrieval runs. The user never has to repeat context — SATORI infers it.

Session Isolation

Every user gets a fully isolated session environment — their own ChromaDB instance, their own uploaded PDFs, their own conversation history.

Feature	Detail
Session TTL	24 hours, automatic cleanup
Data isolation	No user's knowledge bank bleeds into another's
Incremental uploads	Only changed files are reprocessed — SHA-256 hashing detects unchanged documents
File limits	Up to 20 PDFs, 50 MB per file

Tech Stack

Layer	Technology
Frontend	React + TypeScript (Vite, Tailwind CSS) — Claude.ai-style chat layout with typing indicator, mode badges, markdown rendering
Backend	Node.js + Express — session management, file uploads via Multer, bridges to Python engine via subprocess
RAG Engine	Python — full pipeline: extraction, chunking, embedding, retrieval, strict synthesis, LLM tutor integration
Vector Store	ChromaDB — persistent vector storage, one instance per session
LLM	IBM WatsonX Llama 3.3 70B
OCR	Tesseract
PDF Parsing	PyMuPDF

Results & Impact

A production-grade RAG system that handles scanned PDFs, embedded equations, multi-turn conversation, and session isolation, with a dual-mode architecture that gives users document accuracy and LLM depth in the same interface.

View on GitHub