Rag Done Simply | Sauron AI

What is RAG (and why you might want it)

Retrieval-Augmented Generation (RAG) is a lightweight pattern that lets an LLM “look things up” before it answers. Instead of hoping the model remembers everything, you index your own docs (PDFs, notes, reports) into a vector database. At question time, you:

Embed the user query,
Retrieve the most similar chunks,
Pass that context to the LLM.

You get answers that are grounded in your data and easier to audit (you can show the source chunks).

What we set out to accomplish

Keep it simple: a small local embedding model, a local vector DB (Qdrant), and a small local instruct LLM. No orchestration frameworks—just a clean, minimal RAG that:

Ingests PDFs,
Splits and cleans text,
Embeds + stores in Qdrant,
Retrieves top-K context,
Builds a focused prompt and generates an answer.

Buffet Data

How we collected data

The data is a collection of PDF’s written by Warren Buffet that were available for download online.

Pre-processing

PDF text is noisy: hard linebreaks, hyphenation, empty blocks. We:

Normalize linebreaks and whitespace,
De-hyphenate across line breaks,
Drop very short or non-alphanumeric paragraphs,
Split on double newlines to get paragraph-like chunks.

This yields chunks that are big enough for semantic similarity but small enough to be specific. Consider storing basic metadata per chunk (source filename, page range, a stable chunk_id) to support citations.

Embeddings and the vector DB

We use mlx-community/embeddinggemma-300m-8bit to create dense vectors and store them in a local Qdrant collection with cosine distance. At query time, we embed the question with the matching query prefix and retrieve the top-K similar chunks. Consistency in the embedding task/prefix between document and query is key to good recall.

LLM

Prompting (lean but structured)

We keep a compact system instruction that sets role/tone and then clearly separates:

--- CONTEXT --- (retrieved chunks)
--- QUESTION --- (user query)
A short directive to use the context and refrain from guessing outside it.

The goal isn’t a perfect “system prompt,” but a reliable schema that the model can follow.

Adding context (retrieval from the DB)

We embed the user query with the same embedding family used for documents, retrieve top-K results from Qdrant, join them with clear separators, and slot them into the prompt. If there’s no strong match (low scores), we can either answer generically or ask for clarification. Showing sources (filenames / page hints) builds trust.

Conclusion

By augmenting the LLM with retrieved chunks, we keep answers grounded in our PDFs. The stack stays small (MLX embeddings + Qdrant + a small MLX LLM), the code is readable, and the pattern scales: better chunking → better retrieval → better answers. That’s a minimal, practical RAG.

Code can be found at GitHub