Case Study · AI Engineering

Ragify

A production-grade Retrieval-Augmented Generation pipeline built from scratch — no LangChain abstractions, no managed vector DB, no tutorial shortcuts.

PythonFastAPIGeminiGroqpgvectorSupabaseReact
768
Embedding dims
~800
Tok/s on Groq
2k
Chunk size (chars)
0.2
LLM temperature
5
Default top-k
Live Demo Architecture GitHub

01 — The Problem

Most RAG demos are toys.

The typical tutorial route — LangChain RetrievalQA, Pinecone managed vectors, OpenAI embeddings — gets you a working demo in an afternoon. But it teaches you almost nothing about what's actually happening inside a retrieval pipeline, and it produces a system where every design parameter is hidden behind an abstraction.

I wanted to understand — and be able to explain to an interviewer — exactly how each stage works, why each decision was made, and what breaks when parameters change. That requires building it yourself.

The goal wasn't "build a chatbot." It was: build something where I can defend every technical decision from the scraper to the prompt template.

02 — The Pipeline

Five stages, zero magic.

01
Scrape
BeautifulSoup · requests
Fetch any HTML URL, strip boilerplate tags (nav, footer, script, style), extract readable text. No headless browser — if it requires JS to render, it's not a valid knowledge source for this pipeline.
02
Chunk
2,048 chars · 200-char overlap
Split text at word boundaries — never mid-word. Overlap ensures that a sentence crossing a chunk boundary isn't lost. Chunk size and overlap are both configurable from the UI settings tab.
03
Embed
Gemini gemini-embedding-001 · 768 dims · batched 50
Each chunk is embedded into a 768-dimensional vector. Batched in groups of 50 to stay inside API rate limits. Deduplication runs before embedding — identical chunks from re-ingesting the same URL are skipped.
04
Store
pgvector via Supabase · cosine similarity
Vectors stored in a PostgreSQL table with a pgvector column. Retrieval uses a Supabase RPC call — the cosine similarity math runs inside the database, not in Python. The similarity threshold is configurable.
05
Generate
Llama 3.3 70B via Groq · temperature 0.2 · ~800 tok/s
Top-k retrieved chunks are injected into a structured prompt with citation markers. Llama 3.3 70B runs on Groq's inference infrastructure — at ~800 tok/s it's fast enough that streaming felt unnecessary for the demo. Temperature 0.2 keeps answers grounded without being robotic.

The full conversation history is passed on every query turn — multi-turn follow-ups work naturally. Inline citations are tied to source URLs, not just chunk text.

03 — Design Decisions

Every tradeoff, documented.

These are the four choices that shaped the architecture — not the ones tutorials make for you, but the ones you actually think about when building for production.

D-01Vector database
Pinecone (managed)
pgvector on Supabase
chosen

Pinecone is a dependency you don't control — it adds latency, cost, and a third-party failure mode. pgvector keeps vectors inside the same Postgres instance that already holds the rest of the data. Cosine similarity runs as a Supabase RPC, so the vector math never leaves the database.

D-02Embedding model
OpenAI text-embedding-3-small
Gemini gemini-embedding-001
chosen

Gemini's gemini-embedding-001 uses task_type parameter — RETRIEVAL_DOCUMENT at ingest time, RETRIEVAL_QUERY at query time. This asymmetric embedding is specifically designed for retrieval workloads and produces meaningfully better recall than symmetric embeddings on factual Q&A tasks.

D-03LLM inference
OpenAI GPT-4o
Llama 3.3 70B via Groq
chosen

Groq's LPU hardware runs Llama 3.3 70B at ~800 tok/s — comparable to GPT-3.5 in speed, closer to GPT-4 in quality, and significantly cheaper per token. For a RAG use case where the model is primarily synthesizing retrieved context rather than doing complex reasoning, 70B is more than sufficient.

D-04RAG framework
LangChain / LlamaIndex
Raw API calls in Python
chosen

LangChain would have hidden every interesting decision behind a .from_chain_type() call. Implementing scraping, chunking, embedding, retrieval, and generation manually means I understand what's happening at every step — and can explain it. The total implementation is ~400 lines across 3 files.

04 — Parameters

What you can tune, and why.

The Settings tab in the UI exposes the retrieval parameters directly. Understanding what each one does is the difference between operating a RAG system and just using one.

top_k
51–20

How many chunks are retrieved per query. Higher = more context, higher cost, more noise. Lower = faster, potentially missing relevant context.

similarity_threshold
0.30.0–1.0

Minimum cosine similarity score for a chunk to be included. 0.3 is permissive — good for exploratory queries. Raise to 0.6+ for high-precision factual retrieval.

chunk_size
2048512–4096

Character length of each text chunk at ingest time. Smaller chunks = more precise retrieval. Larger chunks = more context per chunk, fewer total vectors.

chunk_overlap
2000–512

Character overlap between adjacent chunks. Prevents sentences at chunk boundaries from being split across two separate retrieval units.

temperature
0.20.0–1.0

LLM generation temperature. 0.0 = deterministic and literal. 0.2 = grounded but slightly natural. Above 0.5, the model starts adding information not in the retrieved context.

05 — What's Next

Known gaps and next moves.

Production-readiness isn't a checkbox — it's a continuous process. Here's what I'd build next with more time.

Streaming responses

Add stream=True on the Groq client and SSE on FastAPI. ~30 lines. Makes the demo feel dramatically faster to users — senior engineers notice this immediately.

Hybrid retrieval

Combine pgvector cosine similarity with BM25 keyword search (full-text search in Postgres). Hybrid retrieval outperforms pure vector search on precise factual queries.

Re-ranking

Add a cross-encoder re-ranker (Cohere or a local model) after initial vector retrieval. Bi-encoder retrieval is fast but approximate — re-ranking on top-20, returning top-5, improves precision.

Evaluation harness

Build a test suite using RAGAS metrics — faithfulness, answer relevancy, context precision. Right now quality is assessed manually; a proper eval loop would make parameter tuning data-driven.

06 — Takeaways

What this project actually taught me.

01
Chunking strategy matters more than model choice
Poor chunking produces poor retrieval regardless of how good your embedding model is. Getting the chunk boundaries right — word-aligned, with overlap — improved answer quality more than switching models did.
02
Vector math belongs in the database
Running cosine similarity in Python on the application server means you load all vectors into memory first. Pushing it into Postgres as an RPC is architecturally correct and dramatically more scalable.
03
Temperature is a grounding dial, not a creativity dial
0.2 is the right temperature for RAG — the model should mostly synthesize retrieved context, not hallucinate. Above 0.4 the model starts generating plausible-sounding information that isn't in the retrieved chunks.
04
Framework abstractions hide the interesting problems
Every difficulty I ran into — overlap edge cases, API rate limiting on batch embedding, prompt injection via malicious URLs — would have been silently handled by LangChain. Building from scratch means understanding those problems firsthand.
Want to talk through the architecture?
I can walk through any part of the pipeline in detail.
Live Demo Get in touch