Case Study · AI Engineering

Ragify

A production-grade Retrieval-Augmented Generation pipeline built from scratch — no LangChain abstractions, no managed vector DB, no tutorial shortcuts.

PythonFastAPIGeminiGroqpgvectorSupabaseReact

768

Embedding dims

~800

Tok/s on Groq

2k

Chunk size (chars)

0.2

LLM temperature

5

Default top-k

Live Demo Architecture GitHub

01 — The Problem

Most RAG demos are toys.

The typical tutorial route — LangChain RetrievalQA, Pinecone managed vectors, OpenAI embeddings — gets you a working demo in an afternoon. But it teaches you almost nothing about what's actually happening inside a retrieval pipeline, and it produces a system where every design parameter is hidden behind an abstraction.

I wanted to understand — and be able to explain to an interviewer — exactly how each stage works, why each decision was made, and what breaks when parameters change. That requires building it yourself.

The goal wasn't "build a chatbot." It was: build something where I can defend every technical decision from the scraper to the prompt template.

02 — The Pipeline

Five stages, zero magic.

01

Scrape

BeautifulSoup · requests

Fetch any HTML URL, strip boilerplate tags (nav, footer, script, style), extract readable text. No headless browser — if it requires JS to render, it's not a valid knowledge source for this pipeline.

02

Chunk

2,048 chars · 200-char overlap

Split text at word boundaries — never mid-word. Overlap ensures that a sentence crossing a chunk boundary isn't lost. Chunk size and overlap are both configurable from the UI settings tab.

03

Embed

Gemini gemini-embedding-001 · 768 dims · batched 50

Each chunk is embedded into a 768-dimensional vector. Batched in groups of 50 to stay inside API rate limits. Deduplication runs before embedding — identical chunks from re-ingesting the same URL are skipped.

04

Store

pgvector via Supabase · cosine similarity

Vectors stored in a PostgreSQL table with a pgvector column. Retrieval uses a Supabase RPC call — the cosine similarity math runs inside the database, not in Python. The similarity threshold is configurable.

05

Generate

Llama 3.3 70B via Groq · temperature 0.2 · ~800 tok/s

Top-k retrieved chunks are injected into a structured prompt with citation markers. Llama 3.3 70B runs on Groq's inference infrastructure — at ~800 tok/s it's fast enough that streaming felt unnecessary for the demo. Temperature 0.2 keeps answers grounded without being robotic.

The full conversation history is passed on every query turn — multi-turn follow-ups work naturally. Inline citations are tied to source URLs, not just chunk text.

03 — Design Decisions

Every tradeoff, documented.

These are the four choices that shaped the architecture — not the ones tutorials make for you, but the ones you actually think about when building for production.

D-01Vector database

Pinecone (managed)

pgvector on Supabase

chosen

Pinecone is a dependency you don't control — it adds latency, cost, and a third-party failure mode. pgvector keeps vectors inside the same Postgres instance that already holds the rest of the data. Cosine similarity runs as a Supabase RPC, so the vector math never leaves the database.

D-02Embedding model

OpenAI text-embedding-3-small

Gemini gemini-embedding-001

chosen

Gemini's gemini-embedding-001 uses task_type parameter — RETRIEVAL_DOCUMENT at ingest time, RETRIEVAL_QUERY at query time. This asymmetric embedding is specifically designed for retrieval workloads and produces meaningfully better recall than symmetric embeddings on factual Q&A tasks.

D-03LLM inference

OpenAI GPT-4o

Llama 3.3 70B via Groq

chosen

Groq's LPU hardware runs Llama 3.3 70B at ~800 tok/s — comparable to GPT-3.5 in speed, closer to GPT-4 in quality, and significantly cheaper per token. For a RAG use case where the model is primarily synthesizing retrieved context rather than doing complex reasoning, 70B is more than sufficient.

D-04RAG framework

LangChain / LlamaIndex

Raw API calls in Python

chosen

LangChain would have hidden every interesting decision behind a .from_chain_type() call. Implementing scraping, chunking, embedding, retrieval, and generation manually means I understand what's happening at every step — and can explain it. The total implementation is ~400 lines across 3 files.

04 — Parameters

What you can tune, and why.

The Settings tab in the UI exposes the retrieval parameters directly. Understanding what each one does is the difference between operating a RAG system and just using one.

top_k

51–20

How many chunks are retrieved per query. Higher = more context, higher cost, more noise. Lower = faster, potentially missing relevant context.

similarity_threshold

0.30.0–1.0

Minimum cosine similarity score for a chunk to be included. 0.3 is permissive — good for exploratory queries. Raise to 0.6+ for high-precision factual retrieval.

chunk_size

2048512–4096

Character length of each text chunk at ingest time. Smaller chunks = more precise retrieval. Larger chunks = more context per chunk, fewer total vectors.

chunk_overlap

2000–512

Character overlap between adjacent chunks. Prevents sentences at chunk boundaries from being split across two separate retrieval units.

temperature

0.20.0–1.0

LLM generation temperature. 0.0 = deterministic and literal. 0.2 = grounded but slightly natural. Above 0.5, the model starts adding information not in the retrieved context.

05 — What's Next

Known gaps and next moves.

Production-readiness isn't a checkbox — it's a continuous process. Here's what I'd build next with more time.

Streaming responses

Add stream=True on the Groq client and SSE on FastAPI. ~30 lines. Makes the demo feel dramatically faster to users — senior engineers notice this immediately.

Hybrid retrieval

Combine pgvector cosine similarity with BM25 keyword search (full-text search in Postgres). Hybrid retrieval outperforms pure vector search on precise factual queries.

Re-ranking

Add a cross-encoder re-ranker (Cohere or a local model) after initial vector retrieval. Bi-encoder retrieval is fast but approximate — re-ranking on top-20, returning top-5, improves precision.

Evaluation harness

Build a test suite using RAGAS metrics — faithfulness, answer relevancy, context precision. Right now quality is assessed manually; a proper eval loop would make parameter tuning data-driven.

06 — Takeaways

What this project actually taught me.

Chunking strategy matters more than model choice

Poor chunking produces poor retrieval regardless of how good your embedding model is. Getting the chunk boundaries right — word-aligned, with overlap — improved answer quality more than switching models did.

Vector math belongs in the database

Running cosine similarity in Python on the application server means you load all vectors into memory first. Pushing it into Postgres as an RPC is architecturally correct and dramatically more scalable.

Temperature is a grounding dial, not a creativity dial

0.2 is the right temperature for RAG — the model should mostly synthesize retrieved context, not hallucinate. Above 0.4 the model starts generating plausible-sounding information that isn't in the retrieved chunks.

Framework abstractions hide the interesting problems

Every difficulty I ran into — overlap edge cases, API rate limiting on batch embedding, prompt injection via malicious URLs — would have been silently handled by LangChain. Building from scratch means understanding those problems firsthand.

Want to talk through the architecture?

I can walk through any part of the pipeline in detail.

Live Demo Get in touch