Architecture

System overview

flowchart LR
  subgraph Ingestion
    D[Documents<br>PDF / MD / TXT] --> C[Chunker]
    C --> E[Embedder<br>TF-IDF]
    E --> I[(Index<br>pickle)]
  end
  subgraph Query
    Q[User Query] --> R[Retriever]
    R --> I
    R --> G[Answer Generator]
    G --> A[Response + Citations]
  end
  subgraph API Layer
    F[FastAPI Server]
    F --> Auth[Auth Manager]
    F --> RL[Rate Limiter]
    F --> M[Prometheus Metrics]
  end

Pipeline stages

1. Ingestion

Documents go through three steps:

Parsing - Extract text from PDF (via pypdf), Markdown, or plain text files
Chunking - Split into overlapping windows (default: 220 chars, 40 char overlap). Overlap preserves context across chunk boundaries.
Indexing - TF-IDF vectorization with scikit-learn. The resulting sparse matrix and metadata get serialized to a pickle file.

The chunker handles edge cases like empty documents, single-character files, and very long lines. Chunk boundaries snap to sentence ends when possible.

2. Retrieval

When a query comes in:

The query gets vectorized using the same TF-IDF vocabulary
Cosine similarity scores are computed against all chunks
Top-k chunks are returned, ranked by relevance score

3. Answer generation

Retrieved chunks are assembled into a grounded answer. Each claim maps back to a citation with:

Source document ID
Character offsets (start/end position in the original document)
Relevance score
Text excerpt

This means every answer is auditable. No hallucinated content.

4. Evaluation

The evaluate command runs context precision tests against a golden dataset. This measures how well the retrieval step surfaces the right chunks for known questions, giving you a quantitative quality signal.

Security model

Layer	Implementation
Authentication	Bearer token with role-based access (admin/user)
Authorization	Role checks on sensitive endpoints (ingest, evaluate, admin)
Rate limiting	In-memory per-IP sliding window
Observability	Prometheus metrics + structured logging

Tokens are configured via environment variables (RAG_ADMIN_TOKEN, RAG_USER_TOKEN) so they never touch source code.

Data flow

User --[Bearer token]--> FastAPI
  --> Rate limiter (per-IP)
  --> Auth check (role validation)
  --> Handler (ask/ingest/evaluate)
  --> Pipeline (chunk/embed/retrieve)
  --> Response (answer + citations + rate_limit)

Everything runs in a single process. No external dependencies beyond Python and the files you ingest. The index is a local pickle file, so there is no database to manage.