Building an Enterprise RAG Pipeline on Azure Databricks

At Vanderlande, the warehouse automation documentation corpus exceeded 2.3 million pages — equipment manuals, safety procedures, maintenance schedules, engineering specifications, regulatory compliance documents. Engineers spent an average of 47 minutes per query finding the right information. We built a RAG pipeline that reduced this to under 30 seconds with 94% answer accuracy.

Why RAG over Fine-Tuning

Fine-tuning a model on Vanderlande's corpus would have required:

Re-training every time documentation changed (weekly)
Handling data governance for proprietary engineering schematics
Managing model versioning across different business units

RAG separates knowledge (vectorized documents) from reasoning (the LLM). New documents are indexed within minutes. The LLM never sees raw proprietary data during training. Access control is enforced at the retrieval layer, not the model layer.

Architecture

Loading diagram...

Data Ingestion: Delta Live Tables

The ingestion pipeline runs on Databricks with Delta Live Tables (DLT) for exactly-once processing guarantees:

Bronze Layer: Raw documents ingested as binary blobs with metadata (source system, document ID, last modified timestamp, security classification).

Silver Layer: Document parsing and normalization. PDFs run through Azure Document Intelligence for OCR + layout analysis. We preserve table structures, figure captions, and cross-references — critical context that naive text extraction destroys.

Gold Layer: Chunked, embedded, and ready for indexing. This is where the interesting engineering happens.

Chunking Strategy: The Hard Problem

Chunking is the most consequential decision in a RAG pipeline. Too small, and you lose context. Too large, and retrieval precision drops. We tested four strategies:

Strategy	Chunk Size	Overlap	Retrieval Accuracy	Latency
Fixed-size	512 tokens	50 tokens	71%	120ms
Sentence-based	~200 tokens	1 sentence	78%	140ms
Semantic (embedding similarity)	Variable	Adaptive	84%	180ms
Hybrid hierarchical	Parent: 2048 / Child: 256	None	94%	160ms

Hybrid hierarchical chunking won decisively. Here's how it works:

Split documents into parent chunks (~2048 tokens) at natural section boundaries (headings, page breaks)
Split each parent into child chunks (~256 tokens) at sentence boundaries
Embed and index child chunks for high-precision retrieval
When a child chunk matches, retrieve the parent chunk for generation context

This gives you the precision of small chunks during retrieval and the context of large chunks during generation.

Embedding Model Selection

We evaluated three embedding models on our domain-specific benchmark:

Model	Dimensions	Domain Accuracy	Throughput	Cost
text-embedding-ada-002	1536	82%	2,500 docs/min	$0.0001/1K tokens
text-embedding-3-large	3072	89%	1,800 docs/min	$0.00013/1K tokens
text-embedding-3-large (1536-dim)	1536	88%	2,200 docs/min	$0.00013/1K tokens

text-embedding-3-large with Matryoshka dimension reduction to 1536 dimensions gave us 88% accuracy (only 1% less than full 3072) at 22% higher throughput. The reduced dimensions also halved our vector storage costs.

Vector Store: Azure AI Search over Pinecone

This was a governance-driven decision. Vanderlande operates under strict EU data residency requirements. Azure AI Search runs within the same Azure subscription, VNet, and region as our Databricks workspace. Data never leaves the compliance boundary.

Beyond governance, Azure AI Search offers hybrid search — combining vector similarity with BM25 keyword matching. For technical documentation, keyword search catches exact part numbers and specification codes that semantic search misses. Hybrid retrieval improved accuracy by 6% over vector-only search.

Unity Catalog: Row-Level Security for RAG

The most complex requirement: different engineers should only retrieve documents they're authorized to see. A maintenance engineer shouldn't find classified R&D specifications. A contractor shouldn't see internal safety incident reports.

We implemented this through Unity Catalog's row-level security:

Each document chunk inherits the security classification of its source document
User authentication flows through Azure AD → Databricks → Unity Catalog
At query time, the retrieval function runs with the user's Unity Catalog permissions
The vector search is post-filtered by the user's accessible document set

This means the same vector index serves all users, but each user's results are filtered by their authorization level. No separate indexes per role, no data duplication.

Generation Layer

Prompt Engineering for Technical Accuracy

The system prompt enforces strict citation requirements:

You are a technical documentation assistant for Vanderlande warehouse automation.

Rules:
1. Answer ONLY using the provided context documents
2. Every factual claim must include a citation: [Doc: <source>, Section: <heading>]
3. If the context doesn't contain enough information, say so explicitly
4. Never extrapolate beyond the provided documents
5. For numerical specifications, quote the exact value from the document

Grounded generation is critical for enterprise RAG. An engineer acting on a hallucinated torque specification for a conveyor belt motor could cause equipment failure. The citation requirement lets engineers verify every claim against the source document.

Evaluation Framework

We built a continuous evaluation pipeline using a human-labeled benchmark of 500 question-answer pairs:

Metric	Definition	Target	Actual
Answer accuracy	Human-judged correctness	> 90%	94%
Citation accuracy	Cited sources contain the claimed fact	> 95%	97%
Retrieval recall@5	Relevant document in top 5 results	> 85%	91%
Hallucination rate	Claims not supported by retrieved context	< 5%	2.1%
Latency (P95)	End-to-end response time	< 10s	6.2s

This evaluation runs nightly against the latest index. Any regression triggers an alert and blocks index promotion to production.

What I Learned

Chunking matters more than the LLM. We spent weeks testing GPT-4 vs GPT-4o vs Claude — accuracy varied by ~3%. Switching from fixed-size to hierarchical chunking improved accuracy by 23%.

Governance is a feature, not a constraint. Unity Catalog's row-level security wasn't just a compliance checkbox — it enabled us to serve a single platform to all 3,000+ engineers instead of building separate systems per security tier.

Evaluation is the product. Without the 500-QA benchmark and nightly evaluation pipeline, we couldn't have iterated on chunking strategies with confidence. The evaluation pipeline took 3 weeks to build and saved us months of blind experimentation.