Back

Deep Dive: The Architecture of ZeroEntropy v1

May 14, 2025 ·

Deep Dive: The Architecture of ZeroEntropy v1

TL;DR

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLMs in the loop to deliver enterprise-grade search over unstructured documents.

At ZeroEntropy we’ve re-imagined every layer of the retrieval stack, from PDF parsing to query execution, to deliver end-to-end document intelligence on par with having an entire team of expert search engineers, behind one simple API.

In the sections below, we’ll dive into each layer of our architecture, from ingestion and indexing to query execution and security, demonstrating how we achieve sub-second latency, 90%+ recall, and enterprise-grade compliance (SOC 2, HIPAA).

Query Architecture Diagram — Ingestion Architecture Diagram

1. High-Level Overview

At its core ZeroEntropy is a hybrid search system combining:

Sparse retrieval (BM25) for lightning-fast keyword matching
Dense embedding retrieval for semantic relevance
LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

2. Document Ingestion & Chunking

Render → OCR → VLM diagram tags

Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.
Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.
Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.

Hierarchical chunking

We detect language, pick the right tokenizer & stemmer, then split into words → sentences → paragraphs. By keeping contextual spans, we try to create meaningful chunks. We currently support two chunk sizes: coarse (around 2,000 chars) and fine (around 200 chars).

3. Indexing: Sparse & Dense

Index Type	What we index	Why it matters
ParadeDB BM25	Paragraph & document tokens + LLM-generated keywords	Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree
Turbopuffer	Embeddings of every node (sentences → document)	Sub-second semantic search at scale

Ingestion Architecture Diagram — Query Architecture Diagram

4. Query Execution Walkthrough

LLM Rewriters

Query Rewriter: Refines your input into a clearer embedding prompt (e.g. “procedure for submitting Form 10-K to the SEC”).
Keyword Generator: Scores key terms (e.g. “10-K” = 0.8, “file” = –0.2) to improve matching.
Performance Modes: • Fast Mode skips LLM steps for sub-500 ms responses • Deep Mode runs the full LLM pipeline in 2–3 seconds

Tokenization & Typo Correction

We split the raw query into tokens and run them through a BK-tree typo corrector so even “10-Kk” maps back to “10-K.”

Sparse + Dense Fan-Out

Dense Recall

We query Turbopuffer embeddings to fetch the top-N semantically related chunks.

Sparse Recall

We use ParadeDB’s BM25 index to retrieve the top-N’ keyword matches.

Reciprocal Rank Fusion

We merge the sparse and dense rankings using reciprocal rank fusion to select the final top K results, combining complementary signals for up to a 10–15% boost in overall accuracy.

5. Security & Deployment

SOC 2 Type II compliant.
HIPAA compliant & following industry best practices.
End-to-end encryption for data in transit and at rest.
On-Prem deployment available for enterprise users, as easy-to-use docker images.

Wrapping Up

ZeroEntropy isn’t “another vector search.” It’s a full-stack retrieval platform that:

Knows how to parse your most fiendish PDFs.
Indexes meaningful snippets at every granularity.
Blends classical IR, embeddings, and LLM intelligence under the hood.
Scales from a single document to billions of nodes without compromising accuracy or speed.

Ready to see it in action?

Explore the docs →

Book a demo →

We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

Apr 02, 2026

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

How to use zerank-2's calibrated relevance scores as a binary classifier for context compression, document routing, and multi-label classification — at 50-100x less cost than LLM classification.

Mar 02, 2026

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

A deep dive into how embedding models encode meaning, why famous training examples create the illusion of capability, and what consistent behavior across 10k+ nouns tells us about genuine understanding.

Feb 23, 2026

2026's Top 10 Embedding Companies Powering Search Technology

The best AI teams retrieve with ZeroEntropy

Book Demo View docs