Vector Databases for RAG: Qdrant, Weaviate, and Milvus Compared
Vector Databases for RAG: Qdrant, Weaviate, and Milvus Compared
Large language models are powerful but they hallucinate, go stale, and have no knowledge of your private data. Retrieval-augmented generation (RAG) fixes this by fetching relevant context from your own documents before sending a query to the LLM. The database that stores and searches those document embeddings -- the vector database -- is a critical piece of the stack.
This guide covers what vector databases actually do, how embedding and indexing work under the hood, and compares the four major options: Qdrant, Weaviate, Milvus, and Chroma. We'll set each one up with Docker and walk through real indexing and querying code.

What Is a Vector Database?
A vector database stores high-dimensional vectors (arrays of floating-point numbers) and enables fast approximate nearest-neighbor (ANN) search over them. Traditional databases search by exact matches on structured fields. Vector databases search by similarity -- "find me the 10 vectors closest to this one."
Each vector is typically an embedding: a numerical representation of text, an image, audio, or code produced by a machine learning model. When you embed a sentence like "how to deploy a Kubernetes pod" and search for similar vectors, the database returns documents about Kubernetes deployment, container orchestration, and pod configuration -- even if they don't share the exact same words.
Why Not Just Use PostgreSQL with pgvector?
You can. pgvector adds vector similarity search to PostgreSQL and it's a legitimate option for small-to-medium datasets (up to a few million vectors). The trade-offs:
- pgvector strengths: Familiar SQL interface, ACID transactions, joins with relational data, no new infrastructure
- pgvector weaknesses: Slower at scale, limited indexing strategies, no built-in sharding, fewer ANN tuning knobs
If your dataset is under 1M vectors and you already run PostgreSQL, pgvector is worth considering. Beyond that, or if search latency and recall quality are critical, a dedicated vector database pulls ahead.
Embedding Fundamentals
Before diving into databases, you need to understand embeddings since they're what you'll be storing.
How Embeddings Work
An embedding model converts input (text, images, code) into a fixed-length vector of floating-point numbers. The model is trained so that semantically similar inputs produce vectors that are close together in the high-dimensional space.
from openai import OpenAI
client = OpenAI()
# Generate an embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input="Kubernetes pod scheduling and resource limits"
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") # 1536
print(f"First 5 values: {vector[:5]}")
# [0.0123, -0.0456, 0.0789, -0.0234, 0.0567]
Choosing an Embedding Model
| Model | Dimensions | Context Window | Cost | Quality |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8,191 tokens | $0.02/1M tokens | Good |
| OpenAI text-embedding-3-large | 3072 | 8,191 tokens | $0.13/1M tokens | Better |
| Cohere embed-v4 | 1024 | 512 tokens | $0.10/1M tokens | Good |
| BGE-large-en-v1.5 (open source) | 1024 | 512 tokens | Free (self-hosted) | Good |
| nomic-embed-text (open source) | 768 | 8,192 tokens | Free (self-hosted) | Good |
For most RAG applications, text-embedding-3-small hits the right balance of cost, quality, and speed. If you want to avoid API dependencies, nomic-embed-text runs well locally via Ollama.
Chunking Strategy
Raw documents are too long for embedding models. You need to split them into chunks first. Chunk size affects both retrieval quality and cost:
- Too small (100 tokens): Loses context, retrieves fragments that don't make sense alone
- Too large (2000 tokens): Dilutes the semantic signal, retrieves loosely relevant blocks
- Sweet spot (300-800 tokens): Retains enough context for useful retrieval
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
Overlap between chunks (50-100 tokens) prevents information from being lost at chunk boundaries. The RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then sentences, then words -- preserving natural boundaries.
The Four Major Vector Databases
Qdrant
Qdrant is written in Rust, which gives it excellent single-node performance and memory efficiency. It's the most developer-friendly option with a clean API and good documentation.
Architecture: Single binary, gRPC and REST APIs, optional distributed mode. Stores vectors on disk with an in-memory HNSW index. Supports payload (metadata) storage and filtering natively.
Setup with Docker:
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_data:/qdrant/storage \
qdrant/qdrant
Docker Compose:
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333" # REST API
- "6334:6334" # gRPC
volumes:
- qdrant_data:/qdrant/storage
environment:
QDRANT__SERVICE__GRPC_PORT: 6334
volumes:
qdrant_data:
Creating a collection and inserting vectors:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="localhost", port=6333)
# Create a collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536, # Must match your embedding model dimensions
distance=Distance.COSINE
)
)
# Insert vectors with metadata (payload)
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=embedding_vector, # Your 1536-dim vector
payload={
"text": "Kubernetes pods are the smallest deployable units...",
"source": "k8s-docs.md",
"section": "pods",
"date": "2026-01-15"
}
),
PointStruct(
id=2,
vector=another_vector,
payload={
"text": "A Deployment manages a set of replica Pods...",
"source": "k8s-docs.md",
"section": "deployments",
"date": "2026-01-15"
}
)
]
)
Querying with filters:
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.query_points(
collection_name="documents",
query=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="source",
match=MatchValue(value="k8s-docs.md")
)
]
),
limit=5
)
for point in results.points:
print(f"Score: {point.score:.4f}")
print(f"Text: {point.payload['text'][:100]}...")
Strengths: Fast single-node performance, clean API, good filtering, low memory footprint, excellent Rust-based reliability, built-in snapshot and backup support.
Weaknesses: Distributed mode is newer and less battle-tested than Milvus. Fewer integrations than Weaviate. Community is growing but smaller.
Weaviate
Weaviate differentiates itself with built-in vectorization modules -- you can send it raw text and it handles the embedding internally. It also supports hybrid search (combining vector similarity with keyword BM25 search).
Architecture: Written in Go. Supports modules for vectorization (OpenAI, Cohere, Hugging Face, etc.), generative AI, and reranking. Schema-based with classes and properties.
Setup with Docker:
services:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:latest
ports:
- "8080:8080" # REST API
- "50051:50051" # gRPC
volumes:
- weaviate_data:/var/lib/weaviate
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
DEFAULT_VECTORIZER_MODULE: "none"
CLUSTER_HOSTNAME: "node1"
volumes:
weaviate_data:
If you want Weaviate to handle embeddings for you, add a vectorizer module:
environment:
DEFAULT_VECTORIZER_MODULE: "text2vec-openai"
OPENAI_APIKEY: "${OPENAI_API_KEY}"
ENABLE_MODULES: "text2vec-openai,generative-openai"
Creating a collection and inserting data:
import weaviate
from weaviate.classes.config import Configure, Property, DataType
client = weaviate.connect_to_local()
# Create a collection (class)
collection = client.collections.create(
name="Document",
vectorizer_config=Configure.Vectorizer.none(), # We'll provide vectors ourselves
properties=[
Property(name="text", data_type=DataType.TEXT),
Property(name="source", data_type=DataType.TEXT),
Property(name="section", data_type=DataType.TEXT),
]
)
# Insert with pre-computed vectors
collection = client.collections.get("Document")
collection.data.insert(
properties={
"text": "Kubernetes pods are the smallest deployable units...",
"source": "k8s-docs.md",
"section": "pods"
},
vector=embedding_vector
)
Hybrid search (vector + keyword):
collection = client.collections.get("Document")
# Hybrid search combines BM25 keyword matching with vector similarity
response = collection.query.hybrid(
query="kubernetes pod resource limits",
vector=query_vector,
alpha=0.5, # 0 = pure keyword, 1 = pure vector
limit=5,
filters=weaviate.classes.query.Filter.by_property("source").equal("k8s-docs.md")
)
for obj in response.objects:
print(f"Text: {obj.properties['text'][:100]}...")
Strengths: Hybrid search is genuinely useful (catches things pure vector search misses). Built-in vectorization modules reduce pipeline complexity. Good multi-tenancy support. Active community.
Weaknesses: Resource-hungry (Go + Java module containers). Schema-based design is more rigid than Qdrant's schemaless payloads. Module system adds deployment complexity. Query language has a learning curve.
Milvus
Milvus is the most mature option for large-scale deployments. It was built from the start for distributed, high-throughput vector search and handles billions of vectors in production at companies like eBay and Shopee.
Architecture: Cloud-native, microservice-based. Components include proxy, query nodes, data nodes, index nodes, and etcd for coordination. Written in Go and C++. Uses MinIO or S3 for storage.
Standalone setup with Docker (good for development):
services:
etcd:
image: quay.io/coreos/etcd:v3.5.18
environment:
ETCD_AUTO_COMPACTION_MODE: revision
ETCD_AUTO_COMPACTION_RETENTION: "1000"
ETCD_QUOTA_BACKEND_BYTES: "4294967296"
volumes:
- etcd_data:/etcd
minio:
image: minio/minio:latest
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
command: minio server /minio_data
volumes:
- minio_data:/minio_data
milvus:
image: milvusdb/milvus:latest
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
ports:
- "19530:19530" # gRPC
- "9091:9091" # Metrics
volumes:
- milvus_data:/var/lib/milvus
depends_on:
- etcd
- minio
volumes:
etcd_data:
minio_data:
milvus_data:
Creating a collection and inserting vectors:
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility
# Connect
connections.connect("default", host="localhost", port="19530")
# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields, description="Document embeddings")
# Create collection
collection = Collection("documents", schema)
# Insert data
collection.insert([
["Kubernetes pods are the smallest deployable units...",
"A Deployment manages a set of replica Pods..."], # text
["k8s-docs.md", "k8s-docs.md"], # source
[embedding_vector_1, embedding_vector_2] # embedding
])
# Build an index (required before searching)
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
collection.load()
Querying:
results = collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"ef": 128}},
limit=5,
expr='source == "k8s-docs.md"',
output_fields=["text", "source"]
)
for hits in results:
for hit in hits:
print(f"Score: {hit.score:.4f}")
print(f"Text: {hit.entity.get('text')[:100]}...")
Strengths: Proven at massive scale (billions of vectors). Most indexing algorithm options (HNSW, IVF_FLAT, IVF_SQ8, IVF_PQ, DiskANN). Strong distributed architecture. GPU-accelerated indexing.
Weaknesses: Heavy infrastructure requirements (etcd + MinIO + Milvus). Steep learning curve. The standalone mode works for development but production deployments are complex. Verbose API.
Chroma
Chroma is the lightweight option, designed for quick prototyping and small-scale applications. It's popular in tutorials and getting-started guides because it requires almost no setup.
Architecture: Python-native, runs in-process or as a lightweight server. Uses SQLite for metadata and HNSW for vector indexing. No external dependencies.
Setup (in-process, no Docker needed):
import chromadb
# Ephemeral (in-memory)
client = chromadb.Client()
# Persistent (saved to disk)
client = chromadb.PersistentClient(path="./chroma_data")
Server mode with Docker:
services:
chroma:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
volumes:
chroma_data:
Creating a collection and inserting data:
import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
# Chroma can auto-embed text if you configure an embedding function,
# or you can provide vectors directly
collection.add(
ids=["doc1", "doc2"],
embeddings=[embedding_vector_1, embedding_vector_2],
documents=[
"Kubernetes pods are the smallest deployable units...",
"A Deployment manages a set of replica Pods..."
],
metadatas=[
{"source": "k8s-docs.md", "section": "pods"},
{"source": "k8s-docs.md", "section": "deployments"}
]
)
Querying:
results = collection.query(
query_embeddings=[query_vector],
n_results=5,
where={"source": "k8s-docs.md"}
)
for doc, distance in zip(results["documents"][0], results["distances"][0]):
print(f"Distance: {distance:.4f}")
print(f"Text: {doc[:100]}...")
Strengths: Easiest setup by far. No infrastructure needed. Great for prototyping, demos, and small datasets. Good Python developer experience. Built-in embedding function support.
Weaknesses: Not designed for production scale (struggles past ~1M vectors). No distributed mode. Limited query capabilities. No hybrid search. Single-threaded performance ceiling.
Comparison Table
| Feature | Qdrant | Weaviate | Milvus | Chroma |
|---|---|---|---|---|
| Language | Rust | Go | Go/C++ | Python |
| Max Scale | ~100M vectors | ~100M vectors | Billions | ~1M vectors |
| Index Types | HNSW | HNSW, flat | HNSW, IVF, DiskANN | HNSW |
| Hybrid Search | Sparse vectors | BM25 + vector | Sparse vectors | No |
| Filtering | Excellent | Good | Good | Basic |
| Setup Complexity | Low | Medium | High | Very low |
| Memory Efficiency | Excellent | Moderate | Good | Moderate |
| Multi-tenancy | Per-collection | Native | Per-collection | Per-collection |
| Cloud Managed | Yes | Yes | Yes (Zilliz) | Yes |
| License | Apache 2.0 | BSD-3-Clause | Apache 2.0 | Apache 2.0 |
Indexing Strategies
The index algorithm determines how the database organizes vectors for fast search. Understanding the options helps you tune for your workload.
HNSW (Hierarchical Navigable Small World)
HNSW is the default choice for most vector databases. It builds a multi-layer graph where each layer is a "small world" network. Searching starts at the top layer (sparse) and drills down to the bottom layer (dense).
Key parameters:
- M: Number of connections per node (default 16). Higher M = better recall but more memory and slower inserts.
- efConstruction: Search depth during index building (default 128-256). Higher = better index quality but slower builds.
- ef: Search depth during queries (default 64-128). Higher = better recall but slower queries.
# Qdrant HNSW configuration
from qdrant_client.models import HnswConfigDiff
client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(
m=16,
ef_construct=256,
full_scan_threshold=10000 # Use brute-force below this count
)
)
When to use: Almost always. HNSW provides the best recall-vs-speed trade-off for most workloads and dataset sizes.
IVF (Inverted File Index)
IVF partitions the vector space into clusters using k-means, then searches only the most relevant clusters at query time.
Key parameters:
- nlist: Number of clusters (typically sqrt(N) where N is total vectors)
- nprobe: Number of clusters to search at query time (higher = better recall, slower)
# Milvus IVF configuration
index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)
# At query time
search_params = {"metric_type": "COSINE", "params": {"nprobe": 32}}
When to use: When memory is constrained and you have millions of vectors. IVF uses less memory than HNSW but requires more tuning. IVF_PQ (product quantization) reduces memory further at the cost of recall.
DiskANN
DiskANN is a Microsoft Research algorithm that stores the graph index on SSD, keeping only a small fraction in memory. Milvus supports this natively.
When to use: When your dataset exceeds available RAM. DiskANN can handle billions of vectors on a single node with NVMe storage.
Choosing an Index Strategy
- < 100K vectors: Flat (brute-force) search is fine. No index needed.
- 100K - 10M vectors: HNSW with default parameters. Adjust
efat query time. - 10M - 100M vectors: HNSW with tuned M and efConstruction, or IVF_FLAT.
- 100M+ vectors: IVF_PQ for memory efficiency, or DiskANN if on SSD. Consider Milvus distributed.
Building a RAG Pipeline
Here's a complete RAG pipeline using Qdrant and OpenAI, from document ingestion to answer generation.
Step 1: Ingest Documents
import os
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from langchain.text_splitter import RecursiveCharacterTextSplitter
openai_client = OpenAI()
qdrant_client = QdrantClient(host="localhost", port=6333)
# Create collection
qdrant_client.recreate_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
# Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50
)
def ingest_file(file_path: str):
text = Path(file_path).read_text()
chunks = splitter.split_text(text)
# Batch embed
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
points = []
for i, (chunk, embedding_data) in enumerate(zip(chunks, response.data)):
points.append(PointStruct(
id=hash(f"{file_path}:{i}") % (2**63),
vector=embedding_data.embedding,
payload={
"text": chunk,
"source": file_path,
"chunk_index": i
}
))
qdrant_client.upsert(
collection_name="knowledge_base",
points=points
)
print(f"Ingested {len(points)} chunks from {file_path}")
# Ingest all markdown files in a directory
for md_file in Path("./docs").glob("**/*.md"):
ingest_file(str(md_file))
Step 2: Query and Generate
def ask(question: str, top_k: int = 5) -> str:
# Embed the question
query_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
query_vector = query_response.data[0].embedding
# Search for relevant chunks
results = qdrant_client.query_points(
collection_name="knowledge_base",
query=query_vector,
limit=top_k
)
# Build context from retrieved chunks
context_parts = []
sources = set()
for point in results.points:
context_parts.append(point.payload["text"])
sources.add(point.payload["source"])
context = "\n\n---\n\n".join(context_parts)
# Generate answer with context
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a helpful assistant. Answer the user's question "
"based on the provided context. If the context doesn't contain "
"enough information, say so. Cite sources when possible."
)
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
],
temperature=0.1
)
answer = response.choices[0].message.content
source_list = "\n".join(f"- {s}" for s in sources)
return f"{answer}\n\nSources:\n{source_list}"
# Use it
print(ask("How do I configure resource limits for a Kubernetes pod?"))
Step 3: Evaluate Retrieval Quality
RAG quality depends heavily on retrieval quality. Test it:
def evaluate_retrieval(question: str, expected_source: str, top_k: int = 5) -> dict:
"""Check if the expected source appears in the top-K results."""
query_response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
results = qdrant_client.query_points(
collection_name="knowledge_base",
query=query_response.data[0].embedding,
limit=top_k
)
retrieved_sources = [p.payload["source"] for p in results.points]
hit = expected_source in retrieved_sources
return {
"question": question,
"expected": expected_source,
"hit": hit,
"top_score": results.points[0].score if results.points else 0,
"retrieved_sources": retrieved_sources
}
# Build an evaluation set
eval_set = [
("How do I set CPU limits?", "docs/kubernetes/resources.md"),
("What is a PersistentVolumeClaim?", "docs/kubernetes/storage.md"),
("How do I configure ingress routing?", "docs/kubernetes/networking.md"),
]
results = [evaluate_retrieval(q, s) for q, s in eval_set]
hit_rate = sum(1 for r in results if r["hit"]) / len(results)
print(f"Hit rate @ 5: {hit_rate:.1%}")
Performance Benchmarks
Real-world performance depends on hardware, dataset size, vector dimensions, and index configuration. These benchmarks give you a rough sense of relative performance on a single node with 1M 1536-dimensional vectors.
| Metric | Qdrant | Weaviate | Milvus | Chroma |
|---|---|---|---|---|
| Insert throughput | ~8K vec/s | ~5K vec/s | ~10K vec/s | ~3K vec/s |
| Query latency (p50) | ~2ms | ~5ms | ~3ms | ~8ms |
| Query latency (p99) | ~8ms | ~15ms | ~10ms | ~25ms |
| Memory (1M vectors) | ~3 GB | ~5 GB | ~4 GB | ~4 GB |
| Recall @ 10 | 0.98 | 0.97 | 0.98 | 0.95 |
Notes: Benchmarks are approximate and depend heavily on tuning. Qdrant and Milvus tend to lead on raw performance. Weaviate's overhead comes from Go runtime and module system. Chroma's numbers degrade faster at scale.
Integration with LLM Frameworks
All four databases integrate with the major LLM orchestration frameworks.
LangChain
# Qdrant + LangChain
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="langchain_docs"
)
# Use as retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("How do I deploy to Kubernetes?")
LlamaIndex
# Qdrant + LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="llama_docs")
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("How do I configure pod resource limits?")
print(response)
Production Considerations
Backup and Disaster Recovery
- Qdrant: Built-in snapshots via REST API (
POST /collections/{name}/snapshots) - Weaviate: Backup modules for S3, GCS, filesystem
- Milvus: Backup tool (
milvus-backup) supports S3 and local storage - Chroma: Copy the persistent directory (SQLite + HNSW files)
Monitoring
All four expose metrics for monitoring:
# Qdrant metrics (Prometheus format)
curl http://localhost:6333/metrics
# Milvus metrics
curl http://localhost:9091/metrics
# Weaviate metrics
curl http://localhost:2112/metrics
Key metrics to watch: query latency percentiles, index build time, memory usage, collection size, and error rates.
Security
In production, enable authentication:
# Qdrant - API key auth
docker run -p 6333:6333 \
-e QDRANT__SERVICE__API_KEY=your-secret-key \
qdrant/qdrant
# Client-side
client = QdrantClient(host="localhost", port=6333, api_key="your-secret-key")
# Weaviate - API key auth
AUTHENTICATION_APIKEY_ENABLED=true
AUTHENTICATION_APIKEY_ALLOWED_KEYS=your-secret-key
[email protected]
When to Use Each
Choose Qdrant if you want the best developer experience with strong single-node performance. It's the right default choice for most RAG applications with up to tens of millions of vectors. Rust's memory safety and performance give it an edge on reliability and efficiency.
Choose Weaviate if hybrid search (vector + keyword) is important to your use case, or if you want built-in vectorization modules to simplify your pipeline. Its module ecosystem is a genuine advantage if you need reranking, generative modules, or multi-modal search.
Choose Milvus if you're operating at massive scale (hundreds of millions to billions of vectors) and have the infrastructure team to manage a distributed deployment. Milvus is the most battle-tested option for large-scale production. The managed cloud version (Zilliz) reduces operational burden.
Choose Chroma for prototyping, demos, hackathons, and small applications under 500K vectors. Its zero-configuration setup gets you from idea to working RAG pipeline in minutes. Plan to migrate to Qdrant or Milvus when you outgrow it.
Recommendations
- Start with Chroma for prototyping, then migrate to Qdrant or Milvus for production
- Use HNSW as your default index type -- it works well for nearly all workloads
- Chunk at 300-800 tokens with 50-100 token overlap for text documents
- Evaluate retrieval quality separately from generation quality -- bad retrieval is the most common RAG failure mode
- Store the original text in the payload/metadata so you can reconstruct context without a separate document store
- Version your embeddings -- when you change embedding models, you need to re-embed everything
- Monitor recall and latency in production, not just during development