← All articles
DATABASES Vector Databases for RAG: Qdrant, Weaviate, and Milv... 2026-02-15 · 13 min read · vector-database · rag · embeddings

Vector Databases for RAG: Qdrant, Weaviate, and Milvus Compared

Databases 2026-02-15 · 13 min read vector-database rag embeddings qdrant weaviate milvus chroma llm ai

Vector Databases for RAG: Qdrant, Weaviate, and Milvus Compared

Large language models are powerful but they hallucinate, go stale, and have no knowledge of your private data. Retrieval-augmented generation (RAG) fixes this by fetching relevant context from your own documents before sending a query to the LLM. The database that stores and searches those document embeddings -- the vector database -- is a critical piece of the stack.

This guide covers what vector databases actually do, how embedding and indexing work under the hood, and compares the four major options: Qdrant, Weaviate, Milvus, and Chroma. We'll set each one up with Docker and walk through real indexing and querying code.

Qdrant vector database logo

What Is a Vector Database?

A vector database stores high-dimensional vectors (arrays of floating-point numbers) and enables fast approximate nearest-neighbor (ANN) search over them. Traditional databases search by exact matches on structured fields. Vector databases search by similarity -- "find me the 10 vectors closest to this one."

Each vector is typically an embedding: a numerical representation of text, an image, audio, or code produced by a machine learning model. When you embed a sentence like "how to deploy a Kubernetes pod" and search for similar vectors, the database returns documents about Kubernetes deployment, container orchestration, and pod configuration -- even if they don't share the exact same words.

Why Not Just Use PostgreSQL with pgvector?

You can. pgvector adds vector similarity search to PostgreSQL and it's a legitimate option for small-to-medium datasets (up to a few million vectors). The trade-offs:

If your dataset is under 1M vectors and you already run PostgreSQL, pgvector is worth considering. Beyond that, or if search latency and recall quality are critical, a dedicated vector database pulls ahead.

Embedding Fundamentals

Before diving into databases, you need to understand embeddings since they're what you'll be storing.

How Embeddings Work

An embedding model converts input (text, images, code) into a fixed-length vector of floating-point numbers. The model is trained so that semantically similar inputs produce vectors that are close together in the high-dimensional space.

from openai import OpenAI

client = OpenAI()

# Generate an embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Kubernetes pod scheduling and resource limits"
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")  # 1536
print(f"First 5 values: {vector[:5]}")
# [0.0123, -0.0456, 0.0789, -0.0234, 0.0567]

Choosing an Embedding Model

Model Dimensions Context Window Cost Quality
OpenAI text-embedding-3-small 1536 8,191 tokens $0.02/1M tokens Good
OpenAI text-embedding-3-large 3072 8,191 tokens $0.13/1M tokens Better
Cohere embed-v4 1024 512 tokens $0.10/1M tokens Good
BGE-large-en-v1.5 (open source) 1024 512 tokens Free (self-hosted) Good
nomic-embed-text (open source) 768 8,192 tokens Free (self-hosted) Good

For most RAG applications, text-embedding-3-small hits the right balance of cost, quality, and speed. If you want to avoid API dependencies, nomic-embed-text runs well locally via Ollama.

Chunking Strategy

Raw documents are too long for embedding models. You need to split them into chunks first. Chunk size affects both retrieval quality and cost:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document_text)

Overlap between chunks (50-100 tokens) prevents information from being lost at chunk boundaries. The RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then sentences, then words -- preserving natural boundaries.

The Four Major Vector Databases

Qdrant

Qdrant is written in Rust, which gives it excellent single-node performance and memory efficiency. It's the most developer-friendly option with a clean API and good documentation.

Architecture: Single binary, gRPC and REST APIs, optional distributed mode. Stores vectors on disk with an in-memory HNSW index. Supports payload (metadata) storage and filtering natively.

Setup with Docker:

docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_data:/qdrant/storage \
  qdrant/qdrant

Docker Compose:

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"   # REST API
      - "6334:6334"   # gRPC
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__GRPC_PORT: 6334

volumes:
  qdrant_data:

Creating a collection and inserting vectors:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,           # Must match your embedding model dimensions
        distance=Distance.COSINE
    )
)

# Insert vectors with metadata (payload)
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=embedding_vector,      # Your 1536-dim vector
            payload={
                "text": "Kubernetes pods are the smallest deployable units...",
                "source": "k8s-docs.md",
                "section": "pods",
                "date": "2026-01-15"
            }
        ),
        PointStruct(
            id=2,
            vector=another_vector,
            payload={
                "text": "A Deployment manages a set of replica Pods...",
                "source": "k8s-docs.md",
                "section": "deployments",
                "date": "2026-01-15"
            }
        )
    ]
)

Querying with filters:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.query_points(
    collection_name="documents",
    query=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="source",
                match=MatchValue(value="k8s-docs.md")
            )
        ]
    ),
    limit=5
)

for point in results.points:
    print(f"Score: {point.score:.4f}")
    print(f"Text: {point.payload['text'][:100]}...")

Strengths: Fast single-node performance, clean API, good filtering, low memory footprint, excellent Rust-based reliability, built-in snapshot and backup support.

Weaknesses: Distributed mode is newer and less battle-tested than Milvus. Fewer integrations than Weaviate. Community is growing but smaller.

Weaviate

Weaviate differentiates itself with built-in vectorization modules -- you can send it raw text and it handles the embedding internally. It also supports hybrid search (combining vector similarity with keyword BM25 search).

Architecture: Written in Go. Supports modules for vectorization (OpenAI, Cohere, Hugging Face, etc.), generative AI, and reranking. Schema-based with classes and properties.

Setup with Docker:

services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:latest
    ports:
      - "8080:8080"   # REST API
      - "50051:50051" # gRPC
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "none"
      CLUSTER_HOSTNAME: "node1"

volumes:
  weaviate_data:

If you want Weaviate to handle embeddings for you, add a vectorizer module:

    environment:
      DEFAULT_VECTORIZER_MODULE: "text2vec-openai"
      OPENAI_APIKEY: "${OPENAI_API_KEY}"
      ENABLE_MODULES: "text2vec-openai,generative-openai"

Creating a collection and inserting data:

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

# Create a collection (class)
collection = client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.none(),  # We'll provide vectors ourselves
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="section", data_type=DataType.TEXT),
    ]
)

# Insert with pre-computed vectors
collection = client.collections.get("Document")
collection.data.insert(
    properties={
        "text": "Kubernetes pods are the smallest deployable units...",
        "source": "k8s-docs.md",
        "section": "pods"
    },
    vector=embedding_vector
)

Hybrid search (vector + keyword):

collection = client.collections.get("Document")

# Hybrid search combines BM25 keyword matching with vector similarity
response = collection.query.hybrid(
    query="kubernetes pod resource limits",
    vector=query_vector,
    alpha=0.5,      # 0 = pure keyword, 1 = pure vector
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("source").equal("k8s-docs.md")
)

for obj in response.objects:
    print(f"Text: {obj.properties['text'][:100]}...")

Strengths: Hybrid search is genuinely useful (catches things pure vector search misses). Built-in vectorization modules reduce pipeline complexity. Good multi-tenancy support. Active community.

Weaknesses: Resource-hungry (Go + Java module containers). Schema-based design is more rigid than Qdrant's schemaless payloads. Module system adds deployment complexity. Query language has a learning curve.

Milvus

Milvus is the most mature option for large-scale deployments. It was built from the start for distributed, high-throughput vector search and handles billions of vectors in production at companies like eBay and Shopee.

Architecture: Cloud-native, microservice-based. Components include proxy, query nodes, data nodes, index nodes, and etcd for coordination. Written in Go and C++. Uses MinIO or S3 for storage.

Standalone setup with Docker (good for development):

services:
  etcd:
    image: quay.io/coreos/etcd:v3.5.18
    environment:
      ETCD_AUTO_COMPACTION_MODE: revision
      ETCD_AUTO_COMPACTION_RETENTION: "1000"
      ETCD_QUOTA_BACKEND_BYTES: "4294967296"
    volumes:
      - etcd_data:/etcd

  minio:
    image: minio/minio:latest
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    command: minio server /minio_data
    volumes:
      - minio_data:/minio_data

  milvus:
    image: milvusdb/milvus:latest
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    ports:
      - "19530:19530"  # gRPC
      - "9091:9091"    # Metrics
    volumes:
      - milvus_data:/var/lib/milvus
    depends_on:
      - etcd
      - minio

volumes:
  etcd_data:
  minio_data:
  milvus_data:

Creating a collection and inserting vectors:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

# Connect
connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields, description="Document embeddings")

# Create collection
collection = Collection("documents", schema)

# Insert data
collection.insert([
    ["Kubernetes pods are the smallest deployable units...",
     "A Deployment manages a set of replica Pods..."],      # text
    ["k8s-docs.md", "k8s-docs.md"],                        # source
    [embedding_vector_1, embedding_vector_2]                # embedding
])

# Build an index (required before searching)
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
collection.load()

Querying:

results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 128}},
    limit=5,
    expr='source == "k8s-docs.md"',
    output_fields=["text", "source"]
)

for hits in results:
    for hit in hits:
        print(f"Score: {hit.score:.4f}")
        print(f"Text: {hit.entity.get('text')[:100]}...")

Strengths: Proven at massive scale (billions of vectors). Most indexing algorithm options (HNSW, IVF_FLAT, IVF_SQ8, IVF_PQ, DiskANN). Strong distributed architecture. GPU-accelerated indexing.

Weaknesses: Heavy infrastructure requirements (etcd + MinIO + Milvus). Steep learning curve. The standalone mode works for development but production deployments are complex. Verbose API.

Chroma

Chroma is the lightweight option, designed for quick prototyping and small-scale applications. It's popular in tutorials and getting-started guides because it requires almost no setup.

Architecture: Python-native, runs in-process or as a lightweight server. Uses SQLite for metadata and HNSW for vector indexing. No external dependencies.

Setup (in-process, no Docker needed):

import chromadb

# Ephemeral (in-memory)
client = chromadb.Client()

# Persistent (saved to disk)
client = chromadb.PersistentClient(path="./chroma_data")

Server mode with Docker:

services:
  chroma:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma

volumes:
  chroma_data:

Creating a collection and inserting data:

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")

collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Chroma can auto-embed text if you configure an embedding function,
# or you can provide vectors directly
collection.add(
    ids=["doc1", "doc2"],
    embeddings=[embedding_vector_1, embedding_vector_2],
    documents=[
        "Kubernetes pods are the smallest deployable units...",
        "A Deployment manages a set of replica Pods..."
    ],
    metadatas=[
        {"source": "k8s-docs.md", "section": "pods"},
        {"source": "k8s-docs.md", "section": "deployments"}
    ]
)

Querying:

results = collection.query(
    query_embeddings=[query_vector],
    n_results=5,
    where={"source": "k8s-docs.md"}
)

for doc, distance in zip(results["documents"][0], results["distances"][0]):
    print(f"Distance: {distance:.4f}")
    print(f"Text: {doc[:100]}...")

Strengths: Easiest setup by far. No infrastructure needed. Great for prototyping, demos, and small datasets. Good Python developer experience. Built-in embedding function support.

Weaknesses: Not designed for production scale (struggles past ~1M vectors). No distributed mode. Limited query capabilities. No hybrid search. Single-threaded performance ceiling.

Comparison Table

Feature Qdrant Weaviate Milvus Chroma
Language Rust Go Go/C++ Python
Max Scale ~100M vectors ~100M vectors Billions ~1M vectors
Index Types HNSW HNSW, flat HNSW, IVF, DiskANN HNSW
Hybrid Search Sparse vectors BM25 + vector Sparse vectors No
Filtering Excellent Good Good Basic
Setup Complexity Low Medium High Very low
Memory Efficiency Excellent Moderate Good Moderate
Multi-tenancy Per-collection Native Per-collection Per-collection
Cloud Managed Yes Yes Yes (Zilliz) Yes
License Apache 2.0 BSD-3-Clause Apache 2.0 Apache 2.0

Indexing Strategies

The index algorithm determines how the database organizes vectors for fast search. Understanding the options helps you tune for your workload.

HNSW (Hierarchical Navigable Small World)

HNSW is the default choice for most vector databases. It builds a multi-layer graph where each layer is a "small world" network. Searching starts at the top layer (sparse) and drills down to the bottom layer (dense).

Key parameters:

# Qdrant HNSW configuration
from qdrant_client.models import HnswConfigDiff

client.update_collection(
    collection_name="documents",
    hnsw_config=HnswConfigDiff(
        m=16,
        ef_construct=256,
        full_scan_threshold=10000  # Use brute-force below this count
    )
)

When to use: Almost always. HNSW provides the best recall-vs-speed trade-off for most workloads and dataset sizes.

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means, then searches only the most relevant clusters at query time.

Key parameters:

# Milvus IVF configuration
index_params = {
    "metric_type": "COSINE",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)

# At query time
search_params = {"metric_type": "COSINE", "params": {"nprobe": 32}}

When to use: When memory is constrained and you have millions of vectors. IVF uses less memory than HNSW but requires more tuning. IVF_PQ (product quantization) reduces memory further at the cost of recall.

DiskANN

DiskANN is a Microsoft Research algorithm that stores the graph index on SSD, keeping only a small fraction in memory. Milvus supports this natively.

When to use: When your dataset exceeds available RAM. DiskANN can handle billions of vectors on a single node with NVMe storage.

Choosing an Index Strategy

Building a RAG Pipeline

Here's a complete RAG pipeline using Qdrant and OpenAI, from document ingestion to answer generation.

Step 1: Ingest Documents

import os
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from langchain.text_splitter import RecursiveCharacterTextSplitter

openai_client = OpenAI()
qdrant_client = QdrantClient(host="localhost", port=6333)

# Create collection
qdrant_client.recreate_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

def ingest_file(file_path: str):
    text = Path(file_path).read_text()
    chunks = splitter.split_text(text)

    # Batch embed
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )

    points = []
    for i, (chunk, embedding_data) in enumerate(zip(chunks, response.data)):
        points.append(PointStruct(
            id=hash(f"{file_path}:{i}") % (2**63),
            vector=embedding_data.embedding,
            payload={
                "text": chunk,
                "source": file_path,
                "chunk_index": i
            }
        ))

    qdrant_client.upsert(
        collection_name="knowledge_base",
        points=points
    )
    print(f"Ingested {len(points)} chunks from {file_path}")

# Ingest all markdown files in a directory
for md_file in Path("./docs").glob("**/*.md"):
    ingest_file(str(md_file))

Step 2: Query and Generate

def ask(question: str, top_k: int = 5) -> str:
    # Embed the question
    query_response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )
    query_vector = query_response.data[0].embedding

    # Search for relevant chunks
    results = qdrant_client.query_points(
        collection_name="knowledge_base",
        query=query_vector,
        limit=top_k
    )

    # Build context from retrieved chunks
    context_parts = []
    sources = set()
    for point in results.points:
        context_parts.append(point.payload["text"])
        sources.add(point.payload["source"])

    context = "\n\n---\n\n".join(context_parts)

    # Generate answer with context
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the user's question "
                    "based on the provided context. If the context doesn't contain "
                    "enough information, say so. Cite sources when possible."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        temperature=0.1
    )

    answer = response.choices[0].message.content
    source_list = "\n".join(f"- {s}" for s in sources)
    return f"{answer}\n\nSources:\n{source_list}"

# Use it
print(ask("How do I configure resource limits for a Kubernetes pod?"))

Step 3: Evaluate Retrieval Quality

RAG quality depends heavily on retrieval quality. Test it:

def evaluate_retrieval(question: str, expected_source: str, top_k: int = 5) -> dict:
    """Check if the expected source appears in the top-K results."""
    query_response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )

    results = qdrant_client.query_points(
        collection_name="knowledge_base",
        query=query_response.data[0].embedding,
        limit=top_k
    )

    retrieved_sources = [p.payload["source"] for p in results.points]
    hit = expected_source in retrieved_sources

    return {
        "question": question,
        "expected": expected_source,
        "hit": hit,
        "top_score": results.points[0].score if results.points else 0,
        "retrieved_sources": retrieved_sources
    }

# Build an evaluation set
eval_set = [
    ("How do I set CPU limits?", "docs/kubernetes/resources.md"),
    ("What is a PersistentVolumeClaim?", "docs/kubernetes/storage.md"),
    ("How do I configure ingress routing?", "docs/kubernetes/networking.md"),
]

results = [evaluate_retrieval(q, s) for q, s in eval_set]
hit_rate = sum(1 for r in results if r["hit"]) / len(results)
print(f"Hit rate @ 5: {hit_rate:.1%}")

Performance Benchmarks

Real-world performance depends on hardware, dataset size, vector dimensions, and index configuration. These benchmarks give you a rough sense of relative performance on a single node with 1M 1536-dimensional vectors.

Metric Qdrant Weaviate Milvus Chroma
Insert throughput ~8K vec/s ~5K vec/s ~10K vec/s ~3K vec/s
Query latency (p50) ~2ms ~5ms ~3ms ~8ms
Query latency (p99) ~8ms ~15ms ~10ms ~25ms
Memory (1M vectors) ~3 GB ~5 GB ~4 GB ~4 GB
Recall @ 10 0.98 0.97 0.98 0.95

Notes: Benchmarks are approximate and depend heavily on tuning. Qdrant and Milvus tend to lead on raw performance. Weaviate's overhead comes from Go runtime and module system. Chroma's numbers degrade faster at scale.

Integration with LLM Frameworks

All four databases integrate with the major LLM orchestration frameworks.

LangChain

# Qdrant + LangChain
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = QdrantVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="langchain_docs"
)

# Use as retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("How do I deploy to Kubernetes?")

LlamaIndex

# Qdrant + LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="llama_docs")

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)

query_engine = index.as_query_engine()
response = query_engine.query("How do I configure pod resource limits?")
print(response)

Production Considerations

Backup and Disaster Recovery

Monitoring

All four expose metrics for monitoring:

# Qdrant metrics (Prometheus format)
curl http://localhost:6333/metrics

# Milvus metrics
curl http://localhost:9091/metrics

# Weaviate metrics
curl http://localhost:2112/metrics

Key metrics to watch: query latency percentiles, index build time, memory usage, collection size, and error rates.

Security

In production, enable authentication:

# Qdrant - API key auth
docker run -p 6333:6333 \
  -e QDRANT__SERVICE__API_KEY=your-secret-key \
  qdrant/qdrant

# Client-side
client = QdrantClient(host="localhost", port=6333, api_key="your-secret-key")
# Weaviate - API key auth
AUTHENTICATION_APIKEY_ENABLED=true
AUTHENTICATION_APIKEY_ALLOWED_KEYS=your-secret-key
[email protected]

When to Use Each

Choose Qdrant if you want the best developer experience with strong single-node performance. It's the right default choice for most RAG applications with up to tens of millions of vectors. Rust's memory safety and performance give it an edge on reliability and efficiency.

Choose Weaviate if hybrid search (vector + keyword) is important to your use case, or if you want built-in vectorization modules to simplify your pipeline. Its module ecosystem is a genuine advantage if you need reranking, generative modules, or multi-modal search.

Choose Milvus if you're operating at massive scale (hundreds of millions to billions of vectors) and have the infrastructure team to manage a distributed deployment. Milvus is the most battle-tested option for large-scale production. The managed cloud version (Zilliz) reduces operational burden.

Choose Chroma for prototyping, demos, hackathons, and small applications under 500K vectors. Its zero-configuration setup gets you from idea to working RAG pipeline in minutes. Plan to migrate to Qdrant or Milvus when you outgrow it.

Recommendations