AI时代数据库 - 向量数据库完全指南

什么是向量数据库？

向量数据库专门用于存储、索引和查询高维向量数据，是AI时代的核心基础设施。

💡 工作原理

• 文本通过Embedding模型转换为向量
• 向量存储在数据库中
• 使用余弦相似度或欧氏距离计算相似性
• 返回最相似的Top-K结果

🎯 应用场景

• RAG检索增强生成
• 推荐系统
• 图像检索
• 语义搜索
• 相似问题匹配
• 代码搜索

🔄 RAG应用流程

📄

文档分块

→

🧠

向量化

→

💾

存储向量

→

🔍

相似度检索

→

✨

LLM生成

🔝 主流向量数据库

🌲

Pinecone

云端托管的向量数据库，零运维

✨ 优点

✅ 云端托管，无需运维
✅ 开箱即用
✅ API友好
✅ 自动扩展

🔧 使用场景

• 快速原型开发
• 中小企业应用
• 避免运维成本
• SaaS产品

from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("products")

# 插入向量
index.upsert(vectors=[
    ("id1", [0.1, 0.2, 0.3, ...]),
    ("id2", [0.4, 0.5, 0.6, ...])
])

# 查询
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=5,
    include_metadata=True
)

🚀

Qdrant

高性能开源向量数据库，Rust实现

✨ 优点

✅ 开源免费
✅ 性能极佳
✅ REST/GRPC API
✅ 支持过滤器

🔧 使用场景

• 高性能需求
• 私有化部署
• 企业级应用
• 定制化需求

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)

# 创建集合
client.create_collection(
    collection_name="products",
    vectors_config={"size": 384, "distance": "Cosine"}
)

# 插入向量
client.upsert(
    collection_name="products",
    points=[
        {"id": 1, "vector": [0.1, 0.2, ...], 
         "payload": {"name": "Product A"}}
    ]
)

# 搜索
results = client.search(
    collection_name="products",
    query_vector=[0.1, 0.2, ...],
    limit=5
)

🦄

Weaviate

GraphQL API，端到端向量平台

✨ 优点

✅ GraphQL接口
✅ 内置向量化
✅ 自动分类
✅ 多云支持

🔧 使用场景

• 端到端解决方案
• GraphQL生态
• 多云部署
• 企业级应用

📊

Milvus

分布式向量数据库，云原生

✨ 优点

✅ 分布式架构
✅ 水平扩展
✅ 海量数据
✅ CNCF项目

🔧 使用场景

• 大规模应用
• 分布式环境
• 企业级平台
• 云原生部署

🏗️ RAG实战案例

Python + Qdrant + DeepSeek RAG系统

from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
from openai import OpenAI

# 初始化
client = QdrantClient(host="localhost", port=6333)
encoder = SentenceTransformer('bge-large-zh-v1.5')
llm = OpenAI(api_key="sk-xxx", base_url="https://api.deepseek.com")

# 1. 文档向量化并存储
def add_documents(documents):
    vectors = encoder.encode(documents)
    points = []
    for i, (text, vector) in enumerate(zip(documents, vectors)):
        points.append({
            "id": i,
            "vector": vector.tolist(),
            "payload": {"text": text}
        })
    client.upsert(collection_name="knowledge", points=points)

# 2. 向量检索
def retrieve(query, top_k=3):
    query_vector = encoder.encode(query)
    results = client.search(
        collection_name="knowledge",
        query_vector=query_vector.tolist(),
        limit=top_k
    )
    return [result.payload['text'] for result in results]

# 3. RAG生成
def rag_generate(user_query):
    contexts = retrieve(user_query)
    prompt = f"""基于以下上下文回答问题：
    
{' '.join(contexts)}

问题：{user_query}

答案："""
    
    response = llm.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# 使用
answer = rag_generate("什么是机器学习？")
print(answer)

📊 性能对比

数据库	类型	向量支持	适用场景
Pinecone	托管	✅	快速原型
Qdrant	开源	✅	高性能
Weaviate	开源	✅	端到端
Milvus	开源	✅	大规模
PostgreSQL + pgvector	扩展	✅	传统DB扩展

📋 向量数据库最佳实践

✅ 推荐做法

• 选择合适的向量维度（768/1536）
• 使用余弦相似度或点积
• 合理设置分块大小（512 tokens）
• 添加metadata过滤
• 定期更新向量索引
• 监控检索质量

❌ 避免的做法

• 不要直接存储原始文本
• 避免过长的分块
• 不要忽略重排序（Rerank）
• 避免单一向量数据库
• 不要跳过评估指标
• 谨慎处理敏感数据