Project · Shipped

Vault Search

shipped
Voyage AI
ChromaDB

A semantic search CLI over my personal Obsidian vault. Built explicitly to learn embeddings, chunking strategies, vector databases, and the shape of a RAG pipeline end-to-end — the vault is just the pretext.

What it is

Vault Search is a Python CLI that answers natural-language questions against the markdown notes in my vault. "What did I decide about the portfolio site stack?" instead of grep -ri portfolio 20-projects/. It indexes markdown files into a local vector store, mirrors chunk metadata into the same SQLite database that powers my personal OS dashboard, and returns top-k semantic matches with file paths and snippets.

This is a learning-sandbox project and the page should say so plainly. The goal wasn't the world's best vault search. It was to get hands on every moving part of a RAG pipeline once, end-to-end, so the next time I meet embeddings and vector stores in a real project I'm reading code instead of reading docs.

Architecture

Three scripts, one database, one vector store.

chunker.py — markdown-aware chunking. H2 sections are the natural boundary; very short sections get merged forward so a six-line note doesn't become six low-signal chunks.
index.py — embeds each chunk with Voyage AI's voyage-3-lite over raw HTTP and writes vectors into ChromaDB in embedded mode via PersistentClient.
search.py — embeds the query, pulls top-k from Chroma, prints matches with file paths and snippets.

Chunk metadata lands in two tables inside os.db — search_chunks (one row per chunk with file path, heading, token count) and search_log (one row per query with latency and the top hit). Same database as the dashboard, so both pipelines are browsable from one TablePlus connection. Current index: 50 chunks across 18 files (verified 2026-04-20). Tiny on purpose — this is a personal vault, not a knowledge base. The point is the pipeline shape, not the corpus size.

Hard-won bits

H2 chunking with short-section merging

Naive chunking by H2 produced a long tail of one-paragraph chunks that drowned real sections out of the cosine-similarity ranking. Merging short sections forward cut chunk count ~30% and noticeably improved top-k quality for queries that should have matched a larger parent section. The lesson generalizes: chunk boundaries should match the semantic unit a reader would cite, not the markup structure.

Raw HTTP against Voyage

The Voyage Python SDK doesn't support Python 3.14 yet. A ~40-line requests wrapper covers embeddings, batch embeddings, and rerank — and made the auth and batching contract more visible than an SDK would have. I'll keep this as a default pattern for any API I'm learning: write the thin client once, understand the wire, adopt the SDK later if it earns its place.

Orphan cleanup on reindex

The first version left stale chunks behind when notes were deleted — search would happily return a hit on a file that no longer existed. Reindex now diffs the vault against search_chunks and drops orphans before inserting new ones. Also handled: duplicate H2 headings within a single note, and empty-body sections that used to slip through as zero-token chunks.

Status

Shipped as v1. Planned expansions, explicitly not blocking completion: an MCP server so any Claude session can query the vault as a tool, auto-reindex hooked into the vault's post-commit chain, hybrid search (vector + keyword) for queries that benefit from exact-term recall, and a small comparison study across embedding providers (Voyage / Ollama / OpenAI) once there's enough corpus to make the comparison meaningful.