SEC filings are hundreds of pages of tables, footnotes, and legal text. Manual search through PDFs doesn’t scale. I built a RAG system to query this data in natural language, a weekend project to test retrieval accuracy on structured financial documents.
Document management
The system integrates with the SEC EDGAR API. Search for companies by ticker or name, browse their filing history, filter by type (10-K, 10-Q, 8-K, proxy statements), and download directly. Documents get parsed and indexed automatically on import.
You can organize tickers into collections (BIG TECH, ENERGY, whatever makes sense). In queries, use @MSFT to filter by ticker or #COLLECTION to search across grouped companies. The system is org-based: invite teammates, make chats private or share them with your organization.
No manual PDF hunting or upload workflows. The interface handles discovery, retrieval, and processing in one flow.
The approach
Standard RAG pipeline: parse documents, chunk text, generate embeddings, store in a vector database, retrieve relevant chunks at query time, feed them to an LLM for synthesis.
The stack:
- Document parsing:
pdf-parsefor text extraction, with optional Tesseract.js OCR for scanned filings - Chunking: Sentence-aware splitting at ~512 tokens with 50-token overlap
- Embeddings: OpenAI’s
text-embedding-3-smallvia OpenRouter - Vector store: PostgreSQL + pgvector
- LLM: Claude 3.5 Sonnet for answer generation
I went with pgvector over dedicated vector databases (Pinecone, Weaviate) for simplicity. One fewer service to deploy and monitor. IVFFlat indexing handles the scale I needed without noticeable accuracy loss.
The hard part: follow-up queries
Basic RAG works fine for standalone questions. “What was Apple’s Q4 revenue?” retrieves the right chunks and generates a decent answer.
The problem is conversations. User asks about Q4 revenue, then follows up with “how does that compare to the previous quarter?” That query, taken alone, retrieves nothing useful, there’s no context about which company or which metric.
The fix: query reformulation. Before embedding a query, I check if it looks like a follow-up (short query, contains pronouns, starts with “and” or “what about”). If so, I send the last few conversation exchanges to the LLM and ask it to rewrite the query as a standalone question.
“How does that compare to the previous quarter?” becomes “How does Apple’s Q4 2024 revenue compare to Q3 2024 revenue?”
The heuristics avoid calling the LLM on every query:
// Very short queries (≤3 words) are likely follow-ups
if (wordCount <= 3) return true;
// Pronouns, comparison words, articles at the start
const strongFollowUpIndicators = [
/^(and|also|what about|how about)\s/i,
/^(this|that|it)\s/i,
/\b(compared|versus|vs|difference)\b/i,
];
This solved most of the conversational reliability issues.
Search: not just vectors
Pure semantic search has gaps. If someone asks for “AAPL 10-K”, vector similarity might retrieve Apple content, but it might also surface other tech companies with similar business descriptions.
Hybrid search helps: 30% weight on keyword matching (PostgreSQL full-text search), 70% on semantic similarity. The keyword component catches exact ticker symbols and form types.
For SEC filings specifically, I added recency weighting. Financial data ages fast, a 2024 filing should rank higher than a 2019 filing when semantic similarity is close:
combinedScore = semanticSimilarity * 0.85 + recencyBonus * 0.15;
// recencyBonus decays linearly over 365 days
What I’d do differently
The chunking strategy is naive. SEC filings have structure (tables, section headers, exhibits) that sentence-based chunking ignores. A table split across chunks loses meaning. Structured extraction (identifying tables, preserving them as units) would improve retrieval accuracy for numerical questions.
The 30/70 hybrid weighting was picked somewhat arbitrarily. It works, but I didn’t systematically tune it. A proper evaluation set with relevance judgments would help here.
Outcome
The system handles conversational queries over SEC filings. Ask “What were Microsoft’s R&D expenses in 2023?” and get “$27.2B, up 15% from 2022.” Follow up with “Show the 3-year trend” and it retrieves and formats 2021-2023 data without re-searching.
Complex multi-hop questions still have gaps, but for single-company financial queries and basic comparisons, it works reliably. Deployed on Railway, single PostgreSQL instance, Docker Compose stack. One service, no separate vector infrastructure.
The project is open-source if you want to dig into the implementation or run it yourself.