04
AI Knowledge Workflows
OCR ingestion, structured document processing, retrieval-based search, and LLM agents so legal teams can query and reuse internal knowledge without re-reading 800 pages.
What I built
- OCR pipeline that survives bad scans: deskewing, layout reconstruction, table extraction.
- Chunking + embedding strategy tuned for legal documents (long clauses, numbered sections, citations).
- Retrieval layer with hybrid sparse + dense search and per-tenant vector namespaces.
- LLM agents that cite their sources and refuse to answer when retrieval confidence is low.
What I learned
- RAG quality lives or dies on chunking. Spend 80% of the work there before tuning prompts.
- Agents that say 'I don't know' are infinitely more useful in legal than agents that hallucinate.
- Vector search is fast. The bottleneck is reranking and context window management.
Stack
PythonFastAPIPostgrespgvectorTesseract OCRLangChainRAGLLM Agents