← All work Agentic / AI
04

AI Knowledge Workflows

RoleBackend · AI
StatusIn production · NDA

OCR ingestion, structured document processing, retrieval-based search, and LLM agents so legal teams can query and reuse internal knowledge without re-reading 800 pages.

What I built

  • OCR pipeline that survives bad scans: deskewing, layout reconstruction, table extraction.
  • Chunking + embedding strategy tuned for legal documents (long clauses, numbered sections, citations).
  • Retrieval layer with hybrid sparse + dense search and per-tenant vector namespaces.
  • LLM agents that cite their sources and refuse to answer when retrieval confidence is low.

What I learned

  • RAG quality lives or dies on chunking. Spend 80% of the work there before tuning prompts.
  • Agents that say 'I don't know' are infinitely more useful in legal than agents that hallucinate.
  • Vector search is fast. The bottleneck is reranking and context window management.

Stack

PythonFastAPIPostgrespgvectorTesseract OCRLangChainRAGLLM Agents