OCR + LLM scanner for small-business tax filing
Built by Nicholas Falshaw · OCR + LLM for small-business bookkeeping · Production since 2025
Small businesses drown in receipts, invoices, and PDF statements. Manual categorization against a German chart of accounts (SKR03 / SKR04) takes hours every month. Generic OCR services dump raw text, forcing the accountant to re-type line items anyway.
A document intake pipeline that accepts PDFs, images, and email attachments, runs layout-aware OCR, uses an LLM to extract structured line items, validates against business rules, categorizes to the chart of accounts, and exports an accountant-ready batch — DATEV-compatible CSV or pre-filled booking PDFs.
Ingestion
Web upload, email attachment, or folder watcher; MIME detection and virus scan
OCR layer
Tesseract for simple receipts, PaddleOCR for complex multi-column invoices, German-language models
LLM extraction
Ollama-hosted model with structured-output prompts to emit JSON line items (date, counterparty, VAT rate, net/gross, account hint)
Validation
Deterministic rules for VAT plausibility, duplicate detection, date-range checks
Storage
PostgreSQL with full-text search across extracted documents
Export
DATEV CSV, accountant-ready PDF summary, or direct push to a bookkeeping system
Monthly bookkeeping prep drops from hours to minutes. Works on German receipts and invoices. Accountant receives a pre-categorized batch with confidence scores and a flag-queue for anything the pipeline couldn't auto-resolve.