Tax Document AI
OCR + LLM scanner for small-business tax filing
Built by Rogue AI · OCR + LLM for small-business bookkeeping · Production since 2026
First Tesseract-only prototype: January 2026. PaddleOCR + llava:13b layer added in February once Tesseract gave up on phone-snapped German invoices with table layouts. DATEV export and SKR03/SKR04 mapping landed in March. Eighty-plus commits, active through April 2026.
The problem
Small businesses drown in receipts, invoices, and PDF statements. Manual categorization against a German chart of accounts (SKR03 / SKR04) takes hours every month. Generic OCR services dump raw text, forcing the accountant to re-type line items anyway.
What I built
A document intake pipeline that accepts PDFs, images, and email attachments, runs layout-aware OCR, uses an LLM to extract structured line items, validates against business rules, categorizes to the chart of accounts, and exports an accountant-ready batch — DATEV-compatible CSV or pre-filled booking PDFs.
Architecture
Tech stack
What broke first
- ▸
Tesseract on phone-snapped German receipts is hopeless — diacritics, blur, and rotation murder it. PaddleOCR is heavier but the accuracy gap on real-world inputs is enormous.
- ▸
llava:13b sometimes confidently invents totals on noisy invoices. Added a deterministic VAT-plausibility check that flags when extracted net + VAT does not match extracted gross within tolerance.
- ▸
SKR03 vs SKR04 mapping is where the LLM earns its keep. Hardcoded rules covered ~60% of cases; the LLM closed the gap, but only with a confidence threshold and a flag-queue for the rest.
Outcome
Monthly bookkeeping prep drops from hours to minutes. Works on German receipts and invoices. Accountant receives a pre-categorized batch with confidence scores and a flag-queue for anything the pipeline couldn't auto-resolve.
Honest limits
Handwritten receipts still go to the human queue. DATEV export is one-way; no round-trip sync. Trained against German receipt layouts — works on EN/DE invoices, struggles on non-German receipt layouts I have occasionally tested.
