Skip to main content
Home / Portfolio / Tax Document AI
DocumentAI

Tax Document AI

OCR + LLM scanner for small-business tax filing

Built by Rogue AI · OCR + LLM for small-business bookkeeping · Production since 2026

First Tesseract-only prototype: January 2026. PaddleOCR + llava:13b layer added in February once Tesseract gave up on phone-snapped German invoices with table layouts. DATEV export and SKR03/SKR04 mapping landed in March. Eighty-plus commits, active through April 2026.

Tax Document AI — OCR + LLM scanner for small-business tax filing

The problem

Small businesses drown in receipts, invoices, and PDF statements. Manual categorization against a German chart of accounts (SKR03 / SKR04) takes hours every month. Generic OCR services dump raw text, forcing the accountant to re-type line items anyway.

What I built

A document intake pipeline that accepts PDFs, images, and email attachments, runs layout-aware OCR, uses an LLM to extract structured line items, validates against business rules, categorizes to the chart of accounts, and exports an accountant-ready batch — DATEV-compatible CSV or pre-filled booking PDFs.

Architecture

Ingestion
Web upload, email attachment, or folder watcher; MIME detection and virus scan
OCR layer
Tesseract for simple receipts, PaddleOCR for complex multi-column invoices, German-language models
LLM extraction
Ollama-hosted model with structured-output prompts to emit JSON line items (date, counterparty, VAT rate, net/gross, account hint)
Validation
Deterministic rules for VAT plausibility, duplicate detection, date-range checks
Storage
PostgreSQL with full-text search across extracted documents
Export
DATEV CSV, accountant-ready PDF summary, or direct push to a bookkeeping system

Tech stack

React 19FastAPIPython 3.11PostgreSQL 16TesseractPaddleOCROllamallava:13b

What broke first

  • Tesseract on phone-snapped German receipts is hopeless — diacritics, blur, and rotation murder it. PaddleOCR is heavier but the accuracy gap on real-world inputs is enormous.

  • llava:13b sometimes confidently invents totals on noisy invoices. Added a deterministic VAT-plausibility check that flags when extracted net + VAT does not match extracted gross within tolerance.

  • SKR03 vs SKR04 mapping is where the LLM earns its keep. Hardcoded rules covered ~60% of cases; the LLM closed the gap, but only with a confidence threshold and a flag-queue for the rest.

Outcome

Monthly bookkeeping prep drops from hours to minutes. Works on German receipts and invoices. Accountant receives a pre-categorized batch with confidence scores and a flag-queue for anything the pipeline couldn't auto-resolve.

Honest limits

Handwritten receipts still go to the human queue. DATEV export is one-way; no round-trip sync. Trained against German receipt layouts — works on EN/DE invoices, struggles on non-German receipt layouts I have occasionally tested.

Related reading

← Back to portfolio