What a Private-AI Pilot Actually Looks Like, Week by Week

A private-AI pilot is a short, fixed-scope engagement that puts a working AI assistant over one of your document sets, running entirely on infrastructure you control, then proves it on one real workflow before anything scales. It runs about three weeks. One document set, one measurable workflow, and no data leaving your perimeter. It is deliberately not a demo, not a platform rollout, and not a six-month build.
That narrow shape is the whole point. MIT’s 2025 study of 300 AI deployments found that 95% of generative-AI pilots delivered zero measurable return, and the projects that did pay off were the tightly scoped ones run with a specialist rather than the sprawling internal builds. Gartner expects 30% of generative-AI projects to be abandoned after the proof of concept by the end of 2025. A pilot done right is the cheapest way to land on the right side of those numbers before you commit a real budget. Here is what each week actually involves, what tends to break, and what “done” really means.
| A private-AI pilot is | A private-AI pilot is not |
|---|---|
| One document set, one workflow | A whole-company rollout |
| About three weeks, fixed scope | An open-ended build |
| Running on your hardware, zero data egress | A cloud-API trial |
| A go or no-go decision backed by a measured result | A demo that looks impressive |
| Built and hardened by a security engineer | A model bake-off |
Before Week One: Scope It Down Until It Hurts
The single biggest predictor of a pilot that works is a scope narrow enough to actually finish. Before any code, we pick one document set and one workflow that a real person does often and quietly dreads.
That means one folder, share, or export: contracts, policies, standard procedures, claims files, case notes. And one workflow with a baseline you can measure against, such as “answering a policy question currently takes a junior analyst twenty minutes of searching.” Everything else gets parked, in writing. The 95% failure number is mostly a scope problem wearing a technology costume. Gartner puts the rest down to weak data quality and unclear business value, which is the same disease. A pilot beats it by refusing to do more than one thing. It is the same lesson behind why most AI projects fail before production.
Week 1: Get the Data in, Locally
Week one is ingestion, and nothing calls out to a third party. Your documents move into a store running on hardware you control, get processed, and become searchable. This is where the messy reality of real documents shows up, not the model.
Scanned PDFs need OCR. Inconsistent formats need cleaning. Documents get chunked and embedded into a self-hosted vector store. The honest truth of week one is that it is plumbing, not magic, and the things that break are boring: a third of the PDFs are scans with no text layer, the permissions on the share are a mess, and half the “important” documents turn out to be duplicates nobody had noticed. Getting this right locally, with zero egress, is most of the security story already done. The retrieval mechanics behind it are the same ones in a production RAG pipeline, and the document side is covered in OCR plus LLM document processing.
Week 2: Make the Answers Trustworthy
Week two turns retrieval into answers you can defend. Every answer cites the exact passage it came from, and every question and answer lands in an immutable audit log. In regulated work, an answer with no traceable source is worse than no answer, because someone will act on it.
The rule is simple: no source, no claim. We test it against a real question set written by the people who do the work, not three cherry-picked demo questions. We harden the retrieval against instructions hidden inside the documents themselves, which is a real attack once an assistant can read anything in the folder, covered in securing RAG against prompt injection. What breaks here is predictable: confident answers built on nothing, retrieval that quietly misses the one relevant clause, and the occasional poisoned document. Catching those is the job, and it is why evaluation before you trust the system is built in from the start, not bolted on at the end.
Week 3: Put It in Front of Real Users and Measure
Week three hands the assistant to the people who actually do the workflow and measures it against the baseline you wrote down before any code existed. Adoption and edge cases decide the result, not the demo.
Real users ask the questions you never anticipated. The honest signals are simple: how much time the workflow now takes, whether the answers hold up when someone clicks through to the cited source, and how many real edge cases fall out. The most common surprise is trust. People do not believe the assistant until they click one citation and land on the exact paragraph it used. Once they do, behaviour changes, and that single moment is usually what decides whether a pilot becomes a rollout.
What “Done” Looks Like
A pilot is done when you have a working assistant over one real document set, a measured result against the baseline, a documented architecture, and a clear go or no-go. That is what “ready for a pilot, not a press release” means in practice. You finish with evidence instead of vibes. A go means you know exactly what scaling will cost and why. A no-go means you spent three weeks and a small budget finding out, instead of six months and a large one. Both are wins, which is the part most vendors never tell you.
What a Pilot Is Deliberately Not
A pilot is not a platform rollout, a model comparison exercise, or the purchase of a six-figure appliance. Keeping all of that out is precisely what lets it finish on time. It runs as software on your own hardware or private cloud, so there is no capital outlay and no vendor lock-in, which is the design decision behind the private-AI platform itself. And it is run with a specialist on purpose. The same MIT research found internal-only builds succeed roughly a third as often as projects run with an outside partner. A narrow pilot with someone who has done it before is the boring, high-odds path.
Is Your Team a Fit?
A private-AI pilot fits if you are a regulated or data-sensitive team, you have one document-heavy workflow that genuinely hurts, and you can give access to one document set for three weeks. That covers most of finance under DORA and MiFID II, insurance, healthcare, legal and professional services, data-sensitive small and mid-sized firms, and the public sector.
Where it goes from here
If that sounds like your situation, the next step is one short conversation to pick the single document set and the single workflow worth proving. That is exactly what a private-AI pilot is built around.
Related reading: see the private-AI approach behind Vaultic, EU data sovereignty for AI, self-hosted AI versus cloud APIs, and how to test an AI system before production.
Quick Reference
A private-AI pilot, week by week
| Week | Focus | What tends to break |
|---|---|---|
| Week 0 | Scope one document set and one measurable workflow | Scope creep, no baseline to measure against |
| Week 1 | Local ingestion, zero data egress | Scanned PDFs, messy permissions, hidden duplicates |
| Week 2 | Sourced answers plus an immutable audit log | Answers with no source, missed clauses, injected instructions |
| Week 3 | Real users, measured against the baseline | Edge cases, distrust until the first citation is clicked |
Frequently Asked Questions
How long does a private-AI pilot take?
About three weeks for a focused pilot: one week to get the documents in locally, one to make the answers sourced and auditable, and one to put it in front of real users and measure it against a baseline. The fixed, narrow scope is what keeps it to three weeks instead of three months.
Does our data leave our infrastructure during a pilot?
No. A private-AI pilot runs on hardware you control, with local open-weight models and no external API in the answer path. Zero data egress is the design starting point, not an add-on, which is why it suits regulated and data-sensitive work.
What do we need to provide for a pilot?
One document set you can give access to for three weeks (a folder, share, or export), one workflow that genuinely hurts, and a baseline for how long that workflow takes today. That is enough to scope and measure a pilot. You do not need a data-science team or a GPU cluster.
What if the pilot does not work?
Then you have a clear no-go backed by evidence, reached in three weeks on a small budget rather than six months on a large one. A pilot is designed to make the go or no-go decision cheaply. Both outcomes are useful, which is why a narrow pilot beats committing to a full build up front.