Private Data, Cloud AI: Three Architecture Patterns

I've been exploring how to build AI features for applications that handle sensitive data. Financial numbers, names, internal metrics. The kind of data where "trust our security" isn't a satisfying answer.

The application I built as a technical exercise runs entirely on my laptop. Local database, local processing, local UI. The only thing that leaves the machine is an anonymized data payload sent to the LLM API (but anonymization is an entirely separate topic).

That single constraint changes everything about how you architect AI features.

The obvious approach: stuff it all in

My first version did what most tutorials suggest. Query the local database upfront, pack everything into context, then send it with the user's question (or a pre-built prompt for guided analysis flows).

App pre-loads all relevant data from local database
  → App builds a context-heavy prompt
  → User asks a question (or app triggers a guided analysis)
  → One API call: context + question → LLM
  → LLM responds with analysis

This works. The LLM gets a complete picture in a single call. But you're making all the data selection decisions upfront. Which metrics matter? How much history? What comparisons would be useful? You decide before the LLM even sees the question.

For the initial analysis, you end up packing tens of thousands of tokens of context into the prompt because you can't predict what the LLM will need. Follow-up questions are lighter since the conversation carries prior context, but that first call is always a guess about what matters.

If you've heard the term RAG (Retrieval-Augmented Generation), this is it, running entirely locally. You retrieve data from your own database, augment the prompt with it, and generate a response. I call it the "briefing room" pattern: you prepare a full dossier and hand it over.

The upside is real. With all the data in front of it, the LLM can make comparisons and associations you wouldn't have thought to ask for. It spots patterns across metrics, connects this quarter to last quarter, flags anomalies you didn't know to look for. The breadth of context gives it freedom to think laterally.

The downside: context windows have limits. Pack too much data and you hit token ceilings, inflate costs, and dilute the signal with noise. You're constantly balancing completeness against context size.

The pattern I wanted: dual-layer prompt orchestration

Before I knew the term "function calling," I sketched an architecture I called DLPO: Dual-Layer Prompt Orchestration. The idea was two separate channels between the app and the LLM:

The conversation channel carries the user's question and the LLM's response. Standard stuff.
The data channel is private, between the app and the LLM only. The LLM uses it to request structured data it needs, and the app decides what to return.

The key insight: the LLM initiates the data requests. It receives a question, decides what it needs to answer well, and asks for it. The app is the gatekeeper.

User asks a question
  → LLM receives the question + available data tools
  → LLM requests: get_quarterly_data(quarter="Q4", metric="revenue")
  → App executes query against LOCAL database
  → Result sent back to LLM
  → LLM requests: get_historical_trend(metric="revenue", periods=4)
  → App executes locally again
  → LLM synthesizes and responds

The data stays on my machine. The LLM never touches the database. It only sees the specific results my code returns, for the specific queries it asked for.

I later discovered this is exactly what the industry calls "function calling" or "tool use." Every major LLM provider supports it natively now. I'd reinvented the wheel without knowing it existed. But coming at it from the architecture side rather than the API side shaped how I think about it: it's not just a feature you enable. It's a design pattern for keeping your application in control of what the LLM sees.

The difference from MCP (Model Context Protocol) matters here. MCP requires running a server that the LLM connects to. For my use case, that's a non-starter. I don't want any server running, any port open, any network exposure. Function calling keeps everything client-side. The only network traffic is the API call itself.

Combining both: the hybrid

In practice, neither pattern works alone. The briefing room gives the LLM breadth but hits context limits. DLPO gives precision but adds round-trips. The interesting architecture combines them.

Start with a briefing room baseline. Pre-load the data you know will be needed: current period metrics, key status indicators, recent history. This covers 80% of questions in a single call.

Add narrow tools for the remaining 20%. Define specific data retrieval functions the LLM can call when the baseline isn't enough. Not "query anything" but "get trend for a specific metric over N periods" or "compare two time ranges." Each tool returns a bounded result.

Cache tool results within the session. If the LLM asks for the same data twice, serve it from memory. No redundant queries against the local database.

Log what the LLM requests. After a week, you'll know what data it actually pulls beyond the baseline. Use that to refine what you pre-load next time. The tools teach you what the briefing room was missing.

The broader pattern

This isn't just about my experiment. Anyone building AI features for on-premise software, healthcare data, financial services, legal documents, or internal tools hits the same constraint. The data can't leave the building.

The answer isn't "don't use AI." It's "use AI carefully, with your code as the gatekeeper."

Function calling gives you that gatekeeper role natively. Your code decides what to expose, how much to return, and whether to execute at all. The LLM proposes; your code disposes.

The cloud-first AI world assumes you'll upload your data to someone's vector database and let the LLM browse it. For a lot of real-world applications, that's not an option. Local-first AI is a different architecture, not a lesser one.

I've been exploring local-first AI architecture for over a year. It's slower, more constrained, and requires more architectural thinking than the cloud-first approach. But the exercise of building it taught me more about AI security than any whitepaper could.