Local embeddings for AI agent memory: a self-hosted setup guide
Local embeddings are vector representations of your text generated by a model running on your own hardware, not a remote API. For an AI agent, that means semantic memory search can work entirely offline: the agent finds the right note by meaning instead of exact keywords, and none of your memory ever leaves the machine. As of v2026.5.27, ships a core OpenAI-compatible embedding provider, so pointing memory search at a local endpoint is now a first-class setup rather than a plugin workaround.
This guide covers what local embeddings actually do for agent memory, why you might run them yourself, and the practical steps to connect one.
Table of contents
- What local embeddings do for agent memory
- Why run embeddings locally
- How to set up a local embedding provider
- Local vs hosted embeddings
- Choosing an embedding model
- Common pitfalls
- FAQ
What local embeddings do for agent memory
An embedding turns a chunk of text into a list of numbers that captures its meaning. Two notes about the same topic land close together in that number space even if they share no words. That is what makes “what database does my trading project use?” pull back the right fact when you never wrote the word “database” in the note.
already leans on this. Its memory search uses hybrid retrieval: BM25 keyword matching at roughly 30% weight, combined with vector embeddings at roughly 70% weight, stored in SQLite. Files get chunked into about 400-token segments, embedded, and indexed, and both search paths run in parallel with merged scoring. The embedding half of that pipeline is what local embeddings replace. Swap the cloud embedding call for a local one and the retrieval quality stays intact while the network round-trip disappears.
The model that does the embedding is separate from the model that writes replies. You can run a small, fast embedding model locally even while your main chat model lives in the cloud, or run both locally for a fully offline stack.
Why run embeddings locally
There are three honest reasons, and one non-reason.
Privacy. Every memory you index gets sent to whatever endpoint does the embedding. If that endpoint is a cloud API, your raw notes (calendar details, client names, half-formed ideas) travel there. A user on r/openclaw who switched to local semantic memory put it plainly: all embedding computation is local, nothing leaves the machine, and it stays fast. For anything sensitive, that is the whole argument.
Cost. Embedding is cheap per call but relentless. Re-indexing a growing memory store, re-embedding edited notes, embedding every query: it adds up into a steady metered drip. A local model makes that drip free after the one-time cost of running the hardware you already own.
Resilience. A local embedding endpoint does not rate-limit you, does not deprecate your model out from under you, and works on a plane. If your agent’s memory is load-bearing, removing a network dependency from the recall path is worth something.
The non-reason is raw quality. Top hosted embedding models still edge out small local ones on benchmarks. For personal agent memory the gap rarely matters, because you are searching your own few thousand notes, not running web-scale retrieval. But if you are doing high-stakes retrieval over a large corpus, test before you commit.
How to set up a local embedding provider
The core idea: run a local server that speaks the OpenAI embeddings API, then point at it.
1. Run a local OpenAI-compatible endpoint. The two common choices:
- Ollama serves an OpenAI-compatible API on
localhost:11434automatically whenever the daemon is running. Pull an embedding model and it is ready. - LM Studio exposes OpenAI-compatible endpoints you can reach by switching an OpenAI client’s base URL to your local instance.
Both implement the same /v1/embeddings contract that hosted APIs use, which is exactly why a single OpenAI-compatible provider can talk to all of them.
2. Point at the endpoint. An OpenAI-compatible embedding provider needs three things, the same triad any OpenAI client uses:
# Illustrative shape: run `openclaw doctor` and check the
# embeddings docs for the exact config keys in your version.
embedding:
baseURL: "http://localhost:11434/v1" # your local endpoint
apiKey: "local" # placeholder for local servers
model: "nomic-embed-text" # an embedding model you pulled
Local servers ignore the API key, so a placeholder string is fine. The base URL is the part that matters: it is what redirects embedding traffic away from the cloud and onto your box.
3. Verify it. v2026.5.27 added the embedding provider to core with config, doctor, and docs support, so openclaw doctor is the fastest way to confirm the provider resolves and the endpoint answers before you trust it with a re-index. If doctor is happy, trigger a memory search and confirm results come back ranked by relevance.
One caveat worth knowing: if you change embedding models after indexing, vectors from the old model will not line up with the new one. Plan to re-index when you switch.
Local vs hosted embeddings
| Factor | Local embeddings | Hosted embeddings |
|---|---|---|
| Data exposure | Nothing leaves your machine | Every indexed note sent to the API |
| Cost | Free after hardware | Metered per token, forever |
| Setup effort | Run a daemon, pull a model | Paste an API key |
| Offline use | Works fully offline | Needs connectivity |
| Peak quality | Good for personal scale | Slightly higher on benchmarks |
| Rate limits | None | Provider-dependent |
For a self-hosted personal agent the local column wins on the things that usually matter. For a large team retrieving over a huge shared corpus, the hosted column’s quality edge can justify the trade.
Choosing an embedding model
You do not need a big model for embeddings. A few solid local options:
- nomic-embed-text — a common default, small and fast, with a long context window for chunked notes.
- mxbai-embed-large — larger, stronger on retrieval quality, still comfortable on a modern laptop.
- all-minilm — tiny and quick when you want minimal footprint and your memory store is small.
One reported local memory stack ran the entire system, embedding model included, in roughly 150MB of memory. That is the scale you are dealing with: an embedding model is a rounding error next to a chat model. Start small, measure recall on your actual notes, and only move up if results disappoint.
Common pitfalls
- Mismatched dimensions after a model swap. Different embedding models produce different-length vectors. Re-index after changing models or search quietly degrades.
- Forgetting the daemon. If Ollama or LM Studio is not running, the endpoint is dead and embedding calls fail. Keep it running or autostart it.
- Embedding through a slow model. A general chat model can technically embed, but a dedicated embedding model is far faster and usually better at retrieval. Use a real embedding model.
- Assuming local means lower quality everywhere. For personal-scale memory the difference is hard to notice. Test before you assume you need the cloud.
FAQ
Do local embeddings work with a cloud chat model?
Yes. The embedding model and the chat model are independent. Run embeddings locally for privacy while keeping a hosted model for replies.
Will local embeddings slow my agent down?
Usually the opposite. Embedding models are small, and removing the network round-trip often makes local retrieval faster than a cloud call.
Do I lose ‘s hybrid search?
No. keeps its BM25-plus-vector hybrid retrieval. Local embeddings only change where the vector half is computed, not how search works.
What hardware do I need?
Far less than for a chat model. A small embedding model fits in a few hundred megabytes of memory and runs on a normal laptop CPU.
Putting local embeddings to work
Local embeddings move the one part of agent memory that quietly leaks your data, the embedding step, back onto hardware you control. With the OpenAI-compatible provider now in core, the setup is short: run a local endpoint, point the provider’s base URL at it, verify with doctor, and re-index. You keep semantic recall, you drop the cloud dependency, and your notes stay yours.
To go deeper on how the memory system around this works, see our memory deep-dive, the practical memory and context configuration guide, and the broader Ollama plus local setup.
Sources: