Local embeddings for AI agent memory: a self-hosted setup guide

Local embeddings are vector representations of your text generated by a model running on your own hardware, not a remote API. For an AI agent, that means semantic memory search can work entirely offline: the agent finds the right note by meaning instead of exact keywords, and none of your memory ever leaves the machine. As of v2026.5.27, ships a core OpenAI-compatible embedding provider, so pointing memory search at a local endpoint is now a first-class setup rather than a plugin workaround.

This guide covers what local embeddings actually do for agent memory, why you might run them yourself, and the practical steps to connect one.

Table of contents

What local embeddings do for agent memory

An embedding turns a chunk of text into a list of numbers that captures its meaning. Two notes about the same topic land close together in that number space even if they share no words. That is what makes “what database does my trading project use?” pull back the right fact when you never wrote the word “database” in the note.

already leans on this. Its memory search uses hybrid retrieval: BM25 keyword matching at roughly 30% weight, combined with vector embeddings at roughly 70% weight, stored in SQLite. Files get chunked into about 400-token segments, embedded, and indexed, and both search paths run in parallel with merged scoring. The embedding half of that pipeline is what local embeddings replace. Swap the cloud embedding call for a local one and the retrieval quality stays intact while the network round-trip disappears.

The model that does the embedding is separate from the model that writes replies. You can run a small, fast embedding model locally even while your main chat model lives in the cloud, or run both locally for a fully offline stack.

Why run embeddings locally

There are three honest reasons, and one non-reason.

Privacy. Every memory you index gets sent to whatever endpoint does the embedding. If that endpoint is a cloud API, your raw notes (calendar details, client names, half-formed ideas) travel there. A user on r/openclaw who switched to local semantic memory put it plainly: all embedding computation is local, nothing leaves the machine, and it stays fast. For anything sensitive, that is the whole argument.

Cost. Embedding is cheap per call but relentless. Re-indexing a growing memory store, re-embedding edited notes, embedding every query: it adds up into a steady metered drip. A local model makes that drip free after the one-time cost of running the hardware you already own.

Resilience. A local embedding endpoint does not rate-limit you, does not deprecate your model out from under you, and works on a plane. If your agent’s memory is load-bearing, removing a network dependency from the recall path is worth something.

The non-reason is raw quality. Top hosted embedding models still edge out small local ones on benchmarks. For personal agent memory the gap rarely matters, because you are searching your own few thousand notes, not running web-scale retrieval. But if you are doing high-stakes retrieval over a large corpus, test before you commit.

How to set up a local embedding provider

The core idea: run a local server that speaks the OpenAI embeddings API, then point at it.

1. Run a local OpenAI-compatible endpoint. The two common choices:

  • Ollama serves an OpenAI-compatible API on localhost:11434 automatically whenever the daemon is running. Pull an embedding model and it is ready.
  • LM Studio exposes OpenAI-compatible endpoints you can reach by switching an OpenAI client’s base URL to your local instance.

Both implement the same /v1/embeddings contract that hosted APIs use, which is exactly why a single OpenAI-compatible provider can talk to all of them.

2. Point at the endpoint. An OpenAI-compatible embedding provider needs three things, the same triad any OpenAI client uses:

# Illustrative shape: run `openclaw doctor` and check the
# embeddings docs for the exact config keys in your version.
embedding:
  baseURL: "http://localhost:11434/v1"   # your local endpoint
  apiKey: "local"                         # placeholder for local servers
  model: "nomic-embed-text"               # an embedding model you pulled

Local servers ignore the API key, so a placeholder string is fine. The base URL is the part that matters: it is what redirects embedding traffic away from the cloud and onto your box.

3. Verify it. v2026.5.27 added the embedding provider to core with config, doctor, and docs support, so openclaw doctor is the fastest way to confirm the provider resolves and the endpoint answers before you trust it with a re-index. If doctor is happy, trigger a memory search and confirm results come back ranked by relevance.

One caveat worth knowing: if you change embedding models after indexing, vectors from the old model will not line up with the new one. Plan to re-index when you switch.

Local vs hosted embeddings

FactorLocal embeddingsHosted embeddings
Data exposureNothing leaves your machineEvery indexed note sent to the API
CostFree after hardwareMetered per token, forever
Setup effortRun a daemon, pull a modelPaste an API key
Offline useWorks fully offlineNeeds connectivity
Peak qualityGood for personal scaleSlightly higher on benchmarks
Rate limitsNoneProvider-dependent

For a self-hosted personal agent the local column wins on the things that usually matter. For a large team retrieving over a huge shared corpus, the hosted column’s quality edge can justify the trade.

Choosing an embedding model

You do not need a big model for embeddings. A few solid local options:

  • nomic-embed-text — a common default, small and fast, with a long context window for chunked notes.
  • mxbai-embed-large — larger, stronger on retrieval quality, still comfortable on a modern laptop.
  • all-minilm — tiny and quick when you want minimal footprint and your memory store is small.

One reported local memory stack ran the entire system, embedding model included, in roughly 150MB of memory. That is the scale you are dealing with: an embedding model is a rounding error next to a chat model. Start small, measure recall on your actual notes, and only move up if results disappoint.

Common pitfalls

  • Mismatched dimensions after a model swap. Different embedding models produce different-length vectors. Re-index after changing models or search quietly degrades.
  • Forgetting the daemon. If Ollama or LM Studio is not running, the endpoint is dead and embedding calls fail. Keep it running or autostart it.
  • Embedding through a slow model. A general chat model can technically embed, but a dedicated embedding model is far faster and usually better at retrieval. Use a real embedding model.
  • Assuming local means lower quality everywhere. For personal-scale memory the difference is hard to notice. Test before you assume you need the cloud.

FAQ

Do local embeddings work with a cloud chat model?

Yes. The embedding model and the chat model are independent. Run embeddings locally for privacy while keeping a hosted model for replies.

Will local embeddings slow my agent down?

Usually the opposite. Embedding models are small, and removing the network round-trip often makes local retrieval faster than a cloud call.

No. keeps its BM25-plus-vector hybrid retrieval. Local embeddings only change where the vector half is computed, not how search works.

What hardware do I need?

Far less than for a chat model. A small embedding model fits in a few hundred megabytes of memory and runs on a normal laptop CPU.

Putting local embeddings to work

Local embeddings move the one part of agent memory that quietly leaks your data, the embedding step, back onto hardware you control. With the OpenAI-compatible provider now in core, the setup is short: run a local endpoint, point the provider’s base URL at it, verify with doctor, and re-index. You keep semantic recall, you drop the cloud dependency, and your notes stay yours.

To go deeper on how the memory system around this works, see our memory deep-dive, the practical memory and context configuration guide, and the broader Ollama plus local setup.

Sources: