Overview
How good is your memory store? The built-in LongMemEval-style benchmark measures Recall@K across all retrieval layers so you can answer that question with numbers instead of guesses.
The benchmark runs a fixed set of probe questions against your live memory store and checks whether the expected answer appears in the top-K retrieved results. It reports three recall figures:
| Metric | Layer | Typical range |
|---|---|---|
| Recall@K — Cognitive | L2 semantic search (HNSW + TF-IDF) | 55–80% for sparse stores |
| Recall@K — Verbatim | L3 raw drawer chunks | 80–97% after document ingestion |
| Recall@K — Combined | Either layer returns a hit | 90–99% |
The combined figure is the number that matters for agent quality. If it is below 80%, add more content or lower your dedup thresholds.
Time to complete: ~8 minutes
Prerequisites
- VibeCLI 0.5.1 or later
- A populated memory store (at least a few memories; more content gives more meaningful numbers)
- Optionally: some verbatim chunks ingested (see Demo 61)
Step 1: Run the benchmark at k=5
vibecli
> /openmemory benchmark
LongMemEval Benchmark (k=5)
Total memories: 47 Verbatim drawers: 132 Probe cases: 20
Recall@5 — Cognitive (L2): 75.0% ██████████████████████░░░░░░░░
Recall@5 — Verbatim (L3): 90.0% ███████████████████████████░░░
Recall@5 — Combined: 96.0% ████████████████████████████████
Per-case breakdown:
sector query cognitive verbatim
─────────────────────────────────────────────────────────────────────────────────
episodic What was the last project I worked on? ✓ ✓
semantic What programming languages do I know? ✓ ✓
procedural How do I run the test suite for this project? ✓ ✓
preference What coding style does the user prefer? ✗ ✓
identity What is the user's primary role? ✗ ✓
episodic What was the most recent error we debugged? ✓ ✓
semantic What AI providers are configured? ✓ ✓
procedural What is the deployment pipeline? ✓ ✓
preference How does the user prefer to name variables? ✗ ✓
emotional Was the user frustrated with anything? ✓ ✗
semantic What databases does this project use? ✓ ✓
procedural How do I set up a new dev environment? ✗ ✓
episodic When was the last incident in production? ✓ ✓
reflective What patterns does the user repeat? ✓ ✗
semantic What is the main programming language? ✓ ✓
procedural How do I add a new API endpoint? ✓ ✓
preference What test framework does the user prefer? ✗ ✗
episodic What was discussed in the last session? ✓ ✓
semantic What version of Rust is required? ✗ ✓
procedural How do I run a database migration? ✓ ✓
Summary: 15/20 cognitive · 18/20 verbatim · 19/20 combined
Step 2: Increase k to check depth
> /openmemory benchmark 10
LongMemEval Benchmark (k=10)
Recall@10 — Cognitive: 85.0% ████████████████████████████░░
Recall@10 — Verbatim: 95.0% ██████████████████████████████
Recall@10 — Combined: 98.0% ████████████████████████████████
At k=10, the cognitive layer has more chances to surface a relevant memory — useful when a specific answer is ranked 6th–10th rather than in the top 5.
Step 3: Improve recall by adding missing content
Two probe cases failed both layers:
- “What test framework does the user prefer?” — no matching memory or drawer
Add the missing memory:
> /openmemory add "User prefers Nextest (cargo-nextest) for running Rust tests; faster parallelism than cargo test"
And optionally ingest the relevant config:
> /openmemory chunk file:.config/nextest.toml
Rerun the benchmark:
> /openmemory benchmark
Recall@5 — Cognitive: 80.0% ████████████████████████░░░░░░
Recall@5 — Verbatim: 95.0% ████████████████████████████████
Recall@5 — Combined: 100.0% ████████████████████████████████
preference What test framework does the user prefer? ✓ ✓ ← was ✗ ✗
Step 4: Watch combined recall grow over time
The benchmark is most useful when you run it regularly. A common workflow:
# At the end of each significant coding session
vibecli -e '/openmemory benchmark >> ~/.vibecli/recall-log.txt'
Plot the log to see recall trend over time as your store grows.
Step 5: Interpret low scores
| Pattern | Likely cause | Fix |
|---|---|---|
| Cognitive low, Verbatim high | Memories too short or abstract | Add more specific memories with /openmemory add |
| Verbatim low, Cognitive high | Not enough raw documents ingested | Use /openmemory chunk file:... for key docs |
| Both low | Store nearly empty | Add memories, run /openmemory consolidate |
| Combined 100% but only 2 drawers | Store too small to be meaningful | Ingest more documents, add more memories |
| Preference sector always fails | Preference-type info lives in runbooks, not memories | Ingest chat logs with /openmemory chunk file:session.txt |
VibeUI: Benchmark Runner (Drawers Tab)
The VibeUI OpenMemory panel → Drawers tab includes a live benchmark runner at the bottom.
Running the benchmark
- Open the OpenMemory panel (AI sidebar).
- Click the Drawers tab.
- Scroll to LongMemEval Benchmark.
- Set
kusing the number input (default: 5). - Click Run.
The results appear in seconds:
Three recall gauges — animated bars for Cognitive, Verbatim, and Combined.
Cognitive (L2) 78% ████████████████████░░░░░
Verbatim (L3) 90% ███████████████████████░░
Combined 96% ████████████████████████████████
Summary line — 15/20 cognitive hits · 18/20 verbatim hits · 47 memories · 132 drawers
Per-case table — each probe case shows sector, truncated query, and a green ✓ cognitive / ✓ verbatim or grey ✗ for each layer. Hover over a query for the full text.
Iterating in the UI
After running the benchmark, if the score is below your target:
- Stay in the Drawers tab, scroll to the Context Preview section.
- Type the failing probe query and click Preview.
- Inspect whether the expected answer appears in L2 scoped or L3 verbatim.
- If absent from both: switch to the Memories tab and add a targeted memory.
- Re-run the benchmark to confirm improvement.
Demo Recording
{
"meta": {
"title": "Memory Benchmarking — LongMemEval Recall@K",
"description": "Run the built-in LongMemEval benchmark in CLI and VibeUI, interpret results, and iterate to reach 100% combined recall.",
"duration_seconds": 180,
"version": "1.0.0"
},
"steps": [
{
"id": 1,
"action": "repl",
"commands": [
{ "input": "/openmemory benchmark", "delay_ms": 5000 }
],
"description": "Run recall@5 benchmark — see per-case hit/miss table"
},
{
"id": 2,
"action": "repl",
"commands": [
{ "input": "/openmemory benchmark 10", "delay_ms": 4000 }
],
"description": "Increase k to 10 — cognitive recall improves"
},
{
"id": 3,
"action": "repl",
"commands": [
{ "input": "/openmemory add \"User prefers Nextest (cargo-nextest) for Rust tests\"", "delay_ms": 3000 },
{ "input": "/openmemory chunk file:.config/nextest.toml", "delay_ms": 3000 },
{ "input": "/openmemory benchmark", "delay_ms": 5000 }
],
"description": "Fix a failing probe case and rerun — combined recall reaches 100%"
},
{
"id": 4,
"action": "vibeui",
"panel": "OpenMemory",
"tab": "Drawers",
"description": "Open the Drawers tab in VibeUI — same benchmark runner with animated gauges"
}
]
}
What’s Next
- Demo 61: Verbatim Drawers & MemPalace — Learn how drawers are ingested and organised
- Demo 48: OpenMemory Engine — Core cognitive memory: sectors, consolidation, knowledge graph
- Memory Guide — Full reference for all three memory layers