MemPalace: The Open-Source AI Memory System That Scores 96.6% on LongMemEval
Milla Jovovich, Ben Sigman April 10, 2026 product-announcement low credibility
View source
Referenced in catalog
MemPalace: The Open-Source AI Memory System That Scores 96.6% on LongMemEval
Source: github.com/milla-jovovich/mempalace | Author: Milla Jovovich, Ben Sigman | Published: 2026-04-06 Category: product-announcement | Credibility: low
Executive Summary
- MemPalace is a Python library implementing a hierarchical “memory palace” metaphor (Wings, Rooms, Halls, Closets, Drawers) for organizing AI conversation history, backed by ChromaDB for vector search and SQLite for a temporal knowledge graph, with an MCP server for tool integration.
- The headline “96.6% LongMemEval R@5” benchmark is misleading: independent reproduction confirmed the benchmark runner creates a fresh ChromaDB ephemeral client per question and never exercises the palace architecture — the score measures the
all-MiniLM-L6-v2embedding model, not MemPalace’s structural contribution. - The project launched April 6, 2026 with a celebrity co-founder (actress Milla Jovovich) generating 38k+ GitHub stars rapidly, but multiple acknowledged post-launch corrections reveal overclaims across AAAK compression, palace boost (+34%), and LoCoMo benchmark methodology.
Critical Analysis
Claim: “96.6% LongMemEval R@5 — the highest-scoring AI memory system ever benchmarked”
- Evidence quality: vendor-sponsored
- Assessment: Independent reproduction by @gizmax (GitHub issue #39) confirmed the 96.6% score is reproducible but mislabeled. The benchmark runner builds a fresh
chromadb.EphemeralClient()per question and never exercises wings, rooms, or any palace code path. The result benchmarks ChromaDB’s defaultall-MiniLM-L6-v2embeddings on raw verbatim text retrieval. The “palace architecture” contributes nothing to this score. Additionally, Mastra’s Observational Memory (February 2026) achieves 94.87% R@5 using gpt-5-mini with a completely different text-only, no-vector-DB approach, and OMEGA reports 95.4% — so the claim of “highest-scoring ever” is disputed even on its own metric. For context, GPT-4o-backed Mem0 scores ~49% and Zep ~64% on the same benchmark, making the gap partly a function of the question set design rewarding verbatim recall over graph-based reasoning. - Counter-argument: Verbatim storage genuinely does maximize raw retrieval recall for this type of benchmark. If your primary requirement is “retrieve exact wording of a past conversation,” storing verbatim text and using a decent embedding model is rational. The architecture may have practical value even if the benchmark headline overstates the “novel” contribution. The problem is the marketing framing attributes the score to the palace structure when it comes from embedding quality.
- References:
Claim: “AAAK dialect achieves 30x compression with zero information loss”
- Evidence quality: vendor-sponsored
- Assessment: Multiple lines of independent analysis invalidate this claim. The AAAK code uses regex abbreviations, keyword frequency reduction, and 55-character sentence truncation — the
decode()method is string-splitting only with no ability to reconstruct original text. This is by definition lossy. The authors’ own April 7 correction acknowledges AAAK regresses LongMemEval from 96.6% to 84.2% (a 12.4 percentage point drop). The independent analysis fromlhl/agentic-memoryfound a 3.84x compression ratio in practice, not 30x. The token counting uses a crudelen(text)//3heuristic rather than actual tokenizer calls, making compression metrics untrustworthy. The project’s own README now correctly labels AAAK as “experimental” and “lossy.” - Counter-argument: Compression that reduces 96.6% → 84.2% is still 84.2%, which is better than Mem0 (49%) or Zep (64%) on this benchmark using GPT-4o. A lossy compression that degrades recall by 12 points might still be acceptable if the token savings are significant at scale. The problem is the “zero information loss” claim was objectively false.
- References:
Claim: “Palace structure provides +34% retrieval boost”
- Evidence quality: vendor-sponsored
- Assessment: The authors themselves corrected this claim post-launch. The “+34% boost” compares unfiltered ChromaDB search against wing+room metadata filtering — narrowing the search scope from the full collection to a subset. This is standard metadata filtering available in any vector database, not a novel architectural contribution. It is genuinely useful for retrieval quality but is not a “palace boost” — it is the same technique you’d apply with namespaces in Pinecone or collections in Qdrant. The framing implies the spatial metaphor delivers retrieval gains; the reality is namespace scoping delivers retrieval gains.
- Counter-argument: Providing an opinionated, user-friendly abstraction over metadata filtering (rooms, wings, halls) may lower the barrier to using this technique for non-technical users. The concept has practical value even if the claim exaggerates its novelty. A USC computer science professor cited in press coverage called the approach “a general method for organizing information that could scale across AI frameworks.”
- References:
Claim: “Contradiction detection automatically flags inconsistent facts in the knowledge graph”
- Evidence quality: vendor-sponsored
- Assessment: Independent code review of
knowledge_graph.pyfound this feature does not exist in the shipped code. The module only blocks insertion of identical duplicate triples. Conflicting facts (e.g., “user lives in Paris” followed by “user lives in Berlin”) accumulate silently without contradiction detection. This is a common gap between documentation and implementation in rapidly-launched open-source projects, but the claim was present at launch and is misleading. - Counter-argument: This is a known missing feature that the maintainers have not publicly disputed. It belongs in the roadmap rather than current-state claims.
- References:
Claim: “$0.70/year vs. $507/year — zero cost vs. LLM summarization”
- Evidence quality: anecdotal
- Assessment: The cost comparison is directionally accurate but constructed to maximize the apparent gap. The $507/year figure assumes summarizing 650K tokens with a paid LLM API per session; the $0.70 figure is for “wake-up mode only” (170 tokens loaded at session start, no searches). A more honest comparison including 5 searches/day (their own “everyday user” model at $10/year) is still significantly cheaper than LLM summarization at scale. The local-first, zero-API-cost architecture for writes is a genuine differentiator. However, the comparison ignores that many developers use Claude Code’s built-in CLAUDE.md/MEMORY.md pattern, which is literally free and natively supported.
- Counter-argument: For users currently using LLM-based summarization services (Mem0 cloud, Zep cloud), the cost comparison is valid. For developers using native agent memory features (CLAUDE.md), the comparison is irrelevant.
- References:
Credibility Assessment
- Author background: Milla Jovovich is a well-known actress (Resident Evil franchise, The Fifth Element) with no prior technical track record. She describes her role as conceptual: defining memory requirements from a daily AI user’s perspective. Ben Sigman (co-author) is the technical executor and has a software engineering background. A third contributor “Lu” is alleged in community discussions but not credited in the repository. The celebrity name generated significant media coverage and GitHub star velocity disproportionate to the technical maturity.
- Publication bias: Self-published GitHub repository and accompanying mempalace.tech marketing site. The project is heavily promoted in tech media (Bitcoin News, Cybernews, IntelligentLiving) largely because of Jovovich’s name. The corrections published on April 7 demonstrate some degree of good faith, but the overclaims were in the initial README and generated the viral coverage before corrections.
- Verdict: low — Multiple core technical claims (benchmark attribution, AAAK losslessness, palace boost novelty, contradiction detection) were either false or substantially misleading at launch. The project corrected several claims after community pressure. The underlying concept (verbatim storage with hierarchical namespace filtering via MCP) has practical merit for small-scale personal use, but the maturity level (170 commits, 4 test files for 21 modules, 5 days old at launch) does not support the “highest-scoring ever” marketing. Stars inflated by celebrity, not technical validation.
Entities Extracted
| Entity | Type | Catalog Entry |
|---|---|---|
| MemPalace | open-source | data/catalog/frameworks/mempalace.md |
| ChromaDB | open-source | data/catalog/vendors/chromadb.md |
| Model Context Protocol (MCP) | open-source | data/catalog/frameworks/model-context-protocol.md |
| Agent Memory as Infrastructure | pattern | data/catalog/patterns/agent-memory-as-infrastructure.md |