Ferryte — verification for agent forgetting

The Forgetting Report · open benchmark

A reproducible test of delete-after-revoke behaviour across popular agent-memory stacks. We plant a canary, call each stack’s real delete API, then check whether the agent can still surface it — first without Ferryte, then with.

Reproduce it yourself

Benchmark source →

3 stacks tested· 4 scenarios· embedder text-embedding-3-small· summarizer gpt-4o-mini· updated 2026-06-06

The leaderboard

Naive delete vs. Ferryte cascade — % of scenarios each stack passes cleanly.

Before is what every framework does today: delete the source, hope the derived memory follows. After turns on Ferryte’s lineage cascade — the same harness, the flag --with-ferryte.

AWS Bedrock AgentCore

Native framework · semantic long-term memory

+25pp

Without Ferryte (naive delete)50%

SourceLEAK

Cross-tenantPASS

StaleLEAK

PoisoningPASS

With Ferryte (lineage cascade)75%

SourcePASS

Cross-tenantPASS

StaleWARN

PoisoningPASS

Ferryte's lineage cascade triggers BatchDeleteMemoryRecords after DeleteEvent — exactly what AWS's own docs recommend. The remaining WARN is the stale-fact bug class (needs versioning, not cascade).

Mem0

Native framework · own LLM fact-extraction

no change

Without Ferryte (naive delete)25%

SourceLEAK

Cross-tenantPASS

StaleLEAK

PoisoningLEAK

With Ferryte (lineage cascade)25%

SourceLEAK

Cross-tenantPASS

StaleLEAK

PoisoningLEAK

Honest limitation: Mem0's internal fact-extractor creates derived memories whose IDs aren't returned to the caller, so the lineage graph can't yet enumerate them for cascade. Deeper Mem0 instrumentation is on the Ferryte roadmap.

Vector store + app summary

pgvector · Chroma · Qdrant · in-memory (identical)

+25pp

Without Ferryte (naive delete)25%

SourceLEAK

Cross-tenantPASS

StaleWARN

PoisoningLEAK

With Ferryte (lineage cascade)50%

SourcePASS

Cross-tenantPASS

StaleWARN

PoisoningLEAK

The raw row delete is clean — every store behaves the same. The leak is in the summary layer on top; Ferryte's lineage cascade clears the derived summary in lockstep with the source.

Zep

Native framework · knowledge-graph summaries

Without Ferryte (naive delete)—

Sourcesoon

Cross-tenantsoon

Stalesoon

Poisoningsoon

With Ferryte (lineage cascade)—

Sourcesoon

Cross-tenantsoon

Stalesoon

Poisoningsoon

Self-hosted Community Edition was deprecated; the current zep-cloud SDK is cloud-only. Pending a hosted-account run.

PASSforgot cleanlyLEAKleaked the revoked dataWARNpartial / outrankedBLINDcouldn't verifysoonrun in progress

All scores are reproducible — the 'with Ferryte' column uses the same harness with `--with-ferryte`.

How it’s scored

The raw vector DB isn’t the villain.
The summary layer on top is.

A row delete on pgvector, Chroma, or Qdrant is clean — that’s why they score identically. The leak appears once an LLM summary or knowledge-graph node absorbs the fact and the delete doesn’t propagate. That derived layer is exactly what real agent-memory frameworks add — and exactly what this benchmark measures.

Real backends, default configs

Each stack runs in its recommended setup on our own deployments — no strawmen, no private systems.

Plant a canary the data can't invent

A unique marker is written for one tenant, through one source, so any later appearance is provably a leak.

Call the real delete API

We revoke the source the way an app would — then probe retrieval to see what survived.

Score what's left

PASS if the marker is gone everywhere, LEAK if it resurfaces, BLIND if we honestly couldn't tell.

Reproduce it yourself

Don’t trust us. Run it.

Every number on this page comes from one command against pinned, open-source backends. Clone the repo, bring up the stores, point it at your own API key.

git clone https://github.com/getferryte/ferryte
cd ferryte/benchmark
cp .env.example .env            # add your OpenAI key
docker compose up -d            # pgvector · qdrant · chroma
pip install -r requirements.txt

# Before: naive delete
python -m benchmark.run --scenarios all \
  --backends mem0,qdrant,chroma,pgvector \
  --embedder openai --summarizer openai

# After: same harness, lineage cascade on
python -m benchmark.run --scenarios all \
  --backends mem0,qdrant,chroma,pgvector \
  --embedder openai --summarizer openai --with-ferryte

Blind spots & the obvious objection

The part most benchmarks hide.

“You sell the fix — of course you found leaks.”

Fair to ask. That's why the entire harness is open, the configs and versions are pinned, and you can falsify any cell yourself. We also publish what PASSES — cross-tenant isolation holds on every stack we've tested.

Did you rig the configs?

No. Backends run in their default / recommended setups on our own deployments. We never touch anyone's private systems, and the in-memory illustrative baseline is kept separate from real-backend results.

What did you NOT test?

Proprietary managed memory we can't self-host, and any behaviour behind a paywall we didn't buy. Where we can't verify, a cell reads BLIND — never a silent PASS.

Isn't a vector DB row-delete enough?

For the raw row, yes. The leak is the derived layer — summaries and graph nodes that absorbed the fact. The vendors document this themselves.

Catch it in CI, not in a customer ticket

Run the same test
against your own stack.

Request a free audit

How Ferryte works →

Wedeletedthedata.

Thememorykeptit.

Naive delete vs. Ferryte cascade — % of scenarios each stack passes cleanly.

The raw vector DB isn’t the villain.The summary layer on top is.

Real backends, default configs

Plant a canary the data can't invent

Call the real delete API

Score what's left

Don’t trust us. Run it.

The part most benchmarks hide.

“You sell the fix — of course you found leaks.”

Did you rig the configs?

What did you NOT test?

Isn't a vector DB row-delete enough?

Run the same testagainst your own stack.

The raw vector DB isn’t the villain.
The summary layer on top is.

Run the same test
against your own stack.