Eval tools for Letta

Hey y’all, we’re trying to figure out what kind of evals tooling we want to support.

If you’re interested in feedback, could you give us your thoughts on:

  1. What tools you use currently, if any?

  2. What you want to know when you use eval tools?

  3. What do you evaluate now? Memory persistence, conversation quality, retrieval accuracy, something else?

  4. What’s broken? Where do existing eval tools fail you when testing stateful agents?

  5. What’s your biggest memory eval pain point? Cross-session consistency? Memory management decisions? Long-term coherence?

  6. What would convince you to adopt new eval tooling? Better memory-specific metrics? Easier integration? Cost?

  7. Where are you headed? Planning production deployments? Need compliance/safety evals?

Comments welcome!