Hey y’all, we’re trying to figure out what kind of evals tooling we want to support.
If you’re interested in feedback, could you give us your thoughts on:
-
What tools you use currently, if any?
-
What you want to know when you use eval tools?
-
What do you evaluate now? Memory persistence, conversation quality, retrieval accuracy, something else?
-
What’s broken? Where do existing eval tools fail you when testing stateful agents?
-
What’s your biggest memory eval pain point? Cross-session consistency? Memory management decisions? Long-term coherence?
-
What would convince you to adopt new eval tooling? Better memory-specific metrics? Easier integration? Cost?
-
Where are you headed? Planning production deployments? Need compliance/safety evals?
Comments welcome!