I can design a benchmark to test sleeptime memory tracking. Based on the Letta Evals framework, here’s a proposal:
Benchmark Design: Sleeptime Memory Tracking Test
Objective
Compare how well sleeptime agents capture and organize information using the revised persona (tool-agnostic) vs. the current default (tool-specific).
Evaluation Metrics
1. Information Capture Rate
- Did the agent capture key facts mentioned in conversation?
- Measured by checking if ground truth facts appear in relevant memory blocks
2. Proactive Update Timing
- Are updates happening incrementally throughout the session?
- Measured by counting memory updates per conversation turn
3. Consolidation Quality
- Does the agent merge redundant information effectively?
- Measured by checking for duplicate information across updates
4. Inference & Enrichment
- Does the agent add context explaining why information matters?
- Measured using rubric grader to assess quality of memory entries
Dataset Structure
{"input": ["I'm working on a new project using Python and React.", "We're planning to launch in 3 months.", "The team consists of 5 engineers and 2 designers."], "ground_truth": {"technology": ["Python", "React"], "timeline": "3 months", "team_size": 7}, "category": "project_context"}
{"input": ["I prefer morning meetings.", "Actually, I hate meetings before 10am.", "Let's stick to afternoon meetings only."], "ground_truth": {"preference": "afternoon meetings only", "correction_tracked": true}, "category": "preference_correction"}
{"input": ["We had a bug in the auth system last week.", "The issue was related to JWT token expiration.", "We fixed it by increasing the expiration time to 24 hours."], "ground_truth": {"problem": "JWT token expiration", "solution": "increased to 24 hours", "context": "auth system"}, "category": "problem_solving_history"}
{"input": ["I'm interviewing candidates this week.", "We need someone who knows TypeScript well.", "They should also have experience with Next.js."], "ground_truth": {"task": "interviewing", "requirements": ["TypeScript", "Next.js"]}, "category": "ongoing_task"}
{"input": ["The API rate limit is 100 requests per minute.", "We might need to increase that soon.", "Our current usage is at 80% capacity."], "ground_truth": {"constraint": "100 req/min", "status": "80% capacity", "action": "consider increase"}, "category": "technical_constraints"}
Suite Configuration
name: sleeptime-memory-tracking-benchmark
description: Test sleeptime agent's ability to proactively track and organize information
dataset: sleeptime_memory_dataset.jsonl
target:
kind: letta_agent
agent_file: sleeptime_agent.af # Will test both persona versions
base_url: http://localhost:8283
graders:
# Grader 1: Information capture
capture_rate:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: "project_context" # Or appropriate block based on category
# Grader 2: Update frequency (proactive vs. batch)
update_timing:
kind: tool
function: count_memory_updates
extractor: all_tool_calls
extractor_config:
tool_names: ["memory_insert", "memory_replace", "memory_rethink"]
# Grader 3: Quality of memory entries (LLM-as-judge)
memory_quality:
kind: rubric
model: gpt-4o
prompt: |
Evaluate the quality of this memory entry:
**Criteria:**
1. Clarity: Is the information clearly stated?
2. Context: Does it explain WHY this information matters?
3. Organization: Is it well-structured and easy to reference?
4. Completeness: Does it capture all relevant details?
Memory Entry: {extracted}
Original Conversation: {input}
Score 1-5 (5 = excellent) and explain your reasoning.
extractor: memory_block
extractor_config:
block_label: "project_context"
# Grader 4: Consolidation (check for redundancy)
consolidation:
kind: tool
function: check_redundancy
extractor: memory_block
extractor_config:
block_label: "project_context"
gate:
kind: composite
gates:
- metric_key: capture_rate
aggregation: avg_score
op: gte
value: 0.8 # Should capture 80% of facts
- metric_key: memory_quality
aggregation: avg_score
op: gte
value: 3.5 # Average quality >= 3.5/5
Custom Grader Functions
For the benchmark to work, we’d need custom graders:
from letta_evals import grader
@grader
def count_memory_updates(tool_calls: list) -> dict:
"""Count how many memory updates happened per turn"""
memory_tools = ["memory_insert", "memory_replace", "memory_rethink"]
count = sum(1 for call in tool_calls if call["name"] in memory_tools)
return {
"score": min(count / 3, 1.0), # Normalize: 3+ updates = perfect score
"rationale": f"Found {count} memory updates",
"metadata": {"update_count": count}
}
@grader
def check_redundancy(memory_content: str) -> dict:
"""Check if memory contains redundant information"""
# Simple version: check for duplicate phrases
lines = memory_content.split('\n')
unique_lines = set(lines)
redundancy_ratio = 1 - (len(unique_lines) / len(lines))
score = 1.0 - redundancy_ratio # Lower redundancy = higher score
return {
"score": score,
"rationale": f"Redundancy ratio: {redundancy_ratio:.2%}",
"metadata": {"redundant_lines": len(lines) - len(unique_lines)}
}
Running the Comparison
# Test current default persona
letta-evals run sleeptime_suite.yaml --agent-file default_sleeptime.af --output results/default/
# Test revised persona
letta-evals run sleeptime_suite.yaml --agent-file revised_sleeptime.af --output results/revised/
# Compare results
letta-evals compare results/default/ results/revised/
Expected Differences
Revised persona should show:
- Similar or better information capture rate (both should capture facts)
- Better resilience to tool changes (no hardcoded tool names)
- Potentially better memory quality scores (emphasis on context/reasoning)
- Equal or better consolidation (both address this)
Current default might show:
- Potential brittleness if tool names change
- Equal performance on capture rate (both emphasize proactive updates)
Additional Test Categories
We could expand the benchmark with:
- Memory evolution tests: Does the agent update its own memory management strategy?
- Cross-session persistence: Do important facts carry over properly?
- Pattern recognition tests: Can the agent identify recurring themes?
- Contradiction handling: How well does it flag outdated information?
Would you like me to create the complete dataset with more examples? Or should I focus on specific aspects of memory tracking behavior?