Voice-First Refiner Agent Architecture
Context: Building a voice app over Claude Code (headless CLI). After each turn, a Letta V1 agent (“Refiner”) translates Claude Code’s technical output into personalized conversational speech for TTS.
Two modes:
- Translation — receives: user’s voice transcript + Claude’s thinking blocks + Claude’s text response. Outputs conversational translation based on learned preferences in memory.
- Feedback — receives short strings from a Tapback UI (“too long”, “simpler”, or free-text like “stop reading filenames”). Updates user_preferences memory block. No output needed.
3 memory blocks: user_preferences (r/w, preferences/tone), conversation_memory (read, rotated by sleeptime agent), refinement_patterns (r/w, what translations worked/failed).
1. System Prompt Structure for V1 Agent
Recommended structure for memory-aware V1 agent:
refiner_system_prompt = """
You are the Response Refiner in a voice-first coding pipeline.
# ARCHITECTURE
You receive Claude Code's technical output and translate it to conversational speech.
You operate in two modes: Translation and Feedback.
# MODE 1: TRANSLATION
Input: User transcript + Claude thinking + Claude response
Output: Conversational explanation for TTS
Rules (from user_preferences):
- Check length preference
- Apply tone style
- Filter jargon based on tolerance level
Process:
1. Read Claude's technical output
2. Extract key points
3. Translate to conversational tone
4. Keep it brief (check user_preferences for length)
5. Track success in refinement_patterns
# MODE 2: FEEDBACK
Input: Tapback signal ("too long", "simpler", etc.)
Output: NONE (just update memory)
Process:
1. Interpret feedback
2. Update user_preferences via memory_replace
3. Note pattern in refinement_patterns
4. Acknowledge silently (no speech output)
# MEMORY GUIDELINES
user_preferences (READ/WRITE):
- Durable: Tone, length preference, jargon tolerance
- Update via memory_replace when feedback given
- DO NOT store: specific filenames, code snippets, timestamps
conversation_memory (READ ONLY):
- Context from recent turns
- Managed by sleeptime agent (don't write here)
refinement_patterns (READ/WRITE):
- Successful translations (keep examples)
- Failed translations (learn from mistakes)
- Update via memory_insert after each translation
# MEMORY OPERATIONS
When to use memory_replace (user_preferences):
✓ User says "too long" → Update length preference
✓ User says "simpler" → Update jargon tolerance
✓ User gives tone feedback → Update tone style
When to use memory_insert (refinement_patterns):
✓ After successful translation → Add example
✓ After negative feedback → Record what failed
When to do NOTHING:
✗ Don't store ephemeral details (filenames, timestamps)
✗ Don't update conversation_memory (sleeptime handles it)
# RESPONSE FORMAT
Translation mode: Output conversational text
Feedback mode: No output (memory updates only)
"""
Key patterns:
- Explicit mode descriptions in system prompt
- Memory block descriptions inline (what’s in each)
- Clear memory operation rules (when to use each tool)
- Ephemeral vs durable distinction
2. Two-Mode Agent: One Agent vs Two
Recommendation: Single agent with mode detection
Why:
- Shared memory access (no need to sync)
- Simpler deployment (one agent)
- Context continuity (remembers last translation when processing feedback)
Mode Detection Strategy: Message Format (Recommended)
def send_to_refiner(mode: str, content: dict) -> str:
"""Send to refiner with mode signal."""
if mode == "translation":
message = f"""
[MODE: TRANSLATION]
User said: "{content['user_transcript']}"
Claude thinking:
{content['claude_thinking']}
Claude response:
{content['claude_response']}
Translate to conversational speech.
"""
elif mode == "feedback":
message = f"""
[MODE: FEEDBACK]
User feedback: "{content['feedback_text']}"
Update preferences. No output needed.
"""
response = client.agents.messages.create(
agent_id=refiner.id,
messages=[{"role": "user", "content": message}]
)
return extract_response(response)
Agent detects mode via [MODE: ...] marker in message.
3. Memory Update Guidance: Preventing Pollution
Prompt patterns to prevent pollution:
memory_guidance = """
# MEMORY HYGIENE - CRITICAL
## Durable vs Ephemeral Classification
DURABLE (store in user_preferences):
✓ Length preference: "Keep it under 30 seconds"
✓ Tone style: "Casual, not academic"
✓ Jargon tolerance: "Avoid terms like 'asynchronous'"
✓ Explanation depth: "Skip implementation details"
EPHEMERAL (do NOT store):
✗ Specific filenames: "auth.py" (will change)
✗ Timestamps: "last updated 3pm"
✗ One-time feedback: "that was good" (not a pattern)
✗ Specific code snippets: "def foo():" (context-specific)
## Decision Tree for Memory Updates
When user gives feedback:
1. Is it about HOW I should respond? → Update user_preferences
2. Is it about THIS specific response? → Track in refinement_patterns
3. Is it just acknowledgment ("ok", "thanks")? → Do nothing
Examples:
"Too long" → Update user_preferences: Reduce length preference
"That was too long" → SAME (implies pattern, not one-off)
"This explanation was too long" → SAME
"Simpler" → Update user_preferences: Increase jargon intolerance
"Explain simpler" → SAME
"That was good" → Track in refinement_patterns (example of success)
"Ok" / "Thanks" → Do nothing (acknowledgment only)
## Memory_replace vs Memory_insert
Use memory_replace for:
✓ Updating existing preferences (user_preferences)
✓ Example: Change "Length: 2-3 sentences" to "Length: 1-2 sentences"
Use memory_insert for:
✓ Adding new patterns (refinement_patterns)
✓ Example: Append "Success: Brief Git explanation worked"
## Testing Your Decision
Before calling memory tool, ask yourself:
- Will this preference apply to FUTURE translations? → Store it
- Is this specific to THIS translation? → Don't store (or note as pattern)
- Is this just noise? → Ignore
If uncertain, prefer NOT storing (can always add later).
"""
Example Memory Block Structure:
# user_preferences - BEFORE feedback
label: user_preferences
value: |
# Communication Style
Length: 2-3 sentences
Tone: Friendly, conversational
Jargon: Avoid technical terms
# Recent Adjustments
[None yet]
# user_preferences - AFTER "too long" feedback
label: user_preferences
value: |
# Communication Style
Length: 1-2 sentences MAX
Tone: Friendly, conversational
Jargon: Avoid technical terms
# Recent Adjustments
- 2026-02-10: Reduced length preference (feedback: "too long")
Note the date in “Recent Adjustments” - helps sleeptime agent know what’s recent.
4. Sleeptime Memory Manager Configuration
Exact configuration for daily archival:
Create Sleeptime Agent
# Create memory manager agent
memory_manager = client.agents.create(
name="Memory Manager (Sleeptime)",
agent_type="sleeptime_agent",
model="anthropic/claude-3-5-haiku", # Fast, cheap
memory_blocks=[conversation_memory_block.id], # Read/write access
tools=["memory_replace", "archival_memory_insert"],
system="""
You are the Memory Manager for a voice-first coding assistant.
# JOB
Run daily to:
1. Review conversation_memory block
2. Archive old conversations to archival memory
3. Keep only recent context in conversation_memory
# ARCHIVAL RULES
Archive to archival_memory:
✓ Conversations older than 7 days
✓ Completed projects/tasks
✓ Resolved issues
Keep in conversation_memory:
✓ Last 7 days of activity
✓ Active projects/ongoing tasks
✓ Unresolved issues
# PROCESS
1. Read conversation_memory
2. Extract entries older than 7 days:
Example: "2026-02-01: Worked on auth bug" (9 days old)
3. Archive via archival_memory_insert with tags:
- Tag: date (e.g., "2026-02")
- Tag: topic (e.g., "auth-bug")
- Tag: archived
4. Update conversation_memory via memory_replace (remove archived entries)
5. Keep conversation_memory under 3000 chars
# OUTPUT
Summarize what you archived and what remains.
""",
enable_sleeptime=True
)
Configure Sleeptime Frequency
# Configure sleeptime frequency (24 hours)
# Note: Nest inside manager_config (not top-level)
import requests
requests.patch(
f"{client.base_url}/v1/agents/{memory_manager.id}",
headers={"Authorization": f"Bearer {client.api_key}"},
json={
"manager_config": {
"sleeptime": {
"interval_seconds": 86400, # 24 hours
"min_messages": 1 # Run even with 1 message
}
}
}
)
Alternative: Scheduled Trigger (More Reliable)
# Use external cron to trigger sleeptime agent daily
# crontab entry:
# 0 2 * * * curl -X POST https://api.letta.com/v1/agents/{agent-id}/sleeptime \
# -H "Authorization: Bearer {api-key}"
# Or use Zapier: https://zapier.com/apps/letta/integrations
Memory Block Structure Before/After Sleeptime
# conversation_memory - BEFORE sleeptime
label: conversation_memory
value: |
# Recent Conversations
2026-02-01: Worked on auth bug in login.py
2026-02-03: Refactored database models
2026-02-05: Added unit tests for auth
2026-02-08: Started OAuth integration
2026-02-10: Claude Code generated OAuth flow
# conversation_memory - AFTER sleeptime (Feb 11)
label: conversation_memory
value: |
# Recent Conversations (Last 7 Days)
2026-02-05: Added unit tests for auth
2026-02-08: Started OAuth integration
2026-02-10: Claude Code generated OAuth flow
# Older entries archived to archival_memory
Archived entries go to archival_memory with tags:
# What sleeptime agent does:
client.agents.passages.create(
agent_id=refiner.id, # Main agent
text="2026-02-01: Worked on auth bug in login.py",
tags=["archived", "2026-02", "auth-bug"]
)
client.agents.passages.create(
agent_id=refiner.id,
text="2026-02-03: Refactored database models",
tags=["archived", "2026-02", "refactoring"]
)
Refiner agent can retrieve via archival_memory_search if needed.
Complete Implementation Example
from letta_client import Letta
client = Letta(api_key="your-key")
# 1. Create memory blocks
user_prefs = client.blocks.create(
label="user_preferences",
value="""
# Communication Style
Length: 2-3 sentences
Tone: Friendly, conversational
Jargon: Avoid technical terms unless user is technical
# Learned Patterns
[Will be updated based on feedback]
""",
limit=3000
)
conversation_memory = client.blocks.create(
label="conversation_memory",
value="# Recent conversations\n[Managed by sleeptime agent]\n",
limit=5000
)
refinement_patterns = client.blocks.create(
label="refinement_patterns",
value="""
# Successful Translations
[Examples of what worked]
# Failed Translations
[Examples of what didn't work]
""",
limit=5000
)
# 2. Create refiner agent
refiner = client.agents.create(
name="Response Refiner",
agent_type="letta_v1_agent",
model="anthropic/claude-3.5-sonnet",
system=refiner_system_prompt, # From section 1
memory_blocks=[
user_prefs.id,
conversation_memory.id,
refinement_patterns.id
],
tools=["memory_replace", "memory_insert"]
)
# 3. Create sleeptime memory manager
memory_manager = client.agents.create(
name="Memory Manager",
agent_type="sleeptime_agent",
model="anthropic/claude-3-5-haiku",
system=memory_manager_prompt, # From section 4
memory_blocks=[conversation_memory.id],
tools=["memory_replace", "archival_memory_insert"],
enable_sleeptime=True
)
# 4. Configure sleeptime (24 hours) - use patch request from section 4
# 5. Usage - Translation mode
def translate_response(user_transcript, claude_thinking, claude_response):
message = f"""
[MODE: TRANSLATION]
User: "{user_transcript}"
Claude thinking:
{claude_thinking[:500]}...
Claude output:
{claude_response}
Translate to speech.
"""
response = client.agents.messages.create(
agent_id=refiner.id,
messages=[{"role": "user", "content": message}],
streaming=False
)
# Extract conversational response for TTS
return response.messages[-1].content
# 6. Usage - Feedback mode
def process_feedback(feedback_text):
message = f"""
[MODE: FEEDBACK]
User feedback: "{feedback_text}"
Update preferences. No output needed.
"""
client.agents.messages.create(
agent_id=refiner.id,
messages=[{"role": "user", "content": message}],
streaming=False
)
# No response needed - agent updates memory silently
Key Takeaways
- System prompt: Focus on mechanics (modes, memory rules, operations) not behavior
- Mode switching: Single agent with
[MODE: ...]prefix in messages - Memory hygiene: Explicit durable vs ephemeral classification in prompt
- Sleeptime: Configure via
manager_config.sleeptime, nest interval_seconds inside
This architecture gives you a clean separation between real-time translation and background memory management, with clear guardrails against memory pollution.