For 14 days, I let AI agents run the first 90 minutes of my morning. Calendar prep, news brief, email triage, a workout plan tailored to my soreness, even what I should eat. I logged every minute, every API call, and every time something broke.
The outcome surprised me. Not because the agents were magical, but because the failure modes were so specific and so human. Here's the full breakdown - what to copy, what to skip, and what it actually costs.
Key Takeaways
- A 5-agent morning stack saved 38 minutes per day on average across 14 days, but only after a 6-day tuning phase
- Daily token spend stabilized at $1.40 - under $45/month - using a mix of Claude Haiku 4.5 for routine tasks and Sonnet 4.6 for synthesis
- The biggest wins came from agents that summarized existing data; the worst failures came from agents asked to decide without enough context
- Email triage was the highest-value automation; "what should I eat" was the lowest, despite feeling impressive
- An agent that runs 7 days a week without occasional human review will drift - schedule a weekly audit or it gets worse, not better
Short Answer
Did automating my morning routine actually work? Yes, but not the way the influencer videos promise. The agents that read inputs and produced summaries (calendar, news, email) saved real time from day one. The agents that tried to make decisions for me (workouts, meals, priorities) needed constant correction until I narrowed their scope. The honest verdict: AI agents are excellent personal interns and mediocre personal assistants - at least until you spend a week teaching them.
The full prompt set, agent definitions, and cost log from this experiment are open-sourced in the openbooklet/morning-stack repo. Skill packages for each agent are searchable on OpenBooklet.
The Stack: 5 Agents, 1 Orchestrator
I picked 5 high-impact tasks that consumed predictable time every morning. Each became its own agent, and a single orchestrator chained them together.
| Agent | Job | Model | Avg time saved/day |
|---|---|---|---|
| Brief | Pull 8 news sources, summarize what matters in my domain | Haiku 4.5 | 12 min |
| Calendar | Review today's meetings, prep notes per attendee from CRM | Sonnet 4.6 | 9 min |
| Triage | Sort overnight email into respond / archive / ignore | Haiku 4.5 | 11 min |
| Coach | Suggest a workout based on yesterday's strain + sleep | Sonnet 4.6 | 4 min |
| Kitchen | Recommend breakfast given fridge contents + macros | Haiku 4.5 | 2 min |
The orchestrator was a simple workflow that ran at 6:55 AM, fanned out to all 5 agents in parallel, and dropped results into a single markdown file in my Obsidian vault by 7:02.
Why fan-out, not chain
I tried both. Chaining made each agent depend on the previous one's output - clean architecturally, but a single failure killed the whole brief. Fan-out is uglier but resilient. If Coach times out (which happened twice), I still get my email triage and news. The trade-off is worth it for a daily workflow where partial output beats none.
"If your agent system can't degrade gracefully, it's not a system - it's a single point of failure with extra steps."
That's a lesson I learned the hard way on day 4, when a malformed JSON response from one agent took down the entire chain and I missed two morning meetings I should've prepped for.
The Tuning Phase Nobody Talks About
Days 1–6 were rough. Output was technically working but practically useless. The agents were summarizing the wrong things, prioritizing the wrong emails, and one of them kept suggesting I do leg day even when my watch reported I'd just done one.
The fix wasn't in the model. It was in context.
# Before - generic, brittle
You are a productivity assistant.
Summarize my calendar for today and prep me.
# After - scoped, opinionated
You are reviewing today's calendar for Priya, who works in growth marketing
at a B2B SaaS company. For each meeting:
- Pull the attendee from the CRM (provided as context)
- Note any open Linear tickets they own
- Flag if this is a first meeting or recurring
- Skip internal standups unless attendee count > 8
Output: one-line briefs only. No filler.
The "after" version cut prep output by 60% but increased usefulness by far more. Generic prompts produce generic output. Specific prompts produce useful output. Anthropic's prompt engineering guidance makes this point repeatedly, and it bit me hard until I followed it.
Treat each agent's prompt as a contract you renegotiate weekly. What you wrote in week 1 will be wrong by week 3 - your context, your priorities, and your input data all shift.
What Worked: Summarization Beats Decision-Making
Looking at the 14-day log, the clearest pattern was this: agents that summarized data I already had performed reliably. Agents that tried to decide on my behalf failed often.
- Brief had a 0% override rate by day 14 - I almost never edited the news summary
- Calendar had a 12% override rate - I added context the agent didn't have access to
- Triage had an 18% override rate - but rarely on what to ignore, mostly on tone of replies
- Coach had a 47% override rate - too many factors it couldn't see
- Kitchen had a 71% override rate - basically a glorified random suggester
The clearer pattern: as the agent moved from "compress information" to "make a value judgment," accuracy collapsed. This matches what GitHub's Octoverse 2024 found across coding agents - humans still trust AI more for explanation than for prescription.
What I'd Do Differently
If I were starting fresh today, I'd cut the stack in half.
- Keep Brief, Calendar, and Triage - they pay for themselves daily
- Drop Coach and Kitchen - they were novelty, not utility
- Add a 7th-day audit - every Sunday, review the week's logs and tune the prompts
- Pin a model version - when Anthropic shipped a new Sonnet mid-experiment, two of my agents started behaving differently overnight
- Don't chain - fan out - fragility hides until you're already running 7 days deep
I'd also build the stack inside a single OpenBooklet skill package instead of scattered scripts. When I tried sharing this with two friends, the fact that it was a hand-rolled mess of cron jobs and Python files killed any chance of them adopting it.
The Real Cost
Here's the 14-day cost breakdown, including the false starts and re-runs.
| Item | Cost |
|---|---|
| Anthropic API (Haiku + Sonnet) | $19.60 |
| OpenAI embeddings (for the brief retrieval) | $1.40 |
| n8n self-hosted (free, but $4 hosting allocation) | $4.00 |
| Total over 14 days | $25.00 |
| Projected monthly (steady state, no tuning) | ~$45 |
For 38 minutes per weekday, that's roughly $0.06 per saved minute. However, the comparison isn't really cost-vs-time. It's cost-vs-cognitive-load. The actual win wasn't the minutes saved - it was starting the day with my decisions already prioritized for me, instead of fighting an inbox.
FAQ
Should I run agents on a cron, or trigger them manually?
It depends on your routine consistency. I run mine on a cron because my mornings start within a 20-minute window. If your schedule varies more, a manual trigger from your phone (an iOS Shortcut or a Telegram bot) is more reliable than a cron that fires when you're not there to consume the output.
What's the smallest version of this someone could try?
Start with one agent: email triage. It's the highest-ROI single automation in this stack. Spend a week tuning it before adding anything else. Most people who try to launch a 5-agent stack from day one give up by day 4 because the maintenance overhead is brutal during the tuning phase.
Can I do this without coding?
Yes, but with caveats. Tools like Make.com and Zapier now support AI agent steps, and they're sufficient for a 2–3 agent stack. However, when you hit 5+ agents that need shared context (like my CRM integration), you start fighting the visual editor more than it helps.
How do I know when an agent has drifted?
The honest answer: you don't, until something embarrassing happens. The defense is the weekly audit - read 3 random outputs from each agent every Sunday and ask "would I have written this?" If the answer is no, retune the prompt. Drift is not always visible; only deliberate review catches it early.
Is this worth it for someone with a normal job?
If your morning involves any of: a calendar with 3+ meetings, an inbox with 30+ overnight emails, or content monitoring across multiple sources - yes, the email triage agent alone will pay for the entire stack. If your morning is "coffee, gym, code" with no information work, the answer is no.
Closing Key Takeaways
- Summarization wins, prescription fails - automate "compress this for me" tasks before you automate "decide for me" tasks
- The tuning phase is unavoidable - budget 6+ days before you trust any output
- Fan-out beats chaining for daily resilience - partial output beats none
Further reading: The Solopreneur's AI Stack | 10 AI Agent Skills Every Developer Should Install | Browse the Ideas hub | Explore agent skills on OpenBooklet