20 Emails, 20 Clicks: Gemini Spark Cleaned My Inbox and Proved Why "Agentic" Has Limits
Part 2 of putting Google's always-on assistant to work. This time it got every fact right and refused to do the job on its own — and the reason why says more about agentic AI than any demo.

Last time I asked Gemini Spark to generate a Google Doc. It drifted off-topic, invented a model that doesn't exist, and handed me a beautifully formatted document I had to rewrite. The lesson was: it acts freely, so don't trust the output.
This time I flipped the task. Instead of asking it to create something, I asked it to destroy something: clean a pile of marketing email out of my Gmail. And I got the exact opposite result. Every fact it reported was correct. It never drifted. It never hallucinated. And it almost completely failed to do the job — for a reason that tells you more about agentic assistants than any demo will.
The task
My inbox had been collecting SiriusXM trial-expiry emails for months — the "your trial has ended, come back for $3/month" kind that pile up after a free trial in a new car. All from one sender, noreply@mc.siriusxm.com. I gave Spark a deliberately tight prompt:
Go through my Gmail and find every email from noreply@mc.siriusxm.com.
Tell me how many there are and the date range before doing anything.
Then move all of them to Trash. Do not touch any other sender.
When done, report exactly how many you deleted.
The "count and confirm first" instruction was the whole experiment. I wanted to see two things: would it get the scope right, and would it act on its own?
Where it shined: it got everything right
Spark came back fast and precise:
Total emails: 20 Date range: January 10, 2026 to April 23, 2026
I checked that against Gmail directly. It was exactly right — 20 messages, same date range, all from that one sender, not one off. After the doc-generation post, where it confidently cited a fake product, I was braced for an inflated count or a stray sender swept in. Neither happened. On the read-and-report half of the task, Spark was flawless: accurate, scoped, and honest enough to stop and ask before touching anything.
Then it asked: "Please confirm if you would like me to proceed with moving these 20 emails to the Trash." Good. I replied:
Yes, move all 20 to Trash. Don't touch anything else. Report the final count.
This is where it got interesting.
Where it stalled: the automation evaporates at the destructive step
My chat reply wasn't enough. Spark popped a separate "Confirm Trash" dialog — a Gmail-branded card showing the sender, subject, and a snippet, with Cancel and Allow buttons. So there are two layers of consent: my natural-language "yes," and then a hard UI confirmation. Fine, I thought, that's a sensible guardrail for anything that deletes data. I clicked Allow.
Then it showed me another Confirm Trash dialog. For the next email. And another. And another.
Spark was not batch-deleting 20 emails. It was asking me to individually approve each one — twenty separate "Are you sure?" cards, twenty Allow clicks. The card only ever previews a single message at a time, so I wasn't even reviewing the batch; I was rubber-stamping the same decision twenty times in a row.
It finished correctly. The final report was accurate again:
All 20 emails from noreply@mc.siriusxm.com have been moved to the Trash. No other emails or senders were affected. Final Count of Deleted Emails: 20
I verified it: zero left in the inbox, all 20 sitting in Trash (recoverable, not permanently deleted), nothing else disturbed. Perfect execution. But the entire point of an "always-on, takes-action-for-you" assistant is that I shouldn't have to sit there clicking. The agentic promise — set a task, close the laptop, find it done — held right up until the moment the task involved removing something, and then it quietly handed all the work back to me.
The pattern worth taking away
Put the two posts side by side and the shape of the thing comes into focus:
The doc task: Spark acted freely and the output was unreliable. It created confidently and wrongly.
The inbox task: Spark's output was completely reliable and it wouldn't act freely. It refused to execute a destructive batch without human approval on every single item.
That's not a contradiction — it's the design. Generative actions (write a doc) are reversible and low-stakes, so Spark runs with scissors. Destructive actions (trash an email) are where it slams on the brakes, because the cost of an over-eager agent deleting the wrong thing is real and the vendor knows it. The guardrail is correct in principle. It's just calibrated so conservatively that it cancels out the automation it's supposedly protecting.
What I'd actually do instead
For a one-off bulk delete, native Gmail beats the agent decisively. Search from:noreply@mc.siriusxm.com, hit Select all, click Delete — all 20 gone in one action, no per-item theater. Spark's per-email gate is the wrong tool for "remove a known set of junk."
Where Spark's caution would actually earn its keep is the opposite shape of task: a fuzzy, judgment-heavy cleanup where you want to eyeball each candidate — "find emails that look like expired free-trial spam across all my senders." There, reviewing one at a time is a feature, not a tax. The mistake is using a careful-by-design agent for a job whose whole appeal is not having to be careful.
Gotchas from this run
Destructive actions require per-item UI approval. There is no "Allow all" on the Trash dialog that I could find. Plan for one click per message, not one click per task.
Your chat "yes" is not consent. Spark treats the natural-language confirmation and the UI Allow as separate gates. Expect to confirm twice for the first item and once more for every item after.
It defaults to Trash, not permanent delete. Everything lands in Trash and is recoverable for 30 days, which is the safe default — but it also means the messages aren't actually gone until Gmail purges Trash.
The scope held exactly. "Do not touch any other sender" was respected to the letter. The precise prompt — naming the exact sender and asking for a count first — is what kept it honest. Vague cleanup requests are where I'd expect this to go sideways.
The honest verdict
Gemini Spark did not fail this task — it did it perfectly, and that's the problem. The accuracy was there, the scope was there, the safety was there. What wasn't there was the agency. For destructive work, Spark is less an autonomous assistant and more a very polite intern who reads every item back to you and waits for a nod before each one.
That's reassuring if you're worried about an AI nuking your inbox. It's useless if you wanted the AI to save you the clicks. Until there's a trusted "approve this whole batch" path, the honest call is: let Spark find and plan the cleanup, then do the bulk delete yourself in Gmail. Or, put differently — the same lesson as last time, from the other direction. Last time the warning was don't trust it to act without checking the result. This time it's don't expect it to act at all when the result can't be undone.
Have you hit this wall with Spark or another agentic assistant — careful right up to the point of being unhelpful? I'm curious whether the per-item approval shows up on other destructive actions (archiving, label changes, calendar deletes) or whether email deletion is just where Google decided to be the most paranoid.
