Six Months of Running AI Marketing Agents 24/7. Here's What Actually Broke.

The agents have been running for six months now. Eight of them. Across three websites. Generating briefs, writing drafts, rendering images, pinging search engines. Nobody hits a button. They wake up on a schedule and do their work.

This is not a "how to build agents" post. Those are everywhere. This is what I actually learned after letting them run.

The hard part was never the code. It was never the prompts either. The hard part was knowing whether the agents were doing good work or just doing work.

The Metric That Lies

Every agent platform shows you a dashboard. Tasks completed. Tokens consumed. Posts published. These numbers go up and to the right and everyone feels good.

But "posts published" is a vanity metric when you are the one who has to read them.

In month one, our content agent shipped a post that was technically flawless. Correct word count. Proper schema. Images rendered. Video attached. It also argued the exact opposite of something I had written two weeks earlier. Nobody caught it until I read it.

The agent did its job. It just did it wrong.

This pattern repeated across evaluation agents, research agents, and even the QA agent that was supposed to catch the other agents. Output volume went up. Trust in the output went sideways.

The Eval Stack I Had to Build

After a few of these incidents, I stopped celebrating task counts and built an evaluation stack. It has four layers.

Layer one: deterministic checks. Before anything ships, code verifies the floor. No em dashes in copy. Schema present. Image files exist at the right paths. Video rendered and playable. These checks catch about 30 percent of failures and cost almost nothing.

Layer two: trace review. Every agent run produces a trace. What inputs it received. What tools it called. What it decided. I review a sample of traces weekly. Not all of them. Ten to fifteen per agent. This is where you find the "technically correct but wrong" failures that deterministic checks miss.

Layer three: human override gates. For content that reaches customers, a human still approves. The agent drafts. The human reads. This is not because the agent is unreliable. It is because the cost of being wrong on a customer facing page is higher than the cost of the five minutes it takes to read a draft.

Layer four: post deployment monitoring. After content goes live, I watch for signals. Did the page get indexed? Did anyone link to it? Did it generate any inbound traffic? These signals close the loop between "we shipped" and "it worked."

The stack is simple. But nobody tells you to build it. The agent vendors sell you on "set it and forget it." That is a lie for anything customer facing.

Three Failure Patterns I Did Not See Coming

After six months, three patterns emerged that were not obvious at the start.

The agent learns my slop. If I write a bad prompt, the agent amplifies it. Vague instructions produce vague output. This sounds obvious, but it compounds. A slightly weak research brief becomes a mediocre draft becomes a forgettable published post. The failure is not at the final step. It is in the brief. The agent faithfully executes the weak signal.

Agents drift when left alone. The prompt you wrote in January will not produce the same output in June. Models change. APIs update. New data patterns emerge. Without periodic recalibration, agents slowly wander off target. I now run a calibration check every two weeks on every active agent. Compare recent output against a golden set. If something shifted, fix it before the drift becomes visible to customers.

Coordination failures are more common than individual failures. When eight agents hand off work to each other, the failure is rarely "agent four produced garbage." It is "agent two passed incomplete context to agent five, which made a reasonable but wrong decision." The most valuable debugging technique is not checking individual agent output. It is tracing the handoff.

The Architecture That Survived

I started with a flat architecture. One agent does research. One does drafting. One does QA. They passed work forward like an assembly line.

That broke in month two. The QA agent would reject a draft, send it back, and nobody tracked the revision state. Drafts got stuck. Agents waited for each other. The queue backed up.

What works now is a hub and spoke pattern. A coordinator agent owns the pipeline state. Specialized agents do the work and report back. The coordinator tracks what is pending, what passed, what failed, and what needs human input.

The coordinator does not make content decisions. It manages workflow state. This separation, state management from content judgment, is the single most important architectural decision I made.

What This Means for Marketing Teams

If you are building or buying agentic marketing tools, start with evaluation before you start with agents.

Define what good output looks like for your specific use case. Create ten examples. Then five examples of "close but wrong." Then five examples of "completely wrong." If you cannot articulate the difference between these categories, an agent cannot either.

Build the deterministic checks first. They are cheap and they catch the obvious failures. Add trace review second. Add human gates for anything customer facing.

And do not trust task counts. An agent that publishes fifty posts a month is not necessarily more valuable than one that publishes five posts that actually drive results. Measure work, not activity.

The agents are still running. They will wake up tomorrow and do their jobs. The difference between month one and month six is not that the agents got smarter. It is that the system around them got harder to fool.