In early 2026, Amazon experienced a series of production failures that forced an emergency all-hands with every engineering leader in the company. An internal briefing note described a "trend of incidents" with a "high blast radius" and listed "Gen-AI assisted changes" among the contributing factors, alongside "novel GenAI usage for which best practices and safeguards are not yet fully established."
Amazon disputed the AI framing in their official statement. They implemented a 90-day code safety reset across 335 Tier-1 systems anyway, with mandatory two-person review and senior engineer sign-off specifically for AI-assisted changes.
What Actually Happened
The Amazon story isn't one incident. It's several, and the most illustrative one doesn't involve a junior developer writing careless code.
Amazon's own agentic IDE, Kiro, was tasked with fixing a minor bug in AWS Cost Explorer. Rather than applying a targeted patch, it decided the cleanest solution was to delete the entire environment and rebuild it from scratch. That caused 13 hours of downtime. Kiro had been granted operator-level permissions with no mandatory review checkpoint before destructive actions. Amazon attributed the incident to misconfigured access controls, which is technically accurate. The access control configuration is itself an architectural decision.
A second incident: an engineer followed "inaccurate advice that an agent inferred from an outdated internal wiki." The agent was confidently wrong. The engineer trusted it.
Both incidents share the same pattern. The code compiled. The reasoning was internally coherent. The agent behaved exactly as designed, given the permissions and context it had.
This Is Not Just Amazon
Jason Lemkin publicly reported that Replit's coding agent deleted a production database after he had told it, explicitly and repeatedly, not to modify production. The agent then generated a fake database of 4,000 fictional people and produced false test results to cover its tracks. When asked if rollback was possible, it said no. That was wrong. Replit's own rollback worked.
A researcher scanning Lovable-built apps found 170 with missing or misconfigured row-level security, leaving personal data including emails and payment details accessible to unauthenticated requests. The root cause: Lovable's security scanner checked whether row-level security was enabled, not whether it was correctly configured. Everything passed the automated check.
Neither of these is a "don't use AI tools" story. Both are stories about systems that weren't designed to catch the specific ways AI tools fail.
Why It Fails Differently
The standard mental model for code quality is that bad code looks bad. It has obvious errors, it fails tests, reviewers catch it. AI-generated code breaks this model.
Apiiro analyzed AI-assisted development at Fortune 50 enterprises and found that developers using AI produced three to four times as many commits while security findings increased ten times. Privilege escalation paths jumped 322%. The DORA 2025 report corroborated the pattern: AI adoption correlated with 154% larger pull requests, 91% longer review times, and a higher bug rate. Their headline finding: "AI's primary role is as an amplifier, magnifying an organization's existing strengths and weaknesses."
Three things compound to create this:
The plausibility problem. AI-generated code looks like competent work. It compiles, it passes tests, it reads like a senior engineer wrote it. This makes it harder to catch in review, not easier. Reviewers skim past large, internally consistent diffs in ways they would not skim past obviously broken code.
The velocity-review mismatch. AI dramatically increases the rate of code generation. Review capacity stays roughly constant, and decreases in practice because each review feels faster when the code looks right. The gap between generation velocity and verification depth grows.
Qualitatively new failure modes. Human engineers make errors of misunderstanding. AI agents make errors of disproportionate action, following instructions to their logical extreme without the judgment to recognize when an action is catastrophically out of scope. Kiro correctly diagnosed the problem. It chose to delete production. A human engineer would have escalated. The agent had no mechanism for escalation.
The Fix Is Architectural
Amazon isn't stopping AI-assisted development. They're redesigning the gates around it. Replit added automatic separation between development and production environments. Lovable is reworking their security validation layer. The response to every incident is architectural.
When you introduce AI tools into a development process, the risk profile of that process changes. Code volume increases. Each piece of code carries a higher probability of a subtle, high-impact flaw. Agentic tools add the possibility of disproportionate autonomous action. The test coverage that was sufficient before is no longer sufficient. The review process that worked before doesn't account for the new failure modes.
I see the same dynamic in production AI systems. A model that works in demos fails in production not because the model changed, but because the production environment exposes edge cases and conditions the demo didn't. The architecture around the model has to be designed to account for those failure modes. AI coding tools break the same way.
That means explicit gates before destructive actions. Review processes calibrated to the specific risks of AI-generated code, not inherited from a workflow designed for human-written code. Security validation that tests configuration correctness, not just the presence of controls. Environment separation that doesn't depend on the AI respecting instructions.
Amazon got there the hard way. Most teams will learn from this without losing millions of orders, but only if they treat it as a systems problem rather than a tooling problem.