Playwright 1.59 dropped on April 1, 2026, and most of the QA crowd is still treating it like just another minor release. It isn't. This is the first Playwright release explicitly designed for AI agents, not humans, to drive the browser. The headline addition: three agent definitions — planner, generator, and healer — that let an LLM walk through your application, write tests, and fix the failing ones automatically.
I've spent the last two weeks integrating these into two client projects. Both are real production codebases — one fintech dashboard, one e-commerce checkout. This article is the honest breakdown: what each agent actually does, how to wire them into a real workflow, where they fall apart, and whether they justify the time investment.
Short answer: planner and healer are excellent. Generator needs a tight leash. Don't believe the demo videos.
Table of Contents
- What actually shipped in 1.59
- The Planner Agent: surveying your app
- The Generator Agent: writing the tests
- The Healer Agent: fixing failures automatically
- My current real-world workflow
- What breaks (so you don't repeat my mistakes)
- Should you switch to agent-driven testing?
- FAQs
What Actually Shipped in Playwright 1.59
Before the agents, here's the supporting infrastructure that makes them work:
- AI-optimized accessibility snapshots — flat structured trees an LLM can actually parse without choking on noise
browser.bind()— agents can attach to your already-running browser session instead of cold-starting--debug=cli— a debugger surface designed for an LLM to step through, not a human clicking- Async disposables —
await usingfor cleaner cleanup when agents spawn contexts - Screencast API — programmatic video so agents can show their work
browserContext.setStorageState()— agents can persist auth without rebuilding it every run
Every one of these is in service of agentic workflows. The three agent definitions sit on top.
Install once you've upgraded:
npm install -D @playwright/test@1.59
npx playwright install
The agents themselves ship as Markdown definitions you point your coding assistant at. They aren't binaries — they're prompts with structured outputs. That's the part most people miss.
The Planner Agent: Surveying Your App
The planner is the agent I've used most. You point it at a URL — staging, local dev, whatever — and it walks the app like a manual tester would. It clicks buttons, fills forms with throwaway data, and produces a Markdown test plan organized by feature.
Here's roughly what the output looks like for a checkout flow:
# Test Plan: /checkout
## Critical paths
- Empty cart -> redirect to /products
- Single item -> stripe payment success -> /thank-you
- Multiple items with quantity changes -> tax recalculation
## Edge cases discovered
- Decline-card flow shows error inline (no toast)
- Coupon "FRIEND10" applies twice if you click apply rapidly
- Address autocomplete fails on UK postcodes longer than 7 chars
## Skipped (manual review needed)
- 3DS authentication popup - cannot complete without real card
That third section — "skipped, manual review needed" — is the part that tells me the planner is being honest. Earlier AI test-generation tools would just hallucinate a test that pretends 3DS works. The planner says "I can't do this, you handle it."
How I'm running it
I add the planner agent definition to my Claude Code or Cursor config and point it at the local dev server:
npx playwright test --agent=planner --base-url=http://localhost:3000 --output=plan.md
Run takes 3–8 minutes for a medium app. Read the markdown. Edit. Then hand it to the generator.
Where the planner shines
- Discovering edge cases you forgot existed (the coupon double-click was a real bug it found on a client project)
- Forcing you to articulate what "critical path" actually means before writing code
- Catching dead UI flows — pages that exist but no longer link from anywhere
Where it fails
- Auth-gated apps. Without storage state, the planner sees the login page and thinks that's the whole app.
- Anything behind a feature flag. It will write a plan based on the default state and miss 80% of your features.
- Apps with heavy real-time data (Socket.IO, WebRTC). It can't reliably interact with content that changes mid-survey.
Fix for auth: pre-populate storageState using multi-user auth state setup before invoking the planner. Then it walks past the login wall.
The Generator Agent: Writing the Tests
The generator takes the Markdown plan and produces actual .spec.ts files. This is where I'm more cautious.
Run it like this:
npx playwright test --agent=generator --plan=plan.md --output=tests/
You'll get a folder of test files. Most look reasonable. Some are surprisingly good. A few are nonsense.
Here's a real generator output from my fintech client (lightly redacted):
import { test, expect } from '@playwright/test';
test.describe('Checkout - Decline card flow', () => {
test('shows inline error, preserves cart state', async ({ page }) => {
await page.goto('/checkout');
await page.getByRole('textbox', { name: 'Card number' }).fill('4000000000000002');
await page.getByRole('textbox', { name: 'Expiry' }).fill('12/30');
await page.getByRole('textbox', { name: 'CVC' }).fill('123');
await page.getByRole('button', { name: 'Pay' }).click();
await expect(page.getByRole('alert')).toContainText(/declined/i);
await expect(page.getByTestId('cart-summary')).toBeVisible();
});
});
That's clean. Role-based locators, web-first assertions, no hardcoded waits. The agent knows the best practices because it was trained on the Playwright docs that promote them.
What I always change after generation
- Hardcoded test data. The generator inlines email addresses and card numbers. I move them to fixtures.
- Selectors for dynamic content. If the agent grabbed
getByText('$24.99'), that breaks the moment pricing changes. I rewrite togetByTestId('total-price'). - Assertion granularity. Generator over-asserts. It checks 6 things when 2 would tell you the test failed. I prune.
Where it ships nonsense
The generator hallucinates routes that don't exist. Twice in my fintech project it produced tests against /account/billing/invoices/download when the actual route was /billing/invoices. The planner had it right; the generator drifted.
Always run the generated tests once before committing. npx playwright test --reporter=list. Anything that fails on the first run is suspect.
The Healer Agent: Fixing Failures Automatically
This is the agent everyone gets excited about. The pitch: tests fail in CI, healer reads the failure, patches the test, opens a PR. End of flaky-test war.
The reality is more nuanced. Healer is excellent at fixing one specific class of failure: locator drift. UI changes broke data-testid="cart-btn" because someone renamed it to data-testid="checkout-btn". Healer reads the trace, finds the closest match, updates the test, done.
Healer is bad at: timing failures, race conditions, environment differences, anything where the application is genuinely broken. It will sometimes "fix" a real bug by relaxing the assertion until the test passes. That's worse than no automation at all.
How I'm using it safely
npx playwright test --agent=healer --trace=on --max-fix-attempts=2 --dry-run
The two flags that matter:
--max-fix-attempts=2— stop the agent from infinitely loosening assertions--dry-run— output a diff, don't apply it
I never let healer push directly. It opens a PR. A human reviews. If the fix is "this assertion was wrong," I check whether the assertion was wrong or whether the app is wrong. About 30% of the time, healer's "fix" is hiding a real regression.
My Current Real-World Workflow
After two weeks of trial-and-error, here's how I'm actually shipping with the agents:
- Start of feature work. Run planner against staging. Read the plan. Discuss with the dev team. Edit the plan to remove what's out of scope.
- Generate skeleton tests. Run generator against the edited plan. Commit immediately to a separate branch — don't merge yet.
- Manual cleanup. Move test data to fixtures, fix hallucinated routes, tighten selectors. This takes 30–60 minutes per spec file.
- Run locally. If anything fails on the first run, fix it before pushing.
- CI run. Tests run on PR. If they pass, merge.
- Healer in CI (optional). Wired up to a scheduled job that runs every Sunday on the main branch. Fixes locator drift, opens PRs, I review Monday morning.
Time saved versus my pre-agent workflow: roughly 40% on greenfield test suites, 15% on existing test suites where I'm just adding coverage. Not the 90% the marketing suggests, but real.
What Breaks (So You Don't Repeat My Mistakes)
1. Don't run the planner on production
The planner clicks things. It will create users, post comments, and submit forms. I almost ran it against a client's production CMS in week one. Always staging or local. If you must use production, scope it to read-only routes and prepare to clean up.
2. Storage state stales out faster than expected
Healer assumes your storageState.json is current. If your auth tokens expired between the last test run and the healer run, every test fails with a 401, and healer either gives up or — worse — "fixes" tests by removing the auth steps. Refresh storage state at the top of every healer run.
3. The generator does not understand business logic
It can write a test that submits a form. It cannot write a test that confirms a transaction posted to the right ledger account, or that a webhook fired in the right order. Anything beyond UI surface needs human authorship.
4. Token costs add up
Each planner run on a medium app burns roughly 80,000–120,000 tokens with a frontier model. At Claude Sonnet 4.6 rates that's $0.20–$0.40 per run. Generator and healer per spec file are smaller (~10–30k tokens). Set a monthly cap or you'll be surprised by your bill.
Should You Switch to Agent-Driven Testing?
If you have an existing healthy Playwright suite: don't rip it out. Add planner for new feature exploration and healer for locator-drift maintenance. Skip generator — write those tests yourself, you'll be faster.
If you're starting a new project: start with planner-driven test creation. The output is good enough that you'll save real time, and the workflow forces you to plan before you code.
If you have no test suite at all: this is the strongest case for agents. Going from zero tests to a baseline regression suite used to be a two-week project. With planner + generator + manual cleanup, I did it for the e-commerce client in two days.
What I will not recommend: replacing your QA engineer with the agents. Healer's 30% "fix that hides a real bug" rate is exactly why a human reviewer is non-negotiable. The agents make a senior QA engineer faster. They do not make a junior QA engineer into a senior one.
FAQs
Do I need a Microsoft AI Foundry account to use the agents?
No. The agent definitions are open prompts. You can wire them into Claude Code, Cursor, GitHub Copilot, or any LLM that supports tool use. I run mine through Claude Sonnet 4.6 because the accessibility-snapshot reasoning is the best I've tested.
What about Playwright 1.60?
As of April 2026, 1.60 is in early release notes but not yet shipped to npm. Expect it to refine the agent prompts, not replace them. Pin to 1.59.x for now.
Can the healer fix tests for languages other than TypeScript?
Yes — Python and Java work, but the prompts are tuned for TypeScript. I've seen 60–70% success on Python projects versus 85%+ on TypeScript. The Java version is rough; I wouldn't depend on it for production.
Does it work with Playwright Component Testing?
Mostly no. Component testing is still experimental and the agents weren't trained on it. See my Component Testing post for the full story.
How does it compare to testRigor or Mabl?
Different category. testRigor and Mabl are full SaaS platforms with their own runtime. Playwright agents are local-first — your tests stay in your repo, run in your CI, cost you only the LLM tokens. I prefer the latter for client work where the IP needs to belong to the client, not the vendor.
Will Microsoft sunset the non-agent CLI?
Nothing in the 1.59 release notes suggests this. The non-agent flow is still the default. Agents are opt-in via --agent= flags.
Can I trust generator output for safety-critical code?
No. For payments, healthcare, anything regulated, write tests by hand and use the agents for exploratory coverage only. Healer should be disabled on those paths entirely.
Does the planner respect robots.txt or rate limits?
It respects rate limits configured in playwright.config.ts. It does not check robots.txt — but you shouldn't be running it on third-party sites anyway.
What if I'm using BDD (Cucumber) on top of Playwright?
The agents output raw Playwright Test files. Generating Cucumber feature files isn't supported in 1.59. playwright-bdd has its own roadmap for this — check their GitHub.
How do I roll back if the agents break my pipeline?
Don't merge agent-generated PRs without manual review. Keep agent runs on a separate branch or scheduled job, never on the main branch's PR gate. Worst case, delete the agent-touched specs and rerun manually.
Wrap-Up
Playwright 1.59 is the most consequential release in two years, but only if you treat the agents as accelerators for a senior QA engineer — not replacements for one. Planner is now part of my standard workflow. Generator I use selectively. Healer runs on a leash.
If your team is considering adopting these agents and you want a sanity check before committing, I do automation framework reviews and setup for teams transitioning to AI-assisted testing. Or book a free call and I'll tell you whether your current setup is a good candidate.
Related reading:
Tayyab Akmal
AI & QA Automation Engineer
6 years of catching critical bugs in fintech, e-commerce, and SaaS — then building the Playwright and Selenium automation that prevents them from shipping again.