Six months ago I inherited a QA suite from a fintech client. 412 Playwright tests. Average run time: 28 minutes. Critical-path coverage: about 60%. Edge-case coverage: barely 15%. Half the tests asserted the same successful login flow with slightly different parameters.
This is the most common state I find when I'm called in to fix a flaky CI pipeline. The problem isn't that there are too few tests. The problem is that there are too many of the wrong ones. And nobody knows which is which because the suite grew organically over three years and nobody's allowed to delete a test "in case it catches something."
That fintech suite is now 187 tests. Coverage is up. Run time is 9 minutes. Flakiness dropped from 8% to 0.4%. Here's the method.
Table of Contents
- Why test suites bloat (and why it's not your fault)
- The 5-step audit framework
- Step 1: Categorize by risk class
- Step 2: Find redundant happy-path tests
- Step 3: Identify dead tests
- Step 4: Spot the missing edge cases
- Step 5: Make the cuts (safely)
- How to measure that you didn't break anything
- FAQs
Why Test Suites Bloat (and Why It's Not Your Fault)
Three forces, all rational individually, compound into bloat:
- Bug-driven test creation. Every customer-facing bug results in "add a test for this so it doesn't regress." Over years, this means you have 30 tests for the customer onboarding flow because 30 bugs happened in that area, all of which test slightly different angles of the same happy path.
- Risk aversion. Nobody gets fired for adding a test. People do get blamed if they delete a test and the bug returns. So tests accrete and never get pruned.
- Lack of a coverage map. Most teams don't have a single document that shows "these are the user journeys we care about and how each is covered." Without one, you can't see what's redundant.
Bloat is a coordination failure, not an engineering failure. The fix requires building a coverage map, which is exactly what the audit produces.
The 5-Step Audit Framework
Block out a full day, ideally with one engineer who knows the product well. The output is a spreadsheet that becomes your coverage map.
Step 1: Categorize By Risk Class
For every test in your suite, assign one of three classes:
- Money path — Anything that touches payment, billing, account creation, or user data write. If it breaks, customers leave or sue.
- Critical path — Core user flow. Login, search, view, navigate. If it breaks, the product is unusable.
- Edge / nice-to-have — Settings, account preferences, secondary features. If it breaks, customers frown but keep using the product.
For the fintech suite, I expected ~30% money path, ~50% critical, ~20% edge. The actual breakdown was: 12% money path, 75% critical, 13% edge. Critical path was wildly over-covered, money path under-covered. That's the gap to close.
Spreadsheet column: risk_class.
Step 2: Find Redundant Happy-Path Tests
Group tests by the user journey they test. Within each group, look for tests that:
- Use the same setup and the same teardown.
- Differ only in input parameters that have no meaningful business effect.
- Assert the same outcomes.
Example I found in the fintech suite — three separate tests:
test('login with valid email and password', async ({ page }) => { /* ... */ });
test('login with valid credentials and remember me', async ({ page }) => { /* same flow + checkbox */ });
test('login with valid credentials returns user object', async ({ page }) => { /* same flow + extra assertion */ });
All three exercise the same login mechanism. The remember-me checkbox is a separate concern (does the cookie persist across browser restarts?) which the second test doesn't actually verify — it just sets the checkbox and never tests the persistence. The third test's "extra assertion" was about the user object structure, which is an API concern, not a UI concern.
Right answer:
- Keep one happy-path login UI test.
- Replace the API assertion with a separate API-only test (faster).
- Replace the remember-me test with one that actually closes and reopens the browser to verify persistence.
Three tests became two, but coverage went up because the new ones actually test what the old ones claimed to.
Spreadsheet columns: journey, redundant_with, action (keep / merge / rewrite / delete).
Step 3: Identify Dead Tests
A test is dead if any of these are true:
- It's been skipped (
test.skip) for more than 90 days. - It tests a feature that no longer exists in the product.
- It's been on the flaky-tests-quarantine list for more than 30 days.
- It tests a route that 404s in current production.
To find these mechanically:
grep -rn "test.skip\|test.fixme" tests/
git log --since="90 days ago" --name-only -- tests/ | sort -u
# Tests not in the recent list but still in the codebase = candidates for review
Don't auto-delete. Move them to a deprecated/ folder, run them in a quarantine project for two weeks, then delete if no one objects. The quarantine period catches the rare "actually that test was important."
Step 4: Spot the Missing Edge Cases
Now look at your bug tracker. Last 12 months, every customer-reported bug. For each one, check: is there a test that would have caught it?
For the fintech suite, I found:
- 11 bugs about coupon code validation. Zero tests on coupon edge cases (negative amounts, expired codes, stacking with other promotions).
- 7 bugs about session timeout behavior. Zero tests on session expiration mid-checkout.
- 5 bugs about international addresses. Zero tests on non-US postcodes.
- 3 bugs about decimal precision in currency. Zero tests on rounding edge cases.
Adding the missing edge cases is where bloat reduction creates capacity. Each test you cut frees ~9 seconds of CI time. Each edge-case test you add costs ~9 seconds. The math works out the same; the coverage shifts toward what users actually break.
Spreadsheet column: missing_coverage (a list of bug IDs the suite would have missed).
Step 5: Make the Cuts (Safely)
The cuts in three waves:
Wave 1: Move to deprecated/
Move (don't delete) the tests you've classified as redundant or dead. Run the suite. Confirm nothing broke. The remaining tests should still pass.
git mv tests/checkout/redundant-1.spec.ts tests/deprecated/checkout-1.spec.ts
Add a project to your config that runs deprecated tests separately:
// playwright.config.ts
projects: [
// ... main projects
{
name: 'deprecated-quarantine',
testMatch: /tests\/deprecated\/.*\.spec\.ts/,
retries: 0,
},
]
Run the deprecated project nightly for two weeks. If anything fails — the test you're about to delete actually catches a real bug — you've saved yourself.
Wave 2: Add the missing edge cases
Write the edge-case tests for the missing-coverage column. Use the bugs you found as test plans. The bug report describes the failure; your test exercises the same scenario and asserts the now-fixed behavior.
Wave 3: Delete deprecated
After two weeks, delete the deprecated/ folder. Commit message: "Remove redundant/dead tests after 2-week quarantine. Coverage map updated in /docs/test-coverage.md."
Yes — write a coverage map document. It's the artifact that prevents bloat from re-accumulating. Each entry maps a user journey to the test that covers it. New tests need a journey assignment before they merge.
How to Measure That You Didn't Break Anything
Three metrics, before and after:
- Escape rate — bugs that reach production despite the suite passing. Should stay flat or improve.
- CI run time — should drop substantially (in my fintech case, from 28 min to 9 min).
- Flakiness — flaky-test rate as a percentage of total runs. Often improves dramatically because you've removed the flaky tests that were testing nothing important.
Track these in your engineering dashboard for 30 days post-cleanup. If escape rate goes up, you cut something important; reverse-engineer which test caught what and bring it back. In my experience this happens for about 1 in 50 deletions, which is why the quarantine wave matters.
FAQs
What if my team won't let me delete tests?
Move them to deprecated/ instead of deleting. Run them in a separate nightly project. If they fail, you keep them. If they don't fail for 30 days, your team has data to support deletion. Don't argue from intuition; argue from the quarantine results.
How do I know if a test is testing the same thing as another?
Compare three things: setup, the action being tested, and the assertions. If all three overlap by >80%, it's a redundancy candidate. Read the test, not just the title — title overlap doesn't always mean implementation overlap.
What about tests that test "defense in depth"?
Real defense-in-depth means testing the same outcome via different layers — UI test for happy path, API test for the data contract, unit test for the business logic. That's not redundancy; that's good architecture. Redundancy is three UI tests that all click the same button and assert the same toast.
Should I do this audit alone or with the team?
Solo audit, then team review. Solo gets you a draft fast. Team review catches the "actually that test is here for a reason" cases without the bottleneck of consensus on every line.
How often should I re-run the audit?
Every 6 months for active suites. Every 12 months for stable suites. Bloat creeps back. The audit is a maintenance practice, not a one-time event.
What about visual regression tests?
Audit them too — they bloat fastest. One snapshot per page is usually enough. If you have 50 snapshots of the same page in different states, most of them are wallpaper.
Does this work for unit tests too?
Same framework, different specifics. Unit-test bloat is usually about over-testing implementation details. Same audit method works.
Should I cut tests that take longer than 30 seconds?
Don't cut by duration alone — cut by value. Some long tests are valuable (full checkout flows). Some short tests are worthless (asserting that 2+2=4 because it covers a code path no user touches). Time is one input, not the answer.
What's the ROI of doing this?
For the fintech client: 19 minutes saved per CI run × roughly 80 runs per week = 25 hours of compute saved per week, plus engineer wait-time savings. Plus reduced flakiness investigation overhead. Probably 30+ hours of engineering capacity per month. That's a senior engineer's salaried month back.
How do I sell this to my manager?
Frame it as risk reduction, not test reduction. "Our current suite has 400 tests but misses 60% of the bugs that customers report. I'd like to spend a week mapping coverage and rebalancing it." That's a manager-friendly version of "I want to delete a bunch of tests."
Wrap-Up
Test suites bloat for rational reasons but compound into a real cost. The audit is a 1–2 day exercise that pays back in CI time, flakiness reduction, and most importantly, coverage that actually maps to risk. The output is a coverage map — the artifact that prevents the bloat from coming back.
If your team has 300+ tests and a CI suite that takes 25+ minutes, this is exactly the engagement I do under framework cleanup. Or book a free call and we'll triage your top three problem areas on the call.
Related reading:
Tayyab Akmal
AI & QA Automation Engineer
6 years of catching critical bugs in fintech, e-commerce, and SaaS — then building the Playwright and Selenium automation that prevents them from shipping again.