Articles Snippets Projects

The Flaky Test Chronicles I: The Reckoning

Why Your CI Fails at Random

December 18th ʼ25 2 months ago 17 min 3323 words

Your CI pipeline's success rate shouldn't look like a coin flip simulation.

You know the feeling. You pick up a Jira ticket. You plan the implementation. You write the code. You even do TDD - actual test-first development, not the “I’ll write tests later” kind.

Everything works.
Tests pass.
Coverage looks good.
You push to GitHub, open a PR, and wait for the pipeline.

Then the GitHub Actions runner finishes:
Red.
Multiple failures.

You check the logs. Some failures are in your new tests - but they passed locally, three times in a row. Others are in tests you've never touched. A payment processing test failed. You changed a notification service. How are these even related?

You click “Re-run failed jobs.”
It passes.
You merge.
You move on.

The PR That Should Have Been Green

Sound familiar? Of course it does. Every developer has lived this moment. Most of us have lived it hundreds of times.

For months, this was our workflow. “Just rerun it” became our unofficial QA process.

PR fails?
Rerun.
Fails again?
Rerun twice.
Eventually it passes.
Ship it.

Then one day, someone actually counted: 76 out of 10,500+ tests were playing Russian roulette with our CI pipeline. That's about 0.72%. Sounds small, right? Here's the thing about 0.72%: when you run tests hundreds of times a day across a team, you're guaranteed to see these failures.

Every.
Single.
Day.


How Did We Get Here?

The path to test suite hell is paved with good intentions and bad habits.

It starts innocently:

A test fails.
You rerun it.
It passes.
"Must be timing," you think.
You don't investigate.
Why would you?
You have features to ship, deadlines to meet, and a backlog that won't wait for your test infrastructure.
Then it happens again.
And again.
And suddenly you have a culture.

The culture looks like this:

  • “Flaky” becomes a label, not a problem to solve. Tests get marked as flaky and forgotten. People fix them occasionally - the same tests, multiple times - but they keep coming back like zombies.

  • Trust erodes slowly, then all at once. At some point, your team stops believing test failures mean anything. Red CI? Whatever. Merge it. This is when bugs start sneaking through.

  • Merge requests get blocked by tests that have nothing to do with your changes. You touched a CSS file. A payment processing test failed. Everyone knows it's unrelated. Everyone clicks rerun anyway. Twenty minutes of CI time, gone.

  • Technical debt compounds. Quick fixes mask real issues. Someone weakens an assertion from “equals 3” to “not empty” because the count keeps changing. The test passes now. It also proves nothing.

We lived in this world for too long.


The Numbers That Actually Matter

First, some context. We're talking about a test suite with:

  • 10,500+ tests

  • ~40,000 assertions

  • ~8 minutes CI runtime

That's not a toy project. Changes here ripple.

Here's the thing about test suite statistics: most of them are noise. Test counts go up and down as features ship and code gets deleted. Assertion counts fluctuate with refactoring. In a startup environment where big features ship fast and pivot faster, these numbers tell you almost nothing about test quality.

But some numbers matter. These are the ones we tracked:

Metric

Before

After

Flaky Failures

76

0

PHPUnit Notices

42

0

Alias/Overload Mocks

53

0

Risky Tests

12

0

Let's talk about what these actually mean.

76 flaky failures went to zero. These were the tests that passed locally and failed in CI. The ones that “just needed a rerun.” Every single one was a real bug - either in the test or in production code. We found the root cause for each.

42 PHPUnit notices eliminated. These were risky tests - tests without assertions. PHPUnit was literally telling us "this test proves nothing" and we ignored it. A test that doesn't assert anything is not a test. It's a lie you tell CI.

53 alias and overload mocks killed. Every one of them was a parallelization time bomb. We replaced them with the container binding pattern (covered in Part 2: Mock Madness). This took the most effort but had the biggest impact on CI reliability.

12 risky tests fixed. Tests that ran but didn't actually verify anything. They probably worked once, then broke silently as the codebase evolved. Nobody noticed because they never failed - they just stopped testing.


The Flakiness Rule

If a test fails intermittently, treat it as a deterministic bug.
Period.
Full stop.
No exceptions.

The bug is either:

  • In the test itself: Bad assumptions, global state leaks, time dependencies, race conditions in test setup, or reliance on execution order.

  • In production code: Hidden side effects, non-deterministic database queries, code paths that behave differently under load or timing.

There's no such thing as “just flaky.” There are only bugs you haven't found yet.

When someone says “that test is flaky,” what they're really saying is “I don’t know why this fails and I don’t want to find out.” And look, I get it. Debugging flaky tests is soul-crushing work. But ignoring them is worse.

Every flaky test is a tiny crack in your test suite's credibility. Enough cracks and the whole thing becomes useless. Why run tests at all if you can't trust the results?


The Six Deadly Sins of Flaky Tests

Over 300+ commits, we cataloged every flaky test we fixed. They all fell into six categories.

Time

Time is the sneakiest cause of flakiness. Your code uses now(), today(), or Date::parse().

Your test runs at 11:59 PM. It passes.
Tomorrow it runs at 12:01 AM. It fails.
Same code. Same data. Different day.

The usual suspects:

  • Timestamp comparisons in payloads

  • TTL and cache expiration logic

  • Business day calculations (“next Wednesday”)

  • Scheduled jobs and cron-dependent code

The fix: Freeze time.

$this->travelTo(now()->startOfMinute());
// ... test logic that involves timestamps ...
$this->travelBack();

Your test now lives in a frozen moment. Time doesn't pass. Timestamps don't drift. Life is predictable. This is the way.

Randomness

“I’ll use
$faker->numberBetween(2, 5)
so my test covers different cases!”

No. No, you won't.

What you've actually done is create a test that sometimes creates 2 records, sometimes 3, sometimes 5. Your assertion expects exactly 3. Congratulations, you've invented chaos engineering without any of the benefits.

// This is not testing. This is gambling.
$orders = Order::factory()
    ->times($this->faker->numberBetween(2, 5))
    ->create();

$this->assertCount(3, $orders);  // 40% success rate, roughly

The fix: Use fixed, deterministic values.

$orders = Order::factory()->times(3)->create();

Boring? Yes. Predictable? Also yes. Reliable? Absolutely yes.

If you want to test multiple cases, use a DataProvider. Test all the cases. Explicitly. Don't roll the dice and hope you're covering edge cases.

Ordering & Timestamp Precision

“I created two records at the same time.”

You didn't. You created one record, then nanoseconds later, you created another. Usually this doesn't matter. Sometimes it does.

// Production code:
$hasNewerSubmission = $user->submissions()
    ->where('created_at', '>', $submission->created_at)
    ->exists();

// Test code: create two records "at the same time"
$sub1 = Submission::factory()->create();
$sub2 = Submission::factory()->create();

// Surprise: sub2->created_at is 3 milliseconds after sub1
// Your boolean just became a coin flip

The fix: Set timestamps explicitly.

$createdAt = now()->startOfSecond();

$submissions = Submission::factory()
    ->for($user)
    ->count(2)
    ->create([
        'created_at' => $createdAt,
        'updated_at' => $createdAt,
    ]);

Now they really are created at the same time. The universe is deterministic again.

Global State Leaks

The test that works alone but fails when run with others. The classic “it works on my machine, in isolation, when the moon is full” problem.

Common culprits:

  • Locale and timezone: Your code formats dates. Another test changed the locale. Your assertions fail.

  • Config cache: You mocked a config value. Another test has cached config. Your mock gets ignored. (See Part 4: The Cache vs Mock Race)

  • Eloquent events: A test called Model::unsetEventDispatcher(). Your factory relies on a creating event to set defaults. Your model is now incomplete.

// If you do this...
Model::unsetEventDispatcher();

// ...you MUST do this in the same test class
protected function tearDown(): void
{
    Model::setEventDispatcher($this->app['events']);
    
    parent::tearDown();
}

Parallelization Hazards

Parallel test execution is wonderful until it isn't. The most common failure mode: Mockery alias mocks.

Mockery\Exception\RuntimeException: Could not load mock
App\Services\SomeClass, class already exists

This error happens when two parallel tests try to mock the same class using Mockery::mock('alias:...'). PHP's autoloader gets confused. Everything breaks. Your CI turns red.

The fix: Don't use alias mocks. Use the factory pattern instead. We dedicated all of Part 2: Mock Madness to this because we had to kill 43 of them.

Factories That Depend on Observers

Your factory creates a User. An observer fires on create and sets their role. Another test disabled event dispatching. Your User has no role. Your test fails.

This is subtle and maddening.

// Another test somewhere called:
Model::unsetEventDispatcher();

// Your factory relies on this observer to set the role:
// UserObserver::creating() { $user->role = UserRole::OWNER; }

// Now your test creates a User...
$user = User::factory()->create();

// ...and the role is null. Test fails.

The fix: Set critical attributes explicitly in factory states. Don't trust events to run.

User::factory()->state(['role' => UserRole::OWNER])->create();

Yes, it's more verbose. Yes, it's more reliable. Pick reliable.


Practical Patterns

Here are the patterns we use daily. Bookmark this section.

Freeze Time for Payload Comparisons

When comparing serialized data that includes timestamps:

$this->travelTo(now()->startOfMinute());

$order = Order::factory()->create();
$expectedPayload = OrderReportDTO::fromOrder($order);

ProcessOrderJob::dispatchSync($order->id);

// Timestamps match because time is frozen
$this->assertDatabaseHas('order_reports', [
    'payload' => $expectedPayload->toString(),
]);

$this->travelBack();

Stabilize created_at Comparisons

When your code checks “is there a newer record?”:

$createdAt = now()->startOfSecond();

$submissions = Submission::factory()
    ->for($user)
    ->count(2)
    ->create([
        'created_at' => $createdAt,
        'updated_at' => $createdAt,
    ]);

// Now the "newer record exists?" check behaves consistently

How to Verify Your Fix Actually Fixed It

You fixed a flaky test. Or did you? Here's how to be sure.

Step 1: Run it many times

for i in {1..100}; do
    php artisan test --filter=it_handles_payment_status || break
done

If it fails once, you haven't fixed it. You've just made it less frequent.

Step 2: Run in random order

./vendor/bin/phpunit --order-by=random

This exposes global state leaks. If your test only passes when run after some other test, you have a dependency bug.

Step 3: Run in parallel

php artisan test --parallel --filter=YourTestClass

This exposes parallelization issues. Race conditions. Alias mock collisions. All the fun stuff.


The Flakiness Checklist

Print this. Tape it to your monitor. Use it when writing or reviewing tests.

  • No random values affect test logic or counts. Faker is for names and emails, not for determining how many records to create.

  • Time is frozen when relevant. If your code uses now() anywhere, your test probably needs travelTo().

  • Time is reset afterwards. Don't let frozen time leak to other tests.

  • Assertions don't rely on implicit ordering. If order matters, assert it explicitly. If it doesn't, don't assume it.

  • Global state changes are restored in tearDown(). Same test class. Same file. No exceptions.

  • No alias/overload mocks. Not even one. Not even “just this once.”

  • Factories set critical attributes explicitly. Don't rely on observers that might be disabled.


What's Next

Part 1 was the why. The pain. The anatomy of a flaky test suite.

Part 2: Mock Madness is where Mockery shows its dark side. We had 53 alias and overload mocks across multiple test files. Every single one of them was a ticking time bomb in parallel execution. We'll cover:

  • Why alias and overload mocks are fundamentally broken for parallel testing

  • The container binding pattern that saved us

  • Why @runInSeparateProcess is not a solution (it's a surrender)

  • PHPUnit mocks vs Mockery mocks, when to use which

See you there. And maybe fix that one flaky test before you continue. You know the one.


The Flaky Test Chronicles is a series documenting what we learned from 300+ commits of test suite cleanup. May your CI be green and your reruns be few.