Every feature flag is a runtime branch. The function that calls client.getBooleanValue('checkout-redesign', false) has two possible behaviors, and the production rollout decides which one your users see. If your tests only cover one branch, you are shipping code that nothing has ever run.

This is the most common feature-flag failure mode: the default is false, the unit tests pass against false, the rollout flips to 1%, and the first user to land in the bucket triggers an exception nobody caught. The fix is not “test more carefully.” It is structural. Every flag check needs both branches tested every time, and the tooling should make that the path of least resistance.

This article walks through the patterns: how to mock OpenFeature without faking the SDK, how to parametrize tests over both branches, what changes for integration and end-to-end tests, and how percentage rollouts fit in.

Don’t mock at the call site

The instinct most people have first is to spy on the OpenFeature client and return whatever the test needs:

jest.spyOn(OpenFeature, 'getClient').mockReturnValue({
  getBooleanValue: () => Promise.resolve(true),
} as any);

This works, but it has problems that grow over time:

The mock replaces the entire client, including hooks, context handling, and error semantics. Your test no longer exercises the real evaluation pipeline.
A refactor that changes the call (a new context attribute, a switch from getBooleanValue to getBooleanDetails) silently breaks the contract between mock and reality.
The mock has no opinion about flag keys. A typo in the production code ('chekout-redesign') will still pass any test that mocks the client wholesale.

The better pattern is to swap only the provider, and let the rest of the OpenFeature SDK run for real.

Use the InMemoryProvider

The OpenFeature SDKs ship an InMemoryProvider for exactly this. You give it a map of flag keys to values, register it as the provider, and your application code runs against the real client.

import { OpenFeature, InMemoryProvider } from '@openfeature/server-sdk';

beforeEach(async () => {
  await OpenFeature.setProviderAndWait(new InMemoryProvider({
    'checkout-redesign': {
      defaultVariant: 'off',
      variants: { on: true, off: false },
      disabled: false,
    },
  }));
});

afterEach(() => OpenFeature.close());

Same idea in Python:

from openfeature import api
from openfeature.provider.in_memory_provider import InMemoryProvider, InMemoryFlag

api.set_provider(InMemoryProvider({
    'checkout-redesign': InMemoryFlag(
        default_variant='off',
        variants={'on': True, 'off': False},
    ),
}))

The application code under test does not change. It calls client.getBooleanValue('checkout-redesign', false) exactly like it does in production. Only the source of the value is different.

Test both branches, every time

The single most useful habit is to parametrize every flag test over both values. Once you do this consistently, the “tested off, shipped on” failure mode disappears.

Jest:

describe.each([
  { flag: false, expected: 'classic checkout' },
  { flag: true,  expected: 'redesigned checkout' },
])('renderCheckout when checkout-redesign=$flag', ({ flag, expected }) => {
  beforeEach(async () => {
    await OpenFeature.setProviderAndWait(new InMemoryProvider({
      'checkout-redesign': {
        defaultVariant: flag ? 'on' : 'off',
        variants: { on: true, off: false },
      },
    }));
  });

  it('renders the expected variant', async () => {
    const html = await renderCheckout(testUser);
    expect(html).toContain(expected);
  });
});

pytest:

@pytest.mark.parametrize('flag,expected', [
    (False, 'classic checkout'),
    (True,  'redesigned checkout'),
])
def test_render_checkout(flag, expected):
    api.set_provider(InMemoryProvider({
        'checkout-redesign': InMemoryFlag(
            default_variant='on' if flag else 'off',
            variants={'on': True, 'off': False},
        ),
    }))
    assert expected in render_checkout(test_user)

The discipline is one extra row in the parameter table. The payoff is that no flag check ever ships without coverage on both sides.

For string or JSON flags, the same pattern extends naturally: parametrize over every variant your code branches on. If you have a checkout-variant flag with 'control', 'a', and 'b', your tests should run all three.

Integration tests: predictable, isolated

Unit tests cover individual functions in isolation. Integration tests cover the whole stack: HTTP routes, database, real SDK initialization. The temptation is to point integration tests at your production Flipswitch project so the SDK code runs end-to-end. Don’t.

Two problems show up the moment you do:

Anyone who toggles a flag in the dashboard can break CI without realizing it.
Test traffic shows up in your production evaluation analytics, polluting whatever you measure rollouts against.

Two clean alternatives:

Option A: keep using the InMemoryProvider in integration tests. Same setup as unit tests, registered once during the test harness boot. You lose coverage of the real network path, which most projects do not need to test in CI.

Option B: a dedicated Flipswitch project for tests. Create a separate project (or environment) with its own API key, wire that key into CI, and treat the flags there as fixtures. Nothing else writes to that project, so flag values are stable across runs.

Option A is the default. Reach for B only when the network path itself is part of what you need to verify.

Percentage rollouts deserve their own thinking

Percentage rollouts are deterministic per targeting key. The same user always lands in the same bucket for the same flag, because the SDK hashes targetingKey + flagKey and compares to the percentage. That makes them testable, but only if you control the targeting key.

Two practical approaches:

// Approach 1: skip percentages in tests entirely.
// Test at 0% (everyone off) and 100% (everyone on). Trust the rollout math
// to do the right thing in between, because that math is the SDK's job, not yours.

// Approach 2: pin specific targeting keys that you know land in known buckets.
const inBucket    = { targetingKey: 'test-user-A' };
const outOfBucket = { targetingKey: 'test-user-Z' };

Approach 1 is almost always the right answer. Your application code does not contain the hash function, so testing percentages tests your provider, not your code. Cover the two endpoints (off and on), and let the rollout machinery handle the rest.

If you genuinely need to test percentage behavior (you are writing the provider, or doing a migration where you need to confirm bucket parity between two systems), Approach 2 works. Just commit to keeping the test fixtures stable: changing the targeting key flips the bucket.

End-to-end tests against a real environment

Once the code ships, you still want a smoke test that runs against the real Flipswitch project. The pattern that works: don’t ramp up a percentage rollout until a single targeted test user has hit the new path successfully.

The earlier gradual rollouts tutorial covered this from the operational side: enable the flag for an internal targeting rule first, hit the feature manually or with an automated smoke test, and only then start ramping the percentage. From a testing angle, this is your real-environment integration test. Two minutes of work, catches everything that mocked providers can’t (real network paths, real auth, real database state).

Cleaning up: tests are part of the flag

A flag’s lifecycle does not end when production code stops branching on it. The tests for the dead branch outlive the flag unless you remove them deliberately.

When you delete a flag check from production:

Replace the flag check with the surviving branch.
Delete the dead branch.
Drop the parametrized test cases for the removed branch.
Archive the flag in Flipswitch.

Skipping step 3 leaves test code that exercises code paths nobody can reach. It compiles, it runs, it asserts true things, and it pretends to add coverage. Over time this is how test suites slow down and confuse readers about what the production behavior actually is. Treat it as part of the cleanup checklist, alongside the other practices that keep flags from becoming permanent fixtures.

Wrapping up

Feature flags are runtime branches, and runtime branches need tests. The InMemoryProvider gives you a real OpenFeature client with controllable values, parametrized tests cover both branches without extra discipline, and percentage rollouts are testable as long as you stick to the endpoints (0% and 100%) and let the SDK handle what’s in between.

The discipline is small: two test rows instead of one, every time you add a flag check. The payoff is never being the person who shipped the path nobody ran.

Try Flipswitch for free, set up a project and your first flag in about 5 minutes. The tests will be ready when the flag is.

Testing Code That Uses Feature Flags