A/B test in-app pricing on mobile without store changes

Most mobile teams assume IAP pricing is untestable without re-submitting to the App Store or Google Play. That’s true for the displayed price tier — tier 8 vs tier 9 is locked at submission. But what your backend returns as the bundle contents for that price is a config value, and config values are exactly what Fleack tests. This recipe walks you through testing the bundle composition for a fixed-price starter pack, measuring both first-purchase conversion and downstream ARPPU so you catch cannibalisation before you ship.

What you’ll measure

Test the bundle a user receives for a $4.99 starter pack SKU (iap_starter_pack_4_99). The price is fixed; the contents vary:

Variant	Gems given	Bonus content	Hypothesis
Control	500 gems	+ 1 character skin	Baseline.
Variant A	750 gems (+50%)	+ 1 skin	Better gems-per-dollar lifts conversion.
Variant B	500 gems	+ 2 skins (+1 from control)	Cosmetic-driven players convert better with bonus content.

Primary metric: Conversion to purchase (binary) Secondary metric: ARPPU (average revenue per paying user) at D14 (continuous scalar)

Pre-flight check

Confirm that bundle composition is served from a backend config endpoint and not bundled into the app binary. Open the Fleack backoffice, navigate to Endpoints, and look for your shop config endpoint. Check the body sample for paths like:

data.shop.bundles[?id=starter_pack].gems
data.shop.bundles[?id=starter_pack].bonus_items
data.iap.starter_pack.contents

If bundle contents are hardcoded in the binary, Fleack can’t rewrite them without a backend change first. Move the bundle config to a server-side endpoint, then return here.

Also confirm that purchase validation is a backend call. Fleack uses the conversion endpoint — typically your POST /api/iap/validate (the endpoint your app calls after StoreKit or Play Billing receipt verification) — as the purchase signal. If that call doesn’t exist or doesn’t pass through Fleack’s proxy, the conversion metric won’t fire.

Main workflow

Declare the levers

You need two levers, both pointing into the same shop endpoint.Lever 1 — Gems per starter packClick + New lever in the Levers page:

Pick the shop config endpoint.
Search for gems in the path picker and select data.shop.bundles[?id=starter_pack].gems.
Set Label: Starter pack gems, Type: number, Test suggestions: 500, 750, 1000.

Lever 2 — Bonus skin countClick + New lever again:

Same endpoint.
Select data.shop.bundles[?id=starter_pack].bonus_items.
Set Label: Starter pack bonus skins, Type: number, Test suggestions: 1, 2, 3.

The [?id=starter_pack] filter syntax in the path picker matches the correct object inside an array of bundle configs. Fleack rewrites only that item and leaves the rest of the array untouched.

Set up the tests

Run two separate tests, not one combined test that varies both levers at once. Combined tests inflate variance, slow down decisions, and make it impossible to attribute the result to a single change.Gems test:

Variants: 500 (control), 750, 1000
Allocation: 33% / 33% / 34%
Segment: days_since_install ≥ 1 AND has_purchased = false — you’re measuring first-purchase conversion, not upsell behaviour.
Primary metric: Conversion on POST /api/iap/validate, conversion window 24 hours after exposure.
Secondary metric: Scalar delta on total_revenue_usd, observation window 14 days.

Bonus skins test — same structure, using the bonus_items lever instead.Click Launch on both. They run independently, each with their own exposure counts and result panels.

Watch the results

Pricing tests produce signal more slowly than engagement tests because purchase rates are low — typically 1–3% on a starter pack.Realistic timelines for a 200K DAU game:

1,000+ exposures per variant within a few hours of launch
First conversion events within a day
Statistically meaningful conversion read at 5,000–10,000 exposures per variant
Decisive ARPPU verdict at 14 days

Watch the conversion rate first, but don’t act on it alone. The ARPPU secondary metric is what matters most — short-term conversion uplift can mask long-term cannibalisation of higher-value purchases.

Each test detail page shows per-variant conversion rates, Bayesian win probabilities vs control, and the ARPPU scalar uplift once the 14-day observation window closes.

Make the call

Pricing tests have a specific failure mode: Variant A wins on conversion but loses on ARPPU. This means you sold the starter pack to users who would have bought a more expensive bundle later — you traded future revenue for a short-term conversion bump.Use this decision rule:

Verdict	Condition	Action
Promote	Variant wins on both primary AND secondary: ≥ 90% win probability on conversion AND ≥ 5% ARPPU uplift	Click Promote
Reject	Variant wins on conversion but ARPPU delta is flat or negative	Stop the test, keep control
Run longer	Mixed verdict at 14 days	Extend to 21–28 days for the LTV signal to stabilise

If both tests declare a winner, promote each independently — the two levers are orthogonal.

Promote the winner

From the winning test’s detail page, click Promote. Fleack instantly routes 100% of traffic to the new bundle composition on every shipped app version — no binary update required.

If your App Store or Google Play listing copy references the old bundle contents (e.g. “500 gems + 1 skin”), schedule a metadata update for your next release so the listing matches what users actually receive. In-app value adjustments via remote config are generally permitted without a binary update, but verify current platform policy with your IAP partner.

Common pitfalls

Don’t test below your floor price. Some regions and currencies have minimum IAP tier amounts enforced by the platform. Test parameters around the price (bundle contents, bonus items) — never the SKU price tier itself.
Watch for cohort drift. Long-running pricing tests (21+ days) can be confounded by seasonal spend patterns. Compare users by days_since_install bracket, not by calendar date, to keep cohorts equivalent.
Don’t cross-test cosmetics with currency. Cosmetics (skins, avatars) and soft currency (gems, coins) behave very differently — cosmetics are upgrades driven by identity, currency is staples driven by utility. Run them as separate tests on separate user cohorts so the results stay interpretable.

A/B test interstitial ad frequency

Balance ad revenue and D7 retention by testing three interstitial cadences on a live game.

A/B test an onboarding flow

Reorder onboarding steps on new installs to lift D1 retention without a binary update.

​What you’ll measure

​Pre-flight check

​Main workflow

​Common pitfalls

A/B test interstitial ad frequency

A/B test an onboarding flow

What you’ll measure

Pre-flight check

Main workflow

Common pitfalls