Skip to main content
Once a test is running and exposures are accumulating, Fleack’s results engine computes per-variant statistics and gives you a clear signal: keep running, promote the winner, or stop. Rather than requiring you to calculate p-values or set up your own analytics pipeline, Fleack surfaces a single win probability number for each variant — the probability that this variant genuinely outperforms the control. This page explains how that number is computed and what to do with it.

Metric types

You choose one metric per test when you create it. Fleack supports three types:

Conversion

Question: Did the user call a specific endpoint within a set time window after being exposed to the variant? You pick the conversion endpoint (e.g. POST /api/purchase) and a time window in hours. Fleack counts how many exposed users triggered that endpoint within the window. The win probability is computed using a Bayesian Beta model (Jeffreys prior Beta(0.5, 0.5)) with Monte Carlo sampling — giving you an interpretable probability rather than a reject/fail p-value. Only users with a non-null user identity are included in the denominator.

Retention (day-N)

Question: Did the user have any activity on day N after their exposure? You specify the day number (e.g. day 3, day 7). Fleack checks whether each exposed user generated any event on that calendar day. An eligibility cutoff applies: users who were exposed less than N days ago haven’t had the chance to be retained yet, so they’re excluded from the calculation until they’re measurable. The same Bayesian win probability model applies as for conversion.

Revenue / scalar

Question: What is the average change in a profile attribute (e.g. arpu) between exposure and now? You pick a scalar profile attribute and an observation window in days. Fleack computes the average delta for each variant relative to the control. This is a continuous metric — Fleack reports uplift percentage rather than a binary win probability. An eligibility cutoff also applies here: users exposed more recently than the observation window are excluded.

Verdict thresholds

Fleack applies the following thresholds to give each variant a verdict:
VerdictCondition
WinnerVariant win probability ≥ 90% (binary) or uplift ≥ 5% (scalar)
Control winsControl win probability ≥ 90% — the variant is performing worse
No differenceNeither side has reached a threshold yet — keep running
Not enough dataFewer than 30 total exposures — the engine declines to give a verdict
The “not enough data” guard prevents you from acting on noise in the earliest hours of a test. Once you cross 30 exposures, verdicts start appearing.

Reading the results panel

On the test detail page you’ll see a per-variant breakdown:
  • Exposures — unique users assigned to this variant
  • Conversions / retention / uplift — the raw metric value
  • Win probability — for binary metrics, the probability this variant beats the control
  • Verdict — Winner, Control wins, No difference, or Not enough data
For scalar metrics, win probability is replaced by an uplift percentage (e.g. “+8.3% arpu vs control”).

When to promote

Click Promote when a variant reaches Winner status — or any time you’re ready to commit. Promoting immediately routes 100% of traffic to the winning variant and marks the test as completed. No code deployment, no app release.
Once you promote, the test is completed and the change is permanent until you create a new test. Make sure you’re satisfied with the result before promoting.
For revenue metrics, a sustained uplift above 5% is the trigger. For conversion and retention, wait for ≥ 90% win probability to act with confidence. If a test has been running for several days with no verdict, consider whether the expected effect size is large enough to detect — a very small change requires a very large sample.