Flywheel's in-app campaign evaluation tools are a powerful way to analyze the impact of your campaigns on a wide array of business values and outcomes, all in near real-time.
Our Campaign Evaluation is based on a treatment/control experimental framework, otherwise known as frequentist A/B testing - an approach that quantifies the incremental improvements in business outcomes driven by your campaigns and supports them with robust statistical significance testing.
This guide is intended to explain and inform the various decisions that must be made when designing an experiment and to detail how Flywheel's evaluation methodology functions, both from a general and technical perspective.
Selecting a Treatment / Control Split
Campaign evaluation begins during audience creation when the treatment/control split is set.
The Treatment / Control experimental design relies on using 2 (or more) experimental groups that are drawn from the same population (meaning the qualification criteria for both groups are the same).
The treatment group receives the marketing or sales intervention, while the control group does not. Because these two groups are drawn from the same population, the only difference between them is the intervention itself. Their behavior can then be compared to isolate the incremental impact of the intervention on any number of business metrics, such as revenue generated, app logins, email opens, etc.
There are several trade-offs to keep in mind when selecting the Treatment / Control Split percentage:
There is often the temptation to bias the sample towards the treatment group in order to maximize the number of users receiving the marketing or sales intervention. Generally speaking, however, the best experimental comparisons between treatment and control groups will be achieved when the size of the groups is similar.
That said, the overall size of the audience matters a great deal when using statistical significance testing to evaluate the effectiveness of a campaign. Larger treatment group splits can be warranted if the overall size of the audience is suitably large.
When in doubt, Flywheel can perform what is called a Statistical Power Analysis - an analysis used to determine the minimum sample sizes and effect size (% incremental benefit in the treatment group) required to achieve statistical significance in an experiment.
Common Approaches to Treatment/Control Split:
When evaluating "evergreen" campaigns - campaigns that will be active on an ongoing basis and apply to any new users who qualify - many marketers will opt to split their campaign into two distinct phases.
In the first phase (the assessment phase), a robust control group will be used to ensure a proper determination of the campaign's effectiveness can be achieved.
In the second phase, assuming the assessment phase resulted in a sufficiently impactful incremental performance boost, the size of the control group will be reduced to ~20% (depending on audience size) to maximize the impact of the campaign while still retaining the ability to track the campaign's performance directionally.
Treatment / Control Assignment
The most common approach to splitting users into Treatment and Control groups is Random Assignment. In this methodology, users in the audience are assigned randomly to either the treatment or control group based on the treatment/control split selected during audience creation. By randomly assigning users between treatment and control, we avoid introducing any bias outside of the experimental intervention itself. The end result of this assignment is two experimental groups that match the specified sizing percentages.
Random Assignment is the default approach implemented in the Flywheel app.
A few technical notes on Flywheel's implementation:
In the case of ongoing campaigns, when new users are added to the audience, the same random assignment approach is implemented on the entire group of newly qualifying users regardless of whether the current 'active' size of the treatment and control groups matches the split percentage. Functionally, this means that over time the cumulative treatment/control split percentage will match the specifications, but at any given time after the audience's creation, because users may drop from the treatment/control groups at different rates, the 'active' size counts may diverge slightly from the expected ratio.
1. While Random Assignment is a widely-used and well-understood methodology for group assignment in this experimental framework, it does have several drawbacks in the context of marketing campaign evaluation. Random Assignment works best in controlled experimental settings where all outside influences are mitigated. Marketers will often have several campaigns ongoing at any given time, however, and the audiences for these campaigns may overlap. This leads to situations in which a given user may appear in multiple treatment groups or the treatment group for one audience and the control group for another audience - introducing noise into the experiment.
- Because users are randomly assigned between treatment/control in all of the affected campaigns, the experiments remain statistically independent, and as such these overlaps do not invalidate the random assignment approach. However, they do make it more challenging to establish statistically significant incremental effects, due to the performance of the control group being inflated by users who are also in the treatment group for other campaigns.
2. Users retain their control/treatment label over the entire duration of their membership in a given audience, meaning that if a user exits and then re-enters a given audience, they will be assigned the same label.
In order to contend with the audience overlap issue discussed above, some organizations opt for the use of a Global Control Group in lieu of Random Assignment.
Flywheel can implement a Global Control Group Methodology upon request.
The Global Control methodology functions by randomly reserving a modest percentage of the overall marketable user base, generally somewhere in the neighborhood of 10-20%, and withholding all marketing from these users. As new users are acquired, the same methodology is used to add a percentage of new users to the global control based on the predetermined split.
Audience definition happens the same way, but instead of randomly assigning all qualifying audience members based on some predetermined split, the control group for any given audience consists of the overlap between the qualifying audience and the global control group.
Functionally, this means that no user ever ends up in both a treatment and control group simultaneously - the global control represents a true baseline when evaluating the effect of marketing compared to no marketing at all. We, therefore, would expect less noisy measurement of incremental performance improvements.
However, because the global control group is randomly assigned and generally only ~10-20% of the total population, we would also expect the size of the control group for any given audience to closely match the overall split. Depending on the overall size of the audience in question, this can lead to situations in which the control group is too small to achieve reliable, statistically significant comparisons with the treatment group.
Some questions to consider when selecting a treatment/control split are...
- How saturated is your marketing environment? How possible is it that a given user will qualify for two marketing campaigns simultaneously? The development of a distinct lifecycle segmentation can help greatly in mitigating the risk of multiple marketing efforts conflicting with one another.
- How large is your overall customer base? How large are the individual audiences that you generally target?
- What effect size/incremental performance boost do you generally expect from your marketing campaigns?
Both approaches have their advantages and disadvantages, along with strategies that can be used to mitigate their shortcomings. We are always happy to engage in a conversation about which approach best suits the needs of your business.
Defining and Calculating Metrics
Once the audience is defined, audience members are split into treatment and control groups, and the treatment group has started receiving marketing or sales intervention, we can start the evaluation.
Because Flywheel's campaign evaluation platform sits directly on top of your data warehouse, we can compare the performance of the treatment and control groups across any event we can tie to the customer level. We call these Metrics, and they come in two different flavors: Aggregations and Conversions.
As the name suggests, Aggregations involve aggregating events (or the values associated with those events) over time. Aggregations first happen at the user level (e.g. how much has each user spent since joining the campaign), and are then aggregated again up to the audience level (e.g. how much have all users in the treatment group spent so far during the campaign). We can either count the number of times some event happened, or sum up all the values associated with events that took place (like the amounts associated with transactions).
Some examples of Aggregation Metrics:
- Total spend in the treatment group vs. control group during the duration of the campaign
- Total number of transactions in the treatment group vs. a control group that took place within 60 days of the user being added to the campaign
- Total number of logins in the treatment group vs. control group during the duration of the campaign
At the customer level, Conversions capture whether an event happened or not. The only possible values for a conversion metric are True or False. At the audience level, conversions represent the number or percent of users in the group that converted.
Some examples of Conversion Metrics:
- Number/Percent of users who made at least one transaction during the campaign in the treatment group vs. control group
- Number/Percent of users who purchased a specific product/service during the campaign in the treatment group vs. control group
Understanding Adjusted Control and Lift
As discussed in the 'Selecting Treatment / Control Split' section above, it is often the case that the Treatment and Control groups are different sizes. When this is the case we need to normalize our metrics by group size in order to reach an 'apples-to-apples' comparison.
For example, if we have 80 users in the treatment group, 20 in the control group, and everyone spends $10, it's clearly wrong to assume that the $800 total spend in the treatment group outperformed the control group with $200 total spend. If we normalize the spend by the group size, we get $10 spend per user in both treatment in control. In other words, there's no difference in performance.
In order to arrive at the adjusted control version of a metric, we first normalize the control metric by group size, and then multiply it by the size of the treatment group. This allows for an apples-to-apples comparison between the two groups, framed in terms of the actual incremental benefit of the treatment.
The full calculation for the example above is as follows:
The AdjustedControl metric = $800, which is the same as Treatment meaning there was no incremental benefit to this campaign.
Let's now consider the same example but where the Total Revenue metric for the Treatment group is instead $1000. In this case, when we compare the Total Revenue of Treatment to Total Revenue of AdjustedControl we get an incremental benefit (lift) of $200 for this campaign.
In other words, if we had not run the campaign, we would have expected the users in the treatment group to spend a total of $800, resulting in a total of $1000 for the audience as a whole. Instead, we registered $1200 in spend, meaning the campaign generated an incremental benefit of $200.
A few considerations to keep in mind when working with Adjusted Control metrics:
1. Percentages are already normalized by population size, so an additional adjustment isn't necessary.
2. If the size of the control group relative to the test group is very small, outliers in the control group can have an outsized impact on the adjustment metric due to being multiplied by the size of the treatment group. This is only relevant to aggregation metrics, and it's possible to include outlier adjustment on a per metric basis to account for this.
Now that we understand how Lift is calculated, let's take a closer look at Statistical Significance Testing - The methodology that is used to determine whether an observed difference between the performance of treatment and control groups is real vs. due to random chance.
Implementing Significance Testing
Why Statistical Significance?
Evaluating campaign success based on performance metrics on their own is helpful in understanding how a campaign is tracking directionally over time, but when it comes time to make an important decision about whether a given campaign should be expanded or whether it should be scrapped, how does one determine what level of performance is good enough? There are a variety of situations that can make it difficult or unreliable to make this assessment based on looking at metrics and lift alone. Inadequate audience/sample sizes, the presence of a few big outliers, and/or an inordinate percentage of your control group happening to be in the Bahamas during your experiment are just a few examples of issues that may lead to misinterpreting the likelihood that a given incremental benefit will be replicable in the future.
Significance testing is the statistical methodology used to assess how likely it is that an observed difference in performance between a treatment and control group is not due to random chance.
It helps give us confidence that the observed difference is actually due to the intervention on the treatment group, and therefore if we continue outreach to this audience using the intervention, we will continue to observe a similar effect.
The two primary measures used to calculate statistical significance are the metric in question and the audience sizes (referred to as sample size). In both cases, more is typically better. If the observed difference between treatment and control (effect size) is large, statistical significance may be reached even for a campaign with a small sample size. Similarly, if the sample size for an audience is very large, we are more likely to be able to conclude that even a small performance difference between treatment and control is not due to chance. (Note that just because a small performance difference is not due to chance, it doesn't necessarily indicate that it was worth the cost to achieve it!)
Understanding Confidence Intervals
When statisticians talk about statistical testing, it is framed in terms of what is called the null hypothesis. The experimenter's hypothesis is that the treatment intervention is going to have a real impact on the behavior of those who receive it. The null hypothesis, conversely, is the opposite - the base assumption that any difference in treatment vs. control is due to random chance. Significance testing, therefore, is used in an attempt to reject the null hypothesis.
Significance tests always come hand in hand with something called Confidence Intervals. If you've worked with significance testing in the past, you've probably heard something along the lines of... "The effect is significant at a Confidence Interval of 95% or .95." This is another way of saying that there is at least a 95% chance that the observed effect is not due to chance. The Confidence Interval represents the probability threshold at which the experimenter is willing to reject the null hypothesis. The higher the Confidence Interval, the more difficult it is to reach statistical significance. Larger effect sizes and sample sizes both contribute to higher statistical confidence.
Fundamentally, the decision of what Confidence Interval to use is made by the experimenter, or often the organization or marketing department as a whole. There is no right or wrong way to make this determination - it's based on how much tolerance one has for the risk of incorrectly rejecting the null hypothesis, which often depends on the nuances of the experiment or business context. Academic researchers, for example, often have to go through a rigorous review process before their work is accepted for publication by a journal, and as such generally demand a high level of confidence in their findings. Confidence Intervals of 95% or even 99% are often used. When evaluating marketing or sales campaigns, however, statistical confidence can be harder to come by and the stakes of getting it wrong are generally far less. As such Confidence Intervals of 90% or even 80% are not uncommon.
Choosing the Right Significance Test
There is a large variety of Significance Tests that have been developed for this type of experimental validation. This variety exists to handle the nuances presented by different experimental frameworks and metrics.
Flywheel defaults to the use of just a few tests in our campaign evaluation framework:
- t-test / z-test: Parametric tests used to compare two sample means when evaluating normally distributed metrics
- Mann-Whitney U test: Non-Parametric test used to compare samples that are not normally distributed.
If you have a more complicated experimental design that necessitates the use of a different statistical test, or just have questions about how statistical testing is being applied to your metrics, don't hesitate to reach out at email@example.com