Taking Your Testing To the Next Level

Jon May, Email Marketing, The R.A.C

Video

Speaker/s name

Jon May, Email Marketing, The R.A.C

Description

Jon is the email marketing manager for the RAC, a UK-based vehicle and breakdown cover provider. He gave a talk about taking email testing to the next level and discussed how A/B testing can be used to make better data-led decisions, reduce the risk of failure, and help internal stakeholders be heard. He also cautioned that A/B testing may not be suitable in all circumstances and that sometimes it is ok to trust that an email will be successful without testing it.

Video URL

https://vimeo.com/532437895

Transcript

Jon May 0:25
Hi, i'm Jon, I'm the email marketing manager for the RAC, thank you for coming to my Inbox Expo talk, taking testing to the next level. So I'm gonna be talking about some stuff that we've done at the RAC and how we've kind of tried to make it a bit more systematic in in what we're doing, and how that's worked for us. And some key takeaways that we're going to be working through this lunchtime will be lunchtime, UK, AV testing can help us not necessarily needed all the time, there's some examples or some cases where you might not want to be doing some testing. Never let the raw numbers deceive you. And we'll come on to a little bit later, giving the vital context. So you want some vital contexts to help you understand what's going on. So you can help prove that your data is doing what you wanted to do. Think of some tests that you want to do in the next month or so. And then work towards some kind of regular testing on perhaps the most important emails that you're doing. So on the email marketing manager for the RAC, a little bit about a little bit about the RAC, we're a UK based vehicle and breakdown cover provider. And we have about 13, just shy of 13 million members across the UK, we send probably around 100 million emails a year so certainly keeps me busy. And while we are insurance and financial services company, and that breakdown cover is an insurance product, we message it in our emails sound a lot more like memberships, so renewing mostly message more like a membership in a club rather than an insurance product. So why do a B tests there's so many different reasons why you might want to so many different, so many benefits from it. So for us, we like to think about the obviously make better data led decisions. We want to help turn opinions into facts. And we'll come on to opinions in just a second. We want to learn specific things about our audience. And that can be fought from how this audience in particular opens their emails or how they click through or specific things about our audiences within within the RAC. It can help us try experimental and unconventional ideas. So we rolled out the Gmail annotation tool from pique inbox.com. Brian, the email geek slack group, and it's probably more experimental and unconventional, but it was new to us. And by doing an A B test, it helps us kind of reduce that overall risk of failure. So if it didn't work, or if it had a huge impacts and negative huge negative impacts, we can always roll it back quite quickly. And also it kind of means that only 50% of the pot which would be affected. And it also helps I suppose with quite a big organisation, it helps internal stakeholders be heard, and also helps us feed in information generated from other channels. So the email is a key channel theory. See, obviously, we have lots of different channels, and how they do their testing can help impact how we do someone else. But AV testing is not suitable for everyone. in every circumstance, you shouldn't test everything all the time. So there's a few reasons why you might not want to do some AV testing. So if you don't have enough data or subscribers, it might be difficult to get enough data to be able to prove that something's happened. It does double the amount of creative that you've got to do if you're testing two different emails, you have to create two different emails. This might mean that if you don't have enough time, or if you don't have the resources, that it's it's just not feasible for your, your team at that size. Not everything needs to be tested all the time. And I have to try and remind myself some of the some of the time that you don't need to test everything all the time. It's okay to let some of them go. And you have to just trust that it's all right, just because otherwise we would I would spend my entire life just reporting on a B tests and not doing anything else.

testing for the sake of testing, I think that kind of features as part of the previous one. But just because we can test everything doesn't mean we necessarily should. And if you've got if you've got to try and if you've low on resources and low on time, it's probably better to pool all that time into making one creative really good. Rather than trying to split it out. Make two half decent ones. So I know we're talking about opinions and turning opinions into facts. And no matter whether you're in a business or an agency, you always have to watch out for the hippo Now managing the hippo the hippo stands for the highest paid person's opinion. So in business, this tends to be anyone above you on the organisational chart. But if you're an agency, it's most mostly the client. So obviously, they're the person paying for it. And they will have some kind of opinion. We all have opinions. And until we do a test and prove it to, and we'll come on to how we prove it in a bit. It's not a facts. And yes, opinions are useful in guiding you down certain roads, but facts are the key milestones to help us verify that what our opinions are valid, and that they do work. So while obviously a be testing, we want to try and improve what we do all the time, we kind of also want to fail. So I love this quote, if you have to kiss a lot of frogs to find a prince, find more frogs and kiss them faster and faster, as a book from a book called to do it wrong quickly. And a B testing I think definitely helps us, obviously Find What does work. But more importantly, it helps us find what doesn't work and helps us find that faster, so that we can get to the stuff that does. So inevitably we do a B tests that fail or not fail a B tests that prove that, you know, our hypothesis was not true, or it was incorrect. And that information can help us going forward. Suppose the more we do it, the more information the more learnings that we have. So there are generally speaking, there are three types of tests. So there's multivariate tests. And I'm not going to talk about multivariate tests, because that's a statistics lesson all unto itself. So I'm going to steer clear of multivariate and we're going to just talk solely about AB short term. And this is where you have two ingredients, one goes 10% of the pot, one goes to another 10 10% of the pot, and then you wait for a certain amount of time. And then whichever is the winner goes to the rest of the 80%. And then a B longer term where we send a and b to 50% of the pot, regardless. So there's, we use a B long term on pretty much all our campaigns. A B short term is probably where you if you're starting out, that's probably where you're going. But the long term can help you learn and help you learn a bit quicker as well. So the reason why, why I kind of suggest steering away from short term if you possibly can, is it samples. So essentially, it's guessing that those 10% of people will act the same as the other 80% of people. But sampling is prone to error. It's kind of like popping down your local pub and asking who they're all voting for, you'll get some answers, but there won't be necessarily fully representative of the whole population, or in this case, our whole audience. Whereas the full A B 5050 test can help you learn faster, it's the most accurate because it's not being sampled. It's literally roping in every audience member and asking them to actually vote with their open click or action. email service providers will randomly sample the data to pick that 10%. But no computer can be random and the more than a smaller sample, the less accurate. Just statistically speaking, it will be political polls use a similar technique asking a small group of people for their opinion then kind of multiplying matter to the population and assuming that that split will kind of work. But in the last few years, at least in Britain, that's not gone entirely well. So remain ahead in final poll, and then it didn't Theresa me, no relation heads for landslides, election and then now she doesn't, although this issue isn't necessarily entirely confined to Britain. So when we're approaching testing, we've got to be thinking, how can we make it better, but we've got to have a bit more of a methodical approach of, Okay, well, if we send version a version a should be what we send if we only send one email, or what you currently send, and then version B could be something new, or a different idea to test or a singular specific idea. So think of this big question that you want to answer. So testing a hypothesis, thinking back to science classes, which was probably for me quite a number of years ago. So a few ideas for starters, and we've done some of these do emojis and subject lines, improve our open rates, I always see this in blog posts like yet improved subject improves open rates by like 64%. And I've run it for a number of different brands and get different results. Sometimes Yes, it is. It does improve sometimes it actually massively decreases. So it is quite useful to see some of that information. Does offering a higher discount, improve the total revenue. So if you're doing percentage discounts, does moving that percentage up or down, actually improve the total revenue position? Does adding a countdown timer or some element of urgency? Not too much, obviously, just just a tiny bit of urgency? doesn't click rate and does impact does the colour change? If you change a button colour, does that change the click to open rate? So there's a number of things we can test and arguably you can test everything so we can test it. Subject line in preview text, this is probably where most people start out. And this is where we do quite a lot of testing because that will come on to email engagements in a minute. But trying to ram open that open rates, trying to cram it open that a bit bigger, can definitely be more beneficial. sender name, I think is probably one of the ones that isn't tested enough. Now, I'm not saying make it not your company name. But trying to find different ways of kind of displaying your sender name can be very helpful. button text, email, body creative, and also kind of the images in the creative itself. So now that we kind of find out what we can test, let's have a look at kind of how we might define success. So ultimately, you got to find what is success. So if you pick up a textbook on this kind of thing, you'll you'll hear it be called the overall evaluation criteria or the OEC. But normal people call it the primary success metric. Now you can only have one primary success metric, but you could have several secondary success metrics. And these could be opens clicks, sales revenue, average order value this goes on, but it could be that your primary success metric is for revenue to go up. But your secondary success metric is that the average order value remains static, or goes up. Rather, it just doesn't go down. Because if you get more sales, but actually the order value is lower on each of them, your revenue isn't likely genuinely quite full. So the positive email engagements could be broadly split into three typical engagements. So we've got opens, clicks, and actions. Now actions here could be sales, but they could also be signing a petition, or donating to a campaign or registering for an event. So I would probably normally call these conversions, but we're going to get confused a bit later on. So I'm going to call them actions here. So opens clicks actions, so when we're thinking about the typical email engagement stage, so we've got, say, you know, 1000 emails that we send out they're sent, then I'm hoping that almost all of them are delivered, hoping for every big delivery,

then, and the percentage of them will open, then a smaller percentage of them will click and then ultimately, a smaller percentage than that will actually take an action. And when we're talking about conversions in a B testing, we're talking about this arrow from one stage to the to the stage below. So for example, do emojis improve open rates, were looking at specifically this arrow between the number of emails that were delivered, which arguably is just the total number of emails sent minus the bounces, does this improve the amount of people flow into the next segment into opens? So briefly, open rates are not going into the debate of the merits of open rates, because I'm sure the, the comments section will be ablaze, no doubt in a second. So open rates are a measure of success for the envelope content. So we're talking about the sender name, the subject line, and the pre rejects. And these three, three things, pretty much sum up all of it, there is a school of thinking that brand sentiment plays a part in this and I suppose Yes, if you've had a bad experience with a company, you might be less likely to open their emails. And that has no relation on any of these three things. So brand sentiment could be this invisible, push behind it as well. And click rates are a measure success, force more things, so a lot of different things. And so we've got attractive proposition clicked a button click, call to action button text, expectation matches against the subject line. So if I'm talking free puppies, and then you open it up, and you get 50% off, that might be a bit deceptive, if you might feel a bit calmed, a little bit of urgency, not too much. And having some kind of clear and present goal, it's very clear what this email is trying to do is trying to get you to buy bread.

So now we've learnt a little bit more about the numbers. When you do a test on your ESP. If your ISP is generating those numbers for you, it's probably lying to you. Now it's not actually trying to lie to you. But when it's when it's calculating these numbers and it giving you these numbers when it does the 10% A to 10% B, and then it calculates the number and then it goes all right, this is slightly higher, so we'll send it out. It's missing that key context of okay, but what do these numbers mean? The numbers themselves, one will be a winner. Sure. But there's a few bits missing from this. And let's go through some of the trust checks that we've put into some of our reporting. And then we'll go through some of the reports that we use to help internal stakeholders understand our tests a bit better, and explaining each of them. So we have some trust checks. So we've got enough data. So we need to have enough for a representative sample. We say that as the minimum is 100 conversions so 100 from the top bar to the to the bar below, per variation, so 200, and a B tests would be 300 for an ABC test. So I'm just because we want to make sure there's enough physical data for us to be able to understand We all have some reliable data. So if it's a campaign that's going out every day, it has to win for five days straight, or that we just keep repeating it, and we're getting the same results or very similar results. Lift is bigger than the background variance. So every A B test, even if you had exactly the same subject line on both tests, they'll have a slightly different variants. So one of them will have slightly more opens on the other. And you'll be like, well, statistically speaking, they should be exactly the same. And you'd be right. But the universe is chaos, it doesn't quite work that way. Fight, you can find your audience's background variation by doing what's called an a test. And this is where you send out exactly the same creative and you just split the point in half, you send it out, and you find out the nuances of the different, the different niggles in your audience, for our for the audience, it moves somewhere between 0.4 and 1.6%. So if you're, you don't have a big enough pot to do this, I'd say as a rule of thumb, maybe 2%. So if it's less than 2%, it actually just might be background chance that's behind, which one is the winner. And if all three of these apply, it's probably what's called a meaningful test. So the result is meaningful. This technically speaking, is statistical confidence over 95%. And we'll come on to why we specifically use 95% in just a second. And then finally, we want to just make sure that the test is actually valid, that there's nothing technically that's gone wrong with it. So we use what's called a guardrail metric. To check the tests integrity, usually this is probably usually is might be bounces, if two versions of the same email have significantly different number of bounces, there might be something underlying that's causing a problem. And you might have to avoid the test. So we keep a good check to make sure that everything is trustworthy. And I'll come on to some of the reports in just a second. But why we use 95% statistical confidence. So essentially, this means if you were to repeat the test 100 times 95% of the time, you would get the same or very, very similar results. Now I tend to get asked, what's the minimum we could use, I'm like, zero, I guess. But 50% is a coin toss. So I'd probably say that's probably the minimum you want to aim for. I wouldn't really say 50% at all. But 70% is a total guess. 85% is probably the absolute minimum that I suck up my teeth, to say you could probably do as an educated guess. So I live in the UK, I could take an educated guess it will rain this week. Just because I've lived here for so long. 90% is a pretty good guess. And this is probably where we're getting into if you've got a small pot that 90% might be where you're looking for 95 is pretty sure, and this is kind of the gold standard. Most people will accept 95% and 99% ashore, but you have to have quite a big pot in order to do it. If you have 1000 people, you're gonna be very struggled to get to 99% statistical confidence. So the nine different steps to get started and to kind of get through this rotational testing is first of all, first, firstly, finding the right buttons on the ESB and how your ESB does a B testing and the nuances of how it does it each ESP does it ever so slightly different? So it's keen to is good to find that out? To find the big question or the hypothesis that you have defining a primary success metric and any secondary success metrics? Or do you create the two different versions of emails or more different versions of emails, send it live and wait. And the waiting is the hardest bits, because obviously the stakeholders who emails that we're testing, want to find out the results as soon as humanly possible. But we need enough data to be able to give them something useful.

Then we analyse the results. And this is where I'm gonna be talking mostly about for the second half of this presentation is about how we analyse it and how we get there. Documenting and sharing learning, which I think is super important, and we don't do it enough. And we are starting to do more of it now. What's next and then rinse and repeat. So we're going to talk more about analysing the results. And as part of that I've built an app called junction. We're murdering organisations. So everything looks like British road signs, because we're fun like that. And this is essentially where we've got the junction app. And I've had to blur some of what we're actually testing out. But you can see some of the numbers, what it looks like, and what the key things for us were to inform the decision making have some email specific learning resources, because I found out to my detriment, but most of the stuff online or even textbooks is mostly aimed at websites and apps. There's very few email specific AB learning resources isn't easy to use testing calculators, sample sizes, first of all internal stakeholders, sharing the testing programmes, particularly with different teams so they know what's happening, and improving the institutional memory. So building some kind of app for to kind of rope this all together. So say we start a new test we're putting in it's gonna be a subject line test and we'll fill it All in. And then we're going to be a bit more specific about what we want to say, what element are we testing? What's our primary success metric? And what's our guardrail metric, just so that we can make sure that we get all these tests done properly, and that we can verify the data. So when we're logging it, we're adding in our data into here. So it asks you each other for each version, what's the population, what's your conversions, but when you start to type it in, if there's a problem, obviously, this is a bit more technical, because it's for an internal user. But we actually highlights actually, maybe you made a typo. So that might be why there's a mismatch in the data. But actually, there might be something wrong with the data, you might be able to correct it or interrogate it further, before you actually submit it. So we want to do is we want to actually make this more understandable for our internal users. So instead of sending them a report of all the war test numbers, we want to send them something a bit more user friendly. So we want to be able to see that we want to be able to see actually got by 18%, in this instance, and then we want to give them some extra context. So they have these tests, trust checks, and being able to call it a winner. So it's important that the is a winner, that the confidence is over 95%. There's enough data and everything is on the green ticks, that's what we want. And this kind of shows obviously, there's a small lift at 3.68%. And we've used a little bit of British roadside for hills, up and downs to show whether the it's actually gone up or it has gone down. So we're fun like that. So I hope that was a quick source of stock, Toronto, kind of what we've done. So what we have covered, so actually, so maybe doesn't come help, but not necessarily all the time. Don't let the raw numbers deceive you and especially Don't let your ESP just don't take your SPS number or winner at face value, get that vital context to help you prove that the data is actually having that the changes have had an impact. Think of some of the tests that you could do in the next month. So maybe try emojis and subject lines, I've tried it several times to varyingly different results for different brands. So it's, it's a one to two kind of critically think about, and work towards some kind of regular testing on important emails. And not necessarily on every single email, we don't test on every single email. But it is useful to do it on the most important ones. So I've got some resources, it's resource time. So I'm actually building a public version of junction dot email, that's email gigs will be free to use and hopefully in September, so that you can log and analyse your own A B tests, because I felt that there wasn't really very much on the market for email specifically, there's a lot of tools if you are experimenting with PPC ads, and all this different kind of digital marketing stuff, but not necessarily for email specifically. So if you go to junction dot email, you can find, you can sign up for launch dates, and they'll be coming in September. You can also find these slides at structure dot email slash hash inbox Expo, and you'll be able to download the slides bear. I've also got a there's a little there's a blog from scripture that talks more about population and samples, and then how random sampling works. As you can see, in this kind of example, the random sample is actually over sampled the blue people as opposed to the teal people. So there's a blog on there. So if you download the slides, you can get the link there to a blog that tells you more about the sampling methods. And for some further reading, if you're like, Oh my god, I absolutely must know more about AV testing in general. And the theory behind it. I all over this book from

COVID, e, Tang, and Zhu from the trust trust the trustworthy online controlled experiments. And you see I've made a tonne of notes in the margins, it's very useful to understand how so these three people used to run the head of experimentation at Facebook, Microsoft and Google, so they know their stuff. So it's very important to kind of get a good grounding. It's not for the faint of heart. It's quite technical, but it is very useful to understand how the bigger companies have approached it, and then how you might be able to replicate that on a smaller scale. So if you do have any questions, please feel free to tweet me at john doe's emails, or feel free to email me at jamie@rec.co. uk. And I hope that was useful insight into the world of AV testing at the RAC and hopefully, if you sign up at junction dot email, you'll be able to use our my free tool from September onwards. So that was taking it testing it to the next level. If you have any questions, please feel free to tweet me or email me. And thank you very much for coming to my inbox Expo Talk. Thank you much to Andrew, Nely and the team for putting this on. It's been a really slick operation. So thank you very much. And yeah, thank you for coming. Cheers.

Taking Your Testing To the Next Level

Jon May, Email Marketing, The R.A.C

Video

Speaker/s name

Description

Video URL

Transcript

Categories

You May Also Be Interested In

Got Email - Get BIMI

Van Funnel naar Flywheel

API First Flowmailer