Katherine Munro, Author at Towards Data Science

No Baseline? No Benchmarks? No Biggie! An Experimental Approach to Agile Chatbot Development

Katherine Munro — Mon, 26 Aug 2024 20:39:06 +0000

Today’s post recaps my recent talk on lessons learned trying to bring LLM-based products to production. You can check out the video here.

What happens when you take a working chatbot that’s already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.

It’s well known that evaluating and comparing LLMs is tricky. Benchmark datasets can be hard to come by, and metrics such as BLEU are imperfect. But those are largely academic concerns: How are industry data teams tackling these issues when incorporating LLMs into production projects?

In my work as a Conversational AI Engineer, I’m doing exactly that. And that’s how I ended up centre-stage at a recent data science conference, giving the (optimistically titled) talk, "No baseline? No benchmarks? No biggie!" Today’s post is a recap of this, featuring:

The challenges of evaluating an evolving, LLM-powered PoC against a working chatbot
How we’re using different types of testing at different stages of the PoC-to-production process
Practical pros and cons of different test types

Whether you’re a data leader, product manager, or deep in the trenches building LLM-powered solutions yourself, I hope I can spare you at least some of the mistakes we made. So without further ado, let’s get into it.

The current state of our chat AI, and where we want to go. Source (of all images): Author provided

Setting the Scene

My company – a large telco – already has some pretty advanced conversational AI systems for both voice and text, including a multilingual chatbot which assists thousands of customers a day with question answering, end-to-end use cases, and transfers to real agents. We’re pretty proud of it, but we know Gen AI and LLMs can help us make it better, and implement those improvements in a more scalable manner.

Our vision is a chatbot that can take an entire conversational context, plus company and customer data, to serve diverse use cases according to defined business processes. It should be built on top of a framework that allows us to craft a controlled interaction between user and system – a so-called "on rails" approach – and to add new use cases easily, to continually improve the customer experience.

What we Want vs What we Have

Sounds great, but how on earth are we going to build this? We want to be test-driven, agile, and get feedback fast, making design decisions with confidence, based on clearly defined KPIs. But how? How can we measure our progress, when we have:

No benchmarks: For any company doing something like this, the problem is, by definition, unique. No other bot exists to serve the same customers, regarding the same products and services, as ours. This means: no benchmark data to test on.
No baseline: We’re also comparing our WIP bot to a working chatbot that’s been iterated on for years, making any early-stage comparison pretty unfair (how do we convince stakeholders of the value of our project, when of course the old bot has a much higher automation rate?!) Our new solution also works completely differently to its predecessor, so we need a whole new way to test it.

A New Way Forward

This has certainly been tricky, but through much trial and error, we’ve figured out a test-driven approach that’s working for us. It focuses on three key processes:

Internal testing
Customer trials
"Simulations"

Let’s now check out the pros, cons, and actionable takeaways of each test type.

The mix of pros and cons of internal teams testing means it needs to be combined with other test types.

Internal Testing:

How it works:

Currently, different parts of the bot are implemented by different domain teams: the billing team builds billing use cases like payment extensions, the admin team implements password resets, and so on. In internal testing, each team defines scenarios based on the use cases they implemented. For example, "you are customer 1234 and you want to extend your payment deadline." The scenarios include a so-called "happy path," which describes how a successful interaction between customer and bot could look. Team members from other teams then pretend to be the customer and try to get the tasks done using the bot. They take notes and give each scenario a rating, and finally, all teams review the scores for their scenarios, and share improvement ideas with the entire group.

Pros:

The key benefit is that the teams have enough general domain knowledge about our company and how the bot works to be able to probe for edge cases (which is, after all, where the problems hide). Yet because the testers didn’t implement the use cases themselves, they can’t accidentally "cheat," by using the same phrasing that the implementers had in mind during development. This helps us identify use cases which fall apart when presented with unusual phrasing of a customer request.

Internal testing also helped us prepare for the customer testing (up next), by revealing the potential diversity and room for confusion in even the most straightforward of customer interactions.

Cons:

This type of testing is manual and time-consuming, meaning we can only test a small number of scenarios. It’s also subjective and prone to evaluate a misunderstanding: evaluators sometimes even misunderstand the happy path, and thus judge the output incorrectly.

Lessons Learned:

Alignment problems aren’t just for LLMs: Our first round of internal testing featured a simple rating scale: Bad-OK-Good. We found only afterwards that some people had rated interactions based on how well the bot stuck to the happy path, while others judged the customer experience. For example, if the scenario was supposed to trigger certain logical steps, but the bot instead returned a high quality RAG1 answer, then some raters would penalise the bot, while others praised it. Thus, we learned that we need a rating system that captures both bot "behaviour" and bot quality. This way, if the bot is "misbehaving," but this produces a better experience, we can react: rethinking our implementation, and checking for misunderstandings or misalignments in our expectations of what customers want.

Don’t forget to define your test outputs: Our first tests also revealed a need to align on how to write useful comments, else people miss important details, or take fuzzy notes that don’t make sense afterwards. We also needed to agree on how to best preserve the chat transcripts: Some people copied the log output, which is almost too verbose to be usable, while others just took screenshots of the UI, which is impossible to search or process in any kind of automatic way later. We failed to brainstorm such issues in advance, and paid the price in hard-to-manage test outputs afterwards.

Actionable Takeaways:

Define a clear, precise evaluation scheme: We came up with an evaluation matrix featuring three metrics, the exact values that they can take, and a guideline for how to apply them. This makes it easier for testers to test, and for teams to aggregate results afterwards. The goal? Maximum learnings; minimum evaluator effort.

Have good data management: For us, this means little things, like: adding test case IDs indicating the specific scenario and tester, automatically capturing the transcripts, logs and steps called by the bot behind the scenes, and trying out the Zephyr test management tool, which provides a more structured way to define, test, and re-test different scenarios.

Customer Testing:

How it works:

We invited a mix of customers and non-customers to our offices and had them attempt to accomplish tasks using the bot. The tasks were similar to the scenarios described earlier – such as trying to extend their payment deadline – but testers weren’t given a happy path, or told in any way what to expect. They were recorded and encouraged to speak out loud as they worked, sharing their expectations and impressions as they waited for the bot’s responses.

Pros:

Having a diverse mix of participants leads to varied and unexpected behaviors. The way they interacted with the bot helped us see how tolerant customers are to issues like latency (a major headache with LLMs!), and revealed customer attitudes towards the technology. For example, some testers, including the young and tech affine, were surprisingly cautious and skeptical of getting things done with a bot, and said they wouldn’t trust the outcome without an additional written confirmation.

Cons:

Customer testing was highly time consuming to organise and execute: participants were chatty and/or slow, so we could only test two or three scenarios each. Such a small sample size means we also have to be careful of outlier feedback: if someone vehemently hates something, that doesn’t mean it’s an absolute no-go.

Again it was also a challenge to figure out what the test observers should note down, and we realised too late that we should have aligned on which aspects would be most useful for drawing actionable insights.

Finally, although we told participants that our bot was a bare bones POC, they still complained of missing functionality they’d seen in ChatGPT and similar tools. While that’s interesting to see, we felt it distracted them from some other feedback they could have given us.

Lessons Learned:

Customers are learning from LLMs… in unexpected ways: For example, customers with experience using tools like ChatGPT wrote fluently and conversationally, and expected the bot to handle it. Less experienced testers wrote in a "keyword search" style, fearing they’d confuse the bot otherwise. And some young participants who were familiar with LLMs used this keyword style on purpose, hoping that the bot would respond with similar brevity. That was a completely unexpected and creative attempt to manipulate the bot, based on an understanding that LLMs can be prompted to respond in different styles. It proves to us that our system will need to be robust to many types of interactions, perhaps adapting its behaviour to suit.

Customers don’t want to do things the way you might expect: For example, while the industry rejoices in LLMs and "conversational everything," our test participants weren’t that excited about the prospect. In some cases, such as when presented with a choice of invoices they’d like to delay payment on, they said they’d rather use a button to select, since "it’s faster than typing."

This was quite a slap into reality, and reminded us that you can’t please everyone. We sometimes received completely opposing feedback from different participants for the same task. This is a challenge with building any kind of consumer product, but it’s good to remember, at least for your own sanity.

Actionable Takeaways:

Design principles are invaluable: … At least for an experimental project like ours. Thus, we collated our observer feedback into a set of general design principles. For example, we sometimes felt like the bot stuck too closely to our business logic, missing contextual cues from our test participants that should have swayed the process. So, we made it a principle that our bot should always prioritize conversational context when responding. By having this clearly stated, it can help guide us through our development, by being included in things like future internal tests and story acceptance criteria.

Simulations:

How it works:

We have an annotated dataset of historic chat interactions, which includes customer utterances, actions triggered by the existing system, the domain detected by our classifiers, and a ground truth domain label added later. Each sprint, we run those customer utterances through the latest version of our WIP chatbot, in order to test two things.

First, automation rate: how often does the new bot trigger an end-to-end use case versus a "T2A" (transfer to a call-centre agent)? How does this compare to the existing, live system’s automation rate? Second, how does the classification accuracy compare? We found a way to measure this, despite the two bots functioning completely differently. So, although the new bot doesn’t actually do domain detection, we can map the commands triggered by the new bot back onto domain labels used by the production one, giving us an apples-to-apples comparison.

For the rest of the evaluation, we split the test utterances and WIP bot responses among the domain teams, who then manually review their quality. It may sound like a tonne of work, but we’ve found ways to make things quicker and easier. For example, if the bot’s response is "fixed" (meaning it’s never rephrased by an LLM), then as soon as an evaluator marks that response as "accurate", certain other metrics will automatically be filled. This speeds up the process, reduces decision fatigue, and helps ensure high quality and consistency from evaluators. Afterwards, we aggregate the evaluation scores, and create stories to tackle any specific issues we observed. The scores are also directly linked to KPIs in our development roadmap, enabling us to determine whether we’re satisfied with our latest changes, and to communicate progress to broader stakeholders.

Simulations: a core component of our chatbot development process.

Pros:

Our simulation approach is more scalable than the other test types. Though we still have a lot to improve on, and there’s still a manual evaluation step in the middle, we invested great effort to streamline the overall process by writing good quality code in a production-style "pipeline" which orchestrates the different steps: running the utterances through the new bot, preparing the responses for manual evaluation, and computing the results afterwards. Simulations are also quantitative, not just qualitative. Our large-ish dataset (ca. 1000 utterances) is sampled to reflect typical distribution of usecase domains in production. This more realistically represents the way customers talk to us, and the problems they have.

Cons:

It’s expensive, thanks to all those LLM calls and, more importantly, the annotator effort. Another issue is that there’s no ground truth for a natural language answer. That makes automated evaluation tricky, and even manual evaluation is subjective and ambiguous.

But a much bigger problem is that we can’t test multi-turn utterances. We’re passing customers’ first utterances to our new bot, and unless it answers very similarly to the old bot (which it ideally won’t), the customers’ historic second utterances will no longer make sense. We could try having an LLM play customer and chat with our new bot, but it would be expensive and not a particularly realistic test, given that our customers have different spoken styles, dialects, and problems, than whatever data ChatGPT and co. have been trained on.

A knock-on effect of the first utterance problem is that we can’t test things like conversation repair, which is when a customer changes their mind during a chat. So we can’t yet get a full picture of how the bot is behaving over entire conversations. There’s also a "log-in barrier," whereby for most first utterances, the appropriate bot response is to have the customer log in. Our WIP bot typically gets this right, but it’s an easy test, which doesn’t teach us much.

Lessons Learned:

Frequent and early communication among testers is critical: Our evaluation sessions are live group efforts, where evaluators share any tricky utterance-response pairs, in order to get second opinions on how to rate them. This helps resolve ambiguities and ensure alignment. We also document the tricky cases alongside our evaluation guidelines, making future evaluations faster and more consistent. It also helps us keep track of where we’re really struggling to implement use cases in a satisfactory way.

Actionable takeaways:

A blend of different test types are key: In addition to the test types described here, we plan to try employee testing: letting colleagues from other departments try out our WIP bot with any scenarios they can think of (rather than specific scenarios like we used in our internal testing). This should provide us with a large, diverse, and more realistic set of test results, given that employees like call centre agents know exactly how customers tend to communicate with us. Gathering feedback will also be cheap and easy, using something like a Google form.

We’d also like to try some automated evaluations such as RAGAS, a suite of LLM answer quality metrics, some of which are evaluated using other LLMs. Of course, we’ll have to weight up cost versus reliability and convenience. But at least for the RAG part of the bot, we believe it’s worth a try.

What’s next?

Having run multiple tests over multiple sprints now, our biggest learning is that you’ll never think of everything in advance: Customers and Chatbots will always surprise you. That’s why we’ll keep on having regular test retros ("retrospectives"), looking for ways to improve our test process every time.

I’m planning more posts about this, as we continue to learn. If that’s your thing, feel free to subscribe, or checkout my related posts on building LLM-based products, making them safe, and why you still need NLP skills in the age of ChatGPT.

[1] RAG, short for "Retrieval Augmented Generation", is an LLM design pattern wherein documents that are similar to a user’s query are passed along with that query to the bot, providing it with extra information which may help it answer the request.

The post No Baseline? No Benchmarks? No Biggie! An Experimental Approach to Agile Chatbot Development appeared first on Towards Data Science.

How Big Tech Is Exploiting Content Creators, And (Trying To) Get Away With It

Katherine Munro — Fri, 05 Jul 2024 13:02:41 +0000

If you’re reading this, you’re part of the content creator ecosystem: either as a fellow writer, casual consumer or a Medium subscriber. You help keep the system running, which means you have a stake in the subject of today’s post, which will be all about copyright concerns and Generative AI.

I’ve been keeping an interested eye on this topic for a while. As both a writer and someone working with "Gen AI" every day, it’s in my own personal and professional best interests to do so. Of course, copyright isn’t the only legal and ethical issue tied to Gen AI, nor is Gen AI the first technology to raise such flags. However, it has captured public attention on an entirely new scale, which is why I’m diving into it today.

Setting the Scene

Let’s start with key stakeholders and critical questions. So far, the most vocal stakeholders in discourse on Generative AI and Copyright law have been:

builders of Generative AI models,
consumers of such models’ outputs,
and content producers, whose IP may wind up in a model’s training data.

Key questions for these stakeholders are:

Can copyrighted works be used as training data?
Can AI-generated works be copyrighted?
… If so, who owns the copyright?

This post will tackle the first question, and the critical concept of "fair use," a legal doctrine which has been central to discussions on the topic. I’ll use a number of current lawsuits against Stability AI, OpenAI and Meta, among others, to illustrate some criteria and considerations which may be used to evaluate whether or not an activity constitutes fair use.

In a later post or two, I’ll cover the latter two questions in the same way. I’m not a lawyer, but I have been following these issues with great interest, and I’ll be sure to include plenty of links so you can fact check my writing. Feel free to drop me a comment with critique or to start a chat on the topic, if that’s your thing. And now, let’s get into the nitty gritty…

Background: The Problem

Generative AI is big business. Open AI is one of the world’s fastest-growing tech companies and recently surpassed the $2 billion in revenue mark, thanks in large part to its release of ChatGPT. Unsurprisingly, all the biggest tech names in the business are racing to catch them up.

On the plus side, this has produced an incredible wave of innovation and, to some extent, democratisation of powerful AI technology. On the other hand, it’s created an insatiable thirst for data to train Generative AI models [1]. For example, Meta was so desperate for text data to train its Large Language Models that it considered buying publishing house Simon & Schuster just to access their copyrighted material, according to a recent investigation by the New York Times. The report found that some Gen AI companies "cut corners, ignored corporate policies and debated bending the law," including Meta, where there was allegedly even talk of simply using copyrighted text data and dealing with any lawsuits later. The time it would take to license such content correctly was, apparently, simply too long.

In a similarly disturbing example, the report cites insider accounts from OpenAI in which the company, desperate for additional text data for training GPT 4, developed a speech recognition tool to extract transcriptions from YouTube videos. This would be a clear breach of YouTube’s terms of service: a fact that OpenAI employees allegedly discussed at the time, before proceeding anyway.

And speaking of breaking the rules: AWS is investigating Perplexity AI for allegedly scraping websites without consent, after both Forbes and Wired magazines raised the red flag over Perplexity’s model outputs.

This appetite for training data – be it text, images, video, or other modalities – won’t go away any time soon. That means we’re unlikely to see an end to such stories until we figure out a fair way to handle the issue. If you don’t consider yourself an artist or content creator, you may wonder how this applies to you. But remember that any content you generate online – such as a photo on Instagram, a thread on X, or a post on LinkedIn – could potentially land in a Gen AI model’s training dataset, if these companies continue to push the boundaries of what they are allowed to scrape.

The NYT investigation, for example, details how Google recently broadened their terms of service to allow it to use publicly available Google Docs, restaurant reviews, and other online content to develop its AI models. And the larger and more diverse a model’s training data is, the more capable it can be trained to become. As a result, Gen AI models can be used to produce content at an unmatched pace, including imitating specific creators and their styles. This can be a threat to content producers worldwide, disrupting all sorts of content platforms, including those you consume.

So, with the stakes made clear, let’s examine the first, key question for copyright and Gen AI.

Can copyrighted works be used as training data?

Debates on this question have heavily revolved around how to apply the legal doctrine of "fair use" to the creation of generative AI models. The fair use doctrine allows limited use of copyrighted material without obtaining permission for it, provided that the result is "transformative," meaning that it adds value, gives commentary on the original work, or serves an entirely different purpose. It permits applications like news reporting, teaching, review and critique, and research, which is why many generative AI models are created by universities and non-profit organisations who state their goals as purely academic. For example, text-to-image model Stable Diffusion was developed by the Machine Vision and Learning group at the Ludwig Maximilian University of Munich; they own the technical license for Stable Diffusion, but the compute power to train it was provided by Stability AI.

The challenge arises when such models are commercialised. For instance, while Stable Diffusion was released for free public use, Stability.Ai built DreamStudio on top of it – a simple interface enabling users to call the model to generate images – and used the fact of their "co-creation" to raise millions in venture capital funding [2, 3]. Stability can also make money from DreamStudio, by having users pay for image credits.

In such cases, courts need to determine whether the models’ outputs are transformative enough to still be allowed. A number of recent, ongoing copyright lawsuits can help us understand just how difficult this can be.

First, we have the suit by three artists against Stability AI, Midjourney and DeviantArt, which claims that these companies violated millions of artists’ copyrights by using the artists’ works to train their Generative AI models without permission. Second, we have multiple lawsuits by various news publications and authors against OpenAI, Microsoft, and Meta [4]. The allegations in these cases include unfair competition, unjust enrichment, vicarious copyright infringement (that is, to know about and benefit from an infringement), and violation of the Digital Millennium Copyright Act by removing copyright management information. In all of these cases, the defendants (that is, the AI companies being sued) have leaned heavily on the defence that their research is "transformative", and thus, "fair use". So, we can start to understand the complexity of the fair use doctrine, by first summarising arguments against these companies, followed by those in defence of them.

How Gen AI companies might be in trouble:

Let’s start with some facts about Stable Diffusion. It was trained on a dataset of image links and their alt-text descriptions, scraped from the internet without necessarily obtaining consent. The dataset could possibly be considered protected under fair use, due to its non-profit, research nature (it was created by German non-profit LAION, short for Large-scale Artificial Intelligence Open Network), and the fact that it does not store the images themselves (rather, it contains web-links and associated scraped information, like captions). However, the plaintiffs (that is, the accusers in the case) argue that Stability AI created unauthorised reproductions of copyrighted works, by downloading the images for training. In other words, the argument against the company relates to its unauthorised use of a possibly-otherwise-permissible source. We can see a parallel issue in the lawsuits against OpenAI: although AI researchers have been using large datasets of publicly crawled text data for years [5], OpenAI are accused of conducting infringement by removing copyright management information, such as authors and titles, from their training data [6].

Another problem for Stability AI is that their model can recreate existing expressions and styles with high accuracy, which could constitute so-called "unauthorised derivative works." This is a huge concern for creators, who fear that such models will be able to out-compete them in their own game. Hence, this case also included a right of publicity claim, alleging that providers of image generation models can profit off being able to reproduce certain artists’ styles (this claim was dismissed due to lack of evidence that the companies were actually doing this). In the case against OpenAI, the company was similarly accused of unfair competition and of harming advertising revenues (for news providers whose work was supposedly being reproduced by ChatGPT, and thus drawing clicks away from the original source). In light of such complains, it is difficult for any of these model providers to defend themselves by claiming their work exists purely for research purposes, given that they allow commercial applications of their models, including Stability AI’s DreamStudio app, and OpenAI’s ChatGPT.

One final accusation by content creators is that models which replicate their works can also harm their brand. For example, the news outlets versus OpenAI case complained that ChatGPT generated misleading and harmful articles – including proposing smoking for asthma relief and recommending a baby product which was linked to child deaths – and falsely attributed to these newspapers, potentially harming their reputations.

How Gen AI companies could be safe (at least, for now):

Turning now to arguments favouring Stability AI and OpenAI: the former has defended the creation of copies of images for training, saying that this technical requirement is, in principle, no different to humans learning and taking inspiration from existing material. They also argued that their model does not memorise training images, but instead, uses them to learn general features about objects – such as outlines and shapes – and how they relate to one another in the real world.

Stability AI have also claimed that Stable Diffusion does not create derivative works, given that a reasonable person usually cannot tell which images, if any, contributed to a specific output: a condition courts have historically used to determine whether a work is derivative. In the case involving authors against OpenAI, the plaintiffs argued they shouldn’t have to prove derivative use of their works, if they could simply prove their works were in a model’s training data. They argued that if a model is trained on protected data, then all its works are derivative. The judge, however, dismissed that line of argument, insisting that they still needed to demonstrate that some outputs strongly resembled their own works. A final point in Stability AI’s favour is that style itself is not copyrightable – only specific, concrete expressions are.

The cases are only heating up…

The courtroom dramas over Generative AI and copyright don’t end with the cases I’ve discussed above. There are plenty more; I’ll summarise a few here, to reiterate just how messy this topic is… and how much messier it’s going to get [7].

Getty Images vs Stability AI: In May 2023, Getty Images sued Stability AI for allegedly breaching their intellectual property rights by using images scraped from gettyimages and istock to train their models. Getty Images claimed that the two databases had cost them nearly US $1 Billion to create, including around $150 million in licensing fees to acquire images. According to Getty, the resulting high-quality, metadata-enriched images have become an enticing target for companies wanting training data for their AI models, and some of those images had also been scraped for the non-profit LAION dataset mentioned above.

Getty’s complaint was that training Stable Diffusion involved "a chain of sequential copying of Getty Images from assembling of the training dataset through the diffusion process and onto the outputs in response to individual user prompts." They say the resulting model outputs are so similar to Getty’s own images that they sometimes even include the company’s watermark. This, they said, not only proves that Stability AI copied their copyrighted material, but also constitutes a breach of trademark, and so-called "passing off".

Stability AI hit back that Getty had missed the point of Gen AI, which is to create new and novel content, not replicate existing works. Stability AI conceded that Getty images were using during training, but offered two key defences. First, they said the copying was temporary and occurred in the US (perhaps trying to take advantage of leaner fair use laws there). Second, they argue that even if a model output is similar to an input training data sample, it is not a breach of copyright. This, they say, is down to multiple reasons: First, during the training phase, Stability says there’s no intention to memorise input data. Second, during usage of a trained model, they say an image starts as random noise, and thus cannot comprise any copyrighted material. Finally, they say that the random start of an output images means the same prompt will never generate the same output, and thus, no particular image can be generated with a specific prompt, which means the model cannot be used to reproduce a copyrighted work. When faced with evidence from Getty of output images that strongly resembled Getty images, Stability argued that this was due to the actions of the user deliberately trying to recreate a training sample, and not due to the model itself.

Music Publishers vs. Anthropic: In October 2023, three major music publishers – Universal Music Publishing Group, Concord Music Group and ABKCO – sued Anthropic for using copyrighted song lyrics to train its "Claude" Language Model, and for allowing it to reproduce copyrighted lyrics almost verbatim. They claimed that even without being prompted to recreate existing works, Claude nevertheless uses extremely similar phrases to well-known lyrics. They also claim that, while other lyric distribution platforms pay to license lyrics (thus providing attribution and compensation to artists where it’s due), Anthropic frequently leaves out critical copyright information.

The music publishers rejected Anthropic’s use of copyrighted material as "innovation", accusing them of downright theft. They acknowledged that Claude sometimes refused to output copyrighted songs, but used this as evidence that guardrails had been applied, but not satisfactorily.

In response, Anthropic called its model "transformative", as it adds "a further purpose or different character" to the original works. They also claimed that song lyrics were such a small part of Claude’s training data that to properly license them – not to mention the rest of Claude’s dataset – would be practically and financially infeasible. Anthropic even accused the plaintiffs of "volitional conduct", which essentially defends Claude as an "innocent" autonomous machine which the music producers "attacked" in order to force it to recreate copyrighted content. (Similarly, OpenAI claimed that the NYT illegally "hacked" ChatGPT to create misleading evidence to support its case).

Continuing to pick apart the case, Anthropic asked the court to reject the music publishers’ allegations of "irreparable harm," because the publishers had not provided evidence (such as a drop in revenues since Claude’s release) to support that claim. In fact, Anthropic argued that the lawsuit’s demands of financial compensation implies that the harm can be quantified, ergo, it cannot be irreparable. Finally, Anthropic insisted that any accidental output of copyrighted material could be fixed by applying guardrails).

Programmers vs. Microsoft, GitHub, and OpenAI: In January 2023, programmer and lawyer Matthew Butterick joined with Joseph Saveri Law Firm – the same firm representing two of the aforementioned cases against OpenAI, Meta and Stability AI – to file a class action against Microsoft, OpenAI and GitHub on behalf of two anonymous software developers. The suit accuses the companies of "software piracy on an unprecedented scale."

The accused companies attempted to have the complaint dismissed, saying that the complaints do not show violation of any recognisable rights, and rely on hypothetical events rather than real evidence of injury. Microsoft and GitHub also said that Copilot, its code assistance tool based on LLMs, does not extract from any existing, publicly available code, but instead, learns from open-source code examples in order to make suggestions. They even accused the plaintiffs of undermining the open-source philosophy by asking for monetary compensation for "software that they willingly share as open source."

The hearing on whether this case can proceed will take place in this May, so I’ll be keeping an eye out, and will return and update this post as the case develops.

So, could these defences work?

That is, will these Gen AI model providers be able to convince courts that their endeavours constitute "fair use"? Perhaps. In a 2015 case of the Authors Guild against Google, "Google was permitted to scan, digitize and catalog books in an online database after arguing that it had reproduced only snippets of the works online and had transformed the originals, which made it fair use." Meta used this example to argue that training its LLMs on copyrighted data should be similarly allowed.

Meanwhile, Anthropic’s insistence that their accusers actually need to prove financial harm is an example of another kind of argument – so-called "harm-based defences" – which Gen AI companies may try to apply in addition to claiming fair use. For example, in the cases against Microsoft and OpenAI, both companies essentially argued that The NYT had failed to show real harm or provide that LLM-powered chatbots had dented news traffic or their subscription revenues [8,9]. In fact, in addition to downplaying any harm, these companies have stated that such lawsuits "threaten the growth of the potential multi-trillion-dollar AI industry." They may well be right. And applying both fair-use and harm-based arguments in this way could potentially strengthen their overall case.

If judicial bodies accept these arguments, then they may consider the act of using a copyrighted work in training as sufficiently transformative, since it results in a productive new model. However, new regulations will likely still be required, giving creators ways to have their creations removed from AI training datasets.

Summing up

This post tackled the question: can copyrighted data be used to train AI models? A clear, legally binding answer isn’t likely to arrive any time soon, and many content creators and platforms are choosing not to wait for one. Furthermore, they’re also recognising this hunger for training data as an opportunity to establish new revenue streams. Many are partnering up with the biggest Gen AI companies in the game, who themselves are working hard to attract such willing partners. A recently leaked Open AI pitch deck, for example, revealed that in addition to financial compensation, the company is offering its partners priority placement in search results and greater opportunities for brand expression when their works are surfaced during chats.

As a result of these situational and commercial pressures, many organisations are making deals with the Gen AI giants directly, allowing use of their copyrighted material, for a fee:

Newspapers and media companies like The Associated Press, The Financial Times, Vox Media, Time Magazine and Axel Springer, the parent company of Business Insider Politico, have made such a deal with OpenAI.
There are rumours (not yet confirmed) that Google has agreed to pay over $5 million per year to News Corp, owners of the Wall Street Journal, to fund News Corp’s development of AI-related content and products.
Reddit and Google have made a deal worth approximately $60 million per year, which allows Google to train its AI models using Reddit’s data. It’s not surprising that Google is interested in Reddit’s vast supply of stored text content: Large Language Models need to be robust to dealing with all kinds of text data, and while the news publications will have stores of pristinely edited articles in a specific journalistic tone, Reddit’s data will contain a diverse mix of both high quality texts as well as texts full of slang, typos, hashtags, emojis, HTML codes, and all kinds of other ‘noise’, making it a vital learning source.

Is this a good thing? What are your thoughts? Personally, I’m frustrated and anxious about the immense power these companies have over content creators and publications – not only does it give Open AI and co. the upper hand in such negotiations, but they’re also proven they’re willing to simply poach other people’s content if no deal is made at all [10] – but I’m happy that media producers are at least getting something out of the situation.

But of course, this isn’t the end of the discussion on Gen AI and copyright concerns. There are still two crucial questions to tackle:

Can AI-generated works be copyrighted?
And if so, who owns the copyright?

Look out for a deep dive on those issues in future posts.

Katherine.

1. The Gen AI race has also caused burnout among AI engineers, rushed rollouts without sufficient testing, and a lack of consideration of safe and responsible AI practices, according to this CBNC report.

2. TechCrunch, October 17, 2022: "Stability AI, the startup behind Stable Diffusion, raises $101M"

3. Sifted.eu, April 21, 2023: "Leaked deck raises questions over Stability AI’s Series A pitch to investors"

4. In the parallel cases of Sarah Silverman, Christopher Golden and Richard Kadrey against OpenAI and Meta, the authors claim that ChatGPT can summarize their books when prompted, but leaves out copyright ownership information. They also accuse Meta of using their works to train its LLaMa models, after Meta listed in a research paper a source database which is actually built off an illegal shadow library website.

5. See a list of commonly available NLP datasets, here.

6. Note: the argument that OpenAI deliberately removed copyright information from training data was ultimately not allowed to proceed, due to a lack of evidence.

7. For a comprehensive, regularly updated timeline of Gen AI and copyright lawsuits, see here.

8. NYT, March 4th, 2024: "Microsoft Seeks to Dismiss Parts of Suit Filed by The New York Times"

9. NYT, February 27, 2024: "OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit"

10. Some content creators are simply being forced out of the game due to their lack of power in the Gen AI copyright fight: photographers I follow on Instagram, for example have been closing their accounts to prevent AI being trained on their works. And while some such artists have found a new home in "AI sceptical" portfolio apps, there’s no doubt that these artists’ audiences and reach will have been significantly damaged in the process.

The post How Big Tech Is Exploiting Content Creators, And (Trying To) Get Away With It appeared first on Towards Data Science.

From Probabilistic to Predictive: Methods for Mastering Customer Lifetime Value

Katherine Munro — Fri, 03 May 2024 05:38:09 +0000

My iPad and I are back with more scrappy diagrams, in this, the final installment of my guide (for marketers and data scientists alike) to all things Customer Lifetime Value.

Welcome, once again, to my article series, "Customer Lifetime Value: the good, the bad, and everything the other CLV blog posts forgot to tell you." It’s all based on my experience leading CLV research in a data science team in the e-commerce domain, and it’s everything I wish I’d known from the start:

Part one discussed how to gain actionable insights from historic CLV analysis
Part two covered real-world use cases for CLV prediction.
Next we talked about methods for modelling historic CLV, including practical pros and cons for each.

This progression from use case examples to practical application brings us to today’s post on CLV prediction: which methods are available, and what can marketers and data scientists expect from each, when trying to apply them to their own data? We’ll look at probabilistic versus machine learning approaches, some pros and cons of each, and finish up with some thoughts on how to embark on your own CLV journey.

But first, let’s remind ourselves why we’re here…

The "why" of CLV prediction…

Last post focussed on analysing past data to investigate the spending habits of different portions of your customer base (known as cohorts). We wanted to answer questions like "how much is an average customer worth to me after 6 months?" and, "how do the different cohort groups differ in what they buy?" Now, we’re interested in estimating future CLV, and not only on a customer group level, but for individual consumers.

Part two discussed the many reasons you might want to do this. Much of the motivation stems from automated customer management: Reliable, timely CLV predictions can help you understand and better serve your customer base, nudge customers along a "loyalty journey", and even decide which customers to "fire". CLV insights can also help you anticipate revenue, and even make better decisions about which inventory to maintain. Check out that post for more ideas, as well as part one, which is full of questions to help you discover the "why" of CLV for your own organisation.

And now, the how…

All of that sounds great, right? But how can it be achieved? Two groups of techniques can help: probabilistic models, and machine learning algorithms. Let’s examine each in turn.

Probabilistic Models for CLV Prediction

The goal of probabilistic models for CLV prediction is to learn certain characteristics of our customer base’s historic purchasing data, and then use those learned patterns to make predictions about future spending. Specifically, we want to learn probability distributions for customers’ purchase frequencies, purchase value, and churn rate, since all these factors combine to generate any given customer’s likely future CLV.

While there are a number of probabilistic models available, the "Beta-Geometric Negative Binomial Distribution" model (or "BG-NBD" for short), is the best known and most frequently applied. Understanding it will help you understand the probabilistic approach in general, so to help you do that, I’m going to take a deep dive now, but mark the most crucial concepts in bold. Feel free to skim over the bold parts first, and then re-read for the details.

The BG-NBD model uses the Beta, Geometric, and Negative Binomial Distributions to learn about typical purchase frequencies and churn rates among your customers:

A Beta distribution models the "buy till you die" process, which is the idea that each day, a customer tosses two coins: buy… or don’t? And die… or don’t? Of course, we don’t expect them to literally die. Rather, we assume that at some point a customer will either decide to stop shopping with us, or they’ll simply forget all about us, and on that day they’ll cease to be a customer, whether they even realise it themselves or not.
A Geometric distribution models the time between purchases.
A Negative Binomial distribution models the total number of purchases a customer makes over time. It does this by combining the properties and assumptions of multiple distributions, including the Poisson distribution for purchase frequency and the Exponential for the variability in time between purchases.

Phew, that’s a lot of talk about distributions. If you’d like to learn more, here’s an excellent article. But it’s also enough if you just understand the point of what we’re trying to do: We want to use these distributions to estimate the likelihood that any given customer is "alive" at any given time, and how many future purchases they’re likely to make. Then we just need to factor in spending, and we’ll have an estimated future CLV. But how?

There are two ways to do this. The simplest is just to take historic average transaction value:

This relies on the simple assumption that alive probability and purchase value will stay fairly constant over the next n transactions. But of course, this is unlikely: p usually changes after every purchase, generally getting higher as the customer becomes more loyal. This is clear in the graph below: The blue line is the probability of a customer being ‘alive’, and the red lines show purchases; With each purchase, the blue slope becomes flatter, as higher loyalty means the customer is more likely to stay ‘alive’.

Repeat purchases (red lines) generally increase P-Alive, the probability that a customer is ‘alive’ (blue lines). Source: Author provided based on Lifetimes package (and yes, this one was hard to draw!)

The natural variability between customers is further driven by seasonality, global events, and all manner of other factors. So a better way to factor in purchasing value, and to capture variability in shopping patterns in general, is to include yet another probability distribution, called "Gamma". Here’s how it works:

Your customer base will include everything from loyal, high-frequency buyers to infrequent, churn-prone buyers. The Gamma distribution represents how many of each kind of shopper you have, assigning different weights to different buying behaviours.
The "Gamma-Gamma model" uses two layers of Gamma distributions. The first assumes that the variation in average transaction size for each individual customer follows a gamma distribution. The second layer assumes that the parameters (i.e. the shape and scale) of this individual gamma distribution themselves follow another gamma distribution, reflecting the variation in spending habits across the entire customer base.

The Gamma-Gamma model is often combined with the BG-NBD model to predict future CLV in monetary terms. Sounds great (if not exactly simple), right? So what are the practical implications of this method?

Source: Author provided.

Pros and Cons of the Probabilistic Model

On the positive side:

It’s tried-and-tested: This is an old, established technique, which has been successfully applied to diverse retail domains.
It’s forward-looking: You can start making predictions into the future, and taking actions accordingly, to steer your business towards higher average CLV for all customers.
It makes churn explicit: One of the biggest ways to increase average CLV is to decrease churn rates. This technique explicitly models churn, allowing you to react to reduce it.
It ‘makes sense’: The model parameters have intuitive meanings, meaning you can explore the learned distributions to better understand your customer base’s behaviour.

These advantages come with some hurdles, however:

It only works for non-contractual, continuous shoppers; that is, shoppers who don’t have a recurring contract, and who can shop at any time. It may not be well suited for non-contractual, discrete buyers, such as those who buy a newspaper every weekend, without actually having a subscription.
It can be computationally intensive to fit all these distributions, especially with large datasets.
It’s not a time-series model: Time series models are classes of probabilistic and machine learning models designed to learn about seasonalities and trends. The BG-NBD model does not natively include such features, although we try to capture some of their influence through the Gamma-Gamma component. Thus, instead of relying 100% on a BG-NBD to forecast consumer spending, it might be desirable to do some dedicating time series modelling as well. This, of course, brings additional complexity and effort.
It’s not typically profit-focussed: I’ve spoken a lot about the importance of thinking of the ‘V’ in CLV in terms of profit, not just dollar transaction value. For example, a frequent buyer who returns many items, thus having a high transaction value but also causing the company significant shipping costs, should actually be considered a low CLV customer. Unfortunately, the BG-NBD model isn’t explicitly designed to model transaction profit. You could try to incorporate it by ditching the Gamma-Gamma component and using a simple formula featuring average transaction profit instead:

Calculating margin isn’t easy, though (as part three made very clear). You may wish to investigate variants of this model which try to handle this, such as the Pareto-NBD model which explicitly learns a relationship between the number of transactions and their average profitability. I’ve found these to be less well supported by coding libraries and best practices, however, so the learning curve for implementation is likely to be steeper.
It won’t help with first time buyers: If you have customers with only one purchase, the BG-NBD model won’t know whether they’ve already ‘died,’ or are just going to be infrequent buyers going forward. In fact, customers with only one purchase will be rated definitely ‘alive,’ as shown by the bright yellow bar in the plot below. Of course, this is unrealistic: Maybe their one purchase was such a bad experience that they’ll never be back. Or maybe they bought a Porsche and they won’t need another one any time soon. To help you figure this out, you may wish to combine your probabilistic model insights with a historical analysis into how many one-time customers you have, or how long a typical pause between first and second purchases is.

A typical CLV analysis graph of p(alive) based on Recency and Frequency. Long-term customers (high Recency) who purchase frequently are likely alive, which is reasonable. Yet, customers with only one purchase (represented as zero repeat purchases on the Frequency axis), are rated with p(alive) = 1 (definitely alive), which is unrealistic. Source: Author provided based on Lifetimes package

CLV Prediction with Machine Learning

We saw that probabilistic methods aim to learn distributions of individual features like customer spending rate, and then combine those learned distributions to make estimations. Machine Learning algorithms take a similar approach: here our goal is to learn relational patterns between features in some data, and then use those learned patterns to make predictions.

There are even more Machine Learning algorithms and architectures to choose from than there were with probabilistic approaches. So once again, I’ll try to make the general and crucial concepts clear using one particularly well-known method: The RFM approach, where RFM stand for Recency, Frequency, and Monetary Value.

Let’s start by clarifying the idea of learning patterns between features in data. It’s obvious that an individual who shopped recently (Recency), shops often (Frequency), and spends a lot (Monetary Value), might be a high CLV customer. But how exactly do these three features combine to predict future CLV? Does recency trump frequency, meaning that if a customer used to shop with you often (good Frequency) but hasn’t at all lately (poor Recency), they’ve churned, making their future CLV effectively $0? This seems plausible, but what kind of Recency values typically mark the point of no return? How does this value change, for high versus low frequency shoppers? We need to quantify exactly the strength and direction of influence each of these features has, in order to use them for making predictions.

To do this, we:

Take a dataset of customer purchases and divide it into pre- and post-threshold periods: for example, the pre-period could be the first 9 months of the year, and the post-period is the last.
Calculate the R, F, M features (and potentially others as well) for each individual customer in the pre-threshold period, and calculate the sum of their spending (Monetary Value) in the post-threshold period.
Train a machine learning algorithm on the pre-threshold features, and use it to make predictions about the post-threshold Monetary Value (MV), as if that data were the future. Of course, it’s not really the future; It’s still historic data, which means we have the true values and can compare them to what we predicted. Based on how wrong the predictions were, our algorithm will keep trying again until it learns to get its predictions close to being correct.

Since we have the Monetary Value labels for the post threshold period, we could try to predict those per customer (in machine learning terms, this is known as a regression problem). We could then rank customers, split them into some number of groups (based on collaboration with marketing or customer service experts, of course), and the business could develop tailored marketing or customer service strategies per group. In particular, customers with very low predicted CLV could be identified as churn risks, and given special treatment accordingly.

This might sound like a great plan, but it’s not perfect: First, predicting exact values can be tricky, especially if the training data is small. Moreover, if you’re going to rank and cluster customers based on predicted CLV anyway (and I’ll come back to why you would do this in a moment), then why not try to predict the cluster directly? This would make the task a classification problem, and not only could it be easier to solve, but the outputs would be directly actionable: Customer A is predicted to land in the top tier CLV bucket; customer B is predicted to land in the bottom tier; we immediately know which campaign or strategy to funnel them to.

So how do we turn our regression problem to a classification one? When we calculate the post-threshold spending (MV) per customer, we need to cluster those values, assign them labels (such as low-, medium- and high-CLV), and train our classifier to predict those labels, instead of the underlying values. The only open question is: how to cluster the post-threshold MV values? The answer can be as simple as ranking and splitting into quantiles, such as the top 10%, middle 30%, and remaining 60%. Or, you could use a clustering algorithm: another type of machine learning algorithm which can discover clusters of values within a dataset. Whichever you choose should be based on collaboration with domain experts and those who intend to act upon the results of the project. The marketing team, for example, could help you decide how many quantiles or clusters would make sense in terms of developing targeted advertising campaigns to suit them.

Before we get to the pros and cons of ML approaches for predicting CLV, I’d like to clarify a couple of points from what we just saw.

First, I mentioned that you would probably rank and cluster customers after predicting their future CLV. You might be wondering, why go to the trouble? The thing is, it would technically be possible to not do this, and to instead create tailored strategies on individual customer level, based on their individual predicted future spending. However, such an approach is only feasible if it’s fully data-driven and automated, and that’s a huge undertaking in and of itself. Most companies just starting with CLV prediction won’t be at that level yet.
Secondly, data practitioners should be aware of another approach, which involves clustering the R, F and MV input features calculated in the pre-threshold period, and using the cluster labels as input features, instead of the raw values. This might bring additional benefits like explainability: For example, it could be nice to explain to stakeholders that the trained model had quantified how customers in the best R, F and MV clusters produce the best future MV predictions. But of course, figuring out a good clustering strategy for each feature adds additional complexities, and will require much additional experimentation.
Third, and on the subject of input features, don’t feel limited to Recency, Frequency and Monetary Value. Virtually any piece of information about customers could prove useful for understanding their shopping habits and predicting future spending. So think creatively, and ask your marketing and customer service teams for ideas: Demographic information, acquisition channel (did this customer first sign up instore or online, for instance), rewards program membership tier, emails clicked, number of returns, and many more, could all prove useful to a machine learning model.

Source: Author provided

Pros and Cons of Machine Learning Approaches

On the positive side:

It’s forward-looking: Just as with the probabilistic models, machine learning approaches allow us to start making – and acting upon – estimations about the future.
It’s versatile: You can potentially gain more accurate results by experimenting with which features you provide the model, enabling it to capture nuanced patterns in the input data which are good predictors of future spending.
It can unlock further insights: Machine learning models can detect patterns far too complex for humans to notice. Data scientists can apply explainability techniques to dig into what the model has learned about how each specific feature influences CLV, which can be immensely useful for marketers. For example, if having a high ‘Frequency’ value turns out to be the a strong predictor of high future spending, the company could invest extra effort in keeping themselves front of mind for customers, and making the purchase process as enjoyable as possible so customers keep coming back. If Monetary Value turned out to be a more useful feature for the model, the company might instead concentrate on cross- and upsell techniques, or other ways to entice customers to spend bigger.

On the negative side:

It’s harder to get right: Machine learning projects always add a certain degree of complexity, and using ML for CLV prediction is no different. Given all the different algorithms and training paradigms out there, the different features which could be used, and the different strategies for clustering the output predictions, data scientists have a lot to think about. Plus, if you want CLV predictions on a recurring basis, you’ll need a plan to deploy, monitor, debug, and periodically re-train the model. Personally, I see this as a rewarding challenge: This is what makes our job as data scientists interesting! But it is something to bear in mind, especially when explaining to stakeholders the feasibility and expected timeline of taking an ML approach.
It doesn’t explicitly model customer churn: This is one downside compared to the probabilistic model. The good news is, you can model churn yourself, using a dedicated prediction model. The bad news (which is not so bad, if you take my attitude to Data Science challenges), is that it’ll come with all the same extra complexities I just listed above.

Summing up this CLV Series

And now, over 10,000 words later, it’s time for me to wrap up this complete guide to Customer Lifetime Value. The first half of the series focussed on how to use CLV information in your business: Part one discussed how to gain actionable insights from historic CLV analysis, while part two covered real-world use cases for CLV prediction. The second half was all about practical data science techniques: Part three talked about methods for modelling historic CLV, including practical pros and cons for each, and today’s post focused on CLV prediction: which statistical and Machine Learning methods are available, and what can marketers and data scientists expect when trying to apply them to their own data?

I structured the posts in this way as a reminder to practitioners not to jump straight to the most complicated machine learning algorithm you’ve got the compute power to run. It may be wiser to start with a historic analysis to understand the story so far, and to form hypotheses about what affects customer spending. Your Marketing team can already take actions from such information, and you may then want to move on to more complex, potentially more accurate techniques for making CLV predictions.

And that’s it from me! Thanks to all of you who’ve been devouring these posts. I’ve seen your follows and highlights, and I’m thrilled that this has been useful for you. I apologise that this last post took quite some delay: I’ve been busy editing and co-writing a handbook of data science and AI, which has been a huge effort, but something I’m very proud of. Feel free to connect on LinkedIn or X if you’d like to stay updated on that. Otherwise, I hope to see you in one of my future posts on data science, marketing, Natural Language Processing, and working in tech.

The post From Probabilistic to Predictive: Methods for Mastering Customer Lifetime Value appeared first on Towards Data Science.

Yes, you still need old-school NLP skills in “the age of ChatGPT”

Katherine Munro — Mon, 12 Feb 2024 08:44:52 +0000

Yes, You Still Need NLP Skills in "the Age of ChatGPT"

Large Language Models have their strengths, but for many production problems, simpler NLP techniques are faster, cheaper, and just as effective.

Large Language Models require new skills, but it’s important not to forget the old ones too, like how to prepare the text data the LLM should use. Source: Markus Winkler on Unsplash.

Back when I started a masters of Computational Linguistics, no-one I knew had even the faintest idea what Natural Language Processing (NLP) was. Not even me [1]. Fast forward four years, and now when I say I work in NLP, I only get blank looks about half of the time [2]. Thanks to masses of media hype, most people know that there are things called Large Language Models, and they can do a lot of amazing and very useful stuff with text. It’s become a lot easier for me to explain my job to people (provided I tell them "it’s basically ChatGPT"). But recently, this also gave me pause.

I’m the editor of a Data Science and AI textbook, published waaaay back in 2022 (seriously, in AI years that’s about 20). In preparation for the third edition, coming this year, I needed to update my NLP chapter. And as I sat down to read what I wrote back then about neural networks, sequence to sequence models, and this damn-fangled new technology called "Transformers," I noticed something remarkable: it all felt so old school. All that stuff on statistical Machine Learning approaches? Quaint. And my little code snippets explaining how to do text preprocessing with Python? Like a curious artifact from a bygone time.

The computer it sounds like I learned NLP on. Source: Jason Leung on Unsplash

I started to worry, would anyone find my old passion relevant? Or would I have to scrap it all in favour of our new chapters on Generative AI and foundation models [3]? I’m convinced that fundamental NLP skills are still important, and that anyone who’s actually worked in "traditional" NLP would agree. But what about a novice reader? What about a total programming newbie, or perhaps a statistician or software developer wanting to go where all the cool kids are? Would they want to read about the joys of web scraping, tokenization and writing regex parsers [4]?

In my opinion, they should.

Why you still need to learn "traditional" NLP

Not everyone needs a chatbot

LLMs are great at using their vast world "knowledge" and creativity to generate novel, long-form content, where multiple correct solutions are possible. But many data use cases seek the exact opposite of this. They require extracting specific, concrete information from unstructured data, and usually, there’s only one correct answer. Sure, LLMs can do this too: if I copy a customer inquiry email into a chatbot and ask it to extract the customer details and inquiry topic into a JSON string, it’ll manage. But so could a Named Entity Recognition (NER) model, and it’ll probably have lower latency, be easier to evaluate, and potentially be more interpretable. Its output format will also be guaranteed, whereas with the LLM, it’ll be up to me to validate that the response is, indeed, valid JSON [5].

Thus, while LLMs could be useful in prototyping a pipeline which performs entity recognition as one of its stages, the final result may be more practicable with a traditional supervised learning model. Of course, you’d need a labelled training dataset for this, but here’s a saving grace for the LLM: it can generate that data for you [6]!

Not everyone is using ChatGPT (yet?)

Outside of the headlines and press releases by the world’s largest research organizations, who are using LLMs to solve NLP problems end-to-end, many companies aren’t there yet, even if their use case could benefit from an LLM. Some of them are figuring out what this technology can do, others are even building their first LLM-powered solutions, but many are realising the challenges with bringing such a product into production.

Best practices and established design patterns don’t yet exist for developers to turn to. Many new tools designed to help build LLM systems are not yet robust enough to be relied upon. Issues like complexity and latency when making multiple LLM calls, and security when connecting LLMs to external tools, can massively slow the pace of development. Finally, difficulties figuring out how to evaluate an LLM’s outputs make it harder to measure the value of the solution, and thus, harder for some companies to justify the continued R&D effort on solving particular problems with LLMs [7].

You can’t throw out all the old solutions at once

You know the saying, "if it ain’t broke, don’t fix it?" Plenty of companies have NLP systems which are working just fine. Those companies have no incentive to start over with Gen AI, and if they do decide to experiment with LLMs, it’ll likely be to tackle brand new problems first (problems which proved unsolvable with traditional methods, perhaps). Thus, it will take quite some time (if it happens at all) before existing solutions using "traditional" NLP techniques become entirely obsolete. And in the meantime, these companies will __ need to maintain their existing NLP systems in production. That means they’ll still need employees who know how to debug text preprocessing pipelines, evaluate NLP models, and maybe even extract new features from text data, to continually improve their existing systems.

LLMs need NLP too, ok?

Training or fine-tuning your own LLM will require text data which has been gathered, cleaned, and formatted consistently. You’ll need NLP skills for that. If you want to filter the input data, prompts, or model outputs for toxic content, you’ll need NLP for that, too, as you’ll be implementing something like keyword filters or content classification models. The same goes if you want to apply quality control to an LLMs responses, pulling a human into the loop in cases where the quality is detected as low: tasks like this are still sometimes done with traditional NLP techniques and supervised models.

How about building a Retrieval Augmented Generation (RAG) application? In this architecture, you provide a language model with a document knowledge base, from which it will draw information to create its answers. For this, you’ll likely need to experiment with embedding methods, document segmentation strategies (known as "chunking"), and how much overlap should be allowed between document chunks, such that all the relevant information is preserved for the LLM, but the content window isn’t immediately filled. NLP skills can help you figure out those issues, too [8].

How NLP and LLMs can work together

I actually dislike the term "age of GPT." To me, it’s indicative of the narrow focus so many people have on Large Language Models, and the blind spots this casts over a robust toolkit of tried and tested NLP methodologies. It seems to imply that, for the foreseeable future, LLMs will be all we need to solve all our problems. But I think this attitude comes from a place of solutions looking for problems, and that’s how you end up with AI-powered coffee machines that don’t actually need any AI, and genuine (as in, I definitely did not make this up) articles asking whether AI will change the way we fridge forever [9]?

In the end, nobody knows for sure how this is all going to play out. But my money’s on LLMs becoming just one useful tool of many for working with text data. So for anyone dreaming of getting into NLP, I’ve got some mixed news for you. Yes, you’ll be able to play with all the new toys, but you’re also going to have to learn regex, too [10].

Some final thoughts, and useful resources

1. Turns out that, by sheer luck, I stumbled into NLP at the perfect time. The entire field was having its own ImageNet moment. Maths with words was cool, neural networks had (very helpfully) developed selective amnesia, and everyone was raving about Sesame Street. It was fascinating, fun, and flattering to see how everyone wanted to learn what I was learning, and every company was hiring for skills like mine. I started teaching, speaking and writing about NLP, and even wound up trying to explain chatbots to Australia’s most famous comedian, Jim Jeffries. But nevertheless, people like my parents still couldn’t wrap their heads around what I was doing. Now they hear about this stuff in the news almost daily.

2. Hey, as long as no-one thinks I mean the other NLP, I’m happy with that.

3. By the way, I’m also responsible for our new chapters on Foundation Models, Generative AI, and LLMs, and they’re going out of date faster than I can finish writing them! (See, for example, this pesky paper introducing a fast and performant new architecture for foundation models, or this one, proposing combining LLMs with SymbolicAI, which will probably be out of date before I even finish reading it!).

4. Of course, by "writing regex parsers," I mean, "rewriting and debugging regex parsers, and generally questioning the life choices which landed you in that situation in the first place."

5. Entity recognition is generally formulated as a sequence labelling task, and the model architecture will be fixed to ensure that output is one label per token in the original input sequence. An LLM, by contrast, returns a string, which must be parsed and validated. Here’s a tutorial from a former coworker demonstrating exactly that that, i.e. how to "Enforce and Validate LLM Output with Pydantic."

6. Matthew Honnibal of NLP library spaCy makes a great case for using LLMs to train their own replacements, and even provides step-by-step recommendations for doing so, in "Against LLM maximalism". Highly recommended reading.

7. For a more concrete example of the challenges of bringing an LLM-based system into production, consider the case of function-calling LLMs, or agent systems. These can be incredibly useful, but bring plenty of operational hurdles with them, including managing complex workflows, maintaining satisfactory levels of security, and evaluating the effectiveness of the system. Ben Lorica of Gradient Flow covered all these and more in this piece, "Expanding AI Horizons: The Rise of Function Calling in LLMs."

8. Another great article, "Driving Operational Efficiency at Adyen through Large Language Models," talks about exactly this: using NLP techniques to support LLM-based applications. Check out the section on RAG, for example.

9. I’ve actually seen some really practical use cases for AI in fridges. It’s just this phrasing which makes me cringe.

10. As ChatGPT so eloquently put it, b(?:[Ss][uc]{2}[kx]s?stos[bB]es[yY]ou)b

The post Yes, you still need old-school NLP skills in “the age of ChatGPT” appeared first on Towards Data Science.

Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas

Katherine Munro — Fri, 17 Nov 2023 17:26:45 +0000

How often does a customer shop? How much do they spend? And how long are they loyal? Three simple factors to help you model your average consumer’s Customer Lifetime Value. But does that make it an easy task? No. No it does not. Source: Author provided.

Welcome back to my series on Customer Lifetime Value Prediction, which I’m calling, "All the stuff the other tutorials left out." In part one, I covered the oft-under-appreciated stage of historic CLV analysis, and what you can already do with such rearwards-looking information. Next, I presented a tonne of use-cases for CLV prediction, going way further than the typically limited examples I’ve seen in other posts on this topic. Now, it’s time for the practical part, including everything my Data Science team and I learned while working with real-world data and customers.

Once again, there’s just too much juicy information for me to fit into one blog post, without turning it into an Odyssey. So today I’ll focus on modelling historic CLV, which, as part one showed, can already be very useful. I’ll cover the Stupid Simple Formula, Cohort Analysis, and RFM approaches, including the pros and cons I discovered for each. Next time I’ll do the same but for CLV prediction methods. And I’ll finish the whole series with a data scientists’ learned best practices on how to do CLV right.

Sounds good? Then let’s dive into historic CLV analysis methods, and the advantages and "gotchas" you need to be aware of.

Method 1: The Stupid Simple Formula

Perhaps the simplest formula is based on three elements: how much a customer typically buys, how often they shop, and how long they stay loyal:

For instance, if your average customer spends €25 per transaction, makes two transactions monthly, and stays loyal for 24 months, your CLV = €1200.

We can make this a little more sophisticated by factoring in margin, or profit. There are a couple of ways to do this:

Stupid Simple Formula V1: With Per-Product Margins

Here you calculate an average margin per product over all products in your inventory, and then multiply the stupid simple formula result by this number to produce an average Customer Lifetime Margin:

For example, if you take the figures from above and factor in an average product margin of 10%, your average CLV = (€25 2 24) * 0.1 = €120.

How to calculate the average product margin depends on the cost data you have, which will likely come from a variety of data sources. A simple way to start is just to take the standard catalog price minus Cost of Goods Sold (COGS), since you’ll probably have this information in your inventory table. Of course, this doesn’t consider more complex costs, or the selling price when an item is on sale, or the fact that different transactions include different items, which can have very different margins. Let’s look at an option which does…

Stupid Simple Formula V2: With Per-Transaction Margins

Version two replaces average transaction value in the original formula with average transaction margin:

For example, €5 margin per transaction 2 transactions per month 24 month lifespan = an average CLV of €240.

This variation requires transaction level margins, based on price minus costs for each item in a transaction. The benefit here is that you can use the actual sales price, rather than catalog price, thus factoring in any sales or discounts applied at the final checkout. Plus, you can include Cost of Delivery Services (CODS), i.e. shipping, and Cost of Payment Services (COPS), i.e. fees to payment system providers, like Visa or PayPal. All this leads to more accurate and actionable insights.

Source: Author provided.

Pros and Cons of the Stupid Simple Formula

On the positive side:

The formula is conceptually simple, which can facilitate better collaboration between data scientists and domain experts on how to calculate it and what to do with it.
Plus, it can be as easy or complex to implement as you want, depending on how you calculate margin

There are two major downsides, however. Firstly, the formula isn’t particularly actionable:

It produces a single, average value, which is hard to interpret and influence: even if you make some changes, recalculate, and find that the average changes too, you’ll have no idea if it was related to your actions.
It averages out sales velocity, so you lose track of whether customers are all spending at the beginning of their lifetime or later.
And it doesn’t help understand customer segments and their needs.

Secondly, the formula can be unreliable:

Being an average, it’s easily skewed, such as if you have big spenders or a mix of retail and consumer clients.
In non-contractual situations, where the customer isn’t bound by a contract to keep paying you, you never really know when that customer ‘dies.’ Thus, it’s hard to estimate a value for the average lifetime component.
The formula assumes constant spending and churn per customer. It fails to consider customer journeys and phases where they’ll need more or less of your products.

Method 2: Cohort Analysis

This technique involves applying the average CLV formula to individual customer segments. You can segment customers any way you like, such as per demographic, acquisition channel or, commonly, per the month of their first purchase. The aim is to answer questions like:

What’s an average customer worth after 3, 6, 12 months?
When do they spend the most during their lifetime? __ For example, do they spend big at first and then drop off, or is it the inverse?
How does acquisition funnel affect CLV? For example, sign-ups due to a promotion could win a lot of short-term, non-loyal customers, while those from a refer-a-friend scheme might result in lifelong fans. Similarly, an in-store acquisition could drive more loyalty than a forced online registration at the checkout.
How do demographic groups differ in their average CLV? For example, do shoppers in affluent suburbs spend more? The answer won’t always match expectations, and whenever that happens, there are usually good insights to be found if you dig in deep enough.

Below we see a classic cohort analysis by acquisition month. The horizontal axis shows Cohort Group, indicating the earliest transaction month we have in the data. This probably indicates acquisition month (although some customers may have existed before the start of data collection). The vertical axis shows Cohort Period: the number of months since the earliest transaction in the data.

Source: Finn Qiao, ‘Cakestands & Paper Birdies: E-Commerce Cohort Analysis in Python’, TowardsDataScience.com

How do you read this? The leftmost column shows people who joined in December 2010 (or were already customers at that time). The darker colours reveal that these customers spent a lot in their first month (top left cell) and their 10th-12th months (bottom right), i.e. September to November, 2011. What could this mean? Collaboration between data scientists and marketers could help decode this trend: maybe it’s because the company saved Christmas for these customers in 2010, and they’re making a happy return in 2011. Maybe it’s just because they were already customers before the start of data collection. Meanwhile, customers acquired in July and August tend to be low spenders. Why? And what strategies can be employed to boost the average CLV for customers acquired during other times of the year?

Exactly the same kinds of investigations can and should take place over other types of segmentations, too.

Source: Author provided.

Method 3: "RFM" Approaches

RFM approaches are based on calculating the following metrics for each customer:

Yes I got a new iPad. No, I’m not an artist. Source: Author-provided.

This enables us to categorise customers based on these metrics and explore distinct customer groups, assigning meaningful business names to them. For instance, those with top Recency, Frequency, and Monetary Value scores earn the title of "Top Prio," or "VIPs." Having figure out who they are, the next thing you’ll want to know is: what’s the size of this group, as a proportion of your overall customer base? Meanwhile, customers with high Frequency and Monetary Value but low Recency have spent significantly but only over a short time. They might be considered "Churn Risks", especially if you add an additional metrics – time since last purchase – and it turns out to be high.

The simplest way to discover these groups is to use percentiles: Sort the customers by Recency and split them up – into tiers for the top 20%, the middle 50%, and the bottom 30%, for example. Repeat for the other metrics. Then define all possible combinations of tiers, label the resulting groups, and plot the size of each group as a percentage of your overall customer base. This is demonstrated below. Creating such a graph makes it really clear that only a small percentage of the overall customer base are "VIPs," while a much larger portion are "Going Cold" or even "Churn Risk". Such insights can help you devise strategies to gain more loyal customers and fewer at risk ones.

There are quite a few categories in this graph, resulting from the combinations of three metrics and three tiers each. For more granularity, you could define more tiers per metric. You can also experiment with how you do the segmentation: I mentioned a 20–50–30 split, but you could use other numbers, and even different strategies per metric. For example, since Frequency is a great indicator of customer loyalty, you might want to rank and split into the 5-10-85th percentiles, if you think that’ll help you most accurately pinpoint the best customers.

What if you aren’t sure about how to split your customers, or you want a more data-driven approach? You could try using unsupervised machine learning (ML) algorithms, such as k-means, to discover clusters of customers. This adds the complexity of using ML and of figuring out the number of clusters that truly represents the underlying data distribution (some recommend the elbow method for this, but boy do I have bad news for them). If you have the data science capacity though, going data-driven might produce more accurate results.

Pros and Cons of RFM Approaches

On the positive side:

RFM approaches are intuitive, which makes communication and collaboration between data scientists and domain experts easier.
Hand-labelled groups are highly sentient and tailored to business needs. You can work with Marketing to define them, as it’s marketing who’ll no doubt be acting on the results.

Cons:

It can be difficult to know how many R, F, & M levels to define: is high-medium-low granular enough? This depends on the businesses’ needs and how much operational capacity it has to tailor it’s marketing strategies, customer service, or product lines, to suit different groups.
The question of how to combine the R, F and M scores is also tricky. Imagine you’ve ranked customers by Recency and split them into three tiers, where top-tier customers are assigned 3, middle tier get 2, and the rest get 1. You repeated this for Frequency and Monetary Value. You now have a couple of options:

How to combine scores from different customer segments? Read on and see. Source: Author provided.

With Simple concatenation, a customer who scored R=3, F=3 and M=3 gets a final score of 333, while an all-round bottom tier customer gets 111. Using simple concatenation with three tiers per metric produces up to 27 possible scores, which is a lot (to verify this yourself, count the unique values in the "Concat." columns above). And the more tiers you add, the more combinations you’ll get. You might end up with more groups than you can deal with, and/or create groups that are so small, you don’t know what to do with them, or you can’t rely on any analyses based on them.
Summing up will provide you with fewer groups: now your all-round bottom-tier customer scores 1+1+1 = 3, your top tier customer gets 3+3+3 = 9, and every other score will land in this 3-9 range. This might be more manageable, but there’s a new problem. Now the R, F, and M metrics are being treated equally, which may be inappropriate. For example, a bad Recency score is a big warning sign you don’t want to overlook, but using summation, you can no longer see its individual contribution.
Adding weighting can tackle these issues: For example, if you found that Frequency was the best indicator of a regular, high CLV shopper, you might multiply F by some positive number, to boost its importance. But this introduces a new challenge, namely, which weighting factors to use? Figuring out some values which result in a fair and useful representation of the data is no easy feat.

Wrapping Up. Also known as, NOW can we get to the Machine Learning?

Phew. As you can see, modelling historic CLV is no easy feat. Yet I truly believe it’s worth it, and wish more data science projects would focus on truly getting to know the data so far, before they jump into Machine Learning and making predictions.

Nevertheless, I know that’s what some of you are here for. So next time – I promise – I’ll cover the pros and cons of CLV prediction methods. Until then, start exploring that past data! Build up some intuitions for it; it’ll help you in the next step.

Or just hit me up on Substack or X, so I know you’re waiting.

The post Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas appeared first on Towards Data Science.

Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas

Katherine Munro — Fri, 17 Nov 2023 17:26:45 +0000

Part three of a comprehensive, practical guide to CLV techniques and real-world use-cases

The post Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas appeared first on Towards Data Science.

Congrats on your Customer Lifetime Value prediction model - now what are you going to do with it?

Katherine Munro — Mon, 07 Aug 2023 13:18:18 +0000

An obsessively detailed guide to Customer Lifetime Value techniques and real-world applications

The post Congrats on your Customer Lifetime Value prediction model - now what are you going to do with it? appeared first on Towards Data Science.

Congrats on your Customer Lifetime Value prediction model – now what are you going to do with it?

Katherine Munro — Mon, 07 Aug 2023 13:18:18 +0000

Congrats on your CLV prediction model – now what are you going to do with it?

I’m a data scientist spilling everything I know about Customer Lifetime Value: how to model it, predict it, and actually make use of it. This is part two of three, and it’s all about use cases for CLV prediction. Your mission, should you choose to accept it, is to take my ideas and run with them. Image source: Author provided.

Call me crazy, but I’ve challenged myself to create the most extensive guide to Customer Lifetime Value (CLV) out there. Codenamed "everything the other tutorials left out", I’m sharing all the ideas and learnings I gained while working on this topic in a real-world data science team, with imperfect data, and complicated client needs.

My last post featured an ever-overlooked topic: use-cases for historic CLV calculation. It went a little viral, so I guess I’m onto something. In this post I’ll discuss:

Some essential terminology
The goal of CLV prediction
CLV prediction uses (going beyond the standard examples, I promise!)

And in upcoming posts, we’ll talk about CLV calculation and prediction methods, their pros and cons, and lessons learned on how to use them correctly.

There’s loads to cover, so let’s get into it!

Laying the Groundwork

Whether you’re a data scientist, analyst or marketer, you need domain knowledge when embarking on data-driven research projects. So if you’ve made it this far without knowing what Customer Lifetime Value is – or how and why businesses should start with calculating historic CLV – then do visit the last post. I designed it to get you asking the right kinds of questions of your own data, which will help make your prediction efforts, and the actions you can take from them, all the more successful. Enjoy it, and see you back here soon.

Contractual vs Non-Contractual Customer Relationships

Speaking of groundwork, let me clarify two terms I’ll use often here, which describe the type of relationship a retailer may have with its customers:

A ‘contractual’ relationship, such as a monthly phone or internet contract, is where customers are ‘locked in’. They’ll keep being customers unless their subscription has a planned end date, or they actively cancel.
A ‘non-contractual’ situation, which most retail relationships are, has no lock-in. Customers can simply stop shopping with a given retailer at any time, either intentionally or even unknowingly, driven by their own changing needs. Shopping at a specific grocery chain, or on Amazon, are common examples.

The goal of CLV prediction

If you work for a retailer with non-contractual relationships, you never know which purchase will be a customer’s last. Even in contractual situations, customers can cancel at any time (if allowed to), or at least, at their next official renewal point. The uncertainty is scary, huh?

But imagine if you could estimate the likelihood that you’re about to lose a customer, or that you already lost them. What if you could predict how many times they’re going to shop with you again, or renew their policies. What if you even knew, in advance, exactly how much they are going to spend? These are the fundamental tasks of CLV prediction, and they all point to one goal: to estimate the value a customer will generate for a retailer over a given, future period.

CLV Prediction Use Cases

If your mind isn’t already exploding with the possibilities that this can unlock for retailers, I’ve got a few ideas for you. As I said last time, don’t forget to think about how can you combine your predicted CLV information with other business data. This can help you unlock even more insights, and enable even more data-driven actions. For example:

Understand your customer base, and serve them better

Last time, I talked about two possible CLV workflows:

Calculate historic CLV → identify different customer segments
Identify customer segments → calculate the segments’ historic CLV.

CLV prediction provides the same options. And whichever order you choose, you can then investigate the results, to understand and better meet the specific needs of your customer segments. You especially want to figure out what makes for a high CLV customer, so you can concentrate your efforts on acquiring and serving more customers like them. I listed plenty of ideas last time, based only on using historic data. Now that we’re talking about prediction, using machine learning (ML) algorithms, there are some new possibilities…

Use ‘model insights’ to explore characteristics of high CLV customers. Certain ML algorithms come with ‘explainability’ features. These help data scientists understand why the model made the predictions it did. A ‘decision tree’ algorithm, for example, learns rules from combinations of its input features: rules like, ‘customers who are registered members, aged under 30, living in city A, who made their first purchase in-store, will spend an average of $327 in the next three months.’ Data scientists and domain experts, such as your marketing team, can try to interpret what these rules mean in the real world. Why is city A important? Do people of a certain income bracket live there? Or are the stores there just better, e.g. with nicer checkouts or a wider product assortment?

A decision tree learns to progressively split a dataset based on different criteria. It discovers these criteria for itself, based on which splits create the most diverse subsets of the data after splitting. In this case, it will progressively split up customers based on information about them and their transaction history. In the end, there’ll be buckets of customers. The average (mean) CLV per bucket will be the prediction for all customers in that bucket. Source: Author provided.

There are also machine learning explainability libraries which can show how different features impacted the model’s predictions. Again, these can be investigated by data scientists and domain experts together. Thinking creatively and critically can help you understand your customer relationships, and find ways to improve them.

Nudge customers along a loyalty journey

Turn highly-engaged new customers into happy regulars. Imagine that, among your new customers, some have an unusually high predicted CLV. These customers deserve special attention. Alongside your usual welcome emails, you could offer them extra promotions, refer-a-friend bonuses, or even customer satisfaction surveys. I know, surveys, eek! But if you make the process fun, quick, and easy, and show that you value the customer and their opinion, then you might get lucky. Wow them with your commitment to customer satisfaction, and the relationship is off to a good start).
Turn loyal customers into brand ambassadors. Similarly, long-term customers with a higher predicted CLV than their cohort need extra TLC. You can use similar tactics as just stated, but you’ll probably have to be more thoughtful about it. New customers often expect a flurry of welcome emails, but older customers might need something a little more catchy to get them to stop and read.

Decide which customers to save…

You might also have some long-term customers whose predicted CLV is starting to drop. If so, you need to figure out why. Have their needs and preferences changed? Have they forgotten about you? Or is it something more? This is where combing CLV data with other data sources is so valuable. Look at the customer’s prior purchase timelines, what they’ve bought, whether they’ve made lots of returns or contacted customer service, and so on. Try to establish whether there’s a poor product-customer fit, or whether the customer just doesn’t see the fit which is there.
The goal is to figure out whether there’s still potential value for both you and the customer. You could do this by looking for any obvious changes, like a change of customer address, a shift in product category purchased, or dramatic decreases in purchase quantity, frequency or value. These might indicate a real change in the customer’s needs. For example, if they suddenly start buying baby items, they probably have new preferences, and a changed budget. Use this knew knowledge to intuit whether the relationship is worth saving (given the cost of customer acquisition, I’d always assume it’s worth it, unless there’s strong evidence otherwise). Then you can try to fix the situation. Last post I mentioned including sizing recommendations as an example of improving the product fit. Other options could be more targeted, clearer communication, or even providing product recommendations (another data science topic, for another blog).

… and which ones to ‘fire’

Some customers are not worth keeping. They return most of what they buy, and the costs of this go deeper than just refunding the original purchase price: think, shipping expenses (Cost of Delivery Services), payment fees from PayPal, Mastercard, etc (Cost of Payment Services), and wages to the staff who packed and unpacked the goods, checked their condition, steam-ironed them, and re-shelved them in the warehouse. At some point, their net revenue may even be negative. And if they also place heavy demands on your customer service team – for example, if refund requests are handled manually in your company – then they’re also detracting from the effort you could spend servicing better quality customers. So, if a customer’s CLV prediction is low, and their historic costs are high, you can simply stop engaging with them. I don’t mean ignore their service requests! You can bet they’ll spread bad reviews about you to other people then. But remove them from email and other marketing lists, and let the relationship fizzle out.

Detect and foster business relationships

Business customers have the potential to spend large amounts, very often, and may be less likely to leave you for other providers (owing to complicated internal purchasing procedures). And if they do leave you, they can take significant revenues with them. Ideally your business customers will make themselves known to you, but when they don’t, CLV could help you detect them. If you predict an unusually high number or value of future transactions for a specific customer, it might be a business. Try to verify this using their purchase history. How do the number and kinds of items or services bought compares to the entire customer population, any other business customers you do know of? You can then reach out to them for confirmation. And once you know they’re a business, you can give them extra focus, offering things like special shipping conditions, personalised service, and relationship management.

… and automate customer relationship management

Perhaps the greatest benefit of predicting CLV is that it enables you to execute the above use-cases proactively. Historic analyses take time, and don’t always get priority among busy data science or marketing teams. Unfortunately, that means customers can cause the business a lot of costs before anyone notices. Worse still, their purchase frequency might drop off significantly, before you get around to re-engaging them. This can lead to losing that customer altogether.
Luckily, your data science and engineering teams should be monitoring your in-production CLV prediction models anyway. This means they can set up automated alerts to fire when there’s a CLV risk. A simple way to start is to send an alert whenever a customer’s predicted CLV drops to below a certain monetary threshold. However, this doesn’t take the distribution of typical CLV into account. There are probably a small minority spending big every month, and a large but faithful long-tail, who spend much less. If you try to create a single "CLV risk threshold", you could end up over-reacting to slight spending fluctuations in the long-tail, and missing drop-offs among the big spenders. A more sophisticated approach is to fire an alert whenever an individual’s prediction drops by a certain percentage, compared to their last prediction. Alerts can be routed to marketing or customer service staff, via email or Slack, for example, enabling them to act fast to save the relationship.
Similarly, you can fire an alert for a predicted CLV increase, as this customer might be ready to be nudged along their loyalty journey towards ambassadorship. You can even set up alerts when certain input features to the model change dramatically.* Increased cost features could indicate a customer which needs to be ‘fired’, while increased units purchased could reveal a new business customer in need of special attention.
*Another thing your data scientists and engineers should do, anyway.

Anticipate revenue inflows

To make CLV predictions for a customer’s entire loyalty lifetime is too long of a time-span to be reasonable and reliable. Instead, we usually try to estimate the value a customer will generate over a specific future period, such as the next three months. You can thus anticipate revenue inflows, using the predictions of your entire customer base. Of course, this will be even more accurate if you have a good understanding of your customer acquisition rate, and some good historical Analytics on when in their lifetime customers tend to spend (another topic from last post). This way, even for new customers, for whom a CLV prediction might not yet be very reliable, you can factor in average spending as identified in your historic CLV analyses.

Customer Lifetime Value prediction can be treated as a ‘supervised machine learning’ problem: for each customer, calculate features like how often they spent, how much, and so on, over the last n time periods. Then predict their spending over the next n time periods, e.g. the next 30 days. Source: Author provided.

For contractual situations, anticipating revenue can be even easier. You know how much customers will spend per month; the only thing you have to worry about is whether and when they’ll churn. Spoiler: I’m working on a churn prediction guide, just as over-the-top detailed as this one!

I’m still waiting for the Machine Learning stuff!

You’re right, enough about use cases, it’s time to get technical. So next time I’ll cover methods for CLV analysis, their advantages and disadvantages, and plenty of ‘gotchas’ to be aware of. And in part four, I’ll do the same for probabilistic and Machine Learning methods for CLV prediction. If you want to be updated about that, or all the other Data Science and marketing content I publish, you can follow me on here, Substack or Twitter (I mean, X). You can also check out my LinkedIn Learning course – Machine Learning in Marketing -it’s in German, but the code worksheets are English, and have plenty of comments to make it easy to follow).

The post Congrats on your Customer Lifetime Value prediction model – now what are you going to do with it? appeared first on Towards Data Science.

From analytics to actual application: the case of Customer Lifetime Value

Katherine Munro — Sun, 02 Jul 2023 15:06:57 +0000

Part one of a comprehensive, practical guide to CLV techniques and real-world use-cases

The post From analytics to actual application: the case of Customer Lifetime Value appeared first on Towards Data Science.

From analytics to actual application: the case of Customer Lifetime Value

Katherine Munro — Sun, 02 Jul 2023 15:06:57 +0000

_Warning: This might be the most extensive guide to CLV you’ll ever come across, and it’s all informed by my experience working with it. So strap in. (Or watch my 60 second summary, if you don’t mind missing the juicy bits). Source: Smarter Ecommerce._

Whether you’re a data scientist, a marketer or a data leader, chances are that if you’ve Googled "Customer Lifetime Value", you’ve been disappointed. I felt that too, back when I was helping lead a new CLV research project in a data science team in the e-Commerce domain. We went looking for state-of-the-art methods, but Google returned only basic tutorials with unrealistically manicured datasets, and marketing ‘fluff’ posts describing vague and unimaginative uses for CLV. There was nothing about the pros and cons of available methods when applied to real world data, and with real world clients. We learned all that on our own, and now I want to share it.

Presenting: all the stuff the CLV tutorials left out.

In this post, I’ll cover:

What is CLV? (I’ll be brief, as this part you probably already know)
Do you really need CLV prediction? Or can you start with historic CLV calculation?
What can your company already gain from historic CLV information, especially when you combine it with other business data?

In the rest of the series, I’ll present:

Uses for CLV prediction
Methods for calculating and predicting CLV, and their advantages and disadvantages (my LinkedIn Learning course – Machine Learning in Marketing – also touches on this)
Lessons learned on how to use them correctly.

And I’ll sprinkle some Data Science best-practices throughout. Sound like a plan? Great, let’s go!

What is Customer Lifetime Value?

Customer Lifetime Value is the value generated by a customer over their ‘lifetime’ with a retailer: that is, between their first and last purchase there. ‘Value’ can be defined as pure revenue: how much the customer spent. But in my e-commerce experience, I found that more mature retailers care less about short-term revenue than they do about long-term profit. Hence, they’re more likely to consider ‘value’ as revenue minus costs. As we’ll see in part three though, knowing which costs to subtract is easier said than done…

Calculation vs Prediction?

Experienced R&D teams know that for new data science projects, it’s best to start simple. For CLV, this can be as ‘easy’ as using historic transactions to calculate lifetime value so far. You can:

calculate a simple average over all your customers, or
calculate an average based on logical segments, such as per demographic group.

Even this rearward-facing view has many uses for a retailer’s Marketing and purchasing (that is, inventory management) teams. In fact, depending on the company’s data literacy level and available resources, this might even be enough (at least to get started). Plus, data scientists can get a feel for the company’s customers’ typical spending habits, and this can be invaluable if the company does __ later want to predict future CLV, on a _per custome_r basis.

To help you and the company decide whether you need historic CLV insights or future predictions, let’s view some use-cases for each. After all, you want the marketing, management, and data science teams to be aligned from the beginning on how the project’s outputs are going to be used. That’s the best way to avoid building the wrong thing, and having to start again later.

Combining CLV information with other business data

Many tutorials only discuss uses for CLV prediction, on a per-customer basis. They list obvious use-cases, like ‘try to re-engage the predicted low-spenders to get them shopping more.’ But the possibilities go so much further than that.

Whether you get you CLV information via calculation or prediction, you can amplify its business value by combining it with other data. All you need is a CLV value, or some kind of CLV level score (e.g. High, Medium, Low), per customer ID. Then you can join this with other information sources, such as:

the products customers are buying
the sales channels (in-store, online, etc) they’re using
returns information
shipping times
and so on.

I’ve illustrated this, below. Each box shows a data table and its column names. See how each table contains a Customer_ID? That’s what allows them all to be joined. I’ll explain the columns of the CLV_Info table in part three; First, I promised you use-cases.

A possible CLV database. Source: author provided.

Uses for historic CLV calculation

Let’s say you’ve ranked all your customers by total spending so far, and segmented them somehow. For example, your marketing team asked you to split the data into the Top 10% of Spenders, the Middle 20%, and the Bottom 70%. Perhaps you’ve even done this multiple times on different subgroups of your customer base, such as per country, if you have online shops around the world. And now, imagine you’ve combined this with other business data, as described above. What can your company can do with this information?

Honestly, there are so many questions you can ask of your data, and so much you can do with the answers, and I could never cover it all. I don’t have the domain knowledge you do, and that’s a massively important, massively undervalued thing in data science. But in the next few sections, I’ll provide you some ideas to get you thinking like a data-driven marketer. It’s up to you to take this further…:

Explore CLV segments and their needs

What makes a top-tier customer? Are they extremely regular, modest spenders? Or do they shop less often, but spend more per transaction? Knowing this helps your marketing and inventory teams identify what kind of customers they really want to acquire – and retain! Then they can plan marketing and customer service efforts, and even inventory and product promotions, accordingly.
Why are costs high and/or revenue low for your bottom-tier shoppers? Are they only ever purchasing items at extreme discounts? Always returning things? Or buying on credit and not paying on time? Apparently there’s a poor product-customer fit – could you improve it by showing them different products? Or here’s another question: are your bottom-tier customers always buying one product and then never shopping with you again? Maybe it’s a ‘poison product’, which should be removed from your inventory.
Are your high CLV customers more satisfied? Why? Imagine you’re a clothing retailer and your customers have an option to save their sizing information to their account. This allows your online store to make sizing recommendations when a logged-in customer is about to add an item to their basket. You also notice that most of your high CLV customers have saved their sizes, and they have fewer returns. Hence, you suspect that recommendations: Reduce return rates > improve customer satisfaction > and keep shoppers loyal.
How can you action this information? Here’s just one idea: the website team could add prompts reminding users to add their size information. Ideally this will increase revenue, decrease costs, and improve customer satisfaction, but if you’re truly data-driven then you’ll want to A/B test the change. This way you can measure the impact, controlling for outside effects, and keeping an eye on ‘guardrail’ metrics. These are metrics you would not want to see change during an A/B test, such as the number of account deletions.

Explore your demographics

The last section was about CLV tiers; now I’m referring to different customer subgroups, such as those based on age range, gender, or location. There are two ways you could do this.

Perform the above CLV analysis on your whole customer base, and then see how your subgroups are distributed among CLV tiers, like this:

Distributions of low (red), medium (yellow) and high (green) CLV customers for different age segments. Source: Author provided.

Split into subgroups first, and then do a CLV analysis for each.

Or, you can try both approaches! It depends on the business needs and resources available. But again, there are plenty of interesting questions:

Which subgroups do you have? Forget the obvious ones I just listed; let’s get creative. For example, you could split customers by their original acquisition channel, or the channel they now use most: online v.s. instore, app v.s. website. You could split by membership level, if you offer it. Using tracking cookies from your webstore, you can even split by preferred shopping device: desktop computer versus tablet versus mobile. Why? Well, maybe your mobile-phone-based shoppers have lower basket values, because people prefer to make big purchases on a desktop. The more domain knowledge you can build up, the better your analysis and – if it comes to it – machine learning efforts will be.
How does buying behaviour differ by customer subgroup? When do they shop? How often? For how much? Do they respond well to promotions and cross-sells? How long are they loyal? Do they spend often in the beginning of their lifetime and then tailor off, or is it some other pattern? This kind of information can help you plan marketing activities and even estimate future revenue, and I shouldn’t need to tell you how useful that is…
What’s a ‘typical’ customer journey? Are you acquiring most of your new customers in physical stores? Does that mean your stores are great but your website sucks? Or are your in-store workers better at getting people to sign up for membership than your website is? Either way, you could try to improve the website, or at least, be smarter about which channels you advertise on. And what about new customer offers, newsletter sign-up discounts, or friend referrals: are they attracting solid numbers of high CLV customers? If not, time to reevaluate those campaigns.

Get clever about your offering, and how you market it

If you understand your customers better, you can serve them better. For a retailer, that could include stocking up on the types of products their best customers seem to favour. A mobile phone provider could improve the services that its high CLV customers are using, like adding features to their mobile app. Of course, you’ll want to A/B test any changes, to make sure you don’t introduce changes that customers hate. And don’t abandon your low CLV customers – instead, try to find out what’s going wrong, and how you can improve it.
Similarly, if you understand your customers, you can speak their language. By showing the right ads, at the right time, on the right channels, you can acquire customers you want, and who want to shop with you.

Know what to spend on customer acquisition

Ever wondered why companies start emailing you when you haven’t shopped there for a while? It’s because it’s expensive to acquire a customer, and they don’t want to lose you. That’s also why, when you browse one e-commerce site, those products follow you around the internet. Those are -called ‘programmatic ads’, and they appear because the company paid for that first click, and they’re not willing to give you up, yet.
As a retailer, you don’t just want throw money at acquiring any old customer. You want to gain and retain the high value ones: those who’ll stay loyal and generate good revenues over a long lifetime. Calculating historic CLV allows you to also calculate your break-even points: how long it took each customer to ‘repay’ their acquisition cost. What’s the average, and which CLV tiers and customer demographic groups pay themselves off fastest? Knowing this will help marketing teams budget their customer acquisition campaigns and improve their new-customer welcome flows (i.e. those emails you get after the first purchase at a new shop), to increase early engagement and thus improve break-even times.

Track performance over time

Re-evaluate to identify trends. Businesses and markets change, beyond the control of any retailer. By periodically re-calculating your historic CLV, you can continuously build your understanding of your customers and their needs, and whether you’re meeting them. How often should you re-run your analysis? That depends on your typical sales and customer acquisition velocity: a supermarket might re-evaluate more often than a furniture dealer, for example. It also depends on how often the business can actually handle getting new CLV information and using it to make data-driven decisions.
Re-evaluate to improve. Periodically re-calculating CLV will help you ensure you’re gaining ever-more-valuable customers. **** And don’t forget to run extra evaluations after introducing a big strategy change, to ensure you’re not sending numbers in the wrong direction.

So what about CLV prediction…?

I know, I know… you want to talk Machine Learning, and what you can use CLV predictions for. But this post is long enough as it is, so I’m saving it for next time, along with the lessons my team learned on how to model historic CLV and predict future CLV using real-world data. Then in part three and four, I cover the pros and cons of the available modelling and prediction methods.

If AI and marketing are your thing, you can also check out my two part series, "AI in Marketing: the Power of Personalisation," right here. And if you’re loving all of this, then don’t forget to subscribe. See you next time!

The post From analytics to actual application: the case of Customer Lifetime Value appeared first on Towards Data Science.