Marketing Data Science BOM

A ChatGPT-Inspired Bill of Materials for Marketing Data Science Projects

Data Science

What makes a marketing data science project successful? What is required to make this set of tools work for you? Much has been dedicated to techniques for individual models, feature engineering, et cetera, yet general requirements often fail to include all of the details. Here, I discuss some things critical to a general set of applications.


January 15, 2023

I’ve read many articles discussing requirements for individual models within data science projects and ways to determine/improve their generalization performance. Yet general requirements of data science projects is not often discussed. Here, I discuss things I think are critical for the success of such projects.

While there are a great variety of possible projects, I interrogated ChatGPT about ways one can use “data science” or “machine learning” to help a business achieve objectives. The results were the kind of fluff-based summary you’d get from an undergrad with a procrastination problem. But the fluff contained a good high level survey of general use cases.

Combining the ChatGPT fluff list with personal observations from experience working on several such projects, here I argue for a collection of requirements for a successful data science project.

You Need Good Data

This is the most obvious statement and the most overlooked in practice. We require appropriate data to do data science. Failures at this level are often overlooked and cause data science projects to cost more, be more misleading, or end up failing to produce anything of value.

Discoverable, Meaningful, Interpretable, Reliable, and Timely

One way to look at our data requirements is through the five qualities above. Data should be discoverable: you should be able find out what data is recorded, where it is recorded, and what it means. To do this the data must be documented. Documentation could be data dictionaries, Confluence pages, a Wiki — something that is kept updated as new data is added or definitions are changed. Note that having a person in the company that you can ask is not a solution. People make mistakes, and when data documentation is exclusively an oral tradition then wrong meanings frequently get embedded all across the organization.

Data should be meaningful and interpretable. The data should have a meaning, and only one meaning. For example, a time of sale should have a timezone either embedded in the data or in the definition of the data. It should also be documented if it’s the time the transaction was started by the customer, or the time the transaction was closed by the payment processor, or some third magical thing. Data should also be interpretable: we should be able to understand the meaning of the data, both where it exists and what it means when it does not exist.

Data should be reliable and timely. We should be able to know when data becomes available, and it should be updated with enough frequency and consistency that it has value. Also, the data should be accurate, both the sense that it exists when some real fact exists, and that it should not contradict other data.

This is a subset/bastardization of some of the data quality dimensions given by A more full list based on the list is as follows.

  1. Accuracy - The extend to which data agrees with the real world.
  2. Completeness - The data is not missing. Or, when a value is missing, it is deliberate and informative.
  3. Consistency - Does the data agree with itself?
  4. Uniqueness - Do things accidentally get recorded multiple times? Can we rely on the count of a thing to represent a true fact?
  5. Validity - Are business rules and formula correctly being represented in the data? Are categorical values within their possible set of values? Is there bad data? Is there testing data in the production data sources?
  6. Timeliness - How long does it take data to get processed? Are there temporal dependencies that can become race conditions later on?
  7. Meaningfullness - Do we understand what a value means? For datetime, do we know timezone? For categorical data do we know the meaning of each level? If nullable, what does null mean?
  8. Auditability and Currency - Can we derive the state of things at a point in time in the past? If now, does the data represent the current state or is it stale and inactionable?
  9. Conformity - All data has a type, so the particular data should have the same type where stored, including timestamp, or same scale/value/precision. Also, naming should be consistent.

A best practice is that data should have a published ontology shared with everyone in the organization. This will ensure that relationships between data is defined and that everyone in the company shares the same definitions. One of the most difficult problems to deal with is when different groups use different definitions for the same term, and reconciliation requires massive reworking of one of their processes.

An ontology is a controlled vocabulary that the organization can use to discuss categories, properties, and relationships between concepts, data, and entities (I think this definition comes from Wikipedia).

As an example, let’s look at a small piece of FIBO Ontology for Time. It gives the following types of “time instant”:

  • Date
    • Calculated date (using a formula)
      • Relative date (relative to another date)
      • Specified date
    • Explicit date
  • Datetime - Date and time and timezone is optional
  • Datetimestamp - Datetime but timezone is required
  • Time of day

Then it goes on to time intervals. What you’ll notice is that it’s very explicit vocabulary.

Data Must Be Complete and Support Auditability

When I say complete I mean that we have data to describe what happened when something happened. If you are selling a product to a customer then having complete data for an offer would mean knowing:

  1. What product was offered?
  2. Who was it offered to?
  3. When (both for the customer’s local time and for our standard time) and by what channel was it offered?
  4. What was the amount of the offer, including what discounts, and what incentives were included?
  5. Did the customer take the offer? If so, when did they take the offer?

Data must be auditable in the sense that we can reconstruct our picture of the state of things at the time events took place. Operational databases tend to only store the current state of the users, but we may need to know the state of the users last week when we emailed them.

Data Must Describe the Role of Randomization in Records

If events occurred because of randomization we need to record that this occurred, and often need to record the probability of the random event that occurred. This is especially important when the probability changes over time. Things like the mean no longer estimate the mean of the population in these settings.

Sometimes experiments will have a control group that does not receive a touch and a challenger group that does — if we do not store who was in the control group then, since they didn’t get the message, to identify control users we’ll have to reassemble who was eligible for the message at the time of randomization and did not get a message. This puts incredible stress on data quality to reproduce targeting at a historical point in time, and is almost sure to fail.

Data Quality Must Be A Priority

Ensuring the above means people and processes. Usually these kinds of things can be rolled up into data governance. But if you can’t identify the people responsible for the above then you might be better off putting off data science projects until data quality can be better addressed. It’s unlikely any model build on bad data will produce good results, and organizations that fail simple data quality are likely to be unable to support running machine learning models and distributing their scores to execute on any model that (by some miracle) does work.

Historical data is the way it is for a reason. Sometimes the main difference between customers is how the business treats them. Models built using historical data have the problem that the relationships they see might not be something that we can reproduce by changing treatments to customers. Observational models alone cannot establish cause-and-effect relationships that we need to advance business objectives.

There are two ways to address this. The gold standard is to introduce randomization into the system, either directly or through experimentation. Another option is to use causal inference to build a model of the relationships within the data and compensate for this in the analysis.

The main thing I want to highlight here is the following: even data that exactly depicts what happened, if what happened didn’t vary sufficiently to see different customer behavior then models won’t be able to use the data to alter customer behavior.

Good data is a preliminary requirement for successful data science projects. Most initiatives have lots of stages that have different requirements. It’s important to be explicit about the type of project and the stage of development to make such projects successful.

Organizing Initiatives

There’s a number of ways to organize different projects and initiatives, and each demonstrates something important about the needs of the project. Initiatives have different goals, target customers differently, and exist at different stages of their lifecycle.

Population to Personalization

There exists a continuum over how much individualization we associate with our customers. And as we travel from one end of this continuum to the other we are less and less able to gain interpretable insights, but more able to do optimization.

On one end of the continuum we have experiments where we look to see which treatment is best to give all customers. Here, customers are a big group and the size of this grouping allows us to make certain inferences about them. Our experiments can be well powered and can tell us a lot about our population of customers (or potential customers).

Further along this scale is segmentation. Here, we label our customers with a segment from a finite and small number of segment labels, and we treat different segments differently. Experiments are employed to understand impacts but as the segments include smaller number of users, the more difficult it is to do experimentation and gain insights.

Even further along this scale is personalization, where we have predictive models that are often black boxes taking a lot of data about a customer and choosing a best treatment. Experiments here cannot give us insights about individuals, but are limited to understanding the aggregate behavior of such models. It’s very difficult to learn about our customers in this setting. Yet such methods can give long run optimal performance.

Descriptive and Predictive Models

Sometimes we want to describe things in a way that helps us understand what’s happening. For example, we may model a customer’s journey with a product and seek to understand where customers are having the most trouble. Other times we may want to predict something, for example, how much are certain customers going to be worth over their lifetimes. Sometimes we do something that does both, for example, running an experiment both seeks to describe the difference between treatments but also to help us predict the impact of choosing a particular treatment going forward.

Causality and Inference

Separate from predictive and descriptive models are questions about decision making, causality, and inference. This could mean rigorously testing hypotheses in a way that controls error rates. It could also give strong evidence for a causal relationship that was merely observed in historical data. Finally, we might wish to estimate the effectiveness of some intervention.

Stage of Project Lifecycle

Some projects have short lifecycles and others have long lifecycles. Experiments, for example, have short lifecycles. Once an experiment is complete it may well indicate a result that answers a question and that’s it. But some experiments and most all other projects have longer lifecycles.

Some projects begin with exploratory data analysis (EDA). Early in development we seek to determine what kinds of problems might be amenable to a data science solution given our data. Sometimes part of EDA and sometimes part of later stages, feature engineering describes work to take data in its myriad forms and convert that into a form that data science models can consume. At some point a model can be fit/trained and evaluated using computational methods. Later, it can be deployed and included in an experiment to verify its performance. All models must be monitored over time and often require maintenance and enhancement to maintain the gains originally achieved.

Changing System Requirements

System requirements for each stage of a data science project lifecycle are different, but the requirements can also change based on how successful a project is. When a project works well there’s a tendency to expand use of the current project and to create similar spin-off projects. This can change the backend requirements to support such uses.

Consider personalization. Personalization requires a lot of data about customers. If the number of customers is not too large you might generate results for customers in nightly batches, serving these if and when the customer requires it. But if the number of customers grows large (or if we develop many such models and/or want to run the models more often) then we need a platform to serve data very quickly. Your applications may require a high performance Customer Data Platform (CDP) and move more towards real-time or rolling models instead of a batch processing model.

Targeting new versus existing users has a similar affect on system requirements with new customers requiring real-time systems and existing customers possibly being amenable to batch offline computation while the number of existing customers is not too big. If the number of customers increases then the requirements will change.

It is instructive to apply this framework when looking at some example projects. To that end I assembled some examples from ChatGPT that would be relevant to almost any business seeking to use data science to advance business objectives.

ChatGPT Examples

The following list of example projects were assembled from a number of different prompts for a number of data science, machine learning, or statistical projects that can serve business goals.

Decision Making: Provide insights and inform decision-making by:

  1. Testing hypotheses,
  2. Identifying causality,
  3. Evaluating effectiveness of interventions (effect size estimation).

Customer Segmentation: Identifying different groups of customer based on their characteristics and behavior, and develop targeted marketing strategies for each segment.

Customer Churn Prediction: Develop models to identify customers at risk of churn, so that targeted interventions can be made to retain them. (This might play into segmentation, experimentation, and other models).

Customer Lifetime Value Prediction: Estimating the total value that a customer is likely to generate over their lifetime, so that resources can be prioritized towards acquiring and/or retaining high-value customers. (This might also play into segmentation, experimentation, and other models).

Personalization: Customizing products and marketing materials to individual customers based on their preferences and behavior. (This is difficult for new customers because of the cold start problem - we have little data on new customers. Also if the number of products is low then this is still difficult. Easiest for larger market places.)

Recommendation engines: Personalization, usually based more on history than demographics, but maybe both. Often we think of this as recommending product, but it can also be recommending help for people seeking support. Whatever. Do what you like.

Customer Support Optimization: Analyze customer support data to identify trends and patterns, and developing strategies to improve the efficiency and effectiveness of the support.

Marketing Campaign Optimization: Analyze data from past marketing campaigns to identify what tactics were most effective at driving customer acquisition and engagement.

Pricing Optimization: Find the right price for customers.

Product development: Identify trends and opportunities for new product development.

Customer Journey Analysis: Analyze data on how customers interact with a a company’s product and services to identify areas for improvement.

Customer Sentiment Analysis: Analyze customer feedback and social media data to understand how customers feel about a company and its products.

Social media analytics: Monitor the broader web for info on your products/services.

Geospatial Data analytics: Analyzing data that has a geographical component to extract insights and inform decision making. (More of a feature engineering thing).

The previous list of projects is a good template for organizations starting to utilize data science. Some projects combine elements from multiple entries. Still, this is just the hand-wavy high-level version that ChatGPT produced. Certainly, it is not exclusive, but it might help you decide on some project ideas to investigate.

Bill of Materials

To sum it up, successful data science projects require:

  1. Good data that supports causal inference
  2. Clear goals, targets, and requirements
  3. Persistent effort across the project’s lifecycles

It’s an open question if doing so can be done for a cost less than that of the project itself. But any attempt that fails the above requirements is very likely to fail. Too often people sell data science projects as magic boxes that take “data” and make value — a modern snake oil. But in reality it’s a tool that has appropriate uses.