Preventing analytical bias

  • Facebook
  • Twitter
  • Linkedin
  • Email

David Hauser, PhD

Chief Science Officer, CPG and Specialty Analytics

May 27, 2016 - Suppose a chocolate manufacturer and a car manufacturer have both identical weekly revenue and share, and have identical media, marketing, and other drivers by geography. If we construct the best possible marketing mix models on these identical datasets using industry best practices, would the results of the two analyses differ or would they be identical?

Alternative views
Since the numbers between the two datasets are identical, some believe the statistically derived relationships are identical. Hence, the parameters of a regression model are identical in the two analyses, and thus both would reveal the same relationships, as the two dataset are numerically identical. As such, the ROIs, contributions, response curves, interactions, and all other parameters are the same.

Followers of this school of thought must assume the category (chocolate versus cars) is irrelevant. Why? If differences between these categories should lead to different models and different results, then one should observe different correlations and coefficients. But, these statistics are a function of the numbers themselves, which are identical in the two datasets. If this is true, then analytics could be done robotically, since only statistics, machine learning, and numerical artificial intelligence are required to assess the relationships in the data.

Some disagree with the above and believe the results should indeed differ. For instance, consumers of chocolate likely purchase chocolate several times over two or three years of data. Thus, any autocorrelation observation in the data may be due in part to the experience effect. On the other hand, households are unlikely to purchase cars several times in two or three years. Hence any autocorrelation in the data must be due to other phenomena. The ratio of planned purchase to impulse purchase of chocolate is likely quite different from the same ratio for cars, and thus the lag length and lag structures are likely different. The moment a consumer chooses to purchase chocolate, they are likely able to indeed make the purchase. Yet the purchase of a car is often contingent upon getting credit and financing approval, which could impact the purchase decision. Even the nature of the price sensitivity is different between the two. The more one dives into the purchase funnel, supply chain, product availability, competitive set, etc., the more one can find the ways in which one could expect the results to differ. However, when one creates a model for chocolate or for cars, and takes into consideration the lag structure, the autocorrelation influence, etc., one needs to ask how much they know about cars or chocolate and why their choices for the model structures are the most reasonable. For this school of thought, the mathematics must be done differently for chocolate as compared to cars. Hence, we need to articulate the kind of mathematics appropriate for each category and assess whether other SMEs familiar with the categories and modeling would use the same mathematical approaches.

Reality versus perception
The reality is that most consumers of analytics are not fully aware of the assumptions analysts make when constructing models. Users of analytical results simply assume the analyst made valid decisions throughout the analysis. The reality is that the validity of the assumptions can sometimes be faulty. Perhaps someone more familiar with chocolate or cars would have used a different lag structure than the analyst chose. The analyst is generally an expert in mathematics, not necessarily an expert in cars or chocolate. Much as the criminal sketch artist requires an eyewitness's account on which to base the drawing, the statistical analyst must team up with category experts and discuss relevant category nuances so that they can be correctly mathematically replicated. Failure to do so too often leads to incorrect functional forms and mathematical results, and erroneous models that are not reasonable representations of the environment being modeled.

Consider the following typical situations:

  • Most analysts and consumers of analytics believe analyses should be customized to each category and client. Then why is the industry standard modeling approach to marketing mix modeling log-log regression, regardless of category or client?
  • Why are pharmaceutical, consumer packaged goods (CPG), and financial services marketing mix models very commonly built with the identical function form, log-log regression?
  • Using the identical model form suggests all categories behave the same way. But we already see that chocolate and cars behave differently.
  • Why do demand forecasters in nearly all verticals use the most common additive or multiplicative model functional forms?
  • By using the same functions assume their growths behave the same way.
  • Why are most marketing mix models created with either weekly or monthly data?
  • That assumes that all categories follow the same time series behaviors
  • Why not use bi-weekly data or start the week on a different day?

The answer to these questions is that most analysts use the tools and approaches with which they are most familiar without regard to the validity of the underlying assumptions. This means that analytics that are delivered may not be the most realistic or even most appropriate. Thus,

  1. The analyst must validate all underlying assumptions required in an analysis with category experts to ensure realism, and
  2. The consumer of the analytical results must ask the modeler about all underlying assumptions, and how they impact the business interpretability and business implications of the results to assess the analytics true reliability.