Why all analysts must know Statistics (and know it well)

  • Facebook
  • Twitter
  • Linkedin
  • Email

David Hauser, PhD

Chief Science Officer, CPG and Specialty Analytics

May 9, 2016 - 

Statistics as an imperative
Statistics makes our world go round. Statistics determines how much we pay for insurance, if it will rain tomorrow, how many sales a company will have, and which emails should be flagged. It determines products of likely interest to internet surfers, where businesses should focus their message, what messages should be given, through which channels, and the likely consumer response from these messages. Statistics helps distinguish the probable from the improbable, correlation from causation, and the unexpected from the outlier. It enables one to find patterns that are leverageable, white space opportunity that is actionable, and hypotheses that are reasonable.

Statistics is key to competitive advantage, especially in markets in which companies fight aggressively for the next dollar of consumer spend. Statistics is not only an art, not only a science, and not only key to driving insights. Statistics enables analysts to comb through petabytes of data to find information that is significant, as opposed to computer science combing through the same data to find information that is interesting. For those who call themselves analysts, in any field, statistics is not a nice-to-have but a need-to-have.

  • Forensic analysts use statistics to find causational evidence rather than circumstantial evidence.
  • Financial analysts use statistics to balance risk probabilities associated with reward probabilities.
  • Medical research analysts use statistics to determine incremental benefits of new treatments over existing therapies.
  • Taxation analysts use statistics to determine the likelihood of tax filing errors and the probability of recovering outstanding liabilities.
  • Marketing analysts use statistics to determine the perceptions of customers as compared to the realities of the market.
  • Weather analysts, economic analysts, epidemiological analysts, sociological analysts, supply chain analysts, information technology analysts, and the countless number of other types of analysts rely on statistics to do their jobs. Without firsthand knowledge of statistics, these analysts can do the mechanics if their respective jobs but then require statistics be built into their tools of trade, or each can instead be exceptional at their analytics jobs with a mastery of statistics.

As analysts, we transform data and interpret results. Without a thorough understanding of statistical inference and significance, we risk drawing conclusions on observations more akin to correlation than causation.

Case Study: A Genpact analyst created a Marketing Mix Model for a client. The analyst believed the model was significant, relevant, and predictive based on the model statistics and goodness of fit graphs. The client asked the analyst to determine how to cut the marketing budget by 25 percent with the least negative impact. The analyst used the model to run simulations and found that reducing TV by 50 percent while increasing internet search activity by 75 percent would lead to this result. However, TV drives search activity; decreasing TV decreases search. It was unrealistic to increase search without increasing TV, the driver of search. Despite the model results, the analyst failed to embrace statistics to generate a statistically significant mode with statistically significant causational paths between the drivers. The analyst went through the motions of statistical modeling without applying the rigor of statistics, thus leading to unrealistic models and erroneous results.

Machine learning is of great interest in all industries
Machine Learning is a continuously growing field of interest for companies in all verticals. It enables detection of patterns of complex sorts within the data, across large datasets with varying degrees of data quality, quantity, and structure. As datasets grow in size, finding patterns and determining how variables interact represent goldmines of opportunity. However, care must be taken to distinguish causation from correlation.

The appeal of Machine Learning — to those not versed in statistics — is vast given the promise of computer algorithms to find actionable insights and opportunities. Machine Learning is referred to as Computational Statistics in other circles. One cannot fully leverage Machine Learning without having a solid handle on statistics — doing so actually risks generating outputs that are not significant, relevant, or reasonable. For instance, core to Machine Learning are decision trees. Yet there are so many variations of trees, tests for splitting branches, and rules and heuristics for pruning. To know if the tree created with Machine Learning is the most appropriate for a dataset requires a strong understanding of the underlying statistics behind the trees.

Business analysts are in roles in which it is particularly crucial to have solid understanding of statistics. The analytics created, reports generated, and KPIs calculated are used to aid million-dollar decisions about billion-dollar brands in trillion-dollar economies. Given that business decision-makers rely on analysts for their depth of skills, if analysts lack a solid foundation in statistics, then the reports generated and analytics created may lack verifiable tests of validity, reliable separation of signal from noise, or confidence between reliable "actual" versus spurious "possible."

Analytic outputs are too often reliant on an algorithm in a computer than on a statistical understanding of the appropriateness of that algorithm. Anyone can generate a decision tree once they learn the basic syntax in R. Anyone can generate a non-linear regression model, once they learn Proc NLIN in SAS. Anyone can generate a multi-dimensional pivot of data, once they learn pivot tables in Excel. Yet there are many types of decision trees with many kinds of tests of splits, many types of non-linear regression functional forms and many types of objective functions, and many types of pivots and transformations. So, just because one can generate the outcome does not mean that specific outcome is optimal or even appropriate. Furthermore, being an analyst means that non-analysts will trust the results mores, rather than question and dissect them. Hence it is all the more important that analysts gain strength in statistical capability so as to confidently create defensible analytics.

Analysts across industries often create analytics based on convenience rather than statistical preference. They often model the data that is provided without regard to how many observations are actually required given the number of variables and cross sections. Analysts often use model structures they were taught in school, or use model structures built into macros and tools they use, without regard to the theoretical, statistical, and practical appropriateness, or weaknesses, of the model.

Case Study: A Genpact analyst once created an Econometric Model for a client using the standard Log-Log model. When questioned about this model, the analyst reported how well-established that functional form was for many applications of econometrics. Upon completion of the model, the client asked about the optimal marketing spend that would increase the ROI. The analyst used the model and found drastically cutting the spend on all marketing channels would increase the ROI and thus concluded they overspent on marketing. The analyst did not appreciate the underlying statistics of the Log-Log model, which imposes the assumption that the highest return is always at the lowest levels of spend. It is a diminishing-returns-only function, which states that the very first dollar of spending is more effective than the second dollar, which in turn is more effective than the third. Without a foundation in statistics, this analyst, who was well-trained in Log-Log modeling, created a model of limited value by not knowing how to address the imposed limitation.

Understanding hypothesis-testing
Hypothesis-testing is an important tool within statistics wherein a sample of data is used to determine if a hypothesis is invalid. Some people misunderstand statistics and use data as a means to prove their validate hypotheses are valid. Statistics does not enable one to prove something right, only wrong. Failing to find evidence for something is not the same as proving that something is right. Saying that all X are Y requires simply finding one X that is not Y. Failing to find any Xs that are not Ys does not prove that all Xs are indeed Ys. These nuances are important but too often lost on those not versed in statistics.

An important theorem in statistics is called Bayes' Theorem which talks about the probability of an event given prior information about other conditions. This theorem, without a firm understanding of statistics, leads many people, including analysts, astray.

Example: Suppose 1 in 1,000 people are carriers of a genetic disease that can only be detected in a medical test. Suppose there is a test that is accurate 95 percent of the time. If this test is given to someone, and the test result says they are a carrier, what is the probability they are indeed a carrier? Using logic, one would conclude that since the test is 95 percent accurate, then a positive test result means they have a 95 percent probability of being a carrier. With such a high probability of being a carrier, various treatments, protocols, or actions may be implemented. With a solid understanding of statistics and conditional probabilities in particular, however, one would find the actual probability is 1.9 percent of being a carrier (see the figure below). Given such as disparity of results, an understanding of statistics cannot be over emphasized, as the implications of 95 percent versus 1.9 percent could be enormous.

Driving value to the end user
For analysts to drive the greatest value to users of the results, the analysts need to be fluent in all things statistical. They need to know theory and applications, they need to know how to interpret and bridge the interpretation to the end user, and they need to know how to apply statistics to the data available. They need to understand the statistics well enough to explain it to the non-quantitative.

Analytics is a driver of value for all clients. Analytics is built on data and statistical analytics of the data. However, most people spend most of their energies on data quality, quantity, transformations, dashboards, and reporting, and spend little on the depth of the analytics, variations of methodologies, and theoretical and practical underpinnings of the analytics, the latter of which is built firmly on statistics. Using mathematical analytics, numeric manipulation, and transformation is indeed important. But full-fledged statistical analytics is part of holistic analytics and anything less means we will not be maximizing value of the insights embedded within the data. By using all tools other than statistics means we are not doing a full analysis. By having a solid understanding of statistical methods, analysts can provide greater value to clients than without such means.