Whats the haps on the craps: hacking casino data science projects

Introduction

Young analysts and even seasoned analysts from other industries often build elegant models leveraging state-of-the-art algorithms, only to find that the models do not ‘generalize’ well (i.e., predict unseen instances accurately). Casino databases are often subject to this dilemma by nature of the patron-level variability (which is often a function of luck or a heavy skew towards large players). Below are a few hacks for data preparation. This preparation should be fairly universal, but we always recommend the data scientist analyze the trends in the data as a first step in the AI journey.

Hidden Biases in Common Measures All Casino Analysts Should Know

Volatility Impact on Theo: Simply using theoretical win does not entirely negate the impact of volatility. Quick losses constrain turnover and thus theo, while winning players often increase their average bet or play time, thus inflating theo.

Actual Win/Loss: Actual win is unstable and while actual losses are relatively unbiased, wins are skewed similarly to theo.

Ratings: Both theoretical and actual win are biased when the patron does not rate. It is natural, especially for TG players, to not rate until seated.

Recency Weighting: Not only is recency itself a strong predictor, but more recent trips, especially when merged with frequency, are important for weighting.

Based on the modeler’s objectives, she may want to adjust or re-weight these fundamental explanatory variables to accommodate these biases or create features which capture these biases. Crafting novel features (feature engineering) not only improves accuracy but also allows the modeler to capture observed trends from the business. Feature engineering is outside the scope of this post.

Transformations

We have yet to find a casino database with an 80/20 split common in other industries (at least on an annual basis). In many databases we review, the top 1% of players drive 20% – 30% of revenues. This skew inherent in casino databases is the root of many failed predictive modeling projects.

Centering and Scaling: Even with models that are famously scale invariant, we have found accuracy improvements from centering and scaling. In general, we get the best results by leveraging ‘robust’ centering and scaling, which is based on median variance against the observed value and inter-quartile range. Even with robust scaling and centering, for many models we further scale to a log base. This transformation out-of-the-box does not work for actual win. As such, we typically first take the absolute value, transform it, and then reapply the sign. While this scaling works for continuous predictions, this approach distorts elasticity estimates. As a solution, we often leverage a Box-Cox transformation.

Recency: Recent trips nearly always hold more insightful signals than more distant trips; this is particularly true for reactivated patrons emerging from dormancy. In addition to weighting more recent trips, we recommend feature engineering to capture recent changes in behaviors.

Sampling: As casino databases are so heavily skewed, the modeler should think clearly about sampling when building train, test, and validation data sets. This is a particularly important issue within deep learning models, where the testing and validation data should represent the problem. Our go-to tool here for fixing class imbalances is SMOTE, which uses synthetic data generation to balance out the target variable classes.

Training

Overfitting is a common issue, particularly when timeframes are condensed resulting in few data. Avoiding overfitting is critical for deploying robust and reliable models.

Cross-validation: Cross-validation helps in detecting overfitting and ensures that the model generalizes well to unseen data. While a simple train/test split may suffice, it is not the most robust method. Our go-to method is k-fold cross validation. Yet, while it is a great tool, it is not one-size-fits-all – in the context of forecasting it inflates accuracy rates, using future data to predict the past. It is important to adapt your validation method to the context of the problem.

Hyperparameter Tuning: Finding the optimal tuning of model hyperparameters allows for performance optimization, model generalization, and adaption to the specific problem at hand. In our tree-based models, we like to emphasize shallower trees and lower learning rates when we structure our grid searches. In the case of tree-based models, we will often structure our grid-search hyperparameters to emphasize shallower trees and lower learning rates. Perhaps more importantly, we find more simplified layer architecture is key. In cases where the underlying business is complicated, we find accuracy improvements via ensemble models.

Model Output

Much like the rest of the process, the outputs provided by a model should be considered within the context of the problem and the needs of the operator.

Interpretation: The results of the model should always be evaluted with the problem in mind. Let me illustrate this with an example. In a casino marketing project where we were using machine learning models to predict high-value players, we found that our model had a relatively low overall accuracy, but that it did a really good job capturing those who would be high-value players. In this case, because the operator wanted to capture high-value players early on, we were okay with a low accuracy – because we didn’t really care about misclassifying players as long as we were capturing what was important to the operator. This made the model great.

Explainable AI: New developments have yielded methods to peek inside of the black boxes that are machine learning algorithms. These methods are known as explainable AI (XAI), which allow us to understand what variables influence the model’s decisions (and how). We leverage XAI techniques such as SHAP and LIME to extract underlying drivers of model predictions, allowing us to yield actionable insights to the guys upstairs.

Conclusion

While a lot of these “hacks” can (and probably should) be applied in a wide range of disciplines and industries, the casino industry in particular stands to benefit from a new army of hackers (a.k.a. savvy data scientists). We hope you have a safe and fun AI journey 🙂