From Detection to Correction: How to Keep Your Production Data Clean and Reliable | by Youssef Hosni

[ad_1]

In Production ML, data quality is everything. No matter how great your models or algorithms are, if the data you feed them is garbage, you’ll get garbage results. But how can you tell if your data is good or bad? That’s what we’re going to explore in this article.

We’ll start by discussing the importance of validating data and detecting data issues in production. Specifically, we’ll focus on two types of data issues: data and concept drift and schema and distribution skew. These issues can be difficult to detect, but they can have a significant impact on the accuracy and reliability of your ML models.

By the end of this article, you’ll have a solid understanding of how to conceptually detect data issues, including drift and skew, and what steps you can take to fix them. So, let’s get started on our journey to mastering data quality control in production!

Data is a critical element for many production processes, from manufacturing to healthcare to financial services. However, the quality and relevance of data can change over time, leading to issues that can affect the accuracy and effectiveness of these processes. Two of the most significant data issues in production are drift and skew.

Data drift occurs when the statistical properties of the data used to build a model or system change over time, leading to a degradation of performance. This can happen due to changes in the underlying population, measurement errors, or other factors. Data drift can cause models to become outdated and produce inaccurate results, leading to potential operational and financial losses.

Data skew, on the other hand, occurs when the data used to build a model or system is not representative of the real-world population it is intended to serve. Skewed data can lead to biased models and decisions, which can result in unfair treatment of individuals or groups and negatively impact business outcomes.

Both drift and skew are critical issues that must be addressed in production to ensure accurate and effective decision-making. Techniques such as continuous monitoring, data augmentation, and diversity testing can help detect and mitigate these issues. It is essential to prioritize data quality and accuracy in production to ensure that processes are optimized for success.

In a typical machine learning pipeline, you will have different sources of data that are conceptually the same. They have the same feature vector, but over time they will change. That means that model performance can either drop quickly due to things like system failure or can decay over time due to changes in the data and things like changes in the world.

We’re going to focus on performance decay over time that arises due to issues between training and serving data. There are really three main reasons for that Data Drift, Concept Drift, and Covariate Shift.

Data drift: Data drift is a concept in machine learning that refers to a change in the distribution of input data over time. This can occur when the underlying data-generating process changes, causing the input features to become outdated or no longer representative of the target population.
Concept drift: It happens when the underlying relationship between the input features and the variable target changes over time, making the previously trained model less accurate or even useless for making predictions on new data. This can be caused by a variety of factors, including changes in the data-generating process, changes in user behavior or preferences, changes in the environment or market conditions, and other factors that may impact the relationship between the input features and the target variable.
Covariate shift: This refers to a situation in which the distribution of the input variables (covariates) in the training data is different from the distribution of the input variables in the test data. This can cause problems for machine learning models, as they may not be able to generalize well from the training data to the test data. For example, suppose you are building a model to predict housing prices based on various features such as location, size, number of bedrooms, etc. If the distribution of these features in the training data is significantly different from the distribution of the features in the test data (for example, if the training data has mostly small apartments in rural areas, while the test data has mostly large houses in urban areas), then the model may not perform well on the test data.

In the example below, we’re looking at an app that, during training, app classified as a spammer, any user who is sending 20 or more messages per minute. We classified anybody like that as a spammer. But after a system update which you see as labeled on the chart there, both spammers and non-spammers start to send more messages. In this case, the data and the world have changed, and that causes unwanted misclassification. We have all of our users are classified as spammers, which they probably won’t like.

Detecting data issues usually starts by comparing your serving baseline statistics and instances. You check for differences between that and your training data. You look for skew and drift.

Significant changes become anomalies, and they’ll trigger an alert. That alert goes to whoever’s monitoring system, which can either be a human or another system, to analyze the change and decide on the proper course of action. That’s got to be the remediation of the way that you’re going to fix and react to that problem.

2.1. Detecting Skew

Let’s first start with detecting the schema skew. There are some ways to detect schema skew:

Schema Comparison: Compare the schema of the data at different points in time to identify any differences. This can be done manually by inspecting the schema or programmatically by using tools such as schema comparison software.
Data Profiling: Use data profiling techniques to analyze the data and identify any anomalies or inconsistencies that may indicate schema skew. Data profiling tools can help identify changes in the data structure, such as the addition or removal of columns, changes in data types, etc.
Statistical Analysis: Use statistical analysis techniques to detect changes in data distributions over time. For example, you can compare the means and variances of different subsets of data to identify any changes in the underlying distributions.
Data Visualization: Use data visualization tools to plot the data over time and identify any changes in the data structure. Visualizing the data can help identify trends and patterns that may indicate schema skew.

2.2. Detecting Drift

Drift detection involves continuous evaluation of data coming to your server once you train your model. To detect these changes, you need continuous monitoring and evaluation of the data. Let’s take a look at a more rigorous definition of the drift and skew that we’re talking about.

Dataset shift: occurs when the joint probability of x (features) and y (labels) is not the same during training and serving. The data has shifted over time.
Covariate shift: refers to the change in the distribution of the input variables present in training and serving data. In other words, it’s where the marginal distribution of x (features) is not the same during training and serving, but the conditional distribution remains unchanged.
Concept shift: refers to a change in the relationship between the input and output variables as opposed to the differences in the Data Distribution or input itself. In other words, it’s when the conditional distribution of y (labels) given x (features) is not the same during training and serving, but the marginal distribution of x (features) remains unchanged.

[ad_2]

Source link

From Detection to Correction: How to Keep Your Production Data Clean and Reliable | by Youssef Hosni | Apr, 2023

Table of Contents:

2.1. Detecting Skew

2.2. Detecting Drift