12 Mental Models for Data Science | by Chanin Nantasenamat

[ad_1]

Powerful Concepts for Navigating the Data Science Landscape

In the ever-evolving field of data science, the raw technical skills to wrangle and analyze data is undeniably crucial to any data project. Aside from the technical and soft skill sets, an experienced data scientist may over the years develop a set of conceptual tools known as mental models to help navigate them through the data landscape.

Not only are mental models helpful for data science, James Clear (author of Atomic Habits) has done a great job of exploring how mental models can help us think better as well as their utility to a wide range of fields (business, science, engineering, etc.) in this article.

Just as a carpenter uses different tools for different tasks, a data scientist employs different mental models depending on the problem at hand. These models provide a structured way to problem-solving and decision-making. They allow us to simplify complex situations, highlight relevant information, and make educated guesses about the future.

This blog presents twelve mental models that may help 10X your productivity in data science. Particularly, we do this by illustrating how these models can be applied in the context of data science followed by a short explanation of each. Whether you’re a seasoned data scientist or a newcomer to the field, understanding these models can be helpful in your practice of data science.

The first step to any data analysis is ensuring that the data you’re using is of high quality, as any conclusions you draw from it will be based on this data. In addition, this could mean that even the most sophisticated analysis cannot compensate for poor-quality data. In a nutshell, this concept emphasizes that the quality of output is determined by the quality of the input. In the context of working with data, the wrangling and pre-processing of a dataset would consequently help increase the quality of the data.

After ensuring the quality of your data, the next step is often to collect more of it. The Law of Large Numbers explains why having more data generally leads to more accurate models. This principle suggests that as a sample size grows, its mean also gets closer to the average of the whole population. This is fundamental in data science because it underlies the logic of gathering more data to improve the generalization and accuracy of the model.

Once you have your data, you have to be careful about how you interpret it. Confirmation Bias is a reminder to avoid just looking for data that supports your hypotheses and to consider all the evidence. Particularly, confirmation bias refers to the tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. In data science, it’s crucial to be aware of this bias and to seek out disconfirming evidence as well as confirming evidence.

This is another important concept to keep in mind during the data analysis phase. This refers to the misuse of data analysis to selectively find patterns in data that can be presented as statistically significant, thus leading to incorrect conclusions. To put this visually, the identification of rare statistically significant results (either purposely or by chance) may selectively be presented. Thus, it’s important to be aware of this to ensure robust and honest data analysis.

This paradox is a reminder that when you’re looking at data, it’s important to consider how different groups might be affecting your results. It serves as a warning about the dangers of omitting context and not considering potential confounding variables. This statistical phenomenon occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined. This paradox can be resolved when causal relations are appropriately addressed.

Once the data is understood and the problem is framed, this model can help prioritize which features to focus on in your model, as it suggests that a small number of causes often lead to a large proportion of the results.

This principle suggests that for many outcomes, roughly 80% of consequences come from 20% of causes. In data science, this might mean that a large portion of the predictive power of a model comes from a small subset of the features.

This principle suggests that the simplest explanation is usually the best one. When you start to build models, Occam’s Razor suggests that you should favor simpler models when they perform as well as more complex ones. Thus, it’s a reminder not to overcomplicate your models unnecessarily.

This mental model describes the balance that must be struck between bias and variance, which are the two sources of error in a model. Bias is an error caused by simplifying a complex problem to make it easier for the machine learning model to understand that consequently leads to underfitting. Variance is an error resulting from the model’s overemphasis on specifics of the training data that consequently leads to overfitting. Thus, the right balance of model complexity to minimize the total error (a combination of bias and variance) can be achieved through a tradeoff. Particularly, reducing bias tends to increase variance and vice versa.

This concept ties closely to the Bias-Variance Tradeoff and helps further guide the tuning of your model’s complexity and its ability to generalize to new data.

Overfitting occurs when a model is excessively complex and learns the training data too well thereby reducing its effectiveness on new, unseen data. Underfitting happens when a model is too simple to capture the underlying structure of the data thereby causing poor performance on both training and unseen data.

Thus, a good machine learning model could be achieved by finding the balance between overfitting and underfitting. For instance, this could be achieved through techniques such as cross-validation, regularization and pruning.

Long tail can be seen in distributions such as the Pareto distribution or the power law, where a high frequency of low-value events and a low frequency of high-value events can be observed. Understanding these distributions can be crucial when working with real-world data, as many natural phenomena follow such distributions.

For example, in social media engagement, a small number of posts receive the majority of likes, shares, or comments, but there’s a long tail of posts that gets fewer engagements. Collectively, this long tail can represent a significant portion of overall social media activity. This brings attention to the significance and potential of the less popular or rare events, which might otherwise be overlooked if one only focuses on the “head” of the distribution.

Bayesian thinking refers to a dynamic and iterative process of updating our beliefs based on new evidence. Initially, we have a belief or a “prior,” which gets updated with new data, forming a revised belief or “posterior.” This process continues as more evidence is gathered, further refining our beliefs over time. In data science, Bayesian thinking allows for learning from data and making predictions, often providing a measure of uncertainty around these predictions. This adaptive belief system that open to new information, can be applied not just in data science but also to our everyday decision-making as well.

The No Free Lunch theorem asserts that there is no single machine learning algorithm that excels in solving every problem. As a result, it is important to understand the unique characteristics of each data problem, as there isn’t a universally superior algorithm. Consequently, data scientists experiment with a variety of models and algorithms to find the most effective solution by considering factors such as the complexity of the data, available computational resources, and the specific task at hand. The theorem can be thought of as a toolbox full of tools, where each representing a different algorithm, and the expertise lies in selecting the right tool (algorithm) for the right task (problem).

These models provide a robust framework for each of the steps of a typical data science project, from data collection and preprocessing to model building, refinement, and updating. They help navigate the complex landscape of data-driven decision-making, enabling us to avoid common pitfalls, prioritize effectively and make informed choices.

However, it’s essential to remember that no single mental model holds all the answers. Each model is a tool, and like all tools, they are most effective when used appropriately. Particularly, the dynamic and iterative nature of data science means that these models are not simply applied in a linear fashion. As new data becomes available or as our understanding of a problem evolves, we may loop back to earlier steps to apply different models and adjust our strategies accordingly.

In the end, the goal of using these mental models in data science is to extract valuable insights from data, create meaningful models and make better decisions. By doing so, we can unlock the full potential of data science and use it to drive innovation, solve complex problems, and create a positive impact in various fields (e.g. bioinformatics, drug discovery, healthcare, finance, etc.).

[ad_2]

Source link