An Overview of Predictive Modeling and Analytics — Pt. 1

From University of Colorado Boulder — Coursera

Esther Anagu
5 min readMay 19, 2024

In the ever-evolving world of data science, predictive modeling and analytics stand out as critical tools for deriving insights and making informed decisions. Recently, I had the opportunity to explore these concepts through a Coursera course on Predictive Modeling and Analytics. In this summary, I’ll share the core topics covered and the valuable skills I gained.

Key Learnings

Exploratory Data Analysis

One of the major areas I focused on was applying exploratory data analysis (EDA) to gain insights and prepare data for predictive modeling. EDA involves summarizing and visualizing datasets using appropriate tools.

Talkcmo.com

Understanding Predictive Modeling

Predictive modelling is the process of building statistical or machine learning models to make predictions based on data. What differentiates data scientists from data analysts is the ability to use advanced techniques to extract trends and patterns from historical data to make predictions about the future.

Have you ever performed EDA on a dataset? What tools did you use, and what insights did you discover?

Techniques Used

Predictive modeling primarily uses two major techniques: regression and classification. Before applying these techniques, especially with real-life data, which can often be messy, it is essential to clean the data. Models built on messy data can be misleading.

Think About It:

What challenges have you faced when working with messy data, and how did you overcome them?

Common Data Issues

Data issues that may arise include bad formatting, missing values, duplicates, empty rows, inconsistent abbreviations, differences in scales, inconsistent descriptions, skewed distributions, and outliers. Addressing these issues is crucial to avoid misleading outputs and the exploration helps gain a better understanding of your data.

Exploring and Cleaning Data

Exploring data involves examining the dataset and checking for categorical and numerical variables. Key questions to ask about numerical data include:

  • Are there outliers?
  • Are there missing values?
  • Is the data distribution skewed or normal?

There are two popular open-source tools for data cleanup — Open Refine and Data Wrangler. These tools assist with many common data exploration tasks.

Data Transformation

Data transformation involves applying a mathematical function to each data value. One of the most common transformations is centering and scaling, which involves subtracting the mean from each data value and then dividing by the standard deviation. This process makes numerical procedures easier to work with and more stable by ensuring all variables are on a common scale.

Why do you think centering and scaling are important for certain modeling tools like clustering or PCA?

This is because the centering and scaling ensures that multiple variables in a dataset is on a common scale. Centering and scaling are often required or recommended for some modeling tools, such as clustering, principal component analysis, and neural networks.

The main drawback is that the data becomes harder to interpret. The data value after centering and scaling measures the number of standard deviations between each data point and the mean, and its uniqueness.

There are many other data transformations. Some of them can be expressed using common mathematical functions such logarithm, square, square root, and inverse. Except for the first one, all transformations mentioned here are considered polynomial transformations. Because they involve polynomial terms of the original data value. Different transformations are appropriate for different problem contexts. Sometimes we have to experiment to figure out the right one to use.

Data Reduction Techniques

Data reduction techniques generate a smaller set of variables that capture most of the information in the original dataset. Principal component analysis (PCA) is a widely used method that finds weighted averages of the variables to capture most of the variance in the data. These weighted averages, known as principal components, are uncorrelated and ideally capture most of the data’s variance with just a few components.

An important note here is that we need to first scale the variables before applying principal component analysis. This is so that the principal components are not dominated by variables that are much larger in scale.

Source: My Tech Decisions

Dealing with Missing Values and Outliers

Missing Values

Ways missing values can be dealt with are:

  1. Removal: The first approach is to simply remove the data. However, this may not always be feasible, as it could result in discarding too much valuable data.
  2. Imputation: Another approach is to impute or guess a value for the missing data. This could involve filling in the missing value with zero, the average sales, or using a smart guess obtained through interpolation. In general, observations from similar data points can be used to intelligently guess the missing value.
  3. Creating a Separate Category: For categorical data, it may be appropriate to treat missing values as their own category.

Outliers

Outliers are data points that significantly deviate from the rest of the data. In other words, it is an observation far away from other observations. They can skew statistical analyses and affect model performance. When dealing with outliers, it’s essential to consider their impact on the analysis and whether they represent genuine anomalies or errors in the data collection process.

Think About It: How do you identify outliers in your datasets, and what criteria do you use to determine whether to keep or remove them?

Data Visualization Tools

Effective data visualization is essential for conveying insights from data. Here are some best practices and tools for creating impactful visualizations:

  • Avoid Pie Charts: While pie charts are widely used, they can be challenging to interpret and compare, especially when there are many categories. Instead, consider using bar graphs, which make it easier to compare values.
  • Use Appropriate Chart Types: For time series data, avoid using points and instead use bars to emphasize individual values and lines to show trends over time.
  • Avoid 3D and Gimmicks: Minimize the use of 3D effects and other gimmicks in your graphs, as they can distract from the data. Stick to clean and straightforward visuals that effectively communicate your message.

Recommended Resources: For further guidance on effective data visualization, I recommend exploring resources such as “The Visual Display of Quantitative Information” by Edward Tufte and “Show Me The Numbers” by Stephan Few. Additionally, consider exploring specialized data visualization software tools such as Tableau and Spotfire, which offer powerful features for creating insightful visualizations.

If you found this summary helpful or have any questions about the course, please leave a comment below. I’d love to hear your thoughts and experiences!

Also, if you’re interested in getting notified about future articles, you can subscribe to my newsletter! (But no pressure at all.)

--

--

Esther Anagu

A Passionate Data Scientist & Analyst. Dedicated to empowering the data community by sharing invaluable resources. Embark on this data-driven adventure with me!