My Data Science Journey with WorldQuant University — Pt. 1

Esther Anagu
4 min readApr 26, 2024

--

A Screenshot of the WorldQuant University’s Landing page

Getting into data science wasn’t a walk in the park, especially for someone like me with an Accounting background. Though I knew a bit of math, diving into data science math was a whole new challenge. It was a whole different ball game. Nevertheless, my journey with WorldQuant University has been illuminating. It has paved the way for me to navigate this 'complex' field.

In the initial lessons, I got hands-on with predicting prices using ML prediction models. It was pretty exciting! One of the assignments had me diving into datasets to help a client build a model that could predict apartment prices in Mexico City.

In the world of data science, there’s a structured framework that serves as a roadmap for building models:

  1. Data Collection
  2. Data Preparation
  3. Model Building
  4. Results Communication

Data collection

Data resides in various repositories, often within databases or in formats like Excel sheets or CSV files. Importing this data into the Jupyter Notebook, where modeling takes place, is the initial step.

# First, import all the necessary libraries and one of them is Pandas which is used to read the data file(s).
import pandas as pd

# Next, read the data. If it's a CSV file:
data = pd.read_csv("data.csv")

# If it's an Excel file:
data = pd.read_excel("data.xlsx")

Data Preparation

Now, this is where the real work begins! Trust me, this phase is where you’ll spend a good 80% of your time. Why? Because it’s crucial to ensure the accuracy of your results.

First things first, you need to explore your data. You’ll use libraries like describe, info, and shape to get a feel for the layout of your dataset. Visualizing your data with histograms, bar charts, and scatter plots is crucial too. This helps spot outliers, which can affect the performance of the model.

Now, here’s the interesting part: the data you’re working with might be a bit messy. That’s where data cleaning comes in, and boy, do we have our work cut out for us. Luckily, my journey with WorldQuant University has armed me with a whole arsenal of techniques to tackle this:

  1. Drop columns that are more than 50% null values.
  2. Drop columns containing low- or high-cardinality categorical values. Some columns might have too few or too many unique values, making them more trouble than they’re worth.
  3. Drop any columns that would constitute leakage for the target variable. Leakage features are features that would give our model information that it won’t have access to when it’s deployed.
  4. Drop any columns that would create issues of multicollinearity. We don’t need our model getting confused by redundant information, so we’ll give it a helping hand by dropping highly correlated columns.
  5. There are some columns with missing values that are valuable for the model and are better to be handled than being dropped. We can fill in these missing values using information we get from the whole column, and this can be done using a transformer called “SimpleImputer.”
  6. We also deal with useful categorical columns with OneHotEncoder.
#To check columns with over 50% null values
data.isnull().sum() / len(df)

#To check columns containing low- or high-cardinality categorical values.
data.select_dtypes("object").nunique()

# To check for multicollinearity
corr = data.select_dtypes("number").corr()
# Visualize using heatmap
sns.heatmap(corr)

#To deal with missing values, you instatiate SimpleImputer library first.
SimpleImputer()

#To deal with useful categorical columns, use The OneHotEncoder them.
OneHotEncoder(use_cat_names = True)

Model Building

Now, let’s talk about building the model itself. First things first, we need to separate the target variable from the features. The target variable is what we want to predict, while the feature variables help us make those predictions.

So, what’s next? Here’s the breakdown:

  • Instantiate the model: Depending on the problem type — whether it’s classification, regression, etc. — there are various algorithms to choose from, like Linear Regression or Ridge.
  • Train the model: This step involves feeding our model with the training data so it can learn and improve its predictive capabilities.
  • Make predictions: Once trained, our model is ready to make predictions based on new or unseen data.
  • Evaluate its performance: We need to assess how well our model is performing. Are the predictions accurate? Are there any areas for improvement?
# Instantiate the model (e.g., Linear Regression)
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Predict with the trained model
model.predict(X_train)

# Evaluate model performance
mean_absolute_error(y_train, y_pred_training)

Results Communication

Once the model is built and its performance evaluated, it’s time to share the findings with end-users.

Stay tuned for more insights into each stage of my data science journey with WorldQuant University.

--

--

Esther Anagu

A Passionate Data Scientist & Analyst. Dedicated to empowering the data community by sharing invaluable resources. Embark on this data-driven adventure with me!