Data Science Journey with WorldQuant University — Pt. 3

3 min readJul 25, 2024

In this third part of my data science journey with WorldQuant University, I explored more complex concepts and techniques that are crucial for any aspiring data scientist.

Understanding Data Structures

Data stored in a No-SQL database is called semi-structured, whereas data in an SQL database is structured. Despite the differences, both types of databases offer flexibility in handling various data formats.

A Data Dictionary is an essential tool that provides descriptions of data elements in the database and what they represent, ensuring clarity and consistency in data handling.

Logistic Regression for Binary Classification

One of the key learnings in this part of the course was Logistic Regression, which is specifically used for binary classification problems. Logistic Regression models, although used for classification, belong to the same family as Linear Regression models.

Key Concepts:

Feature Selection: When deciding which feature to drop, we usually drop the one with the least correlation with the target variable, as it has the least predictive power.
Leakage Variables: These are variables that can unintentionally provide answers to the model, leading to unrealistic performance metrics.

Performance Metrics

Different performance metrics are used to evaluate various models:

Linear Regression: Mean Absolute Error (MAE)
Logistic Regression: Accuracy Score, which ranges from 0 to 1. An accuracy score closer to 1 indicates a better-performing model.

Data Splitting

Data is split into three sets:

Training Set: For training the model
Validation Set: For tuning the model
Test Set: For testing the model

If model parameters need to be tuned, the validation data is used.

Comparing Linear Regression and Decision Trees

Linear Regression:

Cost Functions: Distance
Relationship: Linear
Model Type: Parametric (strict)

Decision Tree:

Cost Functions: Impurity (subdivides data to reduce impurities in nodes)
Relationship: Non-Linear
Model Type: Non-parametric (flexible)

In Logistic Regression, we did not use OrdinalEncoder as we calculated distances. However, it can be used in Decision Trees.

Practical Application: Bankruptcy Prediction

I worked on predicting company bankruptcy using data collected by Polish economists. This involved handling JSON files, which required specific techniques:

Decompressing JSON files using the Command line and Gzip library.
Loading and Saving Files using Python.
Addressing Imbalanced Data using resampling techniques.
Evaluating Models using classification metrics like precision and recall.

Handling JSON Files

JSON files use double-quotes for all fields, and Booleans are lowercase. To work with JSON in Python, it must be translated into a format that Python can understand, which is similar to dictionaries but not exactly the same.

Hyperparameter Tuning and Cross-Validation

To optimize model performance, hyperparameter tuning is necessary. One common strategy is k-fold cross-validation, which divides the dataset into k folds to ensure the model is trained and validated on different subsets of the data, avoiding the drawback of depriving the model of valuable training data.

Key Steps:

Create a validation set to tune hyperparameters.
Implement cross-validation to avoid data deprivation and improve model robustness.

This part of the journey has been enriching and has provided hands-on experience with different data structures, logistic regression, and decision trees. I’ve also learned the importance of proper data handling and evaluation techniques to build robust and reliable models.

If you found this summary helpful or have any questions about the course, please leave a comment below. I’d love to hear your thoughts and experiences! Also, if you’re interested in getting notified about future articles, you can subscribe to my newsletter! (But no pressure at all.)