Data Science Journey with WorldQuant University- Part 2

Esther Anagu
5 min readJun 12, 2024

In the first part of this series, we examined the real estate market in Mexico and created a machine learning model that predicts apartment prices in Buenos Aires, Argentina. In this series, I looked at air quality data from Nairobi, Lagos, and Dar es Salaam; and built a time series model to predict PM 2.5 readings throughout the day. However, this time around, our data collection process took a different turn as we leveraged MongoDB, a NoSQL database, for storing our data.

ImageSource: CodeAcademy

Before embarking on building any model, it’s imperative to import your data into the environment in which you’ll be working. In this instance, our data resided in a MongoDB server, where unstructured data is conventionally stored.

Understanding MongoDB:

MongoDB is categorized as a NoSQL or semi-structured database, commonly employed for web applications, user profile management, click logs, and sensor data.

Connecting to MongoDB:

To establish a connection to the MongoDB server, we utilized two essential parameters: MongoClient and Port.

  • MongoClient: This is the client instance that allows you to interact with the MongoDB server. It enables operations such as querying and insertion.
  • Port: It specifies the port number on which the MongoDB server listens for connections. The default port for MongoDB is 27017, but it’s configurable to different ports.
#Connect to MongoDB server running on localhost at port 27017
client = MongoClient('mongodb://localhost:27017/')

Navigating MongoDB:

It is important to note that in MongoDB, the terminology differs from traditional relational databases. Here’s a comparison to clarify:

  • Collections: In MongoDB, a collection is analogous to a table in a relational database. It is a group of MongoDB documents.
  • Documents: In MongoDB, a document is analogous to a row in a relational database table. It is a single record in a MongoDB collection, represented in a JSON-like format (BSON).

So, instead of tables and rows, MongoDB uses collections and documents.

Exploring the collections in the database was a beautiful experience (ironically, though 🤭). It was my first time dealing with getting data from a MongoDB and the experience was far different from the normal relational databases, which is quite straightforward.

During exploration, it’s essential to gain insight into the databases stored on the server to understand the collections available for exploration. To facilitate this, we utilize the list_databases method, which provides a list of all databases on the server. However, before accessing this information, we import PrettyPrinter from the pprint module.

from pprint import PrettyPrinter

Utilizing PrettyPrinter allows us to neatly format the output of the databases. This step is crucial, as it organizes the data for easier readability. Without this formatting, the output would appear as an iterator, making it challenging to interpret.

pp = PrettyPrinter(indent=2)
pp.pprint(list(client.list_databases()))
This is what it looks like. Well-Arranged!

By employing PrettyPrinter, we obtain a clear and structured overview of the databases stored on the MongoDB server, enabling us to proceed with our exploration effectively.

Following this, we accessed our desired database and collection:

# Access the 'air-quality' database
db = client['air-quality']

# Access the 'users' collection within 'air-quality'
collection = db['nairobi']

Exploring Data:

With access to the collection, we proceeded to explore the data using various methods:

# Count documents in the collection
nairobi.count_documents({})

# Retrieve one document from the collection
result = nairobi.find_one({})
pp.pprint(result)

# Determine distinct sensor sites in the collection
sites = dar.distinct("metadata.site")
sites

Next, we refined our data, handled outliers, and visualized our findings to gain deeper insights into the air quality data.

Screenshot from WorldQuant: TimeSeries

Subsequently, we partitioned our dataset into training and test sets, and embarked on building our predictive models, exploring techniques such as Linear Regression, Autoregressive Models, and ARMA Models.

  • Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors) by fitting a linear equation to observed data. It seeks to find the best-fitting line through the data points that minimizes the sum of the squared differences between observed and predicted values. In time series analysis, it can be applied to predict future values based on historical data and potentially external factors.
  • Autoregressive models, often denoted as AR(p), are a class of time series models where the current value of a variable is linearly dependent on its past values, plus random error terms. The “p” in AR(p) denotes the number of lagged observations included in the model. AR models are suitable for capturing temporal dependencies and patterns in time series data.
  • Autoregressive Moving Average Models (ARMA): ARMA models combine autoregressive (AR) and moving average (MA) components to capture both temporal dependencies and random fluctuations in time series data. The ARMA(p, q) model consists of autoregressive terms (p) and moving average terms (q). ARMA models are effective for modeling stationary time series data, where the mean, variance, and autocorrelation structure remain constant over time.
Screenshot from WorldQuant: ACF

Key Considerations for Time Series Analysis:

  1. Stationarity: Check if the time series data is stationary, meaning that its statistical properties (such as mean and variance) remain constant over time.
  2. Trend and Seasonality: Identify any trends or seasonal patterns in the data.
  3. Autocorrelation: Examine the autocorrelation structure of the time series.
  4. Model Selection: Choose appropriate models based on the characteristics of the data.
  5. Model Evaluation: Evaluate the performance of time series models using appropriate metrics such as mean absolute error (MAE) and mean squared error (MSE).

By considering these factors and selecting appropriate modeling techniques, analysts can effectively analyze and forecast time series data, uncovering valuable insights and patterns for decision-making purposes.

If you would like to explore other datasets for practice, you can click here.

If you found this summary helpful or have any questions about the course, please leave a comment below. I’d love to hear your thoughts and experiences!

Also, if you’re interested in getting notified about future articles, you can subscribe to my newsletter! (But no pressure at all.)

--

--

Esther Anagu

A Passionate Data Scientist & Analyst. Dedicated to empowering the data community by sharing invaluable resources. Embark on this data-driven adventure with me!