Can We Predict and Explain
Air Quality in Cities?

An interactive analysis of 15 cities across Sri Lanka and Asia, covering 3 years of daily air quality data, factor analysis, and a machine learning model that predicts AQI from weather conditions.

by Vineth Samarasinghe  |  github.com/vinethsam
City Rankings
Good (0-50)
Moderate (51-100)
Sensitive (101-150)
Unhealthy (151-200)
Very Unhealthy (201-300)
Average AQI by City (2021-2023)
City Data Summary
Pollution Map

Click any city bubble to see detailed statistics. Scroll to zoom, drag to pan.

Good (0-50)
Moderate (51-100)
Sensitive Groups (101-150)
Unhealthy (151-200)
Very Unhealthy (201+)
What Drives Air Quality?
Monsoon Effect on AQI
Weekday vs Weekend AQI
Average AQI by Season and City
Live AQI Predictor

Adjust the weather conditions below to get an instant AQI prediction. Uses a linear regression model trained on 3 years of historical data (R2 = 0.94).

PREDICTED AQI
--
Adjust sliders to predict
City Baseline AQI Reference
Model Performance

Linear Regression

R-squared (R2)0.936
Mean Absolute Error5.97
RMSE11.64
CV R2 (5-fold)0.940

Surprisingly strong for a linear model. The near-linear relationship between particulate concentrations (PM2.5, PM10) and AQI means linear regression captures most of the variance.

Random Forest

R-squared (R2)0.935
Mean Absolute Error6.05
RMSE11.78
Estimators150

Comparable to linear regression here, confirming the data relationships are largely linear. Would outperform more clearly on a larger, noisier real-world dataset.

Feature Importance

PM2.578.2%
PM1017.2%
NO22.1%
O31.0%
Weather factors1.5%

PM2.5 dominates because it is the primary component of AQI by definition. The interesting insight is that weather factors (wind, humidity, temp) together explain only ~1.5% of variance once particulate levels are known.

Feature Importance - Random Forest
Methodology
DATASET

16,106 daily records across 15 cities

Three years of daily AQI, weather, and pollutant data (2021-2023), with realistic seasonal patterns and monsoon cycles built in. Approximately 2% missing values injected and cleaned as part of the pipeline.

CLEANING

319 outliers removed, 1,359 nulls imputed

AQI values above 500 (physically impossible) were removed. Missing values were forward-filled per city, with median fallback. Data types standardised and date parsing validated.

FEATURES

8 engineered features added

Including: is_weekend, season (hemisphere-aware), is_monsoon per city, temperature range bucket, wind category, high pollution flag, 7-day rolling average AQI, and humidity category.

NEXT STEPS

What I would add with more time

Activate live OpenWeatherMap API integration, add a 7-day AQI forecast using Facebook Prophet, deploy on Streamlit Cloud with a public URL, and incorporate NASA satellite AOD data as a ground truth comparison.