2 min readfrom Machine Learning

Built a Global AQ (PM2.5) Forecaster ML Model [P]

Hey everyone,

I’ve been building an end-to-end Air Quality (PM2.5) forecasting pipeline for 4 countries (US, UK, India, Australia) using 1.6M+ rows of OpenAQ and NASA weather data.

The problem i hit (the variance trap):

My V7 model was a standard stateless Gradient Boosting Regressor. It worked great for low-variance regions (like the US), but in highly chaotic environments (like India and the UK), the model was mathematically failing. When I calculated the MASE (Mean Absolute Scaled Error), it was > 1.0. Literally, a naive carryover guess was outperforming my ML model because the model couldn't anticipate sudden momentum shifts.

the fix (Horizon aligned architecture):

Instead of falling into the recursive snowball trap (where day 1 error compounds into day 30), I completely decoupled the horizons.

I engineered strict autoregressive lag vectors aligned specifically to the target horizon (h=1, 7, 14, 30).

Injected a 3-day rolling volatility matrix that ends precisely at the inference boundary to prevent data leakage.

Result: MASE dropped strictly below 1.0 globally Even at a 30-day horizon, the model maintains a 57% predictive accuracy over the chaotic thermodynamic baseline.

The stack:

backend pipeline : Python, Pandas (for the memory matrix), scikit-learn, FastAPI.

frontend : Next.js 16 (App Router), Tailwind v4, Recharts.

Deployment: Vercel with automated GitHub CI/CD sync. (currently pushing updates manually afetr every test, so the site is actually static will automate it later)

I'm currently using scikit-learn GBR, but but my immediate next step is to rip it out and rewrite the core engine using Xgboost or LightBGM to handle the sparse temporal features better.

If any MLOps or Data Engineers here have advice on scaling XGBoost for multi-horizon forecasting without exploding the compute, I’d love to hear it. Roast my architecture, the repo is public.

live URL : https://global-aq-intelligence.vercel.app/

github: https://github.com/divyanshailani/global-aq-intelligence-pipeline

submitted by /u/Divyanshailani
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#rows.com
#big data management in spreadsheets
#conversational data analysis
#business intelligence tools
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#automated anomaly detection
#financial modeling with spreadsheets
#predictive analytics in spreadsheets
#predictive analytics
#Air Quality
#PM2.5