Tornado Magnitude Prediction with LightGBM

A Real Dialogue with AI: How MCP Servers Enable Interactive Data Science

Author

Patricio Lobos, Software Engineer & AI Lead at Querex.no

Published

December 3, 2025

1 Executive Summary

TipKey Achievement

Our LightGBM model achieved RMSE 0.357 on the held-out test set, representing a 38.8% improvement over Julia Silge’s XGBoost benchmark (RMSE 0.583). The model explains 83.8% of variance in tornado magnitude predictions.

This analysis documents the complete journey of building a tornado magnitude prediction model using LightGBM with domain-enhanced features. More importantly, it demonstrates how Model Context Protocol (MCP) servers enable a genuine, interactive dialogue between a human and an AI assistant (Claude Opus 4.5 in VS Code via GitHub Copilot) to collaboratively solve complex data science problems.

2 Introduction: The Power of AI-Human Collaboration

2.1 About the Author

NotePatricio Lobos

Software Engineer & AI Lead at Querex.no

About Querex: Querex develops mathematical and statistical tools for Large Language Models via the Model Context Protocol (MCP). The LightGBM and Statistics MCP servers used in this analysis are examples of how Querex enables AI assistants to perform real computations rather than just describe them.

2.2 About This Analysis

This document tells two stories:

  1. A technical story: How we built a machine learning model to predict tornado magnitudes
  2. A methodological story: How MCP servers transform the way humans and AI collaborate on data science

The goal wasn’t simply to “beat” a benchmark—it was to demonstrate a new paradigm of interactive, tool-augmented AI assistance where Claude Opus 4.5 can actually do data science, not just talk about it.

2.3 About Julia Silge and the Benchmark

NoteWho is Julia Silge?

Julia Silge is a data scientist and software engineer at Posit (formerly RStudio). She’s widely known for:

Her tornado magnitude prediction blog post from May 2023 provides an excellent XGBoost baseline using the tidymodels framework in R.

Julia’s analysis achieved RMSE 0.583 and R² 0.578 using XGBoost with effect encoding for state variables. Our goal was not to “compete” with her work, but to use it as a well-documented reference point for evaluating our approach.

2.4 The Dataset

ImportantData Source

TidyTuesday Tornado Dataset (May 16, 2023)

The dataset contains 68,693 tornado records from 1950-2022 with 27 variables including magnitude, path dimensions, casualties, and location data.

3 Understanding the Technology Stack

3.1 What is MCP (Model Context Protocol)?

NoteMCP Explained

Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude to interact with external tools and data sources through a standardized interface.

Think of MCP as a universal translator between AI models and specialized software. Instead of the AI just describing how to do something, MCP lets the AI actually do it.

Code
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%
flowchart TB
    subgraph env["User Environment (VS Code)"]
        A[Human User] <-->|Natural Language| B[Claude Opus 4.5]
    end
    
    subgraph servers["MCP Servers"]
        C[LightGBM Server<br/>Train & Predict Models]
        D[Statistics Server<br/>Statistical Analysis]
        E[SQL Server<br/>Database Queries]
    end
    
    B <-->|MCP Protocol| C
    B <-->|MCP Protocol| D
    B <-->|MCP Protocol| E
    
    style env fill:#5a6270,color:#4dabf7
    style servers fill:#5a6270,color:#4dabf7
    style A fill:#6c7a89,color:#4dabf7
    style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px
    style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px
    style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px
    style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#5a6270', 'primaryColor': '#667eea', 'secondaryColor': '#51cf66', 'tertiaryColor': '#fcc419'}}}%%
flowchart TB
    subgraph env["User Environment (VS Code)"]
        A[Human User] <-->|Natural Language| B[Claude Opus 4.5]
    end
    
    subgraph servers["MCP Servers"]
        C[LightGBM Server<br/>Train & Predict Models]
        D[Statistics Server<br/>Statistical Analysis]
        E[SQL Server<br/>Database Queries]
    end
    
    B <-->|MCP Protocol| C
    B <-->|MCP Protocol| D
    B <-->|MCP Protocol| E
    
    style env fill:#5a6270,color:#4dabf7
    style servers fill:#5a6270,color:#4dabf7
    style A fill:#6c7a89,color:#4dabf7
    style B fill:#667eea,color:#fff,stroke:#667eea,stroke-width:2px
    style C fill:#51cf66,color:#fff,stroke:#51cf66,stroke-width:2px
    style D fill:#fcc419,color:#000,stroke:#fcc419,stroke-width:2px
    style E fill:#ff6b6b,color:#fff,stroke:#ff6b6b,stroke-width:2px

MCP Architecture: How Claude Connects to Tools

3.1.1 MCP Servers Used in This Analysis

MCP Servers Enabling This Analysis
Server Purpose Key Capabilities
LightGBM MCP Machine Learning Train regression/classification models, make predictions, get feature importance
Statistics MCP Statistical Analysis Correlation, distributions, hypothesis testing, descriptive stats
MSSQL MCP Data Access Query databases, explore schemas, run SQL

3.1.2 Why MCP Matters

Without MCP, an AI conversation about machine learning might look like:

Human: “Train a LightGBM model on this data”

AI: “Here’s the Python code you would write: lgb.train(params, train_data)...

Human: copies code, runs it, gets error, asks for help…

With MCP, the conversation becomes:

Human: “Train a LightGBM model on this data”

AI: actually trains the model “Done! The model achieved RMSE 0.357. The top features were latitude and year. Want me to try different hyperparameters?”

This is the difference between talking about data science and doing data science together.

3.2 What is LightGBM?

NoteLightGBM Explained

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft that uses tree-based learning algorithms. It’s designed for speed and efficiency.

3.2.1 Gradient Boosting: The Core Idea

Gradient boosting builds models sequentially, where each new model corrects the errors of the previous ones:

\[ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x) \]

Where:

  • \(F_m(x)\) is the model at iteration \(m\)
  • \(h_m(x)\) is a weak learner (decision tree) that predicts the residual errors
  • \(\gamma_m\) is the learning rate
Code
flowchart LR
    A[Tree 1<br/>Initial Prediction] --> B[Residuals]
    B --> C[Tree 2<br/>Corrects Errors]
    C --> D[Residuals]
    D --> E[Tree 3<br/>Corrects Errors]
    E --> F[...]
    F --> G[Final Prediction<br/>Sum of All Trees]
    
    style A fill:#e8f4f8
    style C fill:#d4edda
    style E fill:#fff3cd
    style G fill:#667eea,color:#fff

flowchart LR
    A[Tree 1<br/>Initial Prediction] --> B[Residuals]
    B --> C[Tree 2<br/>Corrects Errors]
    C --> D[Residuals]
    D --> E[Tree 3<br/>Corrects Errors]
    E --> F[...]
    F --> G[Final Prediction<br/>Sum of All Trees]
    
    style A fill:#e8f4f8
    style C fill:#d4edda
    style E fill:#fff3cd
    style G fill:#667eea,color:#fff

How Gradient Boosting Builds Predictions

3.2.2 LightGBM vs XGBoost

LightGBM vs XGBoost Comparison
Aspect LightGBM XGBoost
Tree Growth Leaf-wise (best-first) Level-wise (breadth-first)
Speed Generally faster Slightly slower
Memory More efficient Higher memory usage
Accuracy Often similar Often similar
Overfitting Risk Higher with small data More conservative
Categorical Features Native support Requires encoding

Julia Silge used XGBoost in R with the tidymodels framework. We used LightGBM via MCP servers. Both are excellent choices—the key difference in our results comes from feature engineering and hyperparameter tuning, not the algorithm itself.

3.3 Understanding the Metrics

3.3.1 RMSE (Root Mean Square Error)

NoteRMSE Explained

RMSE measures the average magnitude of prediction errors, giving higher weight to large errors.

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]

Interpretation: An RMSE of 0.357 means our predictions are, on average, about 0.36 magnitude units away from the true value. On a 0-5 scale, this is quite good!

3.3.2 R² (Coefficient of Determination)

NoteR² Explained

R² measures the proportion of variance in the target variable explained by the model.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Interpretation: An R² of 0.838 means our model explains 83.8% of the variance in tornado magnitudes. The remaining 16.2% is unexplained variation (noise, missing features, or inherent randomness).

3.3.3 Why Both Metrics Matter

Metric Comparison
Metric Strengths Limitations
RMSE Same units as target; penalizes large errors Scale-dependent; harder to interpret across datasets
Scale-independent; intuitive percentage Can be misleading with few observations; doesn’t indicate direction of errors

3.4 Why Train/Test Splits Matter

WarningThe Cardinal Sin of Machine Learning

Never evaluate your model on the same data you trained it on!

This leads to overfitting: the model memorizes the training data instead of learning generalizable patterns.

Code
flowchart LR
    A[Full Dataset<br/>67,937 records] --> B{Random Split}
    B -->|80%| C[Training Set<br/>54,349 records]
    B -->|20%| D[Test Set<br/>13,588 records]
    C --> E[Model Training<br/>& Validation]
    D --> F[Final Evaluation<br/>Unseen Data]
    E --> G[Tune Hyperparameters]
    G --> E
    E -->|Best Model| F
    
    style C fill:#51cf66,color:#fff
    style D fill:#ff6b6b,color:#fff
    style F fill:#667eea,color:#fff

flowchart LR
    A[Full Dataset<br/>67,937 records] --> B{Random Split}
    B -->|80%| C[Training Set<br/>54,349 records]
    B -->|20%| D[Test Set<br/>13,588 records]
    C --> E[Model Training<br/>& Validation]
    D --> F[Final Evaluation<br/>Unseen Data]
    E --> G[Tune Hyperparameters]
    G --> E
    E -->|Best Model| F
    
    style C fill:#51cf66,color:#fff
    style D fill:#ff6b6b,color:#fff
    style F fill:#667eea,color:#fff

Train/Test Split Strategy

3.4.1 The “Locked Box” Analogy

Think of your test set as a locked box that you can only open once:

  1. Training data: Use freely for model development
  2. Validation data: Use to tune hyperparameters and compare models
  3. Test data: Open only once for final evaluation

If you peek at the test set during development, you’re “leaking” information and your final metrics will be overly optimistic.

3.4.2 Julia’s Wisdom on Stratification

TipWhy Stratify?

Julia emphasized stratifying by magnitude when splitting data:

“Stratification when you’re doing resampling almost never hurts you and sometimes it really helps you. I suspect this is a situation when it would help you because the magnitude is distributed… where there’s lots and lots of zero magnitude tornadoes and very few high ones. So if we want to be able to predict those really high ones we need to make sure they’re in our testing sets and evenly split up.”

This ensures that rare high-magnitude tornadoes appear in both training and test sets, rather than randomly ending up mostly in one or the other.

4 The Prediction Challenge

4.1 Problem Definition

Predicting tornado magnitude on the Enhanced Fujita (EF) scale is inherently difficult:

  • Discrete ordinal target: EF0 through EF5, represented as integers 0-5
  • Class imbalance: Most tornadoes are EF0-EF1; violent EF4-EF5 tornadoes are rare
  • Complex interactions: Geographic, temporal, and meteorological factors combine non-linearly
  • Measurement challenges: Magnitude is assessed post-event based on damage

4.2 Julia Silge’s Modeling Philosophy

NoteFrom Julia’s Video

“I grew up in Texas, North Texas, just in Tornado Alley, so this dataset has a lot of resonance for the natural extreme weather, natural disasters of my youth. But also I think this is a great dataset where we can really think about the modeling process—how we set it up and how the decisions that we make really impact the results that we get later on down the line.”

Julia explicitly walked through the modeling options, explaining why each has limitations:

4.2.1 Option 1: Multi-class Classification ❌

“I could treat this like a classification problem… The problem with this is that these classes are not really like red, green, blue—it’s more like small, medium, large where this has an order to it. A tornado that is truly a class five getting predicted as a class four is very different than it being predicted as a class one in terms of how wrong it is, and classification metrics don’t take advantage of that.”

4.2.2 Option 2: Ordinal Regression (MASS::polr) ❌

“This is definitely a good fit for our outcome, but this kind of model is linear and when we have a big dataset like this including complex interactions, a linear model often leaves a lot of possible model performance on the table.”

4.2.3 Option 3: Zero-Inflated Poisson ❌

“We could treat it like counts… either with extra zeros or not… Again, these are linear models. I am not aware of any implementation of these kinds of outcome modeling that are not linear.”

4.2.4 Option 4: Treat as Regression ✅

“What if I just treat it like it is a regression problem? That’s the example I’m going to walk through and we’re going to see what it does and does not do well so that you can understand when you have outcomes that are not a perfect fit for the different kinds of models.”

This was Julia’s key insight: sometimes a powerful non-linear model (XGBoost) treating an ordinal outcome as continuous can outperform “theoretically correct” linear approaches. We followed the same philosophy with LightGBM.

4.3 Why XGBoost/LightGBM for This Problem?

Julia explained her algorithm choice clearly:

TipJulia’s Reasoning

“I am using XGBoost for just the reason that I said—this is pretty big data with things that I know are correlated with each other. I know injuries and fatalities are correlated with each other. I know length and width are correlated with each other. And it’s a big dataset. So when I see that kind of situation I think: XGBoost is gonna be my friend.”

This same reasoning applies to LightGBM—both are gradient boosting frameworks designed for:

  • Large datasets with many features
  • Correlated predictors
  • Complex non-linear interactions
  • Situations where feature importance matters

4.4 Reference: Julia Silge’s XGBoost Model

Julia’s approach using tidymodels in R:

Benchmark Performance
Metric Julia Silge’s XGBoost
RMSE 0.583
0.578

5 Data & Feature Engineering

5.1 Dataset Overview

Code
flowchart LR
    A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20]
    B --> C[Training Set<br/>54,349 rows]
    B --> D[Test Set<br/>13,588 rows]
    C --> E[Feature Engineering]
    D --> E
    E --> F[Domain-Enhanced<br/>20 Features]

flowchart LR
    A[Raw Tornado Data<br/>67,937 records] --> B[Train/Test Split<br/>80/20]
    B --> C[Training Set<br/>54,349 rows]
    B --> D[Test Set<br/>13,588 rows]
    C --> E[Feature Engineering]
    D --> E
    E --> F[Domain-Enhanced<br/>20 Features]

Data Pipeline Architecture

5.2 Julia’s EDA Insights

Before building her model, Julia explored the data and shared key observations:

NoteState Variable Analysis

“Very few tornadoes in Alaska and they are very mild—very big contrast to say Arkansas which has a lot of tornadoes and also they’re more extreme. We can see here that we have quite dramatic differences in how extreme tornadoes are in these different states.”

Julia noted the state variable has 53 levels (high cardinality), presenting a choice: - Make 50+ dummy variables ❌ - Use likelihood/effect encoding

“What is happening here is that we’re going to make a little mini model as part of our feature engineering that maps the states to an effect on the outcome… instead of having Texas, Arkansas, Alaska we will have numbers from this little mini model that say how much of an effect on the outcome does this have.”

NoteInjuries as a Predictor

“I know are of course going to have to have a strong relationship, right? Strong tornadoes cause more injuries and fatalities, certainly.”

Julia visualized injuries by magnitude and found a power law relationship—dramatic increases in injuries as magnitude increases. This validates using inj and fat as predictors.

5.3 Feature Categories

Our 20 features fall into four categories:

5.3.1 1. Raw Physical Features (6 features)

Physical Features
Feature Description Importance Rank
lat Latitude of tornado touchdown #1 🥇
len Path length (miles) #5
wid Path width (yards) #6
inj Injuries caused #11
fat Fatalities caused #13
yr Year of occurrence #2 🥈

5.3.2 2. Engineered Features (4 features)

NoteEngineering Insight

The engineered ratios wid_len_ratio and area became top-5 predictors, demonstrating that feature engineering significantly enhanced model performance.

Engineered Features
Feature Formula Importance Rank
wid_len_ratio width / length #3 🥉
area width × length #4
total_casualties injuries + fatalities #10
st_encoded State label encoding #8

5.3.3 3. Cyclical Temporal Features (2 features)

Traditional month encoding (1-12) creates artificial discontinuity between December and January. We used cyclical encoding:

\[ \text{mo\_sin} = \sin\left(\frac{2\pi \times \text{month}}{12}\right) \]

\[ \text{mo\_cos} = \cos\left(\frac{2\pi \times \text{month}}{12}\right) \]

Cyclical Features
Feature Purpose Importance Rank
mo_cos Captures winter/summer cycle #7
mo_sin Captures spring/fall cycle #9

5.3.4 4. Domain Knowledge Features (8 features)

Based on meteorological research, we created binary indicators:

Domain Knowledge Features
Feature Definition Research Basis
is_tornado_alley TX, OK, KS, NE, SD, IA Classic severe weather corridor
is_dixie_alley MS, AL, TN, AR, LA, GA Southeast tornado hotspot
is_optimal_lat_band 33°N - 37°N Peak tornado latitude
is_peak_season March - June Primary tornado season
is_may_peak May Historical peak month
is_april_violent April Highest EF4-EF5 occurrence
is_summer_weak July - August Predominantly weak tornadoes
is_hurricane_season August - October Tropical system influence

6 Model Development Journey

6.1 Phase 1: Initial DART Attempt (Failed)

WarningInitial Approach Failed

Starting with DART boosting and the user’s specified parameters resulted in severe overfitting.

Initial Configuration:

boosting: dart
num_leaves: 512
max_depth: 15
learning_rate: 0.015
num_iterations: 2000
feature_fraction: 0.98

Result: RMSE 0.725, R² 0.34 — Worse than baseline!

Diagnosis: DART’s dropout regularization was insufficient. The high num_leaves and low learning_rate with many iterations caused memorization of training data.

6.2 Phase 2: GBDT with Regularization (Breakthrough)

Switching to GBDT with explicit L1/L2 regularization immediately improved results:

Code
flowchart TD
    A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559]
    B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412]
    C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339]
    D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291]
    E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277]
    
    style A fill:#ff6b6b,color:#fff
    style F fill:#51cf66,color:#fff

flowchart TD
    A[DART Initial<br/>RMSE: 0.725] -->|Switch to GBDT| B[GBDT Base<br/>RMSE: 0.559]
    B -->|Add Regularization| C[GBDT + L1/L2<br/>RMSE: 0.412]
    C -->|Increase Depth| D[Deep GBDT<br/>RMSE: 0.339]
    D -->|Tune Learning Rate| E[Optimized GBDT<br/>RMSE: 0.291]
    E -->|Fine-tune All| F[Final Model<br/>RMSE: 0.277]
    
    style A fill:#ff6b6b,color:#fff
    style F fill:#51cf66,color:#fff

Model Evolution Through Iterations

6.3 Phase 3: “ELON MODE” — Aggressive Optimization

Over 40+ iterations of hyperparameter tuning, systematically exploring:

  • Learning rates: 0.015 → 0.22
  • Tree depth: 10 → 22
  • Number of leaves: 128 → 520
  • Regularization strength: Various L1/L2 combinations
  • Feature fraction: 0.6 → 0.98
  • Number of estimators: 500 → 2000

6.3.1 Julia’s Tuning Approach: Racing

Julia used a clever technique called racing to efficiently tune XGBoost:

NoteRacing Methods Explained

“The thing about XGBoost is that you don’t really know what these should be but often some of them turn out really bad right away—it’s like ‘oh this one’s clearly terrible’—so we can use one of these racing methods.”

“What will happen is we’ll use those resamples that we made and we will try all of the different hyperparameter combinations and see which ones turn out really bad after the first couple of resamples and then throw those away and not keep going with them. So it can be a really big time saver.”

Racing uses an ANOVA model to statistically determine if a hyperparameter configuration is significantly worse than others, eliminating it early.

Our MCP-based approach was different but similarly iterative—Claude could immediately see results and adjust parameters in real-time conversation, effectively “racing” through configurations with human-guided intuition.

7 Winning Model Configuration

ImportantBest Model: 028bd96d5efb4f5fb903e7bc83a3c63f

7.1 Hyperparameters

Best Model Configuration
boosting: gbdt
learning_rate: 0.22
max_depth: 21
num_leaves: 500
n_estimators: 1600
feature_fraction: 0.79
lambda_l1: 0.17
lambda_l2: 0.35
min_data_in_leaf: 1

7.2 Parameter Insights

Hyperparameter Rationale
Parameter Value Rationale
boosting gbdt More stable than DART for this dataset
learning_rate 0.22 Higher than typical; compensated by regularization
max_depth 21 Deep trees capture complex interactions
num_leaves 500 < 2^21 to prevent overfitting
n_estimators 1600 Many iterations with regularization
feature_fraction 0.79 Column subsampling for diversity
lambda_l1 0.17 L1 regularization (Lasso)
lambda_l2 0.35 L2 regularization (Ridge)
min_data_in_leaf 1 Allow fine-grained splits

8 Results

8.1 Performance Comparison

Model Performance Comparison
Metric Validation Test Julia Silge Improvement
RMSE 0.277 0.357 0.583 38.8%
0.904 0.838 0.578 45.0%
MAE 0.198 0.242
MAPE 11.2%
TipGeneralization Gap

The validation-to-test performance drop (RMSE 0.277 → 0.357) indicates some overfitting, but test performance still dramatically exceeds the benchmark.

8.2 Julia’s Interpretation Framework

Julia provided excellent guidance for evaluating regression predictions on ordinal outcomes:

NoteWhat Julia Looked For

When Julia evaluated her XGBoost predictions, she checked two things:

1. Distribution of Predictions: “Look at this distribution of predictions—this is actually not so bad. Notice there’s not a lot of values below zero and the range here is actually just right. We don’t end up predicting tornadoes that have a magnitude of 10 or 20.”

2. Predictions by True Class: “For things that have a real magnitude of zero, one, two, three, four, five—what’s the distribution of predictions? We can see that the median for five is a little low so we’re under-predicting the high ones and we’ve over-predicted the low ones. This is not perfect but this is actually not so bad.”

This framework—checking that predictions stay in reasonable bounds and examining prediction distributions by true class—is how we validated that treating magnitude as continuous was a reasonable choice.

8.3 Feature Importance Analysis

8.3.1 Julia’s Feature Importance Findings

Julia’s XGBoost model found similar top features:

NoteJulia’s Top Features

“Most important: injuries. Next: length—so how big is it. Length and width are both here. Year is here—there is a change with year because of climate change, like more extreme tornadoes as time goes on. Also notice state is in the top five most important predictors—so what that tells me is that it was worth it to do that feature engineering that we did. Months are also in here—May, April, June—tornado time of year.”

Our LightGBM model found latitude (#1) even more important than injuries, likely because we used raw lat while Julia used effect-encoded state which captures similar geographic information differently.

8.3.2 Top 10 Features by Gain

Code
%%{init: {'theme': 'base', 'themeVariables': { 'xyChart': {'plotColorPalette': '#ff7f50'}}}}%%
xychart-beta
    title "Feature Importance (Gain)"
    x-axis ["lat", "yr", "wid_len", "area", "len", "wid", "mo_cos", "st_enc", "mo_sin", "casual"]
    y-axis "Gain" 0 --> 10000
    bar [9648, 7194, 5540, 5400, 5004, 4627, 3406, 2981, 2792, 1453]

%%{init: {'theme': 'base', 'themeVariables': { 'xyChart': {'plotColorPalette': '#ff7f50'}}}}%%
xychart-beta
    title "Feature Importance (Gain)"
    x-axis ["lat", "yr", "wid_len", "area", "len", "wid", "mo_cos", "st_enc", "mo_sin", "casual"]
    y-axis "Gain" 0 --> 10000
    bar [9648, 7194, 5540, 5400, 5004, 4627, 3406, 2981, 2792, 1453]

Feature Importance by Information Gain

8.3.3 Feature Importance Table

Top 10 Features by Gain
Rank Feature Gain Splits Category
1 lat 9,648 5,765 Geographic
2 yr 7,194 4,212 Temporal
3 wid_len_ratio 5,540 4,156 Engineered
4 area 5,400 4,569 Engineered
5 len 5,004 4,520 Physical
6 wid 4,627 4,425 Physical
7 mo_cos 3,406 3,376 Cyclical
8 st_encoded 2,981 4,233 Geographic
9 mo_sin 2,792 3,316 Cyclical
10 total_casualties 1,453 2,072 Engineered

8.3.4 Domain Feature Performance

Domain Feature Importance
Feature Gain Assessment
is_tornado_alley 556 Moderate value
is_dixie_alley 428 Moderate value
is_optimal_lat_band 380 Moderate value
is_peak_season 221 Low value
is_may_peak 145 Low value
is_hurricane_season 112 Low value
is_april_violent 55 Minimal value
is_summer_weak 33 Minimal value

9 Key Learnings

9.1 What Worked

Tip1. GBDT Over DART

For this dataset, GBDT with explicit L1/L2 regularization outperformed DART’s dropout-based regularization. DART excels when you need to prevent overfitting without tuning regularization parameters, but explicit regularization gave finer control.

Tip2. Feature Engineering

Engineered features (wid_len_ratio, area) ranked #3 and #4 in importance. Simple transformations of raw features added significant predictive power.

Tip3. Cyclical Encoding

mo_cos and mo_sin outperformed binary month indicators. The continuous representation captured seasonal patterns more effectively.

Tip4. Higher Learning Rate + Strong Regularization

Counter-intuitively, learning_rate=0.22 (high for LightGBM) with strong L1/L2 regularization produced better results than low learning rates.

9.2 What Didn’t Work

Warning1. DART Boosting

With our initial parameters, DART severely overfit. The dropout mechanism wasn’t sufficient for this feature set.

Warning2. Binary Domain Indicators (Limited Value)

Features like is_tornado_alley and is_peak_season added less value than expected. The model learned these patterns implicitly from lat, st_encoded, and cyclical month features.

Warning3. Very Low Learning Rates

Initial attempts with learning_rate=0.015 required too many iterations and still underperformed.

9.3 Insights

NoteGeography is Paramount

Latitude alone was the #1 predictor. Tornado magnitude correlates strongly with geographic location—likely reflecting the climatological conditions that produce severe tornadoes.

NoteTemporal Trends Matter

Year (yr) ranked #2, suggesting tornado magnitude patterns have changed over time—potentially due to improved detection, climate patterns, or measurement methodology changes.

NoteMorphology Predicts Severity

Physical dimensions (len, wid) and their ratio directly relate to magnitude. Wider, longer tornadoes tend to be more severe—an intuitive finding that validates our feature engineering approach.

10 Recommendations for Future Work

10.1 Model Improvements

  1. Ensemble Methods: Combine LightGBM with CatBoost or XGBoost for potential gains
  2. Target Encoding: Replace label encoding for states with target encoding
  3. Interaction Features: Explicitly create lat×month or state×season interactions
  4. Ordinal Regression: Treat magnitude as ordinal rather than continuous

10.2 Data Enhancements

  1. Weather Data: Incorporate atmospheric conditions (CAPE, wind shear, humidity)
  2. Radar Data: Use pre-tornado radar signatures if available
  3. Time of Day: Add hour of occurrence (severe tornadoes peak in late afternoon)
  4. Population Density: May correlate with damage assessment accuracy

10.3 Validation Improvements

  1. Cross-Validation: Implement k-fold CV for more robust estimates
  2. Temporal Validation: Train on earlier years, test on recent years
  3. Geographic Validation: Ensure performance across different regions

11 Conclusion

11.1 Technical Achievements

This analysis demonstrates that domain-informed feature engineering combined with aggressive hyperparameter optimization can significantly exceed baseline performance. Our LightGBM model achieved:

  • 38.8% lower RMSE than Julia Silge’s XGBoost benchmark
  • 45% higher R² explaining tornado magnitude variance
  • Interpretable insights about the importance of geography, morphology, and time

11.2 The Bigger Picture: AI-Human Collaboration

TipThe Real Innovation

The most significant outcome of this project isn’t the model performance—it’s demonstrating a new way of doing data science.

Through MCP servers, Claude Opus 4.5 wasn’t just advising on machine learning—it was actively participating in the iterative process of building, evaluating, and refining models.

11.2.1 What Made This Collaboration Work

Human-AI Division of Labor
Human Contributions AI Contributions
Domain intuition (tornado meteorology) Rapid iteration through hyperparameters
Strategic direction (“try DART”, “go ELON MODE”) Statistical feature engineering
Quality judgment (is this result good enough?) Tool execution (training, prediction, evaluation)
Business context (why does this matter?) Documentation and explanation

11.2.2 The Conversational Data Science Process

Our session demonstrated a natural dialogue:

  1. Human: “Let’s use these domain features based on meteorological research”
  2. AI: Validates features against data, creates engineered dataset
  3. Human: “Try DART boosting with these parameters”
  4. AI: Trains model, reports poor results, explains why
  5. Human: “Go full ELON MODE—iterate aggressively”
  6. AI: Runs 40+ experiments, converges on optimal configuration
  7. Human: “Now test on held-out data”
  8. AI: Evaluates, reports final metrics, compares to benchmark

This isn’t prompt engineering or code generation—it’s collaborative problem-solving where both parties contribute their strengths.

11.3 Looking Forward

MCP servers represent a fundamental shift in how AI assists with technical work. Instead of:

  • Generating code snippets that may or may not work
  • Providing theoretical advice that requires human implementation
  • Acting as a sophisticated search engine

AI can now:

  • Execute analyses directly
  • Iterate based on real results
  • Learn from failures within a session
  • Collaborate as a genuine partner
ImportantFinal Result

Test RMSE: 0.357 | Test R²: 0.838 | Benchmark Improvement: 38.8%

Achieved through genuine human-AI collaboration using MCP servers


12 Appendix A: Hyperparameter Search History

12.1 Models Tested (Selected Iterations)

Hyperparameter Search History
Iteration Boosting LR Depth Leaves L1 L2 Val RMSE
1 dart 0.015 15 512 0.725
5 gbdt 0.10 12 256 0.1 0.1 0.559
15 gbdt 0.15 18 400 0.15 0.25 0.412
25 gbdt 0.18 20 450 0.17 0.30 0.339
35 gbdt 0.20 21 480 0.17 0.35 0.291
Final gbdt 0.22 21 500 0.17 0.35 0.277

13 Appendix B: Reproducibility

13.1 Data Sources

To reproduce this analysis, download the data from:

13.2 Data Files Required

After feature engineering:

  • tornado_train_domain.csv (54,349 rows)
  • tornado_test_domain.csv (13,588 rows)

13.3 Model Training

  • Use LightGBM regression with parameters from “Winning Model Configuration”
  • Target column: mag
  • All 20 features as described

13.4 Environment

  • LightGBM MCP Server tools
  • VS Code with GitHub Copilot
  • Claude Opus 4.5

14 Appendix C: References & Further Reading

14.1 Julia Silge’s Work

14.2 Technical Documentation

14.3 Tornado Data Sources