Time Series Forecasting¶
Goal: To demystify time series forecasting, moving beyond simple trend lines to build robust, insightful models. We will cover key techniques for feature engineering, model adaptation, and evaluation specific to temporal data. Where applicable, we’ll note how libraries like pandas
and skrub
can streamline these processes.
The tutorial¶
1. Crafting Time-Structured Features¶
Raw time series data (e.g., daily sales) often requires significant feature engineering to reveal patterns, seasonality, and dependencies that machine learning models can leverage.
-
Temporal Resampling:
- Concept: Adjusting the time frequency of your data.
- Why? Data might be too granular (e.g., per-second readings when hourly forecasts are needed) or not granular enough. It’s also used to align series with different native frequencies.
- Techniques (typically using
pandas
):- Downsampling: Aggregating data to a lower frequency (e.g., daily to weekly using
df.resample('W').sum()
). Common aggregations include mean, sum, median, min, max. - Upsampling: Increasing data frequency (e.g., daily to hourly using
df.resample('H').interpolate()
). This usually involves an interpolation strategy like forward/backward fill or linear interpolation.
- Downsampling: Aggregating data to a lower frequency (e.g., daily to weekly using
- Pipeline Integration: These
pandas
operations can be encapsulated within custom transformers orskrub.deferred
functions to become part of a reproducible machine learning pipeline (e.g., askrub
pipeline).
-
Temporal Merges & Joins (Enriching Time Series with External Data):
- Concept: Combining your primary time series with other relevant data sources, which might include other time series or static features. Timestamps and join keys may not align perfectly.
- Why? External factors often influence time series. For instance, sales data could be enriched with weather information, marketing campaign schedules, or economic indicators.
- Techniques & Tools:
- Exact Joins (e.g.,
pandas.merge
): For when timestamps or other keys align perfectly. - Approximate (“Fuzzy”) Joins (e.g.,
skrub.fuzzy_join
orskrub.Joiner
): Useful if time-based keys (like week numbers) or descriptive keys (like location names) have minor discrepancies.- Example: Joining daily sales to a weekly marketing spend table by matching to the closest week start.
- Interpolation-Based Joins (e.g.,
skrub.InterpolationJoiner
): Very powerful for spatio-temporal data. If you have sparse auxiliary data (e.g., weather readings from stations not at your exact location/time), this can infer the value of an auxiliary variable at the precise time and location of your main series’ observations.- Example: Predicting temperature at an airport for a flight’s departure time, using data from nearby weather stations that report at different times.
- Aggregate-then-Join (e.g.,
skrub.AggJoiner
): If you have high-frequency auxiliary data (e.g., hourly website clicks) and a lower-frequency main series (e.g., daily sales), you can aggregate the auxiliary data (e.g., total daily clicks) before joining it to your main series.
- Exact Joins (e.g.,
- Pipeline Integration: Join operations, especially complex ones like those offered by
skrub
‘s joiners, benefit from being part of a stateful pipeline to ensure correct application during training and inference.
-
Calendar Features:
- Concept: Extracting structured features directly from the date/time index or columns.
- Why? Time series often exhibit strong patterns based on calendar cycles (e.g., day of week, month, holidays).
- Features: Year, month, day of week, day of year, week of year, hour, season, is_weekend, is_holiday, etc.
- Tools:
pandas
datetime accessors: (e.g.,df.index.month
,df.index.dayofweek
).skrub.DatetimeEncoder
: This transformer automatically breaks down datetime columns into a comprehensive set of numerical calendar features. It can also generate periodic encodings (e.g., sine/cosine transformations for cyclical features like month or hour), which are particularly useful for linear models and help tree-based models capture cyclicity effectively. It’s often used withinskrub.TableVectorizer
.
-
Lag Features:
- Concept: Using past values of the target variable (autoregressive features) or other exogenous variables as features for predicting the current or future value.
- Why? The state of a time series at a previous point often influences its current or future state (autocorrelation).
- Techniques (typically using
pandas.DataFrame.shift()
):y(t-1)
: Value of the target at the previous time step.y(t-k)
: Value of the targetk
steps ago (e.g.,y(t-7)
for daily data with weekly seasonality).- Lags of exogenous variables (e.g., marketing spend from the previous month).
- Pipeline Integration: Once created, these lagged columns become regular numerical features that can be processed by tools like
skrub.TableVectorizer
.
-
Aggregate Features over Rolling Windows:
- Concept: Calculating summary statistics (e.g., mean, median, standard deviation, min, max, sum) over a moving (“rolling”) window of past data points.
- Why? Helps capture recent trends, local volatility, or other dynamic patterns in the series.
- Techniques (typically using
pandas.DataFrame.rolling().agg()
):rolling_mean_7_days
: Average value over the past 7 days.rolling_std_30_days
: Standard deviation over the past 30 days (a measure of recent volatility).
- Pipeline Integration: Similar to lag features, these rolling aggregates become standard input columns for downstream processing.
2. Adapting Tabular Models for Forecasting¶
Many state-of-the-art machine learning models (like Gradient Boosted Trees, Random Forests, or linear models) are designed for tabular data. With proper feature engineering, they can be powerful forecasters.
-
The Core Idea: Transform to Supervised Learning:
- The goal is to predict a target variable
y(t)
at timet
. - The features
X(t)
consist of all the time-structured features engineered in Part 1 (lags, calendar features, rolling aggregates, relevant external data, etc.). - The tabular model learns the mapping:
X(t) -> y(t)
.
- The goal is to predict a target variable
-
Automated Preprocessing for Tabular Models:
- Once you’ve created your comprehensive feature set (including temporal features), tools like
skrub.TableVectorizer
can automate the final preprocessing steps. - It detects column types (numeric, categorical—including those derived from calendar features like ‘day_of_week_name’) and applies appropriate transformations (e.g., scaling for numerics, one-hot encoding for categoricals). Its internal
DatetimeEncoder
handles any raw date columns. - The output is a numerical matrix ready for any scikit-learn compatible estimator.
- For quick model building,
skrub.tabular_learner
can create a full pipeline (e.g.,TableVectorizer
+HistGradientBoostingRegressor
) from your feature-engineered DataFrame.
- Once you’ve created your comprehensive feature set (including temporal features), tools like
-
Managing Complex Data Flows:
- For advanced scenarios involving multiple temporal data sources requiring intricate preprocessing and joining before the final feature set is assembled, pipeline frameworks like
skrub
expressions (skrub.var
,.skb.apply()
,skrub.choose_from()
) allow defining complex, stateful, and tunable data processing graphs.
- For advanced scenarios involving multiple temporal data sources requiring intricate preprocessing and joining before the final feature set is assembled, pipeline frameworks like
3. Pitfalls of Recursive Forecasting & Alternatives¶
A common approach is to train a model to predict only one step ahead (t+1
) and then use its own predictions recursively to forecast further.
- The “t+1” Regressor: You train a model
ŷ(t+1) = model(features_at_t)
. - Recursive Forecasting: To predict
ŷ(t+2)
, you useŷ(t+1)
as if it were an actual observation to construct features for timet+1
. This process is repeated fort+3
,t+4
, …, up to the desired forecast horizonH
. - The Pitfall: Error Accumulation. Any error in
ŷ(t+1)
is fed back into the model as input for predictingŷ(t+2)
. These errors can compound, leading to increasingly unreliable forecasts further into the horizon. - Alternatives to Recursive Forecasting:
- Direct Forecasting (or Multi-Output Forecasting):
- Train separate models for each step
h
in the forecast horizon:model_h
predictsŷ(t+h)
directly fromfeatures_at_t
. - Alternatively, use a multi-output regressor that simultaneously predicts
[ŷ(t+1), ŷ(t+2), ..., ŷ(t+H)]
fromfeatures_at_t
.
- Train separate models for each step
- DIRMO (Direct Multi-Horizon Forecasting): Train a single model to predict
y(t+h)
whereh
(the forecast step, e.g., 1 to H) is itself a feature:ŷ(t+h) = model(features_at_t, h)
. - Note: All these strategies still rely on a well-prepared feature matrix
features_at_t
.
- Direct Forecasting (or Multi-Output Forecasting):
4. Sound Prediction Intervals (Beyond Point Estimates)¶
A single point forecast (e.g., “sales will be $1000”) is often insufficient. Quantifying the uncertainty around this forecast is crucial.
-
What are Prediction Intervals (PIs)?
- A range
[lower_bound, upper_bound]
within which the true future value is expected to lie with a certain probability (e.g., 95%). - Example: “We are 95% confident that sales will be between $800 and $1200.”
- A range
-
Methods to Generate PIs (using models trained on appropriately prepared data):
- Quantile Regression: Train models to predict specific quantiles of the future distribution (e.g., the 2.5th and 97.5th percentiles for a 95% PI). Gradient Boosted Trees, for instance, can be configured with a quantile loss function.
- Bootstrapping Residuals: After training a point forecasting model, collect its residuals (actual - predicted) on a hold-out set. For a new forecast, repeatedly sample from these residuals, add them to the point forecast to create an empirical distribution, and then take percentiles of this distribution.
- Conformal Prediction: A model-agnostic framework that can provide PIs with statistical coverage guarantees under certain assumptions.
- Model-Specific Uncertainty: Some models (e.g., Bayesian models, Gaussian Processes) inherently output uncertainty estimates that can be converted into PIs.
5. Methodologically Sound Model Selection & Evaluation¶
Evaluating forecasting models requires care to avoid data leakage from the future into the past.
-
Time-Structured Cross-Validation:
- The Problem with Standard CV: Randomly splitting time series data allows the model to “see the future” during training (e.g., training on data from
t+k
to predictt
), leading to overly optimistic performance estimates. - Solution (Respecting Temporal Order): Use cross-validation schemes that preserve the temporal sequence.
sklearn.model_selection.TimeSeriesSplit
is a standard implementation. It creates folds where the training set always precedes the test set. Variations include expanding windows (training set grows) or rolling windows (training set slides).
- Pipeline Integration: When using pipeline tools (e.g.,
skrub
expressions for hyperparameter search with.skb.get_grid_search()
), ensure you pass a time-aware splitter likeTimeSeriesSplit
as thecv
argument.
- The Problem with Standard CV: Randomly splitting time series data allows the model to “see the future” during training (e.g., training on data from
-
Informative Evaluation Metrics:
- For Point Estimates:
- MAE (Mean Absolute Error), RMSE (Root Mean Squared Error).
- MAPE (Mean Absolute Percentage Error) - use with caution (issues with zero actuals, asymmetry).
- MASE (Mean Absolute Scaled Error) - compares forecast to a naive (e.g., seasonal random walk) baseline.
- For Prediction Intervals:
- Prediction Interval Coverage Probability (PICP): Percentage of actual values falling within the PIs. For 95% PIs, PICP should be close to 95%.
- Mean Prediction Interval Width (MPIW): Average width of PIs. Narrower is better, given good coverage.
- Winkler Score (or Interval Score): A “strictly proper scoring rule” that simultaneously penalizes for poor coverage and overly wide intervals.
- For Probabilistic (Distributional) Predictions:
- Continuous Ranked Probability Score (CRPS): Generalizes MAE for probabilistic forecasts. A strictly proper scoring rule; lower is better.
- Log-Likelihood/Log Score: For models outputting full predictive distributions.
- “Strictly Proper Scoring Rules” are desirable as they are uniquely optimized when the forecast matches the true underlying probability distribution, encouraging honest and well-calibrated probabilistic forecasts.
- For Point Estimates:
6. Diagnosing & Mitigating Bias (Especially for Extremes)¶
Models often struggle with extreme values, potentially systematically under- or over-predicting them.
-
Reliability Diagrams (for Probabilistic Forecasts/Quantiles):
- Concept: Plot the observed frequency of an event (or actual values falling below a certain quantile) against the forecast probability (or predicted quantile).
- How it works (example for quantiles): If you predict the 10th percentile, ideally 10% of actual future values should fall below this predicted 10th percentile. If, for instance, only 5% of actuals fall below your predicted 10th percentile, your model is systematically overestimating this quantile (i.e., not predicting values low enough).
- Interpretation: A perfectly calibrated model has its reliability curve along the diagonal. Deviations indicate bias.
-
Strategies to Mitigate Bias:
- Recalibration: Post-process model outputs (e.g., using isotonic regression for quantiles) based on observed biases on a calibration set.
- Model Adjustments: Consider different loss functions (e.g., to penalize errors on extremes more heavily), engineer features specifically designed to capture drivers of extremes, or transform the target variable (e.g., log transform for skewed data).
- Separate Models for Extremes: Sometimes, modeling “normal” behavior and “extreme” behavior with distinct models can be effective.
Practical Considerations (Real-World vs. Synthetic Data)¶
- Real-World Data: Often messy (missing values, outliers, structural breaks like policy changes or pandemics, evolving seasonality).
- Initial cleaning (e.g., with
skrub.Cleaner
for obvious nulls and date parsing) is often necessary. - If categorical features exist alongside the time series (e.g., store IDs, product types) and have inconsistencies, tools like
skrub.deduplicate
can be helpful.
- Initial cleaning (e.g., with
- Synthetic Data: Useful for isolating specific problems and understanding model behavior under controlled conditions (e.g., to demonstrate catastrophic pitfalls of recursive forecasting or issues with specific CV methods).
Key Takeaways¶
- Feature Engineering is King: Lags, rolling statistics (often via
pandas
), calendar features (e.g., viaskrub.DatetimeEncoder
), and joined external data (e.g., viaskrub
joiners orpandas.merge
) are crucial. - Tabular Models are Versatile Forecasters: With appropriate time-aware features, standard ML models can be very effective. Tools like
skrub.TableVectorizer
can then prepare this data forscikit-learn
. - Beware Recursive Forecasting Pitfalls: Understand its limitations and consider direct forecasting strategies.
- Quantify Uncertainty: Prediction intervals provide a more complete picture than point forecasts.
- Evaluate Rigorously: Use time-structured cross-validation (e.g.,
TimeSeriesSplit
) and appropriate, ideally proper, scoring rules. - Diagnose Your Model: Actively look for systematic biases, especially for extreme values.
Your “Homework”¶
- Explore
pandas
for time series resampling, generating lag features (.shift()
), and calculating rolling window statistics (.rolling()
). - Take a time series dataset:
- Identify potential calendar features and try extracting them using
pandas
datetime accessors orskrub.DatetimeEncoder
. - If you have auxiliary temporal data, consider how it could be merged or joined.
- Construct a feature set and use
skrub.TableVectorizer
to prepare it for ascikit-learn
model.
- Identify potential calendar features and try extracting them using
- Familiarize yourself with
sklearn.model_selection.TimeSeriesSplit
and think about how you would use it to evaluate a forecasting model.
References¶
Page last modified: 2025-06-04 18:52:49