Tutorial on time series forecasting

See also: [[2025 - H3NI - Demos Assessment]]

Here’s a tutorial on using time-series forecasting models in Python for the problem of predicting resource needs (like request rates or CPU utilization) to drive proactive scaling. We’ll focus on statsmodels for classical models (like ARIMA) and scikit-learn for simpler regression-based approaches, which are often good starting points.

References¶

Tutorial: Time-Series Forecasting for Predictive Scaling¶

This tutorial guides you through applying time-series forecasting models to predict web server traffic. These predictions can inform a predictive scaling system, allowing it to proactively adjust resources like server replicas. We’ll start with raw Apache access logs, process them into a time series, and then compare two fundamental forecasting approaches: a classical statistical model (SARIMA) and a machine learning regression model.

Problem at Hand:
We want to predict future web server request counts based on historical data. This allows us to scale resources up before a load spike hits, or scale down efficiently when load is predicted to decrease, improving both performance and cost-efficiency.

Libraries We’ll Use:

pandas: For data manipulation and time-series handling.
numpy: For numerical operations.
statsmodels: For classical statistical time-series models like ARIMA/SARIMA.
scikit-learn: For regression models and evaluation metrics.
matplotlib: For plotting and visualization.
pmdarima (Optional but Recommended): For automatically finding optimal ARIMA parameters.

Installation:

pip install pandas numpy statsmodels scikit-learn matplotlib pmdarima

Step 1: Data Acquisition and Preparation from Logs¶

In a real system, you would fetch metrics from a monitoring system like Prometheus. Here, we’ll simulate the process by starting with raw Apache log data, just as in our experiments.

First, let’s create a sample gzipped log file to work with.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gzip
import re
from datetime import datetime

def create_sample_log_file(filename="sample_access.log.gz"):
    """Creates a gzipped text file with sample Apache logs."""
    log_pattern = '{ip} - - [{ts}] "GET {path} HTTP/1.1" 200 {bytes}\n'
    start_time = datetime(2023, 10, 1, 0, 0, 0)

    with gzip.open(filename, 'wt') as f:
        for i in range(1440 * 3): # 3 days of logs, one per minute
            # Create a daily and weekly cycle
            hour_of_day = (i % 1440) // 60
            day_of_week = (i // 1440) % 7

            # Simulate more traffic during business hours and weekdays
            if 9 <= hour_of_day <= 17 and day_of_week < 5:
                num_requests = np.random.randint(10, 20)
            else:
                num_requests = np.random.randint(1, 5)

            for _ in range(num_requests):
                ts = start_time + pd.Timedelta(minutes=i)
                log_entry = log_pattern.format(
                    ip="127.0.0.1",
                    ts=ts.strftime('%d/%b/%Y:%H:%M:%S %z'),
                    path="/",
                    bytes=100
                )
                f.write(log_entry)
    print(f"Created sample log file: {filename}")

# Create the log file
create_sample_log_file()

Now, let’s parse these logs and aggregate them into an hourly time series.

def parse_and_aggregate_logs(log_file, freq='H'):
    """Parses a gzipped log file and aggregates requests by a given frequency."""
    log_re = re.compile(r'.*\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} [+\-]\d{4})].*')
    timestamps = []
    with gzip.open(log_file, 'rt') as f:
        for line in f:
            match = log_re.match(line)
            if match:
                timestamps.append(pd.to_datetime(match.group(1), format='%d/%b/%Y:%H:%M:%S %z'))

    if not timestamps:
        return pd.Series([], dtype=float)

    ts_series = pd.Series(1, index=pd.to_datetime(timestamps))
    ts_agg = ts_series.resample(freq).count()
    ts_agg.name = 'requests'
    return ts_agg

# Parse and aggregate the data
ts_data = parse_and_aggregate_logs("sample_access.log.gz", freq='H')

# Plot the aggregated data
plt.figure(figsize=(15, 6))
ts_data.plot(title="Aggregated Hourly Web Server Requests")
plt.ylabel("Number of Requests")
plt.grid(True)
plt.show()

print("Sample of the aggregated time series data:")
print(ts_data.head())

Step 2: Understanding Your Time Series¶

Before modeling, it’s crucial to analyze the time series for: * Trend: A long-term increase or decrease. * Seasonality: Repeating patterns (e.g., daily, weekly). Our hourly data should show a strong 24-hour pattern. * Autocorrelation: How a value correlates with its past values.

The Augmented Dickey-Fuller (ADF) test helps check for stationarity (constant mean and variance), a key assumption for ARIMA models. ACF and PACF plots help visualize autocorrelation and guide parameter selection for ARIMA.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

# Check for stationarity
adf_result = adfuller(ts_data)
print(f'ADF Statistic: {adf_result[0]}')
print(f'p-value: {adf_result[1]}')
# A low p-value (e.g., < 0.05) suggests the data is stationary.

# Plot ACF and PACF to observe patterns
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
plot_acf(ts_data, ax=axes[0], lags=48) # ACF shows correlation with past values
plot_pacf(ts_data, ax=axes[1], lags=48) # PACF shows direct correlation, removing indirect effects
plt.show()

For hourly data, you’ll likely see significant spikes in the ACF plot at lags 24 and 48, confirming a strong daily seasonality.

Step 3: Train/Test Split¶

We’ll use the first 80% of the data to train our models and the final 20% to evaluate their performance.

train_size = int(len(ts_data) * 0.8)
train_data, test_data = ts_data[:train_size], ts_data[train_size:]

print(f"Training data points: {len(train_data)}")
print(f"Validation data points: {len(test_data)}")

Step 4: Forecasting Models¶

We will build and compare three different models, reflecting the core of our experiments.

Model A: Simple Baseline (Naive Forecast)¶

The simplest model assumes the next value will be the same as the last known value. It’s a crucial baseline to beat.

naive_predictions = pd.Series(train_data.iloc[-1], index=test_data.index)

Model B: Regression with Lagged Features (Scikit-learn)¶

This approach frames forecasting as a regression problem: predict the next value based on previous values (lags). Our experiments showed this is surprisingly effective for stable, high-volume traffic.

from sklearn.linear_model import LinearRegression

def create_lagged_features(series, max_lag=24):
    """Creates a DataFrame with lagged features."""
    df = pd.DataFrame(series)
    for lag in range(1, max_lag + 1):
        df[f'lag_{lag}'] = series.shift(lag)
    df.dropna(inplace=True)
    return df

# Create lagged features from the training data
lagged_train_df = create_lagged_features(train_data, max_lag=24) # Use 24 hours of history
X_train = lagged_train_df.drop(columns=['requests'])
y_train = lagged_train_df['requests']

# Train the model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions iteratively
history = list(train_data)
lr_predictions = []
for i in range(len(test_data)):
    # The input for prediction must be the last 24 values from history
    input_lags = np.array(history[-24:]).reshape(1, -1)

    yhat = lr_model.predict(input_lags)[0]
    lr_predictions.append(yhat)

    # Add the actual observed value to history for the next prediction
    history.append(test_data.iloc[i])

lr_predictions_series = pd.Series(lr_predictions, index=test_data.index)

Model C: SARIMA (Statsmodels)¶

SARIMA (Seasonal AutoRegressive Integrated Moving Average) is a powerful statistical model designed for time series with seasonality.

Option 1: Manual Parameter Selection (Robust and Fast)
Based on the ACF/PACF plots and experience, you can set the model’s order manually. This was our primary approach in the experiments for consistency.

(p, d, q): The non-seasonal order (AR, Differencing, MA).
(P, D, Q, m): The seasonal order, where m is the length of the season (e.g., m=24 for hourly data with a daily pattern).

from statsmodels.tsa.arima.model import ARIMA

# Define model order based on analysis or experimentation
# e.g., A simple SARIMA for hourly data
sarima_order = (1, 1, 1) # (p,d,q)
seasonal_order = (1, 1, 0, 24) # (P,D,Q,m)

# Train the SARIMA model
sarima_model = ARIMA(train_data, order=sarima_order, seasonal_order=seasonal_order).fit()

# Make predictions for the duration of the test set
sarima_predictions = sarima_model.forecast(steps=len(test_data))

Option 2: Automated Parameter Tuning (pmdarima)
The pmdarima library provides auto_arima, which automates the search for the best SARIMA parameters. It’s a great tool but can have dependency issues, as noted in our experiments.

import pmdarima as pm

# Let auto_arima find the best parameters
auto_sarima_model = pm.auto_arima(train_data,
                                  start_p=1, start_q=1,
                                  max_p=3, max_q=3,
                                  m=24,           # Seasonality period
                                  seasonal=True,  # Enable seasonal search
                                  d=1, D=1,       # Let it find d and D if unsure
                                  trace=True,
                                  error_action='ignore',
                                  suppress_warnings=True,
                                  stepwise=True) # Makes the search faster

print(auto_sarima_model.summary())

# Use the best model found to make predictions
auto_sarima_predictions = auto_sarima_model.predict(n_periods=len(test_data))

Step 5: Evaluate and Compare Models¶

Now, let’s evaluate all our models against the actual test data using standard metrics.

from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error

def evaluate_model(true_values, predictions, model_name):
    rmse = np.sqrt(mean_squared_error(true_values, predictions))
    mape = mean_absolute_percentage_error(true_values, predictions) * 100
    accuracy = 100 - mape
    print(f"\n--- {model_name} Evaluation ---")
    print(f"  RMSE: {rmse:.2f} (Root Mean Squared Error)")
    print(f"  MAPE: {mape:.2f}% (Mean Absolute Percentage Error)")
    print(f"  Accuracy: {accuracy:.2f}%")
    return accuracy

# Evaluate all models
evaluate_model(test_data, naive_predictions, "Naive Forecast")
evaluate_model(test_data, lr_predictions_series, "Linear Regression")
evaluate_model(test_data, sarima_predictions, "Manual SARIMA")
evaluate_model(test_data, auto_sarima_predictions, "Auto-SARIMA")

# Plot all forecasts against the actual values
plt.figure(figsize=(15, 7))
plt.plot(test_data, label='Actual Traffic', color='black', linewidth=2)
plt.plot(lr_predictions_series, label='Linear Regression', linestyle='--')
plt.plot(sarima_predictions, label='Manual SARIMA', linestyle='--')
plt.plot(auto_sarima_predictions, label='Auto-SARIMA', linestyle='--')
plt.title('Model Forecast Comparison')
plt.legend()
plt.grid(True)
plt.show()

Step 6: Conclusions and Integration Strategy¶

Our experiments showed that there is no single “best” model; the choice depends heavily on the traffic profile.

Model Selection:
- For Stable, High-Volume Traffic: A simple LinearRegression model often provides an excellent balance of high accuracy (>85%) and computational efficiency.
- For Volatile, Event-Driven Traffic: Forecasting is harder. More complex models are needed. Our experiments showed that advanced models like MFLES (from libraries like statsforecast) or a well-tuned SARIMA provide the best results, though achieving >70% accuracy can be challenging.
Modern Libraries (statsforecast):
Modern libraries like Nixtla’s statsforecast can test dozens of models in parallel and often find a high-performing model (like MFLES or AutoTheta) automatically. For a production system, this is a highly recommended approach as it automates much of the selection and tuning process.
Integration into a Predictive Scaler:
- Retraining: The chosen model must be retrained periodically (e.g., every few hours or daily) on fresh data to adapt to changing patterns.
- Forecasting: After retraining, use the model to forecast N steps into the future.
- Action Logic: Translate the forecast (e.g., the average or maximum predicted value over the horizon) into a target number of replicas. Always combine this with safety bounds (min/max replicas) and cooldowns to prevent thrashing.

This tutorial provides a solid, practical foundation for building a predictive scaling system. By starting with these core models and understanding their trade-offs, you can make an informed decision tailored to the specific workload you need to scale.

Page last modified: 2025-07-01 17:23:21