Python Time series
Time series analysis is a critical method for understanding sequential data to predict future trends and make informed decisions. For system operators and DevOps engineers, it’s the foundation of predictive autoscaling—the practice of forecasting workload to adjust server resources proactively.
Python, with its robust libraries, has become a favorite tool for this task. In this blog post, we will explore the fundamental concepts and workflow for time series analysis by tackling the real-world problem of predicting web server traffic from logs.
Key Libraries for Time Series Analysis in Python¶
Python offers multiple libraries for efficient time series analysis. Our work has centered on these key players:
-
Pandas: The essential library for data analysis and manipulation. It excels at handling time-indexed data, making it perfect for preparing our log data for analysis.
-
Numpy: The foundational library for numerical operations in Python, providing the arrays and mathematical functions that other data science libraries are built upon.
-
Matplotlib: The primary plotting library, crucial for visualizing our time series data and the performance of our forecasting models.
-
Statsmodels: A powerful library for statistical modeling. It provides robust implementations of classical time series models like ARIMA and SARIMA, which are excellent for capturing trend and seasonality.
-
SciKit-Learn: The most popular machine learning library in Python. While not specialized for time series, it provides simple and efficient regression models that can be adapted for forecasting, often with surprisingly good results.
-
Nixtla/StatsForecast: A modern, high-performance library built for forecasting at scale. It offers highly optimized and automated versions of many classical models, making it a powerful tool for finding the best model quickly.
Other notable libraries include Darts
for its unified API and pmdarima
for auto-ARIMA, though the latter can have dependency issues.
Example: From Raw Logs to a Traffic Forecast¶
Exploring Time Series Data with Python¶
Forecasting starts with understanding your data. Our goal is to predict hourly web traffic, so we begin by parsing and aggregating raw Apache logs into a time series.
import pandas as pd
import gzip
import re
# A function to parse Apache logs and aggregate them by hour
def parse_and_aggregate(log_file_path, freq='H'):
log_re = re.compile(r'.*\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} [+\-]\d{4})].*')
timestamps = []
with gzip.open(log_file_path, 'rt', errors='ignore') as f:
for line in f:
match = log_re.match(line)
if match:
timestamps.append(pd.to_datetime(match.group(1), format='%d/%b/%Y:%H:%M:%S %z'))
ts = pd.Series(1, index=pd.to_datetime(timestamps))
return ts.resample(freq).count()
# # In a real scenario:
# ts_data = parse_and_aggregate('access.log.gz')
# ts_data.plot(figsize=(12, 6), title="Hourly Web Server Requests")
Time Series Analysis with Python¶
There are multiple ways to approach forecasting. We will focus on two fundamental methods that represent different philosophies: classical statistics and machine learning.
Decomposition: First, we decompose the series into its core components: trend, seasonality, and noise. This helps us understand the patterns our models need to learn.
from statsmodels.tsa.seasonal import seasonal_decompose
# Assuming ts_data is our hourly series
# The period=24 is crucial for hourly data with a daily pattern
decomposed = seasonal_decompose(ts_data, model='additive', period=24)
decomposed.plot()
The plot from this decomposition will clearly show if there is a daily rhythm, confirming that seasonal models are necessary.
Approach 1: Statistical Forecasting with SARIMA
SARIMA (Seasonal AutoRegressive Integrated Moving Average) is a powerful model designed specifically for time series with seasonality. It’s a go-to for many forecasting tasks.
from statsmodels.tsa.arima.model import ARIMA
# Fit a SARIMA model. The `order` and `seasonal_order` parameters are key.
# order=(p,d,q), seasonal_order=(P,D,Q,m) where m=24 for our hourly data.
model = ARIMA(train_data, order=(1, 1, 1), seasonal_order=(1, 1, 0, 24))
model_fit = model.fit()
# Forecast the next 12 hours
forecast = model_fit.forecast(steps=12)
Approach 2: Machine Learning with Regression
We can reframe forecasting as a regression problem: predict the next hour’s traffic using the traffic from the previous 24 hours as features. Our experiments showed this to be highly effective for stable, predictable traffic.
from sklearn.linear_model import LinearRegression
# Function to create lagged features
def create_lags(series, n_lags=24):
df = pd.DataFrame(series)
for lag in range(1, n_lags + 1):
df[f'lag_{lag}'] = series.shift(lag)
return df.dropna()
# Prepare data
lagged_df = create_lags(ts_data)
X = lagged_df.drop(columns=[ts_data.name])
y = lagged_df[ts_data.name]
# Train a simple Linear Regression model
model = LinearRegression()
model.fit(X, y)
# To predict the next hour, you need the last 24 hours of data
# last_24_hours = ts_data.tail(24).values.reshape(1, -1)
# prediction = model.predict(last_24_hours)
The key takeaway is that for workloads with strong, regular patterns, a simple regression model can be as accurate as—and much faster than—a complex statistical model.
More Advanced Topics¶
Machine Learning for Time Series Analysis
While we focused on foundational models, deep learning offers powerful alternatives for very complex patterns, provided you have enough data.
Long Short-Term Memory (LSTM)
LSTMs are a type of Recurrent Neural Network (RNN) designed to remember long-term dependencies, making them suitable for time series data.
from keras.models import Sequential
from keras.layers import LSTM, Dense
# Data must be shaped into sequences [samples, timesteps, features]
# For example, using 24 past hours to predict the next hour.
n_steps = 24
n_features = 1
# Define LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# Fit model (X and y need to be prepared as sequences)
# model.fit(X_train_sequences, y_train_sequences, epochs=100, verbose=0)
Gated Recurrent Units (GRU)
GRUs are similar to LSTMs but have a simpler architecture, which can lead to faster training with comparable performance. This makes them an attractive alternative.
from keras.models import Sequential
from keras.layers import GRU, Dense
# Define GRU model
model = Sequential()
model.add(GRU(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# Fit model
# model.fit(X_train_sequences, y_train_sequences, epochs=100, verbose=0)
Both LSTM and GRU require careful data preparation and more computational resources but can capture non-linear patterns that simpler models might miss.
References¶
#python #machine-learning
Page last modified: 2025-07-01 17:23:21