How Good Can Linear Models Be for Time-Series Forecasting?

tl;dr

Linear models are not weak. Their preprocessing is under-tuned.

Closed-form Ridge regression, with only its preprocessing hyperparameters searched (context length, local-normalization fraction $r$, regularization $\alpha$, augmentation), matches or beats Transformer, MLP, and CNN baselines on six of eight standard benchmarks, at orders-of-magnitude lower training cost.
There is no universal lookback or normalization scheme. The optimal $L^* \propto H^b$ has dataset-specific exponents from $-0.19$ on Exchange/Traffic to $+0.46$ on ETTm2, contradicting the convention that longer horizons require longer history.
The tuned hyperparameters expose scaling, locality, and per-series structure — patterns larger models hide inside their weights.

The adaptive context story

One forecaster, eight datasets, eight different optima.

Time-series forecasting research has been moving steadily toward larger architectures, on the assumption that capacity is what unlocks accuracy. Most of that gap closes with cheaper preprocessing tuning, not bigger models. Ridge regression makes the test concrete: a closed-form solution and interpretable weights let the tuned hyperparameters be read off the search directly. The widget below shows what the search finds. Drag the horizon slider, switch between datasets, and watch the optimal lookback span grow, plateau, or shrink — set by each dataset's stationarity, not by $H$.

Pick a dataset above; the colored span shows $L^\star(H)$ for that dataset and horizon, anchored at the prediction-start line. The numbers are the full searched optima per dataset.

Live playground

Try the search yourself.

Pick a dataset, series, and forecast horizon, then drag the lookback $L$ and regularization $\alpha$ sliders to refit closed-form Ridge in your browser. The remaining hyperparameters (scaler scope and method, augmentation, local-norm fraction $r$) stay pinned to the searched optimum for the current horizon; autotune snaps $L$ and $\alpha$ there too. Bundled datasets: ETTh1, ETTh2, and Exchange.

Launch the playground →

Method

Closed-form Ridge over a searched preprocessing space.

The pipeline is small: sliding windows → scaling → optional augmentation → closed-form Ridge. Optuna searches the preprocessing stages while the model class stays fixed. Ridge has a closed form, $\boldsymbol{\beta} = (X^\top X + \alpha I)^{-1} X^\top \mathbf{y}$, so each trial is a few-millisecond Cholesky solve.

Hyperparameter	Range	Scale
lookback $L$	32 – 2048	log-int
local-norm fraction $r$	$10^{-3}$ – $1$	log
regularization $\alpha$	$10^{-6}$ – $10^{4}$	log
scaler method	mean / robust	categorical
scaler scope	global / local	categorical
noise type	none / time / freq	categorical
noise intensity $\sigma$	$10^{-3}$ – $0.5$	log

Two grouping levers regularize the search: $g_h$ groups consecutive horizons to share one HP setting (default 48; barely costs accuracy), and $g_s$ groups series similarly (the optimal value is dataset-specific, ranging from fully shared on ETTh1 to fully per-series on Weather).

Three diagnostic patterns

What the search reveals about the data.

1. Lookback scales with horizon, but the exponent flips sign.

Power-law fits $L^* \propto H^b$ over the eight benchmarks. Exponents span $b = -0.19$ (Exchange, Traffic) to $b = +0.46$ (ETTm2). Weather and Electricity sit near $|b| < 0.1$. Toggle dataset traces by clicking the legend.

The standard $L=96$ default is therefore wrong in two opposite directions at once: it underserves Weather and Electricity by an order of magnitude, and overserves Exchange and Traffic at long horizons, where the optimum drops below 96.

2. Local normalization wins on a learned trailing fraction $r$.

Per-series heatmaps of $\log_{10} r$. ETTh1 (left) is visibly uniform across its 7 variates; Weather (right) is dispersed. Optimal $\log_{10} r$ clusters in $[-2.5, -0.5]$, i.e., trailing fractions of 0.3–30%.

Prior linear forecasters normalize either globally or over the full input window. The search instead picks a small recent slice — the inputs aren't stationary across a window of length $L$. Robust (median/IQR) statistics lost to mean/std on every dataset.

3. Series within a dataset disagree.

Per-series heatmaps of $\log_{10} \alpha$. ETTh1's seven channels share a regularization regime; Weather's 21 channels diverge sharply (OT $\log_{10}\alpha \approx 1\text{--}2$ vs.\ others near $4$).

Channels of a single dataset can live on incompatible scales (Weather: pressure, humidity, wind, rainfall) or be physically related (ETTh1 electricity-load transformers). The optimal degree of cross-series sharing is therefore a property of the dataset, not a fixed design choice.

Efficiency / accuracy tradeoffs

How cheap can the search get?

Horizon grouping is nearly free.

MSE degradation vs.\ horizon group size $g_h$, per dataset. Curves are flat ($\leq 0.4\%$), so $g_h = 48$ costs almost nothing and frees the trial budget for $g_s$ search.

Augmentation helps about two-thirds of the time.

Fraction of trials selecting each augmentation type, per dataset. Augmentation is preferred in 60–70% of trials; time- and frequency-domain noise split roughly evenly.

Forecasts

What the tuned model predicts.

ETTh1 (series OT). The tuned local-Ridge prediction (teal) tracks the slow drift in the ground truth (black). The baseline globally-scaled Ridge (red) drifts back toward the dataset mean.

The same locality and seasonality the search recovers in the hyperparameters are visible directly in the trained weights. See Figure 4 in the paper for the full lag-by-horizon weight heatmap across the eight datasets.

Benchmark results

Long-term multivariate forecasting (MSE).

Per-horizon test MSE on the eight standard benchmarks. Within the linear class, tuned Ridge takes the best average on seven of eight datasets and ties FITS on ETTh2. Against nonlinear baselines, it wins the average on six of eight benchmarks. Bold marks the best per row; underline marks the second-best. OLS, FITS, and DLinear are reproduced with instance normalization from Toner et al. (2024); nonlinear baselines come from their original publications.

Limitations

Scope notes.

The eight benchmarks are smooth sensor and energy series; the lookback-scaling and locality patterns we report are best read as findings on this regime rather than universal claims. We follow prior work in reporting point-estimate MSE; very small margins should be treated as ties. Nonlinear baselines are quoted from their original publications, so comparisons reflect their published configurations rather than a jointly retuned protocol. Per-series search scales with channel count; finer-grained search is an open direction. The study fixes the model class to Ridge. Applying the same preprocessing search to deeper models is the obvious next step.

Citation

BibTeX

@article{huang2026linear,
  title   = {How Good Can Linear Models Be for Time-Series Forecasting?},
  author  = {Huang, Lang and Xu, Jinglue and Darlow, Luke},
  journal = {arXiv preprint arXiv:2606.27282},
  year    = {2026}
}