Ridge regression, with its preprocessing tuned, matches the deep models.
1Sakana AI, Tokyo, Japan 2National Institute of Informatics, Japan · {langhuang, jingluexu, luke}@sakana.ai
tl;dr
The adaptive context story
Time-series forecasting research has been moving steadily toward larger architectures, on the assumption that capacity is what unlocks accuracy. Most of that gap closes with cheaper preprocessing tuning, not bigger models. Ridge regression makes the test concrete: a closed-form solution and interpretable weights let the tuned hyperparameters be read off the search directly. The widget below shows what the search finds. Drag the horizon slider, switch between datasets, and watch the optimal lookback span grow, plateau, or shrink — set by each dataset's stationarity, not by $H$.
Pick a dataset above; the colored span shows $L^\star(H)$ for that dataset and horizon, anchored at the prediction-start line. The numbers are the full searched optima per dataset.
Live playground
Pick a dataset, series, and forecast horizon, then drag the lookback $L$ and regularization $\alpha$ sliders to refit closed-form Ridge in your browser. The remaining hyperparameters (scaler scope and method, augmentation, local-norm fraction $r$) stay pinned to the searched optimum for the current horizon; autotune snaps $L$ and $\alpha$ there too. Bundled datasets: ETTh1, ETTh2, and Exchange.
Method
The pipeline is small: sliding windows → scaling → optional augmentation → closed-form Ridge. Optuna searches the preprocessing stages while the model class stays fixed. Ridge has a closed form, $\boldsymbol{\beta} = (X^\top X + \alpha I)^{-1} X^\top \mathbf{y}$, so each trial is a few-millisecond Cholesky solve.
| Hyperparameter | Range | Scale |
|---|---|---|
| lookback $L$ | 32 – 2048 | log-int |
| local-norm fraction $r$ | $10^{-3}$ – $1$ | log |
| regularization $\alpha$ | $10^{-6}$ – $10^{4}$ | log |
| scaler method | mean / robust | categorical |
| scaler scope | global / local | categorical |
| noise type | none / time / freq | categorical |
| noise intensity $\sigma$ | $10^{-3}$ – $0.5$ | log |
Two grouping levers regularize the search: $g_h$ groups consecutive horizons to share one HP setting (default 48; barely costs accuracy), and $g_s$ groups series similarly (the optimal value is dataset-specific, ranging from fully shared on ETTh1 to fully per-series on Weather).
Three diagnostic patterns
The standard $L=96$ default is therefore wrong in two opposite directions at once: it underserves Weather and Electricity by an order of magnitude, and overserves Exchange and Traffic at long horizons, where the optimum drops below 96.
Prior linear forecasters normalize either globally or over the full input window. The search instead picks a small recent slice — the inputs aren't stationary across a window of length $L$. Robust (median/IQR) statistics lost to mean/std on every dataset.
Channels of a single dataset can live on incompatible scales (Weather: pressure, humidity, wind, rainfall) or be physically related (ETTh1 electricity-load transformers). The optimal degree of cross-series sharing is therefore a property of the dataset, not a fixed design choice.
Efficiency / accuracy tradeoffs
Forecasts
The same locality and seasonality the search recovers in the hyperparameters are visible directly in the trained weights. See Figure 4 in the paper for the full lag-by-horizon weight heatmap across the eight datasets.
Benchmark results
Per-horizon test MSE on the eight standard benchmarks. Within the linear class, tuned Ridge takes the best average on seven of eight datasets and ties FITS on ETTh2. Against nonlinear baselines, it wins the average on six of eight benchmarks. Bold marks the best per row; underline marks the second-best. OLS, FITS, and DLinear are reproduced with instance normalization from Toner et al. (2024); nonlinear baselines come from their original publications.
Limitations
The eight benchmarks are smooth sensor and energy series; the lookback-scaling and locality patterns we report are best read as findings on this regime rather than universal claims. We follow prior work in reporting point-estimate MSE; very small margins should be treated as ties. Nonlinear baselines are quoted from their original publications, so comparisons reflect their published configurations rather than a jointly retuned protocol. Per-series search scales with channel count; finer-grained search is an open direction. The study fixes the model class to Ridge. Applying the same preprocessing search to deeper models is the obvious next step.
Citation
@article{huang2026linear,
title = {How Good Can Linear Models Be for Time-Series Forecasting?},
author = {Huang, Lang and Xu, Jinglue and Darlow, Luke},
journal = {arXiv preprint arXiv:2606.27282},
year = {2026}
}