More about forecasting in cienciadedatos.net

Forecasting with foundation models¶

Foundation models (FMs) have triggered a fundamental paradigm shift in time series forecasting, moving the field away from modelling for each dataset and towards generalised representation learning. Driven by the same architectural breakthroughs that power Large Language Models (LLMs), FMs bring zero-shot and in-context learning capabilities to temporal data.

In the context of forecasting, a foundation model is a massively scaled neural network (typically Transformer-based) that has been pre-trained on highly diverse, cross-domain datasets spanning finance, weather, web traffic, retail and more.

Models such as AWS Chronos, Google TimesFM 2.5 and Salesforce Moirai frame temporal forecasting as a sequence modelling problem and process temporal data either as quantized discrete tokens (as in Chronos, which applies scalar quantization) or as continuous patch embeddings (as in TimesFM and Moirai, which group consecutive time steps into fixed-length patches before encoding them). Having already internalised the structural priors of millions of series during pre-training, they can instantly infer trends, seasonality and complex dynamics in completely unseen data, eliminating the need for any domain-specific weight updates.

Foundation Models vs. Machine Learning Models¶

Foundation models and traditional machine learning models approach forecasting in fundamentally different ways. Understanding these distinctions is crucial for knowing when and how to deploy each method.

Zero-Shot Prediction

Machine learning models require a training phase. You must fit the model on your historical target data so the algorithm can learn the optimal weights and parameters for your specific time series. Foundation models, however, are capable of zero-shot inference. Because their highly generalized weights are already frozen from the massive pre-training phase, they can generate accurate forecasts on your data immediately, leveraging their pre-existing latent representations rather than learning your dataset from scratch.

The Role of the fit Method

Machine learning models must be trained: calling .fit() optimizes the model's internal parameters by minimizing a loss function on your historical data. Foundation models, by contrast, arrive pre-trained: their weights are fixed and are never updated. Calling .fit() on a foundation model is not a training step; it simply stores the historical context (observations, frequency, and any scaling factors) needed at inference time. In some implementations, calling .fit() is entirely optional before prediction.

Context Window vs. Engineered Lags

Machine learning models rely on explicitly engineered features; they require creating a tabular dataset where past values are used as columns to predict the target. Foundation models rely on a context window. You pass a raw, sequential chunk of recent historical data (e.g., the last 512 observations) directly into the model at inference time. The attention mechanism inside the model automatically decides which past data points are most relevant.

In summary, Foundation models represent a fundamental paradigm shift, replacing the traditional train -> predict pipeline with a pre-train -> (context+predict) approach. As major research institutions with access to millions of diverse time series carry out the computationally intensive pre-training phase, end users are completely freed from model training.

However, there is no such thing as a free lunch in machine learning. Skipping the training phase results in a heavier burden during the inference phase. Because their weights are frozen, these models cannot adapt to your data through training. Instead, they adapt implicitly at inference time by processing the historical context through their attention mechanism. Each prediction therefore requires ingesting and attending over a large sequence of raw observations in real time. Consequently, the main drawback of zero-shot forecasting is that the inference process is significantly slower, more computationally expensive and requires your data pipeline to continuously provide large amounts of historical context at runtime.

	ML Model	Foundation Model
fit	Trains model, updates weights	Stores context & metadata
predict	Uses learned weights	Processes context via attention
Data required at train time	Full history	Not required
Data required at predict time	Last lags observations	Full context window
Computational cost	At train time	At inference time

✏️ Note

For more details about forecasting with foundation models, visit Forecasting: Principles and Practice, the Pythonic Way.

Foundation Models in skforecast¶

Skforecast's integration is built on two layers. First, FoundationModel acts as a unified wrapper that adapts each model's native API (Chronos-2, TimesFM 2.5, Moirai-2, TabICL, TabPNF) behind a familiar scikit-learn interface (fit, predict, get_params). Second, ForecasterFoundation wraps that estimator to unlock the full skforecast ecosystem. It exposes the same interface as any other skforecast forecaster, meaning users can use backtesting, prediction intervals, and multi-series support with the exact same code.

	Demand	Temperature	Holiday
Time
2011-12-31 14:00:00	4323.095350	21.225	1.0
2011-12-31 15:00:00	3963.264688	20.625	1.0
2011-12-31 16:00:00	3950.913495	20.325	1.0

Single series forecasting with Chronos¶

Each adapter accepts additional keyword arguments that control model-specific behavior (e.g., context_length, device_map, torch_dtype). These can be passed directly through the FoundationModel constructor.

Three methods can be used to predict the next $n$ steps ahead: predict(), predict_interval(), and predict_quantiles(). All these methods allow for passing context and context_exog to override the historical context used by the underlying model to generate predictions.

	level	pred
2014-12-01 00:00:00	Demand	5527.678711
2014-12-01 01:00:00	Demand	5511.500977
2014-12-01 02:00:00	Demand	5457.791992

	level	pred	lower_bound	upper_bound
2014-12-01 00:00:00	Demand	5527.678711	5372.815918	5689.793457
2014-12-01 01:00:00	Demand	5511.500977	5318.045410	5733.492188
2014-12-01 02:00:00	Demand	5457.791992	5241.040527	5717.424805

	mean_absolute_error
0	171.266953

	level	pred
2014-12-01 00:00:00	Demand	5527.678711
2014-12-01 01:00:00	Demand	5511.500977
2014-12-01 02:00:00	Demand	5457.791992
2014-12-01 03:00:00	Demand	5402.819336

Multiple series (global model)¶

The class ForecasterFoundation allows modeling and forecasting multiple series with a single model.

	item_1	item_2	item_3
2012-01-01	8.253175	21.047727	19.429739
2012-01-02	22.777826	26.578125	28.009863
2012-01-03	27.549099	31.751042	32.078922

In this example, instead of calling fit(), the context is passed directly to the predict() method.

	level	pred
2014-07-16	item_1	25.523064
2014-07-16	item_2	10.456666
2014-07-16	item_3	11.862236
2014-07-17	item_1	25.296782
2014-07-17	item_2	10.701235

	level	pred	lower_bound	upper_bound
2014-07-16	item_1	25.464174	24.582005	26.430853
2014-07-16	item_2	10.649370	8.679634	13.302807
2014-07-17	item_1	25.270247	24.255964	26.327969
2014-07-17	item_2	10.834244	8.717453	13.826941
2014-07-18	item_1	25.175861	24.079086	26.286255

Other foundation models¶

The examples above use the Amazon Chronos model, but the same code structure applies to any other foundation model supported by skforecast. The following subsections demonstrate that the pipeline is identical regardless of the underlying model; only the model_id changes. To use a different model, simply pass it when instantiating the FoundationModel wrapper.

TimesFM 2.5¶

	level	pred
2014-12-01 00:00:00	Demand	5658.415039
2014-12-01 01:00:00	Demand	5671.861816
2014-12-01 02:00:00	Demand	5747.938477

	level	pred	lower_bound	upper_bound
2014-12-01 00:00:00	Demand	5658.415039	5541.004883	5790.344238
2014-12-01 01:00:00	Demand	5671.861816	5470.189453	5899.113281
2014-12-01 02:00:00	Demand	5747.938477	5450.534180	6064.879883

	mean_absolute_error
0	160.357018

	level	pred
2014-07-16 00:00:00	Demand	6189.843750
2014-07-16 01:00:00	Demand	5988.112793
2014-07-16 02:00:00	Demand	5830.692383
2014-07-16 03:00:00	Demand	5696.288086

Moirai¶

A ForecasterFoundation is created using Salesforce's Moirai-2.0-R-small model.

	level	pred
2014-12-01 00:00:00	Demand	5731.725098
2014-12-01 01:00:00	Demand	5870.827148
2014-12-01 02:00:00	Demand	5959.207031

	level	pred	lower_bound	upper_bound
2014-12-01 00:00:00	Demand	5731.725098	5517.737793	5940.646484
2014-12-01 01:00:00	Demand	5870.827148	5548.743164	6176.801270
2014-12-01 02:00:00	Demand	5959.207031	5599.376953	6323.206055

	mean_absolute_error
0	161.691106

	level	pred
2014-07-16 00:00:00	Demand	6222.097656
2014-07-16 01:00:00	Demand	6114.366699
2014-07-16 02:00:00	Demand	5969.839844
2014-07-16 03:00:00	Demand	5920.479492

TabICL¶

	level	pred
2014-12-01 00:00:00	Demand	5590.919922
2014-12-01 01:00:00	Demand	5662.487305
2014-12-01 02:00:00	Demand	5661.673828

	level	pred	lower_bound	upper_bound
2014-12-01 00:00:00	Demand	5617.177246	5159.889160	5952.955078
2014-12-01 01:00:00	Demand	5678.133301	5215.843750	6059.711914
2014-12-01 02:00:00	Demand	5677.455078	5131.911133	6149.854492

TabPFN-TS¶

A [ForecasterFoundation](https://skforecast.org/latest/api/forecasterfoundation.html) is created using Prior Labs' TabPFN-TS model. TabPFN-TS frames forecasting as tabular regression: the series is featurized (running index, calendar features, automatically detected seasonal features) and a TabPFN regressor predicts the forecast horizon zero-shot. By default inference runs locally (mode='local', CUDA > MPS > CPU); pass mode='client' to use the Prior Labs cloud API instead (no GPU needed, requires an API key).

Model comparison¶

The following table summarizes the backtesting results (Mean Absolute Error) for the four foundation models on the same dataset.

	Model	mean_absolute_error	Elapsed time
0	Chronos-2 (small)*	171.2670	1.4381
1	TimesFM-2.5 (200m)	160.3570	56.3338
2	Moirai-2.0-R (small)	161.6911	9.3907
3	TabICLv2*	170.1016	196.0960

Impact of the context length¶

Because foundation forecasting models are highly generalized, they lack intrinsic knowledge of your specific dataset. To compensate, they rely on a "context window", a specific period of recent historical data, to adapt to your unique scenario in real-time. This context acts as the model's short-term memory, allowing it to calculate the current trajectory of your data and identify whether the series is trending upward, accelerating, or flattening out.

The length of this context window is absolutely critical for capturing seasonality and recurring events. To accurately predict a pattern, such as a weekly sales spike or a yearly cycle, the model must actually observe that pattern within the provided history. For instance, if your data has a 365-day seasonality, providing 400 days of context allows the model to recognize and project the cycle, whereas a 30-day window would cause the model to miss the pattern entirely, resulting in a flat or inaccurate forecast.

However, increasing the context length to improve accuracy introduces a significant computational trade-off. Because most foundation models are built on Transformer architectures, the computational complexity of their attention mechanism scales quadratically ($O(N^2)$) with the length of the input sequence. Consequently, doubling the context window can quadruple the required memory and processing power. This quadratic growth means that pushing context lengths to their theoretical maximum often yields diminishing accuracy gains at rapidly increasing computational cost.

Ultimately, effectively utilizing foundation models requires carefully evaluating this trade-off. The best practice is to analyze the context size and select the shortest possible window that still achieves high predictive performance. Striking this balance ensures accurate, pattern-aware forecasts while preventing the unnecessary waste of computational resources and data transfer bandwidth.

For all four models, there is a massive drop in Mean Absolute Error (MAE) when increasing the context length from 100 to 500. However, pushing the context length further to 1000 or 5000 yields only marginal improvements. This indicates a "sweet spot" around 500-1000 where the models have enough historical data to capture the pattern, and feeding them more history doesn't significantly help.

autogluon/chronos-2-small and Salesforce/moirai-2.0-R-small exhibit excellent computational scaling. Their execution time remains virtually flat and very low across all context lengths, making them highly efficient even with 5000 historical data points. google/timesfm-2.5-200m-pytorch shows a linear increase in processing time as the context length grows. The soda-inria/tabicl shows worst scalability than the other three.

Session information¶

Citation¶

If you use this document or any part of it, please acknowledge the source, thank you!

Forecasting with foundation models by Joaquín Amat Rodrigo and Javier Escobar Ortiz available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py79-forecasting-with-foundation-models.html

If you use skforecast for a publication, we would appreciate if you cite the published software.

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.23.0). Zenodo. https://doi.org/10.5281/zenodo.8382787

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.23.0) [Computer software]. https://doi.org/10.5281/zenodo.8382787

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.23.0}, month = {07}, year = {2026}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

Your contribution will help me to continue generating free educational content. Many thanks! 😊

Forecasting with foundation models

Joaquín Amat Rodrigo, Javier Escobar Ortiz

April 2026 (last update April 2026)