[PaperReview]Chronos: Learning the Language of Time Series

연구의 출발은 이렇다.

Language Model(decoder-only models)들은 Next token prediction을 통해 이전단어(historical)를 기반으로 다음에 올 단어를 예측한다. Time-series도 historical data를 기반으로 forecasting을 수행한다. Time-series data도 LM이 예측하는 token처럼 만들어 LM에 집어넣으면 Time-series forecasting을 수월하게 수행할 수 있지 않을까?

1. 참고할 Time-series scaling(Mean-scaling)

2. Context token(tokenize)

3. Language Model (input : context token)

4. Probabilistic Forecast

Scaling

Chronos model에서 normalization : time series를 sutible한 quantization range에 mapping시키기 위한 것

common normalization technique involves applying an affine transformation to the time series

$$\widetilde{x} = \frac{(x_i - m)}{s}$$

본 논문에서는 mean-scaling을 적용했음.

time series의 각 entries를 absolute values의 mean으로 나눔 in the historical context

$$ m=0, \ \ \ s=\frac{1}{C}\Sigma_{i=1}^C \ |x_i|$$

Quantization

scaled time-series : $\widetilde{x}_{1:C+H} = [\widetilde{x_1},...\widetilde{x}_C,...,\widetilde{x}_{C+H}]$

but 여전히 real-value, LM에 적용할 수 없다.→quantization 적용해 주어야 함

$ selece \; B\;bin\;centers\;c_1<...<c_B\; on\;the\;real\;line$

$ B-1\;edges\;b_i \;separating\;them\;\;c_i<b_i<c_{i+1} \;for\;i \in \{1,...,B-1\} $

$$q(x) : quantization, \ \ \ d(j) : dequantization$$

bin의 center와 edge는 data-dependent, or uniform할 수 있음

Quantile binning = data-dependent, exploits cumulative distribution function(CDF)

Uniform binning selects bin centers uniformly within some interval $ [l,r]$

논문에서는 unseen downstream task에 대한 값 분포가 training distribution으로부터 다를 수 있기에, uniform binning 적용.

potential limitations : prediction range is restricted between $ [c_1, c_B]$

time series token $ {1,2,...,B} $에다가 두 개의 special token $ \nu_{ts} : PAD,\;\; EOS$

in Chronos, they ignore time and frequency information, treating the “time series” simply as a sequence.

Objective Function

Chronos is trained to minimize the cross entropy between the distribution of the quantized ground truth label and the predicted distribution.

where $ pθ(z_{C+h+1} = i\;|\;z_{1:C+h}) $denotes categorical distribution predicted by the model parameterized by θ In practice, the loss is averaged over a batch of time series during training.

since equation $ ℓ(\theta) $is not distance-aware funciton = does not recognize bin $ i$ is close to $ i+1$ that to $ i+2$

Instead, the model is expected to associate nearby bins together, based on the distribution of bin indices in the training dataset. In other words, Chronos performs regression via classification

Opting for a categorical output distribution offers two key advantages.

1. LM architecture나 objcetive에 변화를 줄 필요가 없음

2. 구조나 output dist.에 제약이 없음. 임의의 dist.를 학습할 수 있음

Data Augmentation

TSMixup : Time Series Mixup

Mixup : image classification 분야에서 제안된 data aug. 랜덤 이미지쌍에 대해 convex combination(계수의 총합이 1인 linear-combination) 생성. memorization, overfitting issue alleviate.

를 Time Series 영역에서 제안

Mixup의 idea를 두 개 이상의 datapoints에 대해 generalize

combination weights $ [\lambda_1, \lambda_2 ,..., \lambda_k ] $ from symmetric Dirichket dist.

KernelSynth: Synthetic Data Generation using Gaussian Processes

TSMixup이 pattern diversity에는 좋지만, generalize에서는 큰 도움이 안 됨

KernelSynth, a method to generate synthetic time series using Gaussian processes (GPs).

Automatic Statistican 연구에서 영감을 받음 : space of GP kernel이 structure of time series를 설명하기 위해 사용됨

논문서는 역으로 적용 : randomly compose GP kernels to generate new time series

Gaussian Process

mean function $ m(t)$, positive definite kernel $ k(t, t')$에 의해 정의되는 분포. ( $ domain : t \in \mathbb{R} $ )

kernel : 임의의 입력 도메인 $ (t,t') $에서 함수의 공동 변동성(joint variability)을 정의하는 공분산 함수

kernel selecting을 통해 다양한 패턴을 생성할 수 있음

기본 시계열 패턴 정의하는 kernel bank K

추세를 위한 선형 커널
smooth local variation을 위한 RBF 커널
preiodic kernel for 전형적인 시계열 frequencies에서 발견되는 seasonalities

The final kernel, $ \widetilde{k}(t,t') $, is constructed by sampling$ j ∼ U{1, J} $ kernels from K with replacement(중복허용) and combining these kernels via random binary operations, + or ×.

합성 시계열은 GP prior 함수 $ GP(m(t)=0,\widetilde{k}(t,t'))$에서 길이$ l_{syn} $인 샘플을 추출하여 생성

Experiment

Datasets : 55 datasets from multiple sources

(13 datasets) exclusively used for training
(15 datasets) Benchmark1 : both for training, evaluating (in-domain)
(27 datasets) Benchmark2 : solely for evaluation (zero-shot evaluation)

Result on Benchmark1 : in-domain Results

WQL : weighted Quantile Loss

measures the compatibility between the predictive distribution and the ground-truth observation at a uniformly-spaced grid of quantile levels

MASE : Mean Absolute Scaled Error

absolute error of the forecast scaled by the historical seasonal error of the time series

Moirai-1.0-R(Large)가 훨씬 더 큰 corpus에 대해 훈련되었음에도, Chronos-T5(Mini)보다 성능이 안 나옴

Task specific한 model보다 multiple datasets에 대해 학습한 모델이 benefit이 크다!

Result on Benchmark2 : zero-shot Results

performs on par with the best task-specific deep learning models

Analysis of Hyperparameters

Initialization

Figure 8은 language model weight Initialization, random weight Initializaion으로 각각 수행했을 때의 Training Loss Curve이다. random initialization을 적용한 모델이 더 잘 converge했다.

LM parameter을 적용하면 Loss가 더 빠르게 감소하지만, 결국 결과는 random weight 이 더 좋다

뇌피셜) time series data를 language model의 token으로 사용하고자했는데, LM parameter에서의 성능이 왜 더 안 좋은가? 언어 token과 똑같은 역할을 하기엔 부족했거나, augment, synthesis한 데이터가 historical한 정보는 제공하지 못해 LM parameter에서는 약세했던 것 같다.

LM weights가 time-series forecasting remarkable하진 않고, improvement가 없다

TSMixup Augmentations

in-domain에 대해서는 TSMixup의 유무가 크게 중요하지 안하보임

Zero-hsot에서는 유의미한 상승을 보였는데, TSMixup으로 data diversity가 증대되어 unseen data에 대한 성능이 향상

Synthetic Data Proportion

synthetic Data가 10%정도일 때 가장 improvement가 좋음

더 높은 비율에서 오히려 성능이 떨어지는 이유는, Gaussian Process를 통해 생성한 데이터가 실생활 time-series data를 대표하진 않기 때문이다

the others

Qualitative Analysis and Limitations

dfsf

predicts linear trend accurately, struggles with the exponential trend (fig12.b)

potential resolution : perform logarithm scaling before feeding the time series to CHRONOS model

context length가 짧을 때는 forecasting을 잘 수행하지 못함

AR(3), AR(4)처럼 complexity가 높아질 때, Chronos-T5의 성능이 극대화된다.

→ Chronos model can recognize fundamental patterns present in time-series data

Review

시계열 데이터를 시계열 그자체로 생각하기 보다, 스케일링, 양자화를 통해 LM에 넣을 수 있는 형태로 가공한다는 점이 신기했음

다른 시계열 연구들이랑 다르게, day-of-week, week-of-year같은 time and frequency정보를 넣지 않고, time series그 자체를 sequence로 여김

NSP형태로 간주한다는 것 자체가 신선했고, 시계열분야에 LLM을 직접적으로 적용할 수 있는 계기가 된 것이라 생각함

Data Augmentation에서 단순히 historical을 뒤집는 것이 아니라 GP에서의 샘플링을 통한 generalization, TSMixup을 통해 pattern diverse 작업을 처음 봤는데, 시계열의 고전적 한계인 데이터 부족 문제를 극복할 수 있으리라 생각함

RFS : River From Scratch