Summary: Efficiently Modeling Long Sequences with Structured State Spaces

yewentao included in category Paper_summary

2025-09-28 2025-10-19 406 words 2 minutes

Contents

Materials

(RNN-like view)

Proposes S4 (Structured State Space Sequence Model), which makes State Space Models (SSMs) practical for very long sequences.
Unifies three views of SSMs—Continuous, Recurrent (RNN-like), and Convolutional, showing how to compute them efficiently and stably.
Targets long-range dependencies (LRDs) using HiPPO-based state matrices while achieving near-linear time and memory.

NPLR parameterization decomposes the HiPPO matrix as Normal + Low-Rank ($A = V\Lambda V^* - P Q^*$), enabling well-conditioned diagonalization.
Frequency–domain kernel computes the SSM convolution kernel by evaluating a truncated generating function(Lemma C.3).
Woodbury(Algorithm 1: 3) + Cauchy reduction(Algorithm 1: 2) handles the low-rank correction, reducing the problem size to near-linear.

S4 achieves up to ~30× faster training and ~400× less memory than LSSL; speed/VRAM comparable to Performer/Linear Transformer.
S4 attains SOTA across all 6 tasks in Long Range Arena (LRA), including solving Path-X (length 16,384) where prior models failed
98.3% accuracy on raw 16k-sample inputs for Raw speech classification (SC10)
*20.95 ppl, for WikiText-103, ~60× faster generation.
Ablations (CIFAR-10, ≤100K params) shows that Random NPLR alone is not enough, HiPPO+NPLR (full S4) wins

Requires specialized kernels (Cauchy multiplies, FFTs, NPLR machinery); harder to implement/optimize than standard conv/attention
While strong, S4 does not surpass top Transformers on large-scale language modeling
Many vision results treat images as 1-D sequences; lacking native 2-D inductive bias can be suboptimal for some vision tasks.
Choice of state size (N), step size, and HiPPO variant still hyperparameter-heavy; limited guidance on automatic selection.

Combine S4 with local/global attention or convolutions for Hybrid models
Pretrain larger S4 backbones on language/audio with modern recipes to test competitiveness at scale.
Design 2-D/ND SSM kernels (avoiding flattening) for vision, video, and spatiotemporal forecasting.
Optimized GPU/TPU ops for Cauchy and memory-efficient recurrence for long-context decoding.

SSM (State Space Model): A linear dynamical system that maps input u_t to output y_t via a hidden state x_t using matrices &A, B, C, (D)&
HiPPO: A family of specially structured (A) matrices that provably compress and track recent history, giving SSMs strong long-range memory
Path-X: The hardest LRA task requiring reasoning over a flattened image (length 16,384) to decide if two markers are connected