Deep dive into the STARFlow architecture, combining the expressiveness of autoregressive transformers with the efficiency of normalizing flows.
STARFlow-V extends this to video. Video is harder because of temporal consistency—frames must flow logically over time.
The transformer attends to both spatial patches (within a frame) and temporal patches (across frames).
Because flows are non-iterative (or require fewer steps than diffusion), STARFlow-V can generate video faster than comparable diffusion models.