ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

1Tel Aviv University2Lightricks3Adobe Research

TL;DR

We introduce ParetoSlider, a multi-objective RL framework that trains a single diffusion model to approximate the entire Pareto front, enabling users to continuously navigate competing reward trade-offs at inference time — such as photorealism vs. style, or prompt adherence vs. source preservation — without retraining or maintaining multiple checkpoints.

'A paper origami crane on a wooden table'
Photorealistic Sketch
ParetoSlider teaser

Abstract

Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of "early scalarization" collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals — such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By conditioning the model on continuously varying preference weights during RL alignment, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of separately tuned baselines while uniquely providing fine-grained, real-time control over competing generative goals.

Method

🎯 For each prompt and sampled \(\omega\), the policy generates \(K\) images, conditioning the denoising process on both the input and the user-specified preference vector.

⚖️ Each image is scored by \(M\) reward models and normalized into per-reward advantages, decoupling reward scales so the optimization faithfully respects \(\omega\).

🔧 A DiffusionNFT loss is computed per reward and aggregated with \(\omega\) before the gradient update, steering the model toward the desired Pareto-optimal trade-off.

ParetoSlider method architecture

Text-to-Image: Style Control

Drag the slider to continuously navigate between photorealism and different artistic styles.

'a lily flower in a glass vase on a table'
Photorealistic Flat Vector
Lily flat vector
'a lone traveler standing on a high cliff above the sea'
Photorealistic Animation
Traveler animation
'a tiger walking through dense jungle leaves'
Photorealistic Watercolor
Tiger watercolor

Text-to-Video: Style Control

Drag the slider to navigate from photorealism to animation style.

'A cat walking in a kitchen'
Photorealistic Animation
'A close-up of an eye blinking'
Photorealistic Animation
'A turtle swimming'
Photorealistic Sketch
'A butterfly flying and landing on a flower'
Photorealistic Sketch

Image Editing: Preservation vs. Adherence

Drag the slider to navigate from full source preservation to full prompt adherence.

'Convert this portrait into an anime character'
Input
Input
Preserve Edit
Anime Character Edit
'Turn this woman into a warrior'
Input
Input
Preserve Edit
Warrior Edit
'Turn this into a 3D-rendered Disney Pixar scene'
Input
Input
Preserve Edit
Pixar Edit
'Change the style of this image to a Ghibli scene'
Input
Input
Preserve Edit
Ghibli Edit

Comparisons

Pareto front comparison
ParetoSlider
FixedWeight
Realistic ←   \(\omega = (\omega_{\mathrm{real}},\, \omega_{\mathrm{sketch}})\)   → Sketch
FlowMulti Prompting
Ep. 100 Ep. 200 Ep. 300 Realistic Mix Sketch

Pareto front and qualitative T2I comparison (SD3.5, Photorealism vs. Sketch) on the prompt "A chocolate cake with frosting on a stand." ParetoSlider traces a continuous trade-off curve by varying \(\omega\), dominating Fixed Weights, FlowMulti, and Prompt Rewriting baselines. FixedWeight requires a separate training run per point; FlowMulti produces a single static output; Prompting yields only three coarse points. None support continuous inference-time control.

Website template based on Kontinuous Kontext.