DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation

Zeeshan Hayder, Ali Cheraghian, Lars Petersson, Mehrtash Harandi, Richard Hartley

Vision & 3D Sat, Apr 25 · 11:06 AM–11:16 AM · 204 A/B Avg rating: 6.67 (6–8)

Author-provided TL;DR

DTO-KD

Abstract

Knowledge Distillation (KD) is a widely adopted framework for compressing large models into compact student models by transferring knowledge from a high-capacity teacher. Despite its success, KD presents two persistent challenges: (1) the trade-off between optimizing for the primary task loss and mimicking the teacher's outputs, and (2) the gradient disparity arising from architectural and representational mismatches between teacher and student models. In this work, we propose Dynamic Trade-off Optimization for Knowledge Distillation (DTO-KD), a principled multi-objective optimization formulation of KD that dynamically balances task and distillation losses at the gradient level. Specifically, DTO-KD resolves two critical issues in gradient-based KD optimization: (i) gradient conflict, where task and distillation gradients are directionally misaligned, and (ii) gradient dominance, where one objective suppresses learning progress on the other. Our method adapts per-iteration trade-offs by leveraging gradient projection techniques to ensure balanced and constructive updates. We evaluate DTO-KD on large-scale benchmarks including ImageNet-1K for classification and COCO for object detection. Across both tasks, DTO-KD consistently outperforms prior KD methods, yielding state-of-the-art accuracy and improved convergence behavior. Furthermore, student models trained with DTO-KD exceed the performance of their non-distilled counterparts, demonstrating the efficacy of our multi-objective formulation for KD.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

DTO-KD uses multi-objective optimization to dynamically balance task and distillation losses at gradient level for better knowledge distillation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes principled multi-objective optimization formulation of knowledge distillation dynamically balancing losses at gradient level
Resolves gradient conflict where task and distillation gradients are directionally misaligned
Addresses gradient dominance where one objective suppresses learning progress on the other using gradient projection

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Multi-objective optimization
Gradient projection
Knowledge distillation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

ImageNet-1K
COCO

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Data availability is bottleneck; extending to data-free settings remains open challenge, particularly for distilling from large pre-trained models
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Knowledge Distillation

Something off? Let us know →

DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation

Abstract

Author keywords

Related orals

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Depth Anything 3: Recovering the Visual Space from Any Views

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

True Self-Supervised Novel View Synthesis is Transferable