DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation
Zeeshan Hayder, Ali Cheraghian, Lars Petersson, Mehrtash Harandi, Richard Hartley
DTO-KD
Abstract
Knowledge Distillation (KD) is a widely adopted framework for compressing large models into compact student models by transferring knowledge from a high-capacity teacher. Despite its success, KD presents two persistent challenges: (1) the trade-off between optimizing for the primary task loss and mimicking the teacher's outputs, and (2) the gradient disparity arising from architectural and representational mismatches between teacher and student models. In this work, we propose Dynamic Trade-off Optimization for Knowledge Distillation (DTO-KD), a principled multi-objective optimization formulation of KD that dynamically balances task and distillation losses at the gradient level. Specifically, DTO-KD resolves two critical issues in gradient-based KD optimization: (i) gradient conflict, where task and distillation gradients are directionally misaligned, and (ii) gradient dominance, where one objective suppresses learning progress on the other. Our method adapts per-iteration trade-offs by leveraging gradient projection techniques to ensure balanced and constructive updates. We evaluate DTO-KD on large-scale benchmarks including ImageNet-1K for classification and COCO for object detection. Across both tasks, DTO-KD consistently outperforms prior KD methods, yielding state-of-the-art accuracy and improved convergence behavior. Furthermore, student models trained with DTO-KD exceed the performance of their non-distilled counterparts, demonstrating the efficacy of our multi-objective formulation for KD.
DTO-KD uses multi-objective optimization to dynamically balance task and distillation losses at gradient level for better knowledge distillation.
- Proposes principled multi-objective optimization formulation of knowledge distillation dynamically balancing losses at gradient level
- Resolves gradient conflict where task and distillation gradients are directionally misaligned
- Addresses gradient dominance where one objective suppresses learning progress on the other using gradient projection
- Multi-objective optimization
- Gradient projection
- Knowledge distillation
- ImageNet-1K
- COCO
Data availability is bottleneck; extending to data-free settings remains open challenge, particularly for distilling from large pre-trained models
from the paper
Authors did not state explicit future directions.
Author keywords
- Knowledge Distillation
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.