Revisiting Model Stitching In the Foundation Model Era
arXiv:2603.12433v1 Announce Type: cross Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penulti
arXiv:2603.12433v1 Announce Type: cross Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.
Executive Summary
This article revisits the concept of model stitching in the context of Vision Foundation Models (VFMs) and presents a systematic protocol to evaluate the stitchability of heterogeneous VFMs. The authors find that conventional approaches to stitching struggle to retain accuracy, especially at shallow stitch points, but propose a simple feature-matching loss that enables reliable stitchability across vision tasks. They also introduce the VFM Stitch Tree (VST), which allows for a controllable accuracy-latency trade-off for multimodal LLMs that leverage multiple VFMs. The study contributes to the development of a practical recipe for integrating complementary VFM strengths and identifying areas of representation alignment or divergence.
Key Points
- ▸ Conventional approaches to stitching struggle to retain accuracy, especially at shallow stitch points.
- ▸ A simple feature-matching loss at the target model's penultimate layer enables reliable stitchability across vision tasks.
- ▸ The VFM Stitch Tree (VST) provides a controllable accuracy-latency trade-off for multimodal LLMs.
Merits
Contributions to the field
The study provides a systematic protocol for evaluating stitchability and proposes a practical recipe for integrating complementary VFM strengths.
Methodological advancements
The authors introduce a simple feature-matching loss that enables reliable stitchability across vision tasks.
Practical applications
The VFM Stitch Tree (VST) offers a controllable accuracy-latency trade-off for multimodal LLMs.
Demerits
Limited scope
The study focuses on vision tasks and may not be generalizable to other domains.
Assumptions about VFM architectures
The authors assume that VFMs are pre-trained and may not account for the impact of VFM architecture on stitchability.
Expert Commentary
This article makes significant contributions to the field of model stitching and multimodal learning. The authors' systematic protocol for evaluating stitchability and their proposal of the VFM Stitch Tree (VST) offer practical recipes for integrating complementary VFM strengths. The study's focus on representation alignment and divergence also sheds light on how different VFM models represent the same task. However, the study's limited scope and assumptions about VFM architectures are notable limitations. Future research should aim to generalize the findings to other domains and explore the impact of VFM architecture on stitchability. Overall, the study is a significant contribution to the field and has implications for the development of more robust and generalizable AI systems.
Recommendations
- ✓ Future research should focus on generalizing the study's findings to other domains and exploring the impact of VFM architecture on stitchability.
- ✓ The VFM Stitch Tree (VST) should be further developed and tested to ensure its practical applications and accuracy-latency trade-off.