Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings
The state-of-the-art semantic segmentation methods have achieved impressive performance on predefined close-set individual datasets, but their generalization to zero-shot domains and unseen categories is limited. Labeling a large-scale dataset is challenging and expensive, Training a robust semantic segmentation model on multi-domains has drawn much attention. However, inconsistent taxonomies hinder the naive merging of current publicly available annotations. To address this, we propose a simple solution to scale up the multi-domain semantic segmentation dataset with less human effort. We replace each class label with a sentence embedding, which is a vector-valued embedding of a sentence describing the class. This approach enables the merging of multiple datasets from different domains, each with varying class labels and semantics. We merged publicly available noisy and weak annotations with the most finely annotated data, over 2 million images, which enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. Instead of manually tuning a consistent label space, we utilized a vector-valued embedding of short paragraphs to describe the classes. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 (Silberman et al., in: European conference on computer vision, Springer, pp 746–760, 2012) and PASCAL-context (Everingham et al. in Int J Comput Visi 111(1):98–136, 2015) at 60% and 65% mIoU, respectively. Our method can segment unseen labels based on the closeness of language embeddings, showing strong generalization to unseen image domains and labels. Additionally, it enables impressive performance improvements in some adaptation applications, such as depth estimation and instance segmentation. Code is available at https://github.com/YvanYin/SSIW.
Scribble Hides Class Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label
Peking University, Beijing, China
Learning Generalized Medical Image Segmentation from Decoupled Feature Queries
Jarvis Research Center、Wuhan University、Guangxi Medical University
Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation
Zhejiang Lab、Xidian University、Zhejiang University、University of Manchester
Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation
University of Chinese Academy of Sciences、Chinese Academy of Sciences、Alibaba group
Scribble-Supervised Semantic Segmentation with Prototype-based Feature Augmentation
Hohai University, Nanjing, China
Cross-Domain Few-Shot Semantic Segmentation via Doubly Matching Transformation
Nanjing University of Aeronautics and Astronautics 、State Key Laboratory of Integrated Services Networks, Xidian University
Prompt-and-Transfer Dynamic Class-Aware Enhancement for Few-Shot Segmentation
Snipaste_2025-03-05_19-30-17
Prompting Multi-Modal Image Segmentation with Semantic Grouping
Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework(GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propaga-
tion by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore,
an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters.