Abstract:Semantic segmentation, aiming to make dense pixel-level classification, is a core problem in computer vision. Requiring sufficient and accurate pixel-level
annotated data during training, semantic segmentation has witnessed great progress with recent advances in a deep neural network. However, such pixel-level annotation is time-consuming and highly relies on human effort, and segmentation performance dramatically
drops on unseen classes or the annotated data is not sufficient.
In order to overcome the mentioned drawbacks, many researchers focus on learning semantic segmentation with weak and few-shot supervision, i.e.,
weakly supervised semantic segmentation and few-shot segmentation. Specifically, weakly supervised semantic segmentation aims to make pixel-level classification with weak annotations (e.g., bounding-box, scribble, and image-level) as supervision while few-shot
segmentation attempts to segment unseen object classes with a few annotated samples. In this thesis, we mainly focus on image label supervised semantic segmentation, bounding-box supervised semantic segmentation, scribble supervised semantic segmentation,
and few-shot segmentation.
For weakly supervised semantic segmentation with image-level annotation, current approaches mainly adopt a two-step solution, which generates pseudo-pixel
masks first that are then fed into a separate semantic segmentation network. However, these two-step solutions usually employ many bells and whistles in producing high-quality pseudo masks, making this kind of method complicated and inelegant. We harness the
image-level labels to produce reliable pixel-level annotations and design a fully end-to-end network to learn to predict segmentation maps. Concretely, we firstly leverage an image classification branch to generate class activation maps for the annotated categories,
which are further pruned into tiny reliable object/background regions. Such reliable regions are then directly served as ground-truth labels for the segmentation branch, where both global information and local information sub-branch are used to generate accurate
pixel-level predictions. Furthermore, a new joint loss is proposed that considers both shallow and high-level features.
For weakly supervised semantic segmentation with bounding-box level annotation, most existing approaches rely on a deep convolution neural network
(CNN) to generate pseudo labels by initial seeds propagation. However, CNN-based approaches only aggregate local features, ignoring long-distance information. We proposed a graph neural network (GNN)-based architecture that takes full advantage of both local
and long-distance information. We firstly transfer the weak supervision to initial labels, which are then formed into semantic graphs based on our newly proposed affinity Convolutional Neural Network. Then the built graphs are input to our graph neural network
(GNN), in which an affinity attention layer is designed to acquire the short- and long-distance information from soft graph edges to accurately propagate semantic labels from the confident seeds to the unlabeled pixels. However, to guarantee the precision
of the seeds, we only adopt a limited number of confident pixel seed labels, which may lead to insufficient supervision for training. To alleviate this issue, we further introduce a new loss function and a consistency-checking mechanism to leverage the bounding
box constraint, so that more reliable guidance can be included for the model optimization. More importantly, our approach can be readily applied to bounding box supervised instance segmentation tasks or other weakly supervised semantic segmentation tasks,
showing great potential to become a unified framework for weakly supervised semantic segmentation.
For weakly supervised semantic segmentation with scribble level annotation, the regularized loss has been proven to be an effective solution for
this task. However, most existing regularized losses only leverage static shallow features (color, spatial information) to compute the regularized kernel, which limits its final performance since such static shallow features fail to describe pair-wise pixel
relationships in complicated cases. We propose a new regularized loss that utilizes both shallow and deep features that are dynamically updated in order to aggregate sufficient information to represent the relationship of different pixels. Moreover, in order
to provide accurate deep features, we adopt a vision transformer as the backbone and design a feature consistency head to train the pair-wise feature relationship. Unlike most approaches that adopt a multi-stage training strategy with many bells and whistles,
our approach can be directly trained in an end-to-end manner, in which the feature consistency head and our regularized loss can benefit from each other.
For few-shot segmentation, most existing approaches use masked Global Average Pooling (GAP) to encode an annotated support image to a feature vector
to facilitate query image segmentation. However, this pipeline unavoidably loses some discriminative information due to the average operation. We propose a simple but effective self-guided learning approach, where the lost critical information is mined. Specifically,
through making an initial prediction for the annotated support image, the covered and uncovered foreground regions are encoded to the primary and auxiliary support vectors using masked ii GAP, respectively. By aggregating both primary and auxiliary support
vectors, better segmentation performances are obtained on query images. Enlightened by our self-guided module for 1-shot segmentation, we propose a cross-guided module for multiple shot segmentation, where the final mask is fused using predictions from multiple
annotated samples with high-quality support vectors contributing more and vice versa. This module improves the final prediction in the inference stage without re-training.