🧠 Mask2Former Universal Segmentation Evaluation
Evaluating the Performance on KITTI and SYNTHIA Datasets
By Andrew Albano, Saket Pradhan, Utkrisht Sahai
GitHub: Mask2FormerImageSegementation
📄 Overview
This project investigates the generalization capabilities of Mask2Former, a universal image segmentation architecture, by evaluating its performance on KITTI and SYNTHIA datasets. While Mask2Former has proven competitive on standard datasets like COCO, Cityscapes, and ADE20K, this study expands the evaluation to new domains—specifically real-world urban and synthetic scenes.
🧰 Architecture
The evaluation centers on Mask2Former, which builds upon MaskFormer by incorporating:
Masked attention in the Transformer decoder
Multi-scale high-res features for better small object segmentation
Removal of dropout, and use of learnable query features
These improvements target segmentation quality and computational efficiency without altering training procedures or loss functions.
📊 Datasets Used
DatasetTypeKey FeaturesKITTIReal-worldUrban driving scenes, semantic & instance labelsSYNTHIASyntheticDiverse environmental/weather conditionsCOCOBenchmarkGeneral-purpose object segmentationCityscapesBenchmarkUrban street scenesADE20KBenchmarkDiverse visual environments
🧪 Methodology
Used pre-trained Mask2Former model
Fine-tuned on KITTI and SYNTHIA without altering architecture or training procedure
Evaluated with mIoU (Mean Intersection over Union) and qualitative visualization
Focused on small object segmentation performance
📈 Results
Dataset | mIoU Score | Notes |
---|---|---|
Cityscapes | 62 | ~5,000 images (fine annotations) 2,975 train, 500 val, 1,525 test (only train+val have labels) |
COCO | 61 | ~164,000 images ~118k train, ~5k val, ~41k test; supports panoptic/semantic/instance segmentation |
ADE20K | 58 | ~25,000 images 20k train, 2k val, 3k test; wide variety of scenes |
KITTI | 55 | ~200 labeled images (for segmentation) Limited semantic segmentation labels; used mostly for depth/object detection |
SYNTHIA | 49 | ~9,400 images (SYNTHIA-RAND-CITYSCAPES subset) Synthetic, diverse weather/time conditions |
Generalization: Strong performance on KITTI/SYNTHIA despite no architectural tuning
Challenge: Small object segmentation remains a weak point
🎯 Key Insights
Robust across domains: Performs well on both real and synthetic data
Limitations: Misses small-scale details in crowded scenes
Future Work: Improve multi-scale feature extraction, explore domain adaptation
📷 Visual Results
Visual comparisons with ground truth for KITTI images are provided to highlight:
Effective large object segmentation
Missed small-scale details
Color mismatches due to post-processing (not misclassification)
📚 References
Key citations include works on Mask2Former, Mask R-CNN, semantic/instance/panoptic segmentation, and transformer-based vision models (see full paper for full reference list).
📬 Contact
For questions or collaborations, reach out via the GitHub repo or email any of the authors:
Andrew Albano – aalbano@umich.edu
Saket Pradhan – saketp@umich.edu
Utkrisht Sahai – usahai@umich.edu