Semantic segmentation arises as the backbone of many vision systems, spanning from self-driving cars and robot navigation to augmented reality and teleconferencing. Frequently operating under stringent latency constraints within a limited resource envelope, optimising for efficient execution becomes important. At the same time, the heterogeneous capabilities of the target platforms and diverse constraints of different applications require the design and training of multiple target-specific segmentation models, leading to excessive maintenance costs. To this end, we propose a framework for converting state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS) networks: specially trained models that employ parametrised early exits along their depth to i) dynamically save computation during inference on easier samples and ii) save training and maintenance cost by offering a post-training customisable speed-accuracy trade-off. Designing and training such networks naively can hurt performance. Thus, we propose novel two-staged training scheme for multi-exit networks. Furthermore, the parametrisation of MESS enables co-optimising the number, placement and architecture of the attached segmentation heads along with the exit policy, upon deployment via exhaustive search in <1GPUh. This allows MESS to rapidly adapt to the device capabilities and application requirements for each target use-case, offering a train-once-deploy-everywhere solution. MESS variants achieve latency gains of up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same computational budget, compared to the original backbone network. Lastly, MESS delivers orders of magnitude faster architecture selection, compared to state-of-the-art techniques.
Authors: A. Kouris, S. I. Venieris*, S. Laskaridis*, N. D. Lane
Published at:European Conference in Computer Vision (ECCV’22) (full paper) and International Conference on Mobile Systems, Applications and Services (MobiSys’22) (poster)