Logo NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu1*, Cheng Wang1*, Dingkang Liang1, Zongchuang Zhao1, Xingyu Jiang1, Peng Zhang2, Xiang Bai1†
1Huazhong University of Science and Technology, 2National University of Defense Technology
* Equal contribution. Corresponding author

Abstract

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area.

NautData NautData: Underwater Multimodal Instrucion-Following Data

NautData is a large-scale underwater multimodal instruction-following dataset. It contains 1.45 million image-text pairs, which are annotated with eight underwater scene understanding tasks.


Distribution of NautData

Dataset Distribution of NautData.

Image Distribution of NautData

Image Distribution of NautData.

Illustration of the data construction framework. Eight tasks are involved, and the data generation process is tailored to each task. Rule-based generation utilizes predefined templates to generate question-answer pairs. Integration generation integrates question-answer pairs using both templates and outputs from LMMs. Free-form generation enables LMMs to construct questions and answers based on the content they focus on.

Dataengine of NautData

Dataengine of NautData.

Pipeline Enhancing Underwater Understanding via the VFE Module

Framework of NAUTILUS.

Framework of NAUTILUS.

VFE module of NAUTILUS.

Detailed Structure of the Proposed VFE Module.

Pipeline Performance on our NautData test set.

Main Results.

The comparison of our NAUTILUS and renowned LMMs on the NautData test set. The best results are highlighted in bold,and the second-best results are underscored.

Visualization.

Visualization of NAUTILUS.

BibTeX

@inproceedings{xu2025nautilus,
        title={NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding},
        author={Xu, Wei and Wang, Cheng and Liang, Dingkang and Zhao, Zongchuang and Jiang, Xingyu and Zhang, Peng and Bai, Xiang},
        booktitle={Advances in Neural Information Processing Systems},
        year={2025}
  }
Versicherungen online berechnen & abschließen