📄️ Multimodal Interactive Assistant
Combining ASR (Automatic Speech Recognition) with VLM (Vision-Language Model) enables "voice + vision" multimodal interaction—the system understands spoken input and combines it with the current scene for semantic understanding and interaction decisions. This is widely used in robotics, smart cockpits, smart terminals, and exhibition demos.