📄️ Object Detection
Object Detection is a core task in computer vision, primarily used to identify and locate target objects (such as people, vehicles, and animals) in images or video, typically by annotating target positions with bounding boxes. It is widely used in security surveillance, autonomous driving, industrial inspection, and similar scenarios. Representative model families include YOLO, R-CNN, and DETR.
📄️ Voice-to-Text (ASR)
ASR (Automatic Speech Recognition), or speech-to-text, is a technology that converts human speech into text in real time. It is a foundation for voice assistants, meeting transcription, intelligent customer service, and voice input applications.
📄️ Large Language Model (LLM)
LLM (Large Language Model) is a class of AI models trained on massive text corpora with natural language understanding and generation capabilities, supporting tasks such as Q&A, translation, summarization, and code generation. Representative models include ChatGPT, GPT-4, and Qwen.
📄️ Vision-Language Model (VLM)
VLM (Vision-Language Model) is a multimodal model that combines vision and language capabilities, understanding both image content and text semantics to enable visual understanding and language interaction. Typical applications include image Q&A, image captioning, scene understanding, and image-text retrieval.