Computer Vision

This module focuses on two main areas: Human Analysis and Object Detection, obtaining images from a Stereolabs ZED2. Additionally, this year we also explored Visual-Language Models (VLMs) to enhance our capabilities in understanding visual content. To check the current architecture, check the Vision Architecture document.

Human Analysis

This subarea focuses on analyzing human features and behaviors using computer vision techniques. Some of the main tasks include:

Recognizing faces
Tracking persons and re-identifying them across different frames
Detecting poses and gestures
Identifying combination of clothes and colors
Describing a person

Object Detection

The main objective of this subarea is the dataset generation pipeline used to train a YOLO model. However it also includes the integration of zero-shot models or other alternatives as a plan B for object detection. Additionally, this year an alternative to shelf level detection was explored using mainly opencv.

VLM

This subarea explores visual-language models (VLMs) to enhance the understanding of visual content in conjunction with language processing. Currently, the team uses the model moondream for image prompting.

Running vision

For details on how to run the vision module, check the home repo: Run Vision