• Login
  • Register

Work for a Member company and need a Member Portal account? Register here with your company email address.

Publication

WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual Queries

Copyright

Cathy Mengying Fang

Cathy Mengying Fang 

Fang, C. M., Chwalek, P., Kuang, Q., & Maes, P. (2024, October). WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual Queries. In Adjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (pp. 1-4).

Abstract

This paper introduces WatchThis, a novel wearable device that enables natural language interactions with real-world objects and environments through pointing gestures. Building upon previous work in gesture-based computing interfaces, WatchThis leverages recent advancements in Large Language Models (LLM) and Vision Language Models (VLM) to create a hands-free, contextual querying system. The prototype consists of a wearable watch with a rotating, fip-up camera that captures the area of interest when pointing, allowing users to ask questions about their surroundings in natural language. This design addresses limitations of existing systems that require specific commands or occupy the hands, while also maintaining a non-discrete form factor for social awareness. The paper explores various applications of this point-and-ask interaction, including object identification, translation, and instruction queries.

Related Content