Grounding Spatial Language for Video Retrieval and Robotic Direction Following

Understanding spatial language is a challenging problem that requires the ability to map between language and real-world situations. We are building a spatial language understanding system that bridges this representational gap by computationally modeling the semantics of spatial prepositions. Our model enables a system to retrieve video clips that match natural language queries such as "Show me people going across the kitchen." We are also applying it to build robots that can follow natural-language directions such as "Go through the door near the elevators." By using corpus-based machine learning techniques, our model is robust to real-world noise and linguistic variation. Exploring the connection between language and the real world in concrete domains enables us to make progress towards computers that understand language in human-like ways.