People express and communicate their mental states–such as emotions, thoughts, and desires–through facial expressions, vocal nuances, gestures, and other non-verbal channels. We have developed a computational model that enables real-time analysis, tagging, and inference of cognitive-affective mental states from facial video. This framework combines bottom-up, vision-based processing of the face (e.g., a head nod or smile) with top-down predictions of mental-state models (e.g., interest and confusion) to interpret the meaning underlying head and facial signals over time. Our system tags facial expressions, head gestures, and affective-cognitive states at multiple spatial and temporal granularities in real time and offline, in both natural human-human and human-computer interaction contexts. A version of this system is being made available commercially by Media Lab spin-off Affectiva, indexing emotion from faces. Applications range from measuring people's experiences to a training tool for autism spectrum disorders and people who are nonverbal learning disabled.