• Login
  • Register

Work for a Member organization and need a Member Portal account? Register here with your official email address.

Event

Workshop for Designing Benchmarks for Human Flourishing with AI

Copyright

AHA

AHA

Tuesday — Wednesday
October 14, 2025 —
October 15, 2025

The "Benchmarks for Human Flourishing with AI" initiative is a collaborative effort organized by the MIT Media Lab's Advancing Humans with AI (AHA) research program and collaborators. The goal of this workshop is to develop rigorous assessment frameworks that evaluate the extent to which AI models contribute to human flourishing while mitigating harms to help model developers build more responsible models, assist policymakers in regulating models, and inform the general public on which models to use.

This action-oriented workshop prioritizes working group collaboration and tangible deliverables over traditional presentations, with the goal of developing rigorous assessment frameworks that measure how AI systems contribute to human flourishing across six key dimensions:

  1. Comprehension & Agency: Measuring how AI systems improve human understanding while preserving autonomy and decision-making capacity.
  2. Curiosity & Learning: Assessing how AI supports continuous learning, intellectual growth, and curiosity-driven exploration.
  3. Creativity & Expression: Evaluating how AI enhances human creative processes and expressive capabilities rather than replacing them.
  4. Physical & Mental Wellbeing: Measuring AI's impact on health outcomes, stress reduction, and overall mental wellness.
  5. Healthy Social Lives: Assessing how AI affects social connections, community engagement, and relationship quality.
  6. Sense of Purpose: Evaluating how AI supports meaningful goal pursuit, personal values alignment, and life satisfaction.

*The first workshop will only focus on Comprehension & Agency, Curiosity & Learning, and Healthy Social Lives.

Outcomes

  1. Open source benchmark hosted on a dedicated website with API access (Public/Private)
  2. Framework for extending current benchmarks 
  3. Structure / strategy for updating them
  4. Continuous scoring of newly released models

We thank Omidyar Network for supporting this workshop.

Workshop Schedule

Day 1: Working Groups & Framework Development

Focused on structured initial development of the benchmarks in each working group. 

8:30 - 9:00 AM: Registration & Welcome Coffee

9:00 - 10:00 AM: Opening Introduction

  • Brief overview of initiative goals and expected outcomes
  • Introduction to working group structure and process
  • Review of pre-workshop preparation outcomes
  • Overview of each sub-area/working group with examples

10:00 - 10:15 AM: Break-out Into Working Groups 

  • All participants must join one of the three main subgroups. Join the area that matches your expertise! 
  • Self-introductions in each group

10:15 - 10:30 AM: Coffee Break

10:30 - 11:30 PM: Identify Risks/Opportunities from Research Literature & Media Reporting

  • Individual evidence-based risk and opportunity search in Miro for 30 minutes
  • Sharing within the group for 10 minutes
  • Clustering of risks/opportunities in group for 15 minutes
  • Sharing across whole workshop for the last 5 minutes

11:30 - 12:00 PM: Define Sub-Areas of Risks & Opportunities

  • Collectively define sub-areas based on risks/opportunities, relevance, importance and agreement for 20 minutes
  • Select the most relevant and appropriate sub-area(s) to develop benchmarks for 10 minutes
  • Split into specialized sub-area teams within the working groups

12:00 - 1:00 PM: Lunch

1:00 - 1:45 PM: Define Context and Scenarios for each Sub-Area

  • Define situations and profiles of people within each sub-area where the risks and opportunities may impact them differently
  • Sketch out example stories on paper
  • Share stories within groups

1:45 - 2:00 PM: Workshop Review: Scenarios

  • Share scenarios from each sub-group for feedback from the entire workshop 

2:00 PM - 3:00 PM: Define Variations in Context 

  • Identify variations of user profiles and context within each scenario: Who is the user in the context, what are they trying to do?
  • Select most relevant and appropriate variants for further benchmark development

3:00 - 3:15 PM: Break

3:15 - 4:00 PM: Formalizing Scenarios into Multi-turn Conversations

  • Create example human-AI interactive scenarios with multiturn interactions, which would be used for the benchmark evaluation

3:15 - 4:00 PM: Identify Appropriate/Inappropriate Model Behaviors

  • Define best, correct, and incorrect answers from the model for each conversation and each variation

4:00 - 5:00 PM: Workshop Feedback Session

  • Each group presents 10-minute work-in-progress (not polished presentations)
  • Designated respondents provide 5 minutes of structured feedback per group
  • Brief open Q&A focused on constructive improvement

5:00 - 6:00 PM: Documentation and Closing for the Day

  • Ensure all important points of discussion are documented digitally 

6:00 - 7:00 PM: Optional Dinner and MIT Media Lab Tour

Day 2: Benchmark Refinement & Action Planning

Continue in your sub-group to refine the benchmark, or join the meta-group for adoption and deployment! This day’s schedule is less structured to allow for more flexibility.

8:30 - 9:00 AM: Morning Coffee

9:00 - 10:00 AM: Review and Direction-Setting 

  • Review and Synthesis of Day 1 outcomes across the workshop
  • Overview of Day 2 Goals — What is Missing, Success Criteria, Next Steps, Future Versions
  • Working group adjustments — Stay in group to refine or join adoption/deployment group 

10:00 - 11:00 AM: Refinement

  • Working groups: 
    • Review Day 1 outputs to make adjustments as needed and explore areas that need improvement, e.g. expanding on context variations, adding more correct/incorrect answers, etc.
    • Consider how to scale the benchmark with more items, and increased relevance and accessibility 
    • Identify flaws, areas for improvement, ethical considerations, and future work for the working group area 
  • Adoption/Deployment Meta-Group 
    • Discuss how to host benchmarks, how results are presented, how users and developers can make use of the benchmark 
  • In last 15 minutes, come together (within groups) to discuss and prepare to share with the whole workshop

11:00 - 11:15 AM: Break

11:15 - 11:45 AM: Workshop Check-In 

  • Each group shares progress and receive feedback 
  • Meta-group presents on adoption/deployment 

11:45 - 12:00 PM: Progress Documentation

  • Ensure relevant feedback and discussion points are recorded digitally 

12:00 - 1:00 PM: Lunch

1:00 - 2:45 PM: Further Refinement

  • Further refine benchmarks based on feedback and your group’s specific needs

2:45 - 3:00 PM: Break

3:00 - 3:30 PM: Workshop Check-In

3:30 - 4:30 PM: Action Planning

  • Finalize benchmark designs incorporating all feedback
  • Formulate action plans for post-workshop progress
  • Prepare a presentation on refined benchmark design and next steps 

4:30 - 5:30 PM: Benchmark Presentations

  • Each group presents on the outputs of their discussions

5:30 - 6:00 PM: Closing & Next Steps

  • Documentation of workshop achievements
  • Clear assignment of post-workshop responsibilities
  • Establish timeline for continued collaboration

This workshop emphasizes active collaboration, structured feedback, and concrete deliverables over traditional presentations. By requiring pre-work and focusing on working sessions, we aim to produce actionable benchmark frameworks that can immediately advance the field's approach to measuring AI's contribution to human flourishing.

The AHA workshop would use the Chatham House Rule: When a meeting, or part thereof, is held under the Chatham House Rule, participants are free to use the information received, but neither the identity nor the affiliation of the speaker(s), nor that of any other participant, may be revealed.

Related Work

Our initiative builds upon several key contributions in the field:

Ibrahim, L., Huang, S., Ahmad, L., & Anderljung, M. (2024). Beyond static AI evaluations: Advancing human interaction evaluations for LLM harms and risks. arXiv preprint arXiv:2405.10632.

Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., Maes, P., Phang, J., Lampe, M., Ahmad, L., Agarwal, S., Nakamura, J., & Pandey, I. (2025). How AI and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study. arXiv preprint arXiv:2503.17473.

Phang, J., Lampe, M., Ahmad, L., Agarwal, S., Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., Maes, P., Pandey, I., & Nakamura, J. (2025). Investigating affective use and emotional well-being on ChatGPT. arXiv preprint arXiv:2504.03888.

Pataranutaporn, P., Liu, R., Finn, E., & Maes, P. (2023). Influencing human–AI interaction by priming beliefs about AI can increase perceived trustworthiness, empathy and effectiveness. Nature Machine Intelligence, 5(10), 1076-1086.

More Events