Collaborative research on morality, civic reasoning, and AI alignment at the University of Southern California
Case Study: Building & Analyzing AI/NLP Datasets for Social Science Research at USC
Project Overview
This case study highlights my integral role in multiple AI and Natural Language Processing (NLP) research projects within Dr. Morteza Dehghani’s lab at the University of Southern California (USC). My work focused on the critical steps of data annotation, curation, and quality assurance to build large-scale, high-quality datasets used for developing and validating AI models in complex social and behavioral contexts, including morality, politics, and crime prediction.
My Role
AI Data Annotation Specialist & NLP Researcher
- Duration: 2018 – 2020
- Key Responsibilities: Large-scale Data Annotation, Data Curation, Data Quality Assurance, Inter-rater Reliability Analysis, Bias Identification & Mitigation, Collaborative Research, Data Interpretation, Contributing to Publications.
The Challenge
The challenge in these projects lay at the intersection of human understanding and artificial intelligence:
- Complex Data Interpretation: Annotating vast quantities of unstructured text data (social media, news, EAR technologies) for nuanced concepts like morality and politics, which are inherently subjective.
- Ensuring Data Quality for AI: The reliability and accuracy of AI models are directly dependent on the quality of their training data. Ensuring consistency and minimizing bias across thousands of human-annotated data points was paramount.
- Translating Human Nuance to AI Input: Bridging the gap between complex human language and structured data formats digestible by NLP models.
- Collaborative Consistency: Working in a team of annotators and researchers required rigorous calibration and discussion to maintain inter-rater agreement and address discrepancies.
My Approach & Process
My approach focused on meticulous data annotation, collaborative quality assurance, and iterative refinement:
- Large-Scale Data Annotation & Structuring:
- Systematically annotated 35,000 records from diverse, real-world data sources (social media, local news, EAR technologies). This involved reading and categorizing text based on predefined schemas relevant to morality, politics, and other specific research questions.
- Contributed to the creation of a comprehensive corpus of 35,000 annotated data points, serving as a foundational dataset for various NLP and AI modeling efforts.
- Data Quality & Bias Mitigation (Human-in-the-Loop QA):
- Actively participated in team discussions after each series of annotations to review and discuss individual ratings.
- Challenged potential rater biases and identified areas of ambiguity in annotation guidelines, contributing to the refinement of rubrics and improving inter-rater reliability. This “human-in-the-loop” quality assurance was crucial for the integrity of the datasets.
- Collaborative Research & Problem Solving:
- Worked closely with the research team to understand the objectives of each study, ensuring annotations directly supported the development of predictive models (e.g., crime rate prediction).
- Engaged in iterative feedback loops to refine annotation strategies based on preliminary model results or emerging data patterns.
- Contribution to AI Model Development:
- Provided the foundational annotated data that enabled the development and validation of NLP models for understanding social phenomena and making predictions.
Key Contributions & Solutions
- Developed Foundational AI Datasets: Successfully annotated and curated a large-scale corpus of 35,000 data points, critical for training and validating complex NLP models.
- Ensured Data Quality: Actively participated in quality assurance processes, identifying and mitigating rater biases, which directly improved the reliability and robustness of the datasets.
- Supported Predictive AI Research: Provided essential data for projects leveraging AI to predict real-world outcomes, such as local crime rates from publicly available text.
- Collaborative Research Output: Co-authored multiple published AI/NLP studies, demonstrating contributions to academic research and a strong understanding of research methodologies.
Impact & Results
- High-Quality Datasets: The meticulously annotated datasets enabled the lab to train more accurate and robust NLP models.
- Published Research: Direct contributions led to the successful publication of multiple peer-reviewed AI/NLP studies.
- Enhanced Predictive Capabilities: The curated data supported research aimed at building AI models with improved predictive power for real-world social issues.
- Standardized Annotation Practices: Collaborative review processes fostered a more consistent and reliable approach to complex data labeling within the team.
Learnings & Takeaways for Tech Roles
This experience provided critical skills directly applicable to AI, Machine Learning, and Data Science roles in the tech industry:
- Data Annotation & Curation: Practical expertise in preparing vast, complex datasets for machine learning applications, a foundational skill for AI/ML development.
- Data Quality Assurance: Deep understanding of the importance of data quality, bias detection, and human-in-the-loop validation for AI model performance.
- Natural Language Processing (NLP) Fundamentals: Hands-on experience with textual data, understanding its nuances and preparing it for computational analysis.
- AI Ethics & Bias: Direct involvement in discussions around rater bias translates to an awareness of ethical considerations in AI data and model development.
- Collaboration in Technical Environments: Effective teamwork in a research setting, including contributing to scientific publications and collaborating on complex data challenges.
- Problem-Solving with Data: Leveraging data to tackle complex, real-world problems, from understanding human morality to predicting crime.
This project honed my analytical capabilities and provided direct experience with the critical, often overlooked, stages of AI development, making me a valuable asset for teams building and deploying AI solutions.
Hoover, J., Portillo-Wightman, G., Yeh, L., Havaldar, S., Mostafazadeh Davani, A., Lin, Y., …
Dehghani, M. (2020). Moral Foundation Twitter Corpus: A Collection of 35k Tweets Annotated for Moral Sentiment. Social Psychological and Personality Science. https://doi.org/10.1177/1948550619876629
Aida Mostafazadeh Davani, Yeh, L., Atari, M., Kennedy, B. J., Gwenyth Portillo Wightman, Elaine
Acosta González, Delong, N., Bhatia, R., Arineh Mirinjian, Ren, X., & Dehghani, M. (2019).
Reporting the Unreported: Event Extraction for Analyzing the Local Representation of Hate Crimes. Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/d19-1580
Atari, M., Mehl, M. R., Graham, J., Doris, J. M., Schwarz, N., Davani, A. M., Omrani, A.,
Kennedy, B., Gonzalez, E., Jafarzadeh, N., Hussain, A., Mirinjian, A., Madden, A., Bhatia, R., Burch, A., Harlan, A., Sbarra, D. A., Raison, C. L., Moseley, S. A., … Dehghani, M. (2023). The paucity of morality in everyday talk. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-32711-4

