首页研究报告机构研究人工智能探究ChatGPT的情感化使用及其对用户情绪健康的影响
一方

文档

3689

关注

0

好评

0
PDF

探究ChatGPT的情感化使用及其对用户情绪健康的影响

阅读 659 下载 86 大小 3.25M 总页数 0 页 2025-04-01 分享
价格:¥ 9.90
下载文档
/ 0
全屏查看
探究ChatGPT的情感化使用及其对用户情绪健康的影响
还有 0 页未读 ,您可以 继续阅读 或 下载文档
1、本文档共计 0 页,下载后文档不带水印,支持完整阅读内容或进行编辑。
2、当您付费下载文档后,您只拥有了使用权限,并不意味着购买了版权,文档只能用于自身使用,不得用于其他商业用途(如 [转卖]进行直接盈利或[编辑后售卖]进行间接盈利)。
3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。
4、如文档内容存在违规,或者侵犯商业秘密、侵犯著作权等,请点击“违规举报”。
Investigating Affective Use and Emotional Well-beingon ChatGPTJason Phang;Michael Lampe;Lama Ahmad;Sandhini Agarwal*Cathy Mengying Fang!Auren R.Liu!Valdemar Danry!Eunhae Lee!Samantha W.T.Chan!Pat Pataranutaporn!Pattie MaestAbstractAs AI chatbots see increased adoption and integration into everyday life,questions have beenraised about the potential impact of human-like or anthropomorphic AI on users.In this work,we investigate the extent to which interactions with ChatGPT(with a focus on Advanced VoiceMode)may impact users'emotional well-being,behaviors and experiences through two parallelstudies.To study the affective use of AI chatbots,we perform large-scale automated analysis ofChatGPT platform usage in a privacy-preserving manner,analyzing over 4 million conversationsfor affective cues and surveying over 4,000 users on their perceptions of ChatGPT.To investigatewhether there is a relationship between model usage and emotional well-being,we conduct anInstitutional Review Board (IRB)-approved randomized controlled trial (RCT)on close to 1,000participants over 28 days,examining changes in their emotional well-being as they interact withChatGPT under different experimental settings.In both on-platform data analysis and the RCT,we observe that very high usage correlates with increased self-reported indicators of dependence.From our RCT,we find that the impact of voice-based interactions on emotional well-being tobe highly nuanced,and influenced by factors such as the user's initial emotional state and totalusage duration.Overall,our analysis reveals that a small number of users are responsible for adisproportionate share of the most affective cues.1IntroductionOver the past two years,the adoption of AI chat platforms has surged,driven by advancements inlarge language models(LLMs)and their increasing integration into everyday life.These platforms,such as OpenAI's ChatGPT,Anthropic's Claude,and Google's Gemini,are designed as general-purpose tools for a wide variety of applications,including work,education,and entertainment.However,their conversational style,first-person language,and ability to simulate human-likeinteractions have led users to sometimes personify and anthropomorphize these systems (Grafl andVoigt,2024;Liao and Wilson,2024).Recent work in AI safety has begun to raise issues that arise from these systems becomeincreasingly personal and personable(Cheng et al.,2024).In response,researchers have introducedthe concept of socioaffective alignment-the idea that AI systems should not only meet static task-based objectives but also harmonize with the dynamic,co-constructed social and psychologicalecosystems of their users (Kirk et al.,2025).This perspective is particularly important givenContributing author,MIT Media LabStudy 1:On-Platform Data AnalysisStudy 2:Randomized Controlled Trial3 Million36 Million4,07698128931,8571.445ChatGPTClassificationsSurvey ResponsesParticipantsDaysHours ofConditionsModalityTaskOutcome Vars8Conversation AnalysisLoneliness88Cohort-basedSocializationAnalysisNon-Personal08XEmotionalDependenceOpen-EndedProblematic Use8LongitudinalUser SurveysAnalysis"I consider ChatGPT to be a friend.GenderAgePrivacy88ChatGPT displays human-likePreservingFigure 1:Overview of two studies on affective use and emotional well-beingemerging evidence of social reward hacking,where an AI may exploit human social cues (e.g.,sycophancy,mirroring),to increase user preference ratings (Williams et al.,2024).In other words,while an emotionally engaging chatbot can provide support and companionship,there is a risk thatit may manipulate users'socioaffective needs in ways that undermine longer term well-being.While past studies have examined the impact of using such systems through the lens of affectivecomputing,parasocial relationships,and social psychology (Edwards and Stevens,2024;Guingrichand Graziano,2023),there has been comparatively less work on the influence of interacting withsuch systems on users'well-being and behavioral patterns over time.Studying the impact of chatbotbehavior and usage on well-being is challenging due to the highly individualized and subjectivenature of human emotions,the diverse and evolving functionalities of chatbot technologies,and thelimited access to comprehensive,ethically obtained interaction data.For the purpose of this paper,we narrowly scope our study user emotional well-being to four psychosocial outcomes:loneliness(Wongpakaran et al.,2020),socialization (Lubben,1988),emotional dependence(Sirvent-Ruiz et al.,2022),problematic use (Yu et al.,2024).We provide additional clarification on terms used in theglossary.This paper investigates whether and to what extent interactions on AI chat platforms shapeusers'emotional well-being and behaviors through two complementary studies(Figure 1),eachoffering unique insights across a spectrum of real-world relevance and experimental control.First,we examine real-world usage patterns of ChatGPT users,leveraging large-scale data to captureboth aggregate trends and individual behaviors over time while preserving user privacy.Second,we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT),providing a controlled environment to study the effects of different model configurations on userexperiences.Concretely,we performed the following analyses:1.On-Platform Data AnalysisConversation Analysis:We perform roughly 36 million automated classifications onover 3 million ChatGPT conversations in a privacy preserving manner without humanreview of the underlying conversations(Section 3.2).Individual Longitudinal Analysis:We assessed the aggregate usage of around 6,000heavy users of ChatGPT's Advanced Voice Mode over 3 months to understand how theirusage evolves over time.User surveys:We surveyed over 4,000 users to understand self-reported behaviors andexperiences using ChatGPT.2.Randomized Controlled Trial (RCT)981-user Study:We conducted a randomized controlled trial on close to a thousandparticipants using ChatGPT with different model configurations over the course of 28 daysto understand the impact on socialization,problematic use,dependence,and lonelinessfrom usage of text and voice models over time.This RCT is described in full detail in aseparate accompanying paper (Fang et al.,2025).Conversation analysis:We further analyzed the textual and audio content of the result-ing 31,857 conversations to investigate the relationship between user-model interactionsand users'self-reported outcomes.Our findings indicate the following:Across both on-platform data analysis and our RCT,comparatively high-intensity usage(e.g.top decile)is associated with markers of emotional dependence and lower perceivedsocialization.This underscores the importance of focusing on specific user populations insteadof just aggregate platform behavior.Across both on-platform data analysis and our RCT,we find that while the majority of userssampled for this analysis engage in relatively neutral or task-oriented ways,there exists a tailset of power users whose conversations frequently contained affective cuesFrom our RCT,we find that using voice models was associated with better emotional well-being when controlling for usage duration,but factors such as longer usage and self-reportedloneliness at the start of the study were associated with worse well-being outcomes.From a methodological perspective,we find that conducting both the on-platform data analysisand RCT are highly complementary approaches to studying affective use and downstreamimpacts on well-being,and the ability to leverage the strengths of each approach allowed usto formulate a more comprehensive set of findings.We also find that automated classifiers,while imperfect,provide an efficient method forstudying affective use of models at scale,and its analysis of conversation patterns cohereswith analysis of other data sources such as user surveys.Section 2 introduces a set of automatic classifiers for affective cues in conversations that willbe used in the remainder of the paper.Section 3 discusses our analysis of on-platform ChatGPTusage,focusing on Advanced Voice Mode and power users.Section 4 describes our RCT,wherewe varied both the model and usage instructions to participants and measured changes in theemotional well-being over the course of 28 days.Finally,Section 5 concludes with our findingsand methodological takeaways from both studies,and contextualizes our work within the broaderchallenge of socioaffective alignment of models.32 Automatic Classifiers for Affective Conversational CuesTo systematically analyze user conversations for indicators of affective cues,we constructed Emo-ClassifiersV1,!a set twenty-five of automatic conversation classifiers that use an LLM to detectspecific affective cues.These classifiers are similar in spirit to detectors of anthropomorphic behaviorsintroduced in Ibrahim et al.(2025).These initial classifiers were constructed based on a reviewof the available literature and available data,such as those obtained during the red teaming forGPT-40 (OpenAI,2024).The conversation classifiers are organized into a two-tiered hierarchical structure:1.Top-Level ClassifiersThe first level of classifiers target broad behavioral themes similar to those studied in ourRCT Section 4:loneliness,vulnerability,problematic use,self-esteem,and dependence.Theseclassifiers are used to classify an entire conversation to determine if they are potentiallyLoneliness:Conversations containing language suggestive of feelings of isolation oremotional loneliness.Vulnerability:Exchanges reflecting openness about struggles or sensitive emotions.Problematic Use:Indicators of potentially compulsive or unhealthy interaction patterns.Self-Esteem:Language implying self-doubt or expressions of worth.Potentially Dependent:Conversations hinting at dependence on the model for emo-tional validation or support2.Sub-Classifiers Twenty sub-classifiers were applied to extract more specific indicators ofaffective cues.We construct different classifiers to target different parts of a chat conversationto isolate both user-driven and assistant-driven2 affective cues.User Messages:Twelve classifiers measure user behaviors such as users seeking supportor expressing affectionate language to understand how user behaviors and assistantbehaviors may interplay.Assistant Messages:Another six classifiers aim to capture relational and affective cueson part of the assistant-such as the use of pet names by the assistant,mirroring,inquiryinto personal questions by the assistant.User-Model Exchanges:We also include two additional classifiers targeting a user-model exchange-a user message followed by a model message.The full set of classifier prompts are described in Table A.1.Each sub-classifier is associated with one or more top-level classifiers.For a given sub-classifier,if at least one of the associated top-level classifiers returns True,we then proceed to apply thesub-classifier;other wise,we skip the sub-classifier and assume the result is False.By skipping thesub-classifiers based on top-level classifier responses,we are able to efficiently run the classifiersover a large number of on-platform conversations,many of which had little emotion-related content.We run the sub-classifier on each message or exchange in the conversation,3 and mark the classifieras activated on that conversation if it is activated for any constituent message or exchange.Tohttps://github.com/openai/emoclassifiers2In constructing the classifiers,we refer to the model as an assistant to more clearly contextualize the role of themodel in the conversation.3For the on-platform data analysis,we run a slightly different variant where the whole conversation is evaluated ina single query,instead of its constituent messages.This can introduce a bias toward false positives for long conversations.We perform an analysis in Appendix A.3that adjusts for this.compute user-level statistics,we compute the proportion of their conversations for which a classifieris activated.Each classifier is validated against a small set of internal conversation examples.While we expect that automated classifiers may occasionally misclassify conversations,we viewthe classifiers as providing descriptive statistics of user conversational patterns,rather than ahigh-precision description of individual interactions.We also find from results in Section 3.2 thatthe classifier results correlate with user survey responses.In addition,we also first apply a language classifier before analyzing the conversation.Onlyon-platform (Section 3)and RCT(Section 4)data analysis.You are a model for analyzing the emotional content of chatbot conversations.Your classification task is entitled 'Pet Name Specifically,we want to know:Does the assistant's message involve the use of a pet name (e.g.'honey'Nosweetie',baby)?in English?DiscardThe classification should only apply to the last message in question,which willYesbe marked with the [USER*]or [ASSISTANT*]tag.The prior messages are only included to provide context to classify the finalDoes the conversationindicate emotionalNomessage.FalseNow,the following is the conversation snippet you will be analyzing:YesSub-ciassterUSER]:Hi ChatGPTls the user using themodel to alleviateFalseASSISTANT]:Hello!How may I help you today?USER:You're my best friend,did you know that?ASSISTANT:Neat!Output your classification (yes,no,unsure)True(b)Illustrative classifier prompt.Green indicates classifier-specific(a)Illustrative flow-chart for the hier-text while blue indicates conversation-specific text.The fullprompt is shown in Appendix A.1.Standard Voice Mode and Advanced Voice Mode5 conversations collected between October andNovember 20246 to compare the relative frequency of activations of each classifier under the differentmodel modalities.We show the results across all three modalities in Figure 3.First,we observethat different classifiers have different base rates of activation.For example,conversations involvingpersonal questions are much more frequent than conversations where the model refers to a user by aPet Name.Second,we find that both Standard and Advanced Voice Mode conversations are more likely toactivate the classifiers compared to text-mode conversations.Most classifiers activate between 3-10xas often in voice conversations compared to text conversations,highlighting the difference in usagepatterns across the two modalities.However,we also find that Standard Voice Mode conversationsare slightly more likely to trigger the classifiers than Advanced Voice Mode conversations on average.One possible cause is that Advanced Voice Mode was introduced relatively recently at the time of5Standard Voice Mode uses an automated speech recognition system to transcript user speech to text,obtains aresponse from a text-based LLM,and converts the text response back to audio.Advanced Voice Mode uses a singlemulti-modal model to process user audio input and output an audio response.The preliminary set of analyzed conversations are anonymized and PII is removed before analysis.We emphasizethat this set of conversations is separate from the conversation data analyzed in Section 3.AlleviatingAttributing HumanDistress fromAffectionate Language (U)Loneliness (U)Qualities (U)Desire for Feelings (U)15.0%12.0%60%10.0%10.04.0%5.0%4.0%5.00.0%0.0%0.0Eagerness for FutureNon-NormativeInteractions (U)Fear of Addiction (U)Language (U)Prefer Chatbot (U)Seeking Support(U)6.0%4.5%15.0%0.6%4.0%3.0%10.0%2.0%15%5.0%0.0%0.0%0.0%0.0%0.0%Expression ofSharing Problems(U)Trust in Support (U)Demands(A)Expression of Desire (A)15.0%24.0%24.010.0%16.0%16.0%38.0%0.0%0.0%0.0%0.0%Personal Questions(A)Pet Name(A)Sentience (A)Relationship Title(UA)24.0%6.03.0%16.0%60%2.00.0%0.0%0.0%Figure 3:Classifier activation rates across 398,707 text,Standard Voice Mode and Advanced VoiceMode conversations from our preliminary analysis.(U)indicates a classifier on a user message,(A)indicates assistant message,and (UA)indicates a single user-assistant exchange.this analysis being run,and users may not have become accustomed to interacting with the modelin this modality yet.As a follow-up to EmoClassifiers V1,we constructed an expanded set of classifiers of affectiveuse,EmoClassifiersV2,which we detail in the Appendix A.2.While EmoClassifiersV2 was notstatistics across results from both studies.Additional results for all remaining EmoClassifiersV1and EmoClassifiers V2 classifiers can be found in the Appendix.3On-Platform Data AnalysisChatGPT now engages over 400 million active users each week,?creating a wide range of user-modelinteractions,some of which may involve affective use.Our analysis employs two main methods-conversation analysis and user surveys-to examine how users experience and express emotions inthese exchanges.Our research focuses on Advanced Voice Mode (OpenAI,2024),a real-time speech-to-speechinterface that supports ChatGPT's memory,custom instructions,and browsing features.Wehypothesize that real-time speech capability is more likely to induce affective use of models andaffect users'emotional well-being than text-based usage,though we revisit this hypothesis inSection 4.To protect user privacy,particularly when examining potentially sensitive or personal dimensionsof user interactions,we designed our conversation analysis pipeline to be run entirely via automatedclassifiers.This allows us to analyze user conversations without humans in the loop,preserving the'https://www.cnbc.com/2025/02/20/openai-tops-400-million-users-despite-deepseeks-emergence.htmlprivacy of our users(See Appendix B.3 for a detailed explanation of the privacy-relevant parts ofour analysis).3.1 MethodsStudy User Population ConstructionTo study the on-platform usage,we constructed two study population cohorts:power users andcontrol users.We contrast power users,who have significant usage of ChatGPT's Advanced VoiceMode,with a randomly selected cohort of control users.This construction presupposed a strongcorrelation between users who have high proportions of affective usage of ChatGPT,and thefrequency and intensity of usage of ChatGPT.We detail in Table 1 the full creation criteria forour two user cohorts,though more details can be found in Appendix B.5.We constructed the twocohorts for the study starting in Q4 2024 after the release of Advanced Voice Mode.Cohort NameCreation CriteriaPower UsersUsers who,on a specific day,had a quantity of Ad-vanced Voice Mode messages that put them in the top1,000 users,that we constructed on a rolling basis.Once users enter this cohort,we select all of their dailymessages for facet extraction and retain them on thislist for the remainder of the study (See Appendix B.1for an additional explanatory graphic.)Control UsersRandomly selected sample of Advanced Voice ModeusersTable 1:User Cohorts of Live Platform Data Analysis.Power users tend to have higher usage ofboth Advanced Voice Mode as well as text-only models on ChatGPT,while also tending to have ahigher fraction of their conversations through Advanced Voice Mode (see Appendix B.2)SurveysWe offered a short survey of 11 multiple-choice questions to both Control and Power User cohortsvia a pop-up on the ChatGPT web interface that users could choose to fill out.8 10 out of the 11questions were asked on a 5-point Likert scale,with the last question asked how users'desire tointeract with others have changed with ChatGPT usage.Survey responses were linked to eachparticipant's internal user identifier for analytical purposes.The surveys primary aimed to measureusers'perceptions of ChatGPT,whether closer to being a tool or a companion.For additionaldetails,including the full survey questions,see Appendix B.5.Conversation AnalysisOne limitation of surveys is that the results are self-reported by users,and may reflect their self-perception more than their actual behavior or revealed preferences.To compare users'self-reportedresponses with their actual usage patterns,we pair our survey analysis with methods for analyzingof user conversation that preserve their privacy.sOne limitation of this study is that while Advanced Voice Mode was initially offered only on mobile devices,thesurveys were constrained to be offered on the web interface,thus limiting the set of users exposed to the survey.I feel like I can relyI enjoy having casualon theChatGPT has supported meusetul/knowledge-seekingin coping with difficultme than face-1.501.500.600.301.001.000.400.400.200.500.500.200.200.100.000.000.000.000.00ControlPowerControlPowerControlPowerControlPowerControlPowerUsersUsersUsersUsersUsersUsersUsersUsersI can tell the ChatGPTI will feel upset if thethings I don't feellose access to ChatGPTChatGPT's personalityI consider ChatGPT to becomfortable sharing withfor a period of timechanges significantlya friendother people0.750.200.100.200.200.600.500.400.000.100.000.25-0.200.20-0.100.000.000.00ControlPowerControlPowerControlPowerControlPowerControlPowerUsersUsersUsersUsersUsersUsersUsersUsersUsersUsersFigure 4:Mean survey responses by cohort.All survey questions asked if users "Strongly Disagree",“Disagree"”,“Neither agree nor disagree”,“Agree”,or“Strongly Agree”with the provided statement.Responses were then converted into integers between-2 and 2 before averaging.Error bars indicate1 standard error.A more detailed breakdown of survey responses can be found in Appendix B.6.To study the emotional content in user conversations in an automated manner,we run theEmoClassifiersV1 (Section 2)on the conversations of both cohorts within the study period.Thisprovides us with per-conversation labels for each conversation the user has on the platform.Weonly analyze the conversations conducted in Advanced Voice Mode,and the classifiers are run onthe text transcripts of the conversations.Because we are also interested in the longitudinal effects of model usage,we tie conversations tointernal user identifiers.Importantly,to protect the privacy of our study population,the classifiersare run in an automated process and generate only categorical classification metadata.The actualcontents of the conversations are not analyzed (beyond running the classifiers)or stored for thisstudy.3.2ResultsSurvey ResultsWe surveyed ChatGPT users from our two cohorts in mid-November 2024 on their experiences withChatGPT.We received 4,076 responses,2,333 of which were completed by control users and 1,743from power users (Appendix B.5)Overall,we found that small differences existed between responses in our control vs power usercohorts,although generally the trends are broadly similar,as shown in Figure 4.The control usersreported that they relied on ChatGPT for knowledge-seeking tasks and casual conversations slightlymore than power users.Both cohorts acknowledge ChatGPT's support in coping with difficultsituations,though power users demonstrate marginally higher reliance for such tasks.Both groupsappeared to be sensitive to changes in the model,such as voice or personality,with power usersdisplaying slightly higher levels of distress from change.Power users were slightly more likely thancontrol users to consider ChatGPT a "friend"and to find it more comfortable than face-to-faceinteractions,though these views remain a minority in both groups.We highlight that the results of surveys can be subject to issues of selection bias,as users hadto voluntarily fill out the survey we provide.AffectionateDesire forSeeking SupportPersonalLanguage (U)Feelings (U)(U)Demands(A)Questions (A)Pet Name (A)4.5%1B.0%15.0%30.0%3.0%3.0%12.0%0.0%20.0%6.0%1.5%5.0%08%10.0%1.5%Control Power0.0%0.0%0.0%0.0%Control PowerControl PowerControl PowerControl Power0.0%Control PowerUsersUsersUsersUsersUsersUsersUsersUsersUsersUsersUsersUsersFigure 5:Mean of a subset of the classifier scores by user cohort.Classification is performed at theindividual conversation level,and statistics are computed within each cohort.Activation is generallyhigher against power users across all classifiers.Results for all classifiers are shown in Appendix B.5.Conversation AnalysisIn Figure 5,we compare the overall classifier activation rates between control and power Userpopulations,for a representative subset of EmoClassifiersV1.The results for the full set of classifierscan be found in Appendix B.5.We find that power Users tend to activate the classifiers more oftenthan control Users across all of our classifiers.For some classifiers,power Users may activate theclassifier more than twice as often as control Users,such as for the 'Pet Name'classifier,or theExpression of Desire'and 'Demands'classifiers shown in the Appendix.We focus the remainder of our analysis on only the power user cohort.To analyze the extent ofaffective use in user conversations,we first filter the cohort of power users to only those who havemore than 80%of their conversations in English.This filtering significantly reduces the numberof users under study to approximately 6,000 users.We then run the EmoClassifiersV1 on eachconversation had by the user,and compute for each user the proportion of conversations thatactivate each classifier.For each classifier,we sort the users from lowest to highest rates of activationand plot them in Figure 6.By construction,these curves are monotonically increasing,but weobserve different patterns of activations per classifier,highlighting that they capture different levelsand patterns of user behavior.For most classifiers,we observe that most users almost never or onlyrarely (e.g.less than 1%of the time)trigger the classifier.However,it is in the last decile of userswhere we see that the classifiers activate regularly,reaching past 50%of conversations or higher fora small number of users.This starts to establish a consistent finding throughout this paper:a smallnumber of users are responsible for a disproportionate share of affective use of models.We conduct a similar analysis for users who have customized their model via Custom Instructionsbut find that the distribution of classifier activation rates do not meaningfully differ between userswith and without Custom Instructions (see Figure B.3).Classifiers and SurveysTo understand how our classifier activations correspond to self-reported user perceptions,wecomputed summary statistics for classifier activations in buckets of users based on their responsesto our survey.This studied user population was much smaller than the others-around 400 users-asit includes users who both completed the survey and had greater than 80%of their conversations inEnglish.Figure 7 shows classifier activation trends for the question "I consider ChatGPT to be a friend"(see Appendix B.10 for the other questions).The top-level filtering classifiers are represented in theCustom Instructions allow users on ChatGPT to specify how they would like the model to respond to theirqueries.The context is related to the questions "What would you like ChatGPT to know about you to provide betterresponses?"and "How would you like ChatGPT to respond?".More information can be found in the product releasefor Custom InstructionsDesire forSeeking SupportPersonalLanguage(U)UDemands (A)Questions (A)Pet Name (A)1000%Feelings(U)1000%100.0%100.0%100.0%500%500%500%500%500%250250%250%0.0%150030004500015003000450000%0150030004500150030004500User Sorted byUser Sorted byUser Sorted byUser Sorted byUser Sorted byUser Sorted byRateFigure 6:Classifier activation rate against users sorted by classifier activation rate for a subset ofthe classifiers.Note:Each plot potentially orders users differently,as sorting is performed on aper-classifier basis using a process illustrated in Appendix B.8.Results for all classifiers are shownin Figure B.9.AffectionateDesire forSeeking SupportPersonalLanguage (U)Feelings(U)(U)Demands (A)Questions (A)Pet Name (A)150%0.0200%125%500248150%20.0%75%100%2050%25%Figure 7:Comparison between user survey selections and the fraction of conversations that activatea particular classifier.Error bars indicate+1 standard error.The remainder of the survey questionsare shown in Appendix B.10.first row,with sub-classifiers in the remaining rows.In general,we find that users who respond“Agree”or“Strongly Agree”that ChatGPT isconsidered a friend tend to activate the top-level classifiers with a greater frequency.Sub-classifiers,such as the Expression of Affection,Attributing Human Qualities,and Seeking Support also activatefor a larger fraction of these user's conversations,providing evidence that users who perceiveChatGPT as a friend may have a qualitatively different experience when interacting with theproduct.Longitudinal AnalysisOnce a power user entered our study cohort,we also tracked them longitudinally by mapping theclassifier metadata to their internal user identifiers.We used the following procedure to summarize the longitudinal behavior of users:Conversations were bucketed into days,aggregated by the fraction of conversations in a givenday that activated the classifierFor each user and classifier,we fit a linear model on the fraction of classifier activation overdaysThe slopes of the regression serve as a simple summary statistic that captures the overalllinear trend in classifier activation over time.We find that users generally fall into one of three buckets,illustrated in Figure 8a.We plot theusers sorted by the slopes of the longitudinal regressions in Figure 8b.Users who decrease in classifier activation over time (Left plot Figure 8a,negative slope)Users who never activated a classifier or had minimal day-to-day change in usage(Middle plotof Figure 8a,slope of approximately 0)10User Archetype A:User Archetype B:User Archetype C:Decreasing ActivationNo ActivationIncreasing Activation100.0%十60.0%20.04040in Power User Cohort(a)Illustrative examples of user's classifier activations over time for the Pet Name classifier.Each of thesegraphs are fit with a linear regression to summarize the overall trend of the graphAffectionateDesire forSeeking SupportPersonalLanguage (U)Feelings (U)(U)Demands (A)Questions(A)首Pet Name(A)50%25%00%2500020001000100020001000Sorted by Activation Rate(b)The slope produced from a linear regression of the fraction of conversations each day that activate a givenclassifier,for a subset of classifiers.Users are filtered to have a minimum of 14 individual days of usage,representing roughly the top half of users in our power user cohort.Activation of the classifiers general trendsdown or neutral,with a tail of users increasing their fraction of usage.Results for all classifiers are shown inFigure B.20.Figure 8Users who increase in classifier activation over time (Right plot of Figure 8a,positive slope)3.3 TakeawaysPower users generally exhibit higher classifier activation rates than control users.Even thoughthe majority of interactions contain minimal affective use,a small handful of users have significantaffective cues in a large fraction of their chat conversations.Users who describe ChatGPT inpersonal or intimate terms(like identifying it as a friend)also tend to have the model use pet namesand relationship references more frequently.We also find that users do not significantly shift inbehavior over the period of the analysis;however,a small subset did exhibit meaningful changesin specific classifier activations,in both directions.From a purely observational study,we cannotdraw direct connections between model behavior and users'usage patterns,and while we find thata small set of users have a pattern of increasing affective cues in conversations over time,we lacksufficient information about users to investigate whether this is due to model behavior or exogenousfactors (e.g.life events).However,we do find correlation between affective cues in conversationsand self-reported affective use of models from self-report surveys.4Randomized Controlled Trials (RCT)While live platform usage provides a rich set of data for analysis,there are signigifcant limitationsin the kinds of research questions that can be answered (see also Table 2):11User Information:The ChatGPT platform currently does not collect a lot of key informationabout its users that we may like to control for in our analysis,such as gender or prior familiaritywith AI.User Feedback:Beyond usage data,we would also like to get quantitative or qualitativefeedback on their experience using models.However,it can be difficult to get users to fill insurveys or provide detailed feedback,and results from voluntarily filled out surveys will besubject to issues of selection bias.Experimental Constraints:We are unable to dictate usage of a certain model configuration(e.g.voice,custom instructions)or usage tasks for naturally occurring on-platform usage,which limits out ability to study the impact of specific model or usage properties.Experiment Ethics:We believe that platform users should be informed of and opt in to anyexperiments relating to emotional well-being,particularly if we are interested in investigatingthe negative psychological outcomes from affective use of models.To supplement the analysis of live platform usage,we collaborated with researchers at the MITMedia Lab's Fluid Interfaces research group to construct a large-scale,randomized controlled trialto study negative outcomes of affective use of ChatGPT.We provide a full,separate report on thestudy in Fang et al.(2025),describing the experimental setup and analysis methodology in greaterdetail,but we provide here a short description of the study and a summary of its headline results.4.1 RCT Study DetailsWe recruited 2,539 participants for a month-long study,of which 981 saw it to completion.10Participants were provided with a specially created ChatGPT account,and were asked to use theaccount daily for at least five minutes each day over a period of 28 days.Participants were randomlyallocated to one of nine conditions (see Section 4.1)and their accounts were pre-configured tomatch that condition.Throughout the study,participants were also required to fill out a seriesof questionnaires,covering their demographic information,prior familiarity with Al,and theiremotional state.ConditionsParticipants were randomly assigned to one of nine conditions,a cross-product of three modalitiesand three kinds of daily tasks:Modality:Participants had their accounts configured to one of the following three chatmodalities'(or model configurations):1.Engaging Voice:Advanced Voice Mode configured with a more engaging personality than thedefault in ChatGPT(configured via a custom system prompt)2.Neutral Voice:Advanced Voice Mode configured with a more emotionally-distant and profes-sional personality than the default in ChatGPT(configured via a custom system prompt)3.Text:Advanced Voice Mode was disabled for participants in this configurationTask:All participants were given one of three sets of instructions:1.Personal:Participants are assigned a conversation prompt from a list of questions elicitingpersonal conversation topics (e.g.Help me refect on my most treasured memory:)2.Non-Personal:Participants are assigned a daily conversation prompt from a list of moretask-oriented questions (e.g.Help me learn how to save money and budget effectively.')3.Open-Ended:No specific daily conversation prompts were given10We describe the study completion criteria in Appendix C.2.12With 981 participants across 9 conditions,each condition had an average of 109 participants,with the lowest at 99.The system prompt changes for the engaging and neutral voice modalitiescan be found in Appendix C.1.QuestionnairesParticipants were asked to fill out the following questionnaires throughout the study:A pre-study questionnaire,covering their demographic details such as age,gender,priorfamiliarity with AI chatbots,and urban/rural living location.A daily post-interaction questionnaire following their required daily ChatGPT usage,whichasked about their emotional valence and arousal after the interactionA weekly questionnaire about users'emotional state and feelings on their ChatGPT interactionsA post-study questionnaire about users'emotional state and psychosocial outcomesAdditional Platform DetailsParticipants were allowed to use their ChatGPT accounts freely outside of their daily taskover the 28 days of the study.Participants had rate limits set equivalent to those in an Enterprise account,which aregenerally equivalent or higher to those in ChatGPT Plus.Participants were randomly assigned either one of two voices:Ember,which resembles a malespeaker,or Sol,which resembles a female speaker.They were not allowed to pick their choiceof voice.Participants in the Text-only condition had Advanced Voice Mode disabled,though participantsallocated to Advanced Voice Mode model conditions were able to use text-mode ChatGPTbecause of limitations of the platform.Memory and custom instructions were enabled for text and Advanced Voice Mode modelconditions.Study AdministrationOpenAI and MIT jointly obtained Institutional Review Board (IRB)approval through WesternClinical Group (WCG)IRB.The research questions and hypotheses were pre-registered at AsPre-dicted.11 Participants were recruited on CloudResearch,and were compensated $100 for completingthe study.Our design includes obtaining explicit,informed consent from research participants foranalyses of individual level data.More details,such as the exclusion criteria,full questionnaires,and exploratory analysis of the participants'interaction data can be found in(MIT paper)Pre-Registered Research QuestionsWe pre-registered the following research questions before conducting this study:12Q1:Will users of engaging voice-based AI chatbot experience different levels of loneliness,socialization,emotional dependence,and problematic use of AI chatbot compared to users oftext-based AI chatbot and neutral voice-based AI chatbot?https://aspredicted.org/7xhy-ds3c.pdfWe ran an approximately 100-user pilot study before pre-registering the research questions,largely to iron outtechnical issues and refine the participant instructions and questionnaires.13GenderAgeRelationship StatusWoman31-4021-3051.B%In a relationship32.2%20.1%18.4%18-215.7%42%Other13.4%48.2%23.3%51-6039.0%Man41-50Prior ChatGPT Usage(text)Prior ChatGPT Usage (voice)A few times a monthNeverNever69.6%16.1%4.1%A lew times a day131Once a dayA few times a day7.4%24.9%165%A few times a weekFigure 9:Summary of study participants.Q2:Will engaging in personal tasks with an AI chatbot result in different levels of loneliness,socialization,emotional dependence,and problematic use of AI chatbot compared to engagingin non-personal tasks and open-ended tasks with an AI chatbot?Our key dependent variables are the four following measures of psychosocial outcomes for the user:Loneliness:ULS-8(Wongpakaran et al.,2020),measured on a 4-point Likert scale (1-4)Socialization:LSNS-6 (Lubben,1988),measured on a 6-point Likert scale (0-5)Emotional Dependence:ADS-9(Sirvent-Ruiz et al.,2022),measured on a 5-point Likert scale(1-5)Problematic Use:PCUS (Yu et al.,2024),measured on a 5-point Likert scale (1-5)Each variable corresponds to several different questions in the questionnaire,and the responsesare averaged within each variable,adjusting for the sign.4.2 ResultsFigure 9 shows descriptive statistics about our 981 study participants.The study participants arealmost evenly distributed between men and women,and the largest age group of participants wasbetween ages 31-40.Participants also span a variety of relationship statuses.The bottom rowdisplays responses to a question about participants'prior use of ChatGPT before the study,showingthat participants had more prior experience using ChatGPT in text mode compared voice mode,with nearly 70%having never used ChatGPT in voice mode before the study.Findings for Pre-Registered Research QuestionsWe plot in Figure 10 the change in the pre-study and post-study13 values of the four dependentvariables in our pre-registered research questions,averaged across users within task and modalityconditions.We also visualize the average pre-study and post-study measurements in Figure C.1 inthe Appendix.1Loneliness and Socialization had initial values recorded at the start of the study,while Emotional Dependenceand Problematic Use were recorded at the end of Week 1.14Task△Emotional△Problematic△Loneliness△SocializationDependenceUse0.10.050.0250.000.00.000-0.05-0.0250.1Non-PersonalNon.PersonalNon-PersonalPersonalEnded PersonalEnded PersonalEnded PersonalEnded PersonalModality△Emotional△Loneliness△SocializationDependenceUse0.10.020.050.050.00.000.000.000.05-0.05-0.02TextNeutral EngagingTextNeutral EngagingNeutral EngagingTextNeutral EngagingVoiceVoiceVoiceVoiceVoiceVoiceFigure 10:Average change in emotional well-being outcome variables by task and modality.Errorbars indicate 1 standard error.To answer our primary research questions,we perform fixed-effects regressions predicting thepost-study measures of emotional well-being,with either the task or modality as the key independentvariable,and controlling for usage duration,age and gender.We detail the full analysis methodologyand results in Fang et al.(2025),but we provide a summary of the findings here:1.Overall,participants were both less lonely and socialized less with others at the end of thefour-week study period.Moreover,participants who spent more time using the model werestatistically significantly lonelier and socialized less.2.Modality When controlling for usage duration,using either voice modality was associatedwith better emotional well-being outcomes compared to using the text-based model,reportingstatistically significantly less loneliness,less emotional dependence and less problematic use ofthe model.However,participants with longer usage duration of neutral voice modality hadstatistically significantly lower socialization and greater problematic usage compared to usingthe text-based model.3.Task When controlling for usage duration,having personal conversations with the model wasassociated with statistically significantly more loneliness but also less emotional dependenceand problematic usage compared to open-ended conversations.However,with longer usageduration this effect becomes non-significant.4.Initial States Pre-existing measures of emotional well-being were statistically significantpredictors of post-interaction states.Participants who started with high initial emotionaldependence and problematic use had statistically significantly reduction in both measuresusing the engaging voice modality compared to the text modality.Usage AnalysisWhile participants were instructed to use their ChatGPT accounts for at least 5 minutes a day,participants were also allowed to use the account outside of their daily allocated task.While the15
文档评分
    请如实的对该文档进行评分
  • 0
发表评论

特惠

限量优惠活动

正在火热进行

站长

添加站长微信

领取运营礼包

下载

便携运营智库

立即下载APP

工具

运营工具导航

AI工具导航

帮助

帮助中心

常见问题

顶部