User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.
翻译:平台开发者、营销者以及公共政策研究人员在建议系统中经常使用用户的表述方式,以衡量不同人口群体之间的公众舆论。计算机科学家们认为,更抽象地推断用户代表方式的问题;如何从社交媒体如此吵闹和复杂的媒体中从许多下游任务中得出稳定的用户代表方式――对于许多下游任务有效?用户代表方式的质量最终取决于任务(例如,用户代表方式的质量是否提高分类性,在建议系统中提出更准确的建议),但代理人对具体任务不那么敏感。对于个人人口特征、社会经济阶级或心理健康状况等潜在属性的预测方式,是否更能预测潜在属性的问题?它是否预测用户未来的行为?在此理论中,我们首先展示用户代表方式如何从社交媒体的多种用户行为中学习?我们运用了几套通用的直通性关联分析方法来了解这些表达方式并评估其三项任务:预测未来的标签提及、友好行为和人口分类特征。然后我们展示用户特征如何被作为远程监督改进主题模型的模型使用?最后,我们在远程用户行为中展示用户的预测方式,我们如何将用户特征作为智能分析结构,我们如何将改进现有的数据结构。