Over the last decade, Twitter has emerged as one of the most influential forums for social, political, and health discourse. In this paper, we introduce a massive dataset of more than 45 million geo-located tweets posted between 2015 and 2021 from US and Canada (TUSC), especially curated for natural language analysis. We also introduce Tweet Emotion Dynamics (TED) -- metrics to capture patterns of emotions associated with tweets over time. We use TED and TUSC to explore the use of emotion-associated words across US and Canada; across 2019 (pre-pandemic), 2020 (the year the pandemic hit), and 2021 (the second year of the pandemic); and across individual tweeters. We show that Canadian tweets tend to have higher valence, lower arousal, and higher dominance than the US tweets. Further, we show that the COVID-19 pandemic had a marked impact on the emotional signature of tweets posted in 2020, when compared to the adjoining years. Finally, we determine metrics of TED for 170,000 tweeters to benchmark characteristics of TED metrics at an aggregate level. TUSC and the metrics for TED will enable a wide variety of research on studying how we use language to express ourselves, persuade, communicate, and influence, with particularly promising applications in public health, affective science, social science, and psychology.
翻译:在过去的十年中,Twitter已成为社会、政治和健康讨论最有影响力的论坛之一。在本文中,我们引入了美国和加拿大2015年至2021年期间张贴的4 500万多条地理定位推文(TUSC)的庞大数据集,这些推文来自美国和加拿大(TUSC),特别是用于自然语言分析。我们还引入了Tweet 情感动态(TED) -- -- 用来捕捉与推文长期相关情绪模式的衡量标准。我们使用TED和TUSC来探索美国和加拿大使用情感相关词汇的情况;2019年(大流行病爆发年份)、2020年(大流行病第二年)和2021年(大流行病第二年)以及个人推文。我们显示,加拿大的推文往往比美国推文更有价值,更令人振奋醒人心,更具有主导地位。此外,我们显示,COVID-19大流行病对2020年贴出的推文的情绪信号有着显著影响,与相邻年份相比,我们为170 000 TED的推文指标确定了170,用以在总体水平上衡量TED指标特征的特征特征,对我们的研究、特别是社会学、说服、社会学应用将如何影响。