Academic

Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

arXiv:2603.18822v1 Announce Type: new Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten bas

M
Maria Milkova, Maksim Rudnev
· · 1 min read · 5 views

arXiv:2603.18822v1 Announce Type: new Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

Executive Summary

This study presents a multi-stage classification framework for detecting human values in noisy Russian social media text data. By employing a multi-perspective approach, combining expert labels, LLM-generated annotations, and model predictions, the study showcases the potential of this approach in accurately predicting the probability of ten basic human values. The framework's performance is evaluated on a random sample of 7.5 million public text posts, achieving an F1 macro score of 0.83. The study's findings demonstrate the value of this framework in understanding cultural variation, communicative framing, and value-based interpretation in digital environments. However, the study also highlights the need for further refinement to mitigate potential biases and overestimation of certain value domains.

Key Points

  • The study proposes a multi-stage classification framework for detecting human values in noisy Russian social media text data.
  • The framework incorporates a multi-perspective approach, combining expert labels, LLM-generated annotations, and model predictions.
  • The study achieves an F1 macro score of 0.83 on a random sample of 7.5 million public text posts.

Merits

Strength in Methodology

The study's multi-perspective approach offers a robust and nuanced understanding of human values in noisy social media data.

Significance of Findings

The study's findings contribute to a broader research agenda on cultural variation, communicative framing, and value-based interpretation in digital environments.

Demerits

Overestimation of Certain Value Domains

The study highlights the need for further refinement to mitigate potential biases and overestimation of certain value domains, such as Openness to Change.

Limited Generalizability

The study's focus on Russian social media data may limit its generalizability to other languages and cultural contexts.

Expert Commentary

This study represents a significant contribution to the field of value-based interpretation in digital environments. By employing a multi-perspective approach, the study offers a nuanced understanding of human values in noisy social media data. The study's findings have far-reaching implications for understanding cultural variation, communicative framing, and value-based interpretation in digital environments. However, further refinement is needed to mitigate potential biases and overestimation of certain value domains. The study's framework and findings have practical and policy implications for social media companies, digital platforms, governments, and regulatory bodies seeking to promote culturally sensitive and value-based communication.

Recommendations

  • Future studies should focus on refining the framework to mitigate potential biases and overestimation of certain value domains.
  • The study's findings should be replicated and extended to other languages and cultural contexts to ensure generalizability.

Sources