Predicting Mental Health Risk Through Social Media

By Jannic Alexander Cutura, DSTI School of Engineering

The IEEE Big Data Conference and the Big Data Cup Challenge

The IEEE International Conference on Big Data is one of the world’s premier venues for research at the intersection of large-scale data processing, machine learning, and real-world applications. Each year, alongside the main conference proceedings, IEEE hosts the Big Data Cup Challenge—a competitive data science challenge that invites researchers and practitioners to tackle pressing societal problems using cutting-edge analytical methods. The 2025 edition, sponsored by Hong Kong Polytechnic University, focused on a particularly sensitive and impactful domain: predicting suicide risk from social media posts. This challenge provided participants with anonymized Reddit data and tasked them with developing models capable of forecasting a user’s future mental health risk level based on their posting history.

I’m honored to share that the approach described in this article was awarded first place in the competition, along with a $1,000 prize. What started as an exploration of how large language models could be combined with temporal modeling techniques ultimately proved to be the winning solution—a result that speaks to the potential of interpretable, lightweight methods even in an era of increasingly complex AI systems.

The Problem: Predicting Future Risk from Past Posts

Mental health conditions, including suicidal ideation, often manifest through changes in language, emotion, and behavior that can be observed in online discourse. Platforms like Reddit, where users frequently share personal struggles anonymously, offer a unique window into these patterns. The challenge at hand required predicting the suicide risk level of a user’s next post—one that has not yet been written—based solely on their five most recent posts and their timestamps.

This is fundamentally different from simply classifying an existing piece of text. The task requires understanding not just what someone has written, but how their mental state might evolve over time. It demands models that can capture temporal trajectories and extrapolate into the future.

The dataset used in this research contains over 7,000 post sequences from 395 unique Reddit users. Each post is labeled with an ordinal suicide risk rating on a four-point scale: indicator (general warning signs), ideation (explicit suicidal thoughts), behavior (intent to act), and attempt (reference to suicidal actions). The distribution of these labels reflects real-world patterns—nearly half of all posts fall into the ideation category, while attempt-level posts are rare (under 5%), making this a challenging imbalanced classification problem.

A Two-Stage Approach: LLM Classification Meets Temporal Modeling

The research explores a two-stage architecture that combines the semantic understanding of large language models with lightweight temporal aggregation methods. The core insight is that while LLMs excel at understanding the meaning and emotional content of individual posts, predicting future risk requires reasoning about how risk levels change over time.

In the first stage, each individual post is classified using a prompt-based approach with OpenAI’s GPT models (GPT-5, GPT-4o, and GPT-5-mini). Rather than fine-tuning these models—which would require substantial computational resources—the research leverages carefully crafted prompts that have been validated in prior mental health NLP research. This zero-shot approach proves remarkably effective, with the models achieving strong classification accuracy on individual posts.

The second stage tackles the harder problem: what happens when we need to predict the risk level of a post that doesn’t exist yet? For these “final observation” cases, the research tests several temporal aggregation strategies that combine the predicted risk levels of previous posts:

  • Weighted averages that emphasize more recent posts
  • Exponential decay functions that model the fading relevance of older content
  • Time-distance weighting that accounts for irregular posting patterns
  • ARIMA forecasting that treats risk levels as a time series


A key finding is that the choice of aggregation method matters far less than the quality of the underlying post classifications. All five aggregation strategies perform within 0.4% of each other, suggesting that once you have accurate post-level predictions, even simple averaging produces reliable forecasts.

 

Comparing LLMs with Neural Embedding Methods

The research doesn’t stop at LLM-based approaches. It also benchmarks three neural embedding methods that learn to predict risk directly from post sequences without relying on external API calls:

The first method uses MiniLM, a compact sentence embedding model, combined with time-weighted pooling and an ordinal regression head. The second employs a Gated Recurrent Unit (GRU) that processes posts sequentially, learning how linguistic cues interact with posting rhythm over time. The third fine-tunes DistilBERT using Low-Rank Adaptation (LoRA), a parameter-efficient technique that enables transformer adaptation while keeping most of the model frozen.

These neural methods achieve overall performance nearly identical to the LLM approaches—the GRU model comes within 0.02% of GPT-5’s accuracy. However, a critical difference emerges when examining performance on final-observation sequences, where the target post is never seen. Here, LLM-based methods substantially outperform neural approaches. GPT-5 achieves an F1 score of 0.46 on these challenging cases, while the best neural method (GRU) reaches only 0.38. The smaller neural models struggle even more, with MiniLM dropping to just 0.25.

This pattern suggests that neural embeddings trained on limited data have difficulty generalizing to truly predictive tasks that require temporal extrapolation, whereas LLMs—with their vast pretraining—maintain acceptable performance even when predicting the unknown.

Key Results and Practical Implications

The best-performing configuration—GPT-5 with linear weighted averaging—achieves an overall F1-weighted score of 0.72 and a mean absolute error of just 0.30 on the ordinal scale. To put this in perspective, when the model makes an error, it typically confuses adjacent risk categories (like indicator vs. ideation) rather than making extreme misclassifications (like confusing indicator with attempt).

From a practical deployment standpoint, the LLM approach offers compelling advantages. The one-time cost of classifying all posts in the training set with GPT-5 was approximately $25, after which predictions can be cached and reused indefinitely. Temporal aggregation adds negligible computational overhead, enabling sub-millisecond risk assessments suitable for real-time applications. The robustness across aggregation methods also simplifies deployment by eliminating the need for extensive hyperparameter tuning.

Neural methods, while slightly less accurate on the hardest cases, offer the advantage of local deployment without external API dependencies—an important consideration for applications involving sensitive health data.

Ethical Considerations and the Path Forward

Research on suicide risk prediction carries significant ethical responsibilities. The paper emphasizes several critical points that any deployment of such technology must address:

Model outputs are statistical estimates, not clinical diagnoses. They require human oversight and cannot replace professional mental health assessment. Any real-world deployment would require strict privacy safeguards, secure data handling, and compliance with platform policies and data protection regulations.

Predictive systems also carry inherent risks. False positives might cause unnecessary alarm or user distress, while false negatives could fail to identify individuals who would benefit from support. The research advocates for uncertainty estimates, bias monitoring, and human-in-the-loop review processes in any operational system.

Perhaps most importantly, automated mental health tools should complement—not substitute—access to qualified professionals. Technology can help identify patterns and flag potential concerns at scale, but the human connection in mental health care remains irreplaceable.

Conclusion

This research demonstrates that combining the semantic understanding of large language models with simple, interpretable temporal modeling can effectively predict future mental health risk from social media posts. The approach achieves strong accuracy while remaining practical for deployment, with LLM-based methods proving particularly robust for the challenging task of predicting unobserved future states.

As natural language processing continues to advance, applications in mental health represent both an opportunity and a responsibility. The techniques developed here could eventually support early intervention systems, helping to identify individuals at risk before crises occur. But realizing this potential will require continued research, careful ethical consideration, and close collaboration between technologists, mental health professionals, and the communities they serve.

The code and replication materials for this research are available at

About the Author

Jannic Alexander Cutura is a lecturer at DSTI School of Engineering in Paris, France, and works as a staff data engineer at the Directorate General Information Systems of the European Central Bank in Frankfurt, Germany. His research interests include natural language processing, machine learning, and applications of AI in social good domains.

Note: The views presented in this work are solely those of the author and do not represent the views of the European Central Bank or the Euro System of Central Banks.

More Posts