Building a 15-Class Media Frame Detector with Longformer

Overview
Training Data
- Gold Data: Human Annotations
- Silver Data: LLM Annotations
Model Architecture
- Topic Classifier: RoBERTa with Truncation
- Longformer Classifier
Silver Training
Gold Training
Evaluation Metrics
Appendix

1. Overview

Media framing is the process by which communicators select specific aspects of a perceived reality and make them more salient within a text (Card et al. 2015). The concept was operationalized in the 2015 Media Frames Corpus, which provided a 15-category encoding to be used to describe communication strategy in news articles, and has since been used extensively for measuring argumentation, bias and polarization. Accurate, lightweight frame classifiers have applications in internet media analytics and enable richer social network analysis by tracking the communication tactics of specific users.

Card et al. (2015) operationalize framing as ‘thematic sets’ or ‘packages’ of aligned ideas and assumptions:

Frame	Description (Card et al., 2015 / Boydstun et al., 2014)
Economic	The costs, benefits, or other financial implications of the issue.
Capacity and Resources	The availability of physical, human, or financial resources, and the capacity of current systems.
Morality	Any perspective or policy objective/action compelled by religious doctrine, ethics, or social responsibility.
Fairness and Equality	The balance or distribution of rights, responsibilities, and resources; equality or inequality in the application of laws.
Legality, Constitutionality and Jurisprudence	The rights, freedoms, and authority of individuals, corporations, and government (focuses on the constraints/freedoms granted via the Constitution or laws).
Policy Prescription and Evaluation	Discussion of specific policies aimed at addressing problems, and the evaluation of whether certain policies will work or are working.
Crime and Punishment	The effectiveness and implications of laws and their enforcement (includes breaking laws, loopholes, sentencing, and punishment).
Security and Defense	Threats to the welfare of the individual, community, or nation (includes protection from not-yet-manifested threats).
Health and Safety	Health care access/effectiveness, sanitation, disease, mental health, and public safety (e.g., infrastructure safety).
Quality of Life	Threats and opportunities for the individual’s wealth, happiness, and well-being (effects on mobility, ease of routine, community life).
Cultural Identity	The traditions, customs, or values of a social group in relation to a specific policy issue.
Public Opinion	Attitudes and opinions of the general public, including polling and demographics.
Political	Considerations related to politics and politicians, including lobbying, elections, and attempts to sway voters (explicit mentions of partisan maneuvering).
External Regulation and Reputation	The international reputation or foreign policy of the U.S. (relations with other nations, trade agreements).
Other	Any coherent group of frames not covered by the above categories.

Existing methods to automate frame-detection suffer from several limitations:

1) GenAI shortcomings. Several papers use GenAI prompt-engineering, coercing generative models into frame classification tasks. This risks the hallucination of non-existing frames, lowers the interpretability of final models (increasing the black-box barrier), and utilizes computationally expensive solutions which are difficult to scale to large-N data. It’s simply not appropriate to use models built for next-token generation on full-context classification tasks.

2) Context length. Past solutions are often designed for sentence-level classification, whereas frames develop through complex interplays across hundreds of words of text, and reference past signals from earlier in the same input. A longer context with appropriate attention handling is essential to pick-up these nuanced interactions.

3) Training data. Reliance on machine-generated classification data, rather than gold-standard human annotations, can bake-in biases. A lack of abundant human annotated data for frame detection remains a serious challenge for work in this area, and the encoding of human-backed reasoning into model weights requires careful correction and deliberate fine-tuning. This is an area I addressed specifically throughout my development.

My overall aim was to develop a lightweight transformer-based model which addresses not only these challenges but is also pragmatic for applied NLP research and can run on a local mid-range GPU. In the end, the final model uses multi-label classification to detect the presence of the 15 argumentation frames in articles up to ~1600 words. It could be used for social media data directly, but this capability should be developed as the model is trained on news articles. My model outperforms past GenAI approaches in classification accuracy on the Media Frames Corpus, while being 47 times smaller in model size (compared to mm-framing’s Mistral-7B from Arora et al 2025).

I developed the project in its associated repo and uploaded my final model to HuggingFace. The current version (as of writing) may be iterated upon in future, as I optimize further on the available data and test new architectures, which will reflect in future model updates.

2. Training Data

2.1 Gold Data: Human Annotations

The two best datasets for this task are: the 2015 Media Frames Corpus (MFC) and the 2023 SemEval Task 3 (sub-task 2). Accessibility for the MFC has been greatly reduced (as Lexis Nexis has paywalled document access), leaving about 21% of the original 14,481 articles available to me (via Lexis Uni). The SemEval source is open-access, after messaging the task organizers, and provides 516 additional articles. Together, these generate a gold-standard dataset of 2,740 articles, combining labels from 19 trained annotators.

Media Frames Corpus (MFC)

The MFC data records articles within the topics of immigration, same-sex marriage and smoking, spanning 1990-2012. Of the 12 US Newspaper originally encoded, I train on the New York Times articles that are presently accessible through the Lexis Uni portal.

Topic	Original MFC	NYT Available	Retrieved	Coverage
Immigration	5,500	1,443	1,370	24.9%
Smoking	5,074	822	821	16.2%
Same-Sex Marriage	8,407	1,764	1,740	20.7%
Total	18,981	4,029	3,931	20.7%

Annotation Aggregation Strategy:

Each MFC article has 2+ annotators with span-level annotations. I tested three aggregation methods, and opted to use the union of annotations (recording a frame if either annotator records one) to maximize the available signal.

Strategy	Mean Frames/Article
Per annotator	3.18
Union	4.28
Intersection	2.14

SemEval 2023 Task 3 Subtask 2

Another gold-standard dataset is from the SemEval 2023 competition. These examples were originally designed for frontier-level research on multi-lingual modelling, but haven’t been used much since the competition ended. The data can be coerced to produce article-level annotations with similar properties to MFC and covers articles collected between 2020-2022 on topics ranging from the Ukraine-Russia war, COVID-19, Migration, Abortion and Climate Change.

Merging SemEval and MFC gives us our final gold set of training examples, and I run as 90/10 train-test split to provide a suitable ‘exam’ on unseen articles instance to robustly test model performance.

Combined Gold Dataset:

Source	Articles	Train	Test
MFC	2,224	1,993	231
SemEval	516	473	43
Total	2,740	2,466	274

Gold Data Label Distribution:

Frame	Count	% of Articles
Legality	1,600	58.4%
Political	1,373	50.1%
Policy Prescription	1,127	41.1%
Crime & Punishment	879	32.1%
Economic	870	31.8%
Quality of Life	825	30.1%
Cultural Identity	737	26.9%
Public Opinion	688	25.1%
Health & Safety	664	24.2%
Fairness & Equality	599	21.9%
Morality	590	21.5%
Security & Defense	451	16.5%
External Regulation	364	13.3%
Other	304	11.1%
Capacity & Resources	276	10.1%

Average labels per article: 4.14

2.2 Silver Data: LLM Annotations

Unfortunately, the number of gold training examples is insufficient for a truly generalizable and accurate classifier. Any trained model may perform well on topics covered in these examples, but would struggle without exposure to a much wider range of articles scraped from the internet as a whole. I rely on secondary machine-labelled data to build out the training run. I take care to be mindful of overfitting on this data to avoid baking in biases from less reliable GenAI annotations.

Source: copenlu/mm-framing

This dataset from Arora et al. 2025 contains ~478,000 news articles with frame labels generated by Mistral-7B-Instruct using prompt engineering. It was cleaned to remove short length articles (<100 words) and non-standard frames. 378,000 examples were retained, spanning the period May 2023 to April 2024, covering 28 US based news agencies across the political spectrum. Data was hydrated with the python library newsplease from base urls, which took 3 days.

Label Generation Method:

Model: Mistral-7B-Instruct-v0.3
Inference: vLLM with temperature=0.2, max_tokens=4000
Process: Prompted to read text and classify into the 15-frame taxonomy

The Mistral model performed at a mediocre level on the gold MFC test data, with a Micro F1-score of 0.50. This motivated my cautionary handling of the data. I sought to build out the available reasoning patterns taught by this model, then correct on instructive examples from reliable human annotators.

GenAI Lackluster Performance: Mistral’s Performance Against Gold Standard (Media Frames Corpus):

Label	Precision	Recall	F1-score
Capacity & Resources	0.39	0.34	0.36
Crime	0.50	0.87	0.63
Culture	0.38	0.37	0.37
Economic	0.43	0.69	0.53
Fairness	0.17	0.74	0.28
Health	0.48	0.48	0.48
Legality	0.53	0.87	0.66
Morality	0.30	0.63	0.41
Policy	0.40	0.73	0.51
Political	0.68	0.53	0.60
Public Opinion	0.32	0.55	0.40
Quality of Life	0.28	0.36	0.31
Regulation	0.26	0.48	0.34
Security	0.30	0.45	0.36
Micro Avg	0.42	0.62	0.50
Macro Avg	0.39	0.58	0.45

Dataset Variables Used:

title - Article headline
gpt_topic - Consolidated topic (19 categories)
text_generic_frame - List of frame labels
maintext - Full article body (from joined newsarticles table)

I consolidated additional metadata provided by the Mistral model, transforming an unstructured topic classification field into 19 topic categories based on empirical similarity. This enabled the encoding of the 350,000+ examples into discrete topics ranging from politics, sport and environment, which is key to improving model performance via domain-level training.

3. Model Architecture

3.1 Topic Classifier: RoBERTa with Truncation

I found that encoder transformer models (BERT and Longformer) perform better at frame classification after being provided with relevant meta-data. Without any model-level changes, adding the article topic to the beginning of the input tensor bakes in a sizable improvement in test accuracy. Topic injection activates internal weights designed for domain-specific reasoning, generating a +2.7% gain in Micro F1 performance, and represents an easy feature-level improvement. Certain frames experienced greater performance gains over others, largely based on training data availability, but all saw increased performance.

To leverage this finding, I trained a lightweight roBERTa model to classify unseen text into one of the 19 empirical topics derived from the silver training data. Comparing performance to alternatives (base BERT and Distill BERT), RoBERTa achieved the highest validation accuracy at 76% on 64,000 examples, which was sufficient for the task of assisting the downstream framing classifier.

Head+Tail truncation strategy

Head: First 320 tokens (captures title and introduction)
Tail: Last 190 tokens (captures conclusion)
Total: 510 tokens

Topic Injection Implementation:

TOPIC:{topic}
{title}
{article_text}

Topic injection implements a ‘soft’ mixture of experts approach, giving prior context without the overhead of running multiple ‘domain expert’ models.

3.2 Longformer for Full Document Classification

To produce the framing classifier, I evaluated a range of models, combining feature-level and architectural findings (see appendix section). I settled on the Longformer, a variant of the RoBERTa model fundamentally designed for long-context inputs (up to 4096 tokens). This model implements sparse attention, which enables longer input while keeping computation comparatively low, and can be run locally with a mid-range GPU (~16 GB VRAM). In addition, the Longformer’s ability to set specific tokens for global attention enables a mixture of experts design with the structured topic injections, allowing the topic token to speak with the rest of the input text during training and evaluation. This balances the primary advantage of BERT model - complete context window - with the significantly higher input size of Longformer to analyze complete articles.

Key Configuration:

Global attention on [CLS] token (position 0)
Global attention on first topic token (position 3)
Limit input tensor to 2048 tokens for increase efficiency without loss to most use-cases (handling up to ~1500 words)

4. Silver Training

I trained allenai/longformer-base-4096 on the silver dataset for 4 epochs over 72 hours using an NVIDIA A40 (48GB VRAM), holding out 10% for validation and model selection. To address class imbalance, I used binary cross-entropy loss weighted by the inverse frequency of frames in the Mistral-generated labels. This encouraged the model to attend equally to all frames rather than over-predicting dominant categories. I further mitigated imbalance through post-training threshold optimization on the classification layer.

I stopped training at 4 epochs despite observing that validation scores would continue improving marginally, as I wanted to avoid overfitting on the less reliable machine-generated labels. Each epoch took 11-12 hours.

Parameter	Value
Batch Size	16
Gradient Accumulation	2 (effective 32 batch size)
Learning Rate	2e-5
Weight Decay	0.01
Epochs	4

5. Gold Training

I then fine-tuned the silver model on the gold training set from MFC and SemEval. This stage aimed to shift reasoning sufficiently towards the annotators’ decision making processes. I relied specifically on focal loss, which is a variation of cross entropy loss designed to focus training on more difficult examples, which I found to be more instructive for maximizing learning gain on frame discernment. A down weight penalty was applied to assist the model in ignoring things it already knows well and focus on these tougher examples. I searched over 4 parameters for down weighting settings (settling on a Gamma value of 2) for best performance. In addition, I implemented a learning rate scheduler to make best use of the more limited size of the gold training set, setting LR high early on, waiting for validation plateau, and then incrementally decreasing to converge on a minima.

Parameter	Value
Starting Checkpoint	Phase 1 best model (epoch 4)
Loss Function	Focal Loss (gamma=2.0)
Learning Rate	2e-5
Epochs	10
Batch Size	2
Gradient Accumulation	8 (effective 16)
Validation	90/10 train/test split

Evaluation Metrics

Post-threshold optimization performance

Metric	Score
Weighted F1	0.6863
Micro F1	0.6846
Macro F1	0.645

Per-Class Report (excluding “Other”):

Frame	Precision	Recall	F1	Support
Economic	0.78	0.70	0.74	87
Capacity & Resources	0.44	0.48	0.46	25
Morality	0.53	0.64	0.58	61
Fairness & Equality	0.60	0.52	0.56	63
Legality	0.83	0.79	0.81	164
Policy Prescription	0.58	0.85	0.69	110
Crime & Punishment	0.63	0.86	0.73	87
Security & Defense	0.49	0.72	0.58	29
Health & Safety	0.61	0.77	0.68	69
Quality of Life	0.57	0.74	0.64	96
Cultural Identity	0.46	0.74	0.57	72
Public Opinion	0.51	0.77	0.61	81
Political	0.70	0.94	0.81	134
External Regulation	0.92	0.41	0.57	29

Some Takeaways

Building reliable models for specific NLP tasks, especially in hard-to-verify rhetorical analysis tasks, is becoming increasingly possible with the abundance of data labelled by generative models. This data enables the learning of approximate identification patterns, but is so far seen to be insufficient for production-level modelling, as many biases and relics of unstructured training data skew generative model biases in unmeasurable ways. Gold fine-tuning on curated human-annotated data remains essential to deploy frame classifiers at scale, and the process of balancing data availability with quality is an ongoing area of research.

To improve this model, I strongly encourage the open access of all articles in Media Frames Corpus (which LexisNexis has made very difficult to access), and the development of further human annotator studies covering diverse topic areas under the same coding scheme.

Thanks for reading!

Appendix

Future Directions

Improved Lexis-Nexis access to increase scope of gold training data
Calibration analysis of predicted probabilities, using classification layer to generate continuous measure of frames (e.g. 60% economic vs. 20% policy prescription) - COMING SOON
Semi-supervised refinement using gold model to clean silver data
Ensemble of Longformer with domain experts to eke out greater domain performance but at higher computation cost
Empirical case study deploying the model at scale
Knowledge distillation from longformer into lighter weight models for even faster performance but with input size/accuracy trade-offs
Masked language modelling training on silver data, possibly improving performance by 3-4% based on past papers

Tips and Tricks

Running experiments on smaller models usually scaled very well to larger models in the same architecture. This saved computation and made it easier to predict improvements without running on the full training set (and renting GPUs to do so).

I used RunPod’s A40 GPU for the full silver training run, it took approximately 72 hours. I finished with the gold run on my local RTX 4070Ti with 16GB VRAM, optimizing with gradient accumulation and smaller batch size, completing the focal loss grid search and final train over 12 hours.

Using weights and biases (the wandb package) to track training runs proved immensely helpful and greatly eased evaluation between experiments. Their online interface is very useful and saves countless hours just for plotting loss curves. Saves an organizational headache.

Experiments

As a glimpse into underlying reasoning, in certain edge cases, the topic classification model would perform poorly, such as in classifying stories about wildlife hunting as ‘crime’. However, on aggregate, these inaccuracies were not found to significantly disrupt downstream model performance. Alternative topic classifiers could be used to optimize this stage and provide more robust classification, but might not affect end performance strongly.

For the longformer classifier, ideally, a full attention long-context model would be used, as the sliding window of local attention in the sparse attention architecture is non-ideal. However, these are very expensive to run, and grow computation exponentially. This represents a strong challenge with masked language modelling architectures.

Another option I considered was stringing together multiple BERT models in a hierarchical chain. I thought this was excessively complex and failed to integrate the dynamic attention modelling architecture gained by the Longformer model, but it may prove useful if paired with a smart chunking method, breaking a long input into relevant sub-inputs for each BERT.

Finally, owing to a key finding in the winning SemEval 2023 Task 3 sub-task 2 paper, I attempted to run masked language modelling pre-training on the base longformer. This would have taken >48 hours, so I opted not to continue, but has been shown to improve performance on related tasks by 3-4%.

Blog Archive

Archive of all previous blog posts

Blog Archive

Archive of all previous blog posts