Enhancing Structured Meaning Representations with Aspect Classification

Imagine you are watching a movie. You see a character walking down the street.

If you just look at the script (the words), you see: "The cat walked."
But if you look at the movie (the actual meaning), you need to know more:

Did they walk for a second and stop? (A quick achievement)
Are they still walking right now? (An ongoing activity)
Do they walk every day at 5 PM? (A habit)
Did they just stand there? (A state of being)

This paper is about teaching computers to understand that extra layer of meaning, which linguists call "Aspect."

Here is a simple breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Flat" Map

Think of current computer language tools (like the ones that power Siri or Google Translate) as having a flat map of the world. They know where cities are (the main words) and how roads connect them (the grammar).

But this map is missing the terrain. It doesn't know if a road is a steep mountain climb (a difficult task), a flat sidewalk (an easy state), or a winding path that never ends (an ongoing activity).

In the world of "Meaning Representation" (how computers store the meaning of sentences), this missing terrain is called Aspect. Without it, computers struggle to understand the difference between:

"I am eating" (I'm in the middle of it right now).
"I ate" (I finished it).
"I eat" (I do this every day).

2. The Solution: Building a 3D Model

The researchers at the University of Colorado decided to build a 3D model of these sentences. They created a new dataset where they manually labeled every "event" in a sentence with its specific "terrain type."

They used a system called UMR (Uniform Meaning Representation), which is like a universal blueprint for language. They added a special "Aspect Layer" to this blueprint.

The "Terrain Types" (The Labels):
To make this concrete, imagine you are a traffic controller for a sentence. You have to assign a status to every action:

State: The car is parked. (Nothing is changing).
Activity: The car is driving down the highway. (It's moving, but no specific destination is reached yet).
Performance: The car crossed the finish line. (It started, it finished, and it reached a goal).
Endeavor: The car tried to cross the finish line but ran out of gas. (It was a process that stopped before the goal).
Habitual: The car drives to work every morning. (It happens repeatedly).
Process: "The driving." (We don't know if it started or stopped; it's just a vague concept).

3. The Hard Work: The "Human GPS"

You might think, "Just ask a computer to guess this!" But the researchers found that computers are terrible at this. It's like asking a GPS to guess if a driver is enjoying a drive or racing to a destination just by looking at the speedometer.

So, they did the hard, human work:

Training: They taught a team of 8 people exactly how to spot these differences. It was like training a group of detectives to spot subtle clues in a story.
The "Tie-Breaker" System: Two people labeled the same sentence. If they disagreed (e.g., one said "Activity," the other said "Performance"), a third expert stepped in to make the final call.
The Result: They created a "Gold Standard" dataset of 1,473 sentences. Think of this as the answer key for a very difficult test.

4. The Test: Can Computers Learn?

Once they had the answer key, they tested three different types of "students" (AI models) to see if they could learn to predict the terrain:

The Rule-Follower: A computer that follows a strict list of rules (like a recipe). Result: It got about 39% right.
The Pattern Spotter: A standard AI that looks for word patterns. Result: It got about 45% right.
The Big Brain (LLM): A massive, modern AI (like the ones behind this chat) that was just asked to guess without being retrained. Result: It got about 56% right.

The Big Takeaway: Even the "Big Brain" only got about half right. Meanwhile, the human annotators got about 84% right.

Why Does This Matter?

This paper is a wake-up call. It shows that while AI is getting good at reading words, it is still clumsy at understanding the flow of time and action inside those words.

By creating this new dataset, the researchers have built a training gym for future AI. Now, instead of guessing in the dark, future computers can study this "3D map" to learn how to distinguish between a habit, a one-time event, and an ongoing process.

In short: They built a dictionary of "how things happen" so that computers can stop just reading words and start truly understanding the story.

1. Problem Statement

Semantic meaning representation (MR) frameworks, such as Abstract Meaning Representation (AMR) and Uniform Meaning Representation (UMR), are crucial for capturing the core events and arguments of natural language. However, a significant gap exists in the representation of aspect—the internal temporal structure of events (e.g., whether an event is a state, an activity, a completed achievement, or a habitual action).

The Gap: While UMR is designed to be cross-linguistically applicable and includes aspect as a core feature, existing UMR resources lack comprehensive aspect annotations. Most AMR graphs, which serve as the foundation for UMR, do not include aspect labels.
The Consequence: This scarcity hinders both manual annotation consistency and the development of automatic systems capable of predicting aspectual information. Current datasets often use coarse-grained labels or rely on shallow textual cues, failing to capture the rich predicate-argument structures required for deep semantic reasoning.
The Challenge: Aspect annotation is theoretically complex, sensitive to context, and varies significantly across languages (e.g., English relies on context, while languages like Mandarin or ASL grammaticalize aspect). This makes manual annotation time-consuming and prone to inter-annotator disagreement.

2. Methodology

The authors addressed these challenges through a multi-stage pipeline involving dataset construction, rigorous annotation, and baseline modeling.

A. Dataset Construction

Source: The dataset is derived from the UMR 2.0 Dataset, specifically targeting English sentences where AMR graphs had been converted to UMR but lacked aspect labels.
Corpora: Four distinct corpora were selected to ensure diversity:
1. The Little Prince (literary translation).
2. Minecraft (collaborative dialogue).
3. BOLT DF (forum posts).
4. Weblog (online news/articles).
Scale: The final dataset contains 1,473 manually annotated aspect labels for eventive predicates.

B. Annotation Scheme (UMR Aspect Lattice)

The authors utilized the UMR Aspect Lattice, a hierarchical structure allowing for both coarse and fine-grained distinctions. The specific labels applied to English include:

State: Stative events with no change (e.g., "The cat loves milk," including modal verbs and perception).
Performance: Events reaching a result state or natural endpoint (e.g., "The cat walked... in 2 minutes").
Endeavor: Processes that end within a time window but lack a specific result state (e.g., "The cat walked along the fence" without a duration marker).
Activity: Ongoing processes without a defined start or end in the context (e.g., "The cat is playing").
Habitual: Repeated or regular actions (e.g., "The cat eats kibble").
Process: A coarse-grained label for event nominals or unspecified events (e.g., "the game" in "After the game...").
None: Non-eventive predicates.

C. Annotation Pipeline

To ensure high quality and consistency, a two-phase process was implemented:

Phase 1 (Bulk Annotation): A team of 8 annotators underwent 8 weeks of training. They worked in pairs to independently label events from the corpora.
Phase 2 (Adjudication): Conflicting labels were resolved through a multi-step tie-breaking process involving a third annotator and expert consultation. This process continued for up to 5 rounds for difficult cases.
Quality Control: The team reported inter-annotator agreement metrics (Cohen's $\kappa$ = 0.656; Krippendorff's $\alpha$ = 0.688), indicating moderate-to-good agreement, with specific focus on resolving confusion between similar classes like Performance vs. Endeavor.

D. Baseline Modeling

To evaluate the utility of the dataset for automation, the authors established standard 70/15/15 train/dev/test splits (stratified by sentence) and tested three modeling approaches:

Rule-Based: Re-evaluation of AutoAspect (a prior rule-based system).
Embedding-Based: A feedforward neural classifier using token embeddings from LLaMA-3.1-8B.
LLM Prompting: Zero-shot and few-shot prompting of LLaMA-3.1-8B and GPT-5mini.

3. Key Contributions

First Supervised UMR Aspect Dataset: The creation of the first gold-standard dataset specifically designed for supervised learning of UMR aspect labels, bridging the gap between AMR and full UMR.
Comprehensive Annotation Guidelines: A detailed documentation of the UMR aspect lattice applied to English, including specific guidelines for distinguishing subtle classes (e.g., Endeavor vs. Performance) and handling event nominals.
Benchmarking Framework: The establishment of standard data splits and initial performance benchmarks for automatic aspect prediction, serving as a baseline for future research.
Adjudication Protocol: A robust, multi-stage annotation pipeline that demonstrates how to handle the inherent ambiguity of aspect in natural language through expert consultation and iterative review.

4. Results

The baseline experiments revealed significant challenges in automating aspect classification:

Human Performance: Human annotators achieved an accuracy of 84% (Macro F1: 0.76) against the adjudicated gold standard, setting a high bar for automation.
LLM Prompting:
- GPT-5mini performed best among automated models, achieving 56% accuracy (Macro F1: 0.49).
- LLaMA-3.1-8B performed poorly (31% accuracy), and interestingly, performance decreased with few-shot prompting, suggesting the models struggle with the specific nuances of the UMR lattice without fine-tuning.
- Data Sparsity: All models struggled significantly with minority classes (e.g., Endeavor, Habitual), indicating that current LLMs cannot overcome severe data sparsity through prompting alone.
Embedding Classifier: The feedforward classifier using LLaMA embeddings achieved 45% accuracy, outperforming LLaMA prompting but underperforming GPT-5mini. This suggests that while embeddings capture some semantic information, they are insufficient for capturing the complex aspectual distinctions required.
Rule-Based: The AutoAspect system (reported from prior work on a different dataset) showed comparable performance to the neural baselines but is not directly comparable due to dataset differences.

5. Significance and Future Work

Foundation for Cross-Linguistic NLP: By formalizing aspect in UMR, this work supports the development of semantic parsers that can generalize across typologically diverse languages (e.g., languages where aspect is grammaticalized vs. those where it is contextual).
Improved Semantic Reasoning: Accurate aspect annotation is critical for downstream tasks such as temporal reasoning, event coreference, and narrative structure analysis.
Future Directions:
- Model Development: The results highlight the need for models that can better utilize sentential context and the inherent graphical nature of event-argument structures, rather than relying solely on flat text or prompting.
- Expansion: The authors plan to expand this annotation framework to additional languages to support broader cross-linguistic semantic analysis.
- Automation: The dataset serves as a necessary resource for training specialized models to automate UMR parsing, moving beyond the current reliance on manual annotation.

In conclusion, this paper provides a critical resource and methodological framework for integrating aspect into structured meaning representations, demonstrating that while current automated methods are nascent, the high-quality human-annotated dataset provides a solid foundation for future advancements in semantic NLP.