Skeleton-based Coherence Modeling in Narratives

Imagine you are trying to tell a story to a friend. You want the story to flow naturally, where one sentence leads logically to the next, like beads on a string. If you suddenly jump from talking about baking a cake to discussing the stock market without a transition, the story feels "broken" or incoherent.

This paper is about teaching a computer to spot those broken stories. The authors, two students from Stanford, wanted to see if they could measure how "together" a story feels by looking at just the main ideas (the skeleton) or if they needed to look at the whole sentences.

Here is the breakdown of their journey, explained with some everyday analogies.

1. The Big Idea: The "Skeleton" vs. The "Body"

The researchers started with a cool idea from a previous study. Imagine a human body. You have the skin, muscles, and organs (the full sentence), but underneath, there is a skeleton (the core structure).

The Hypothesis: The authors thought, "If we strip a sentence down to just its most important words (the skeleton), maybe it's easier for a computer to see if two sentences fit together. It's like checking if two puzzle pieces have the right shape, ignoring the colorful picture on them."
The Goal: They wanted to build a system that could look at two sentences, check their "skeletons," and say, "Yes, these go together," or "No, these don't belong in the same story."

2. The Experiment: Building the "Similarity Detector"

To test this, they built a special computer brain called a Sentence/Skeleton Similarity Network (SSN). Think of this as a very strict librarian who checks if two books belong on the same shelf.

They trained this librarian in two ways:

The Full-Body Check: The librarian reads the entire sentence (all the words, grammar, and flow).
The Skeleton Check: The librarian only reads the "skeleton" (just the key nouns and verbs, stripped of grammar and extra words).

They then tested the librarian with two types of challenges:

The "Sentence Order" Test: "Here are two sentences. Did they come one after the other in a real story, or did I just grab two random sentences from different books?"
The "Story Order" Test: "Here is a whole story. Is it in the right order, or did I shuffle the pages like a deck of cards?"

3. The Results: The Surprise Twist

The authors expected the Skeleton method to win. They thought that by removing the "noise" (extra words), the computer could focus purely on the logic.

But the results were the opposite!

The Winner: The Full-Sentence method.
The Loser: The Skeleton method.

Why did the Skeleton fail?
The authors realized a few funny reasons why the "skeleton" approach didn't work as well as they hoped:

The "Bad X-Ray" Problem: To get the skeleton, the computer first has to strip the sentence down. If the computer makes a mistake while stripping it (like removing a word that was actually important), the skeleton is broken from the start. It's like trying to identify a person by looking at a blurry X-ray; if the X-ray is bad, you can't tell who they are.
The "Jumbled Puzzle" Problem: A full sentence has a specific order and rhythm. A skeleton is often just a list of words with no order. It's like trying to guess a story by looking at a bag of Lego bricks versus looking at the built structure. The full sentence gives the computer more clues (context) to work with.

4. The "Self-Attention" Mechanism

They also tried adding a feature called "Self-Attention." Imagine a spotlight that shines on the most important words in a sentence, telling the computer, "Pay extra attention to this word!"

They hoped this spotlight would make the computer even smarter. However, the results were mixed. The spotlight didn't make a huge difference in this specific experiment, likely because the computer was already doing a pretty good job just by reading the whole sentence.

5. The Final Conclusion

The big takeaway from this paper is a bit of a reality check for AI researchers:

Sometimes, less is not more.

While it sounds smart to strip a text down to its bare bones, the computer actually needs the full context (the whole sentence) to understand if a story makes sense. The "skeleton" idea is great for writing stories (generating them), but it's not the best tool for checking if a story makes sense (evaluating it).

In a nutshell: If you want a computer to tell if a story is coherent, let it read the whole story, not just the outline. The details matter!

1. Problem Statement

The paper addresses the challenge of textual coherence modeling in Natural Language Processing (NLP). Coherence refers to the logical flow and thematic consistency of a text, which is difficult to quantify automatically.

Context: While recent work (specifically Jingjing Xu et al., EMNLP 2018) demonstrated that extracting "skeletons" (key concepts like entities, relations, and events) from a sentence and using them to generate the next sentence works well for generating coherent narratives, it remains unclear if skeleton consistency is a valid metric for evaluating or detecting incoherence in existing text.
Hypothesis: The authors hypothesize that if a model can generate coherent stories using skeletons, then measuring the similarity between the skeletons of consecutive sentences should be a strong indicator of textual coherence. They aim to transform the generative skeleton approach into a discriminative model to detect incoherent sentences.

2. Methodology

The authors propose and evaluate a Sentence/Skeleton Similarity Network (SSN), a Siamese network architecture designed to measure the similarity between pairs of inputs (either raw sentences or extracted skeletons).

A. Architecture

Input: Pairs of sentences or pairs of skeletons.
Embedding:
- Word Embeddings: Instead of learning embeddings from scratch (which is difficult for skeletons due to lack of contiguous word order), the model uses pre-trained FastText embeddings.
- Sequence Encoding: Inputs are passed through a Stacked LSTM (2-3 layers) to generate dense vector representations.
- Attention Mechanism: Some models incorporate a Self-Attention layer (inspired by Luong et al.) on top of the LSTM outputs to weigh the importance of specific words/concepts before generating the final sentence embedding.
Similarity Calculation: The network computes the normalized L2 distance (or cosine similarity) between the two resulting embeddings ( $e_1$ and $e_2$ ).
Loss Function: The model is trained using Contrastive Loss:
- Positive Pairs (Coherent): Penalized based on $(1 - E_w)^2$ , where $E_w$ is the similarity energy.
- Negative Pairs (Incoherent): Penalized if the similarity exceeds a margin $m$ .

B. Dataset and Preprocessing

Data: The "Storytelling" dataset (40k+ stories, max 6 sentences each).
Skeleton Extraction: The authors utilized the pre-trained skeleton extraction module from Xu et al. [3] to generate skeletons for the dataset.
Data Construction:
1. Sentence Pairs: Consecutive sentences from a story (Label 1) vs. a sentence paired with a random sentence from another story (Label 0).
2. Story Pairs: Original ordered stories vs. jumbled (randomized) stories.

C. Baselines

The SSN was compared against:

Non-parametric baselines: Cosine similarity and Euclidean distance applied to averaged BERT embeddings of sentences/skeletons.
Sentence-based SSN: The same architecture trained on raw sentences instead of skeletons.

3. Key Contributions

Discriminative Skeleton Modeling: The first attempt to repurpose the skeleton-based generative framework into a discriminative model for coherence detection.
Sentence/Skeleton Similarity Network (SSN): A novel Siamese network architecture utilizing FastText, Stacked LSTMs, and Self-Attention to evaluate coherence.
Empirical Comparison: A rigorous comparison showing that while skeletons are useful for generation, raw sentences are superior for coherence evaluation.

4. Results and Analysis

The authors evaluated models on three metrics: Sentence Order Detection, Story Order Detection, and Pair Classification.

Technique	Sentence Order Accuracy	Story Order Accuracy	Pair Classification
SSN on Sentences	92.9%	69.6%	82.2%
SSN on Skeletons	84.2%	62.9%	73.8%
BERT + Cosine (Sentences)	71.9%	N/A	N/A
BERT + Cosine (Skeletons)	61.6%	N/A	N/A

Key Findings:

Neural vs. Non-Parametric: Neural approaches (SSN) significantly outperformed non-parametric baselines (Cosine/Euclidean) even when using strong BERT embeddings.
Sentences vs. Skeletons: Contrary to the initial hypothesis, sentence-based models outperformed skeleton-based models across all metrics.
- Reasoning: Skeletons are short, lack word order, and their quality is dependent on the upstream extraction model. Raw sentences provide a complete set of contextual words, making similarity detection more robust.
Sentence vs. Story Level: Models performed significantly better on sentence-level coherence (detecting adjacent pairs) than story-level coherence (detecting full story order).
- Reasoning: The dataset stories are short (max 6 sentences). When jumbled, it is statistically likely that some original adjacent pairs remain together, making the "jumbled" story easier to distinguish from the original than a truly random sequence. The authors suggest larger datasets (e.g., essays, reports) are needed for robust story-level evaluation.
Self-Attention: The addition of self-attention did not yield significant performance gains in this specific setup, likely due to the limited depth of the model (2 layers with attention vs. 3 layers without) and computational constraints.

5. Significance and Conclusion

Validation of Current Trends: The results suggest that current state-of-the-art coherence modeling techniques are on the right track by focusing on full sentences rather than sub-parts (skeletons).
Limitations of Skeletons: While skeletons are effective for generating narrative flow (as they force the model to focus on key events), they are poor candidates for evaluating coherence because the extraction process introduces noise, and the loss of syntactic structure reduces the signal available for similarity matching.
Future Directions: The authors propose extending this work to longer datasets (e.g., news reports, essays) to better evaluate story-level coherence and exploring more advanced attention mechanisms (e.g., Transformers).

In summary, the paper demonstrates that while the concept of "skeletons" is powerful for narrative generation, direct sentence-level modeling remains the superior approach for detecting and measuring textual coherence.