GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

Imagine you are a detective trying to solve a mystery: Who wrote this story? Was it a human with a cup of coffee and a lot of imagination, or was it a robot spitting out words at lightning speed?

This paper is the report from a team of detectives (researchers at Georgia Tech) who entered a competition called AbjadGenEval. Their mission? To build a computer program that can spot AI-generated text specifically in Arabic.

Here is the story of how they did it, explained simply.

1. The Challenge: The "Arabic" Problem

Detecting fake news or AI text in English is like having a well-stocked toolbox; there are many tools already built for it. But in Arabic, the toolbox was almost empty. Arabic is a complex language with many dialects and tricky grammar rules, making it harder for computers to tell the difference between a human writer and a robot.

The competition gave the team a pile of 5,300 stories—half written by humans, half by AI. Their job was to sort them into two bins: Human and Machine.

2. The Tool: A Super-Reader

The team didn't build a robot from scratch. Instead, they used a pre-trained "super-reader" called E5-large. Think of this model as a very smart librarian who has read millions of books in many languages. It understands the meaning of words but doesn't know how to spot a fake story yet.

The team's job was to "teach" this librarian how to be a detective by adding a special "detective hat" (a classification head) on top of it.

3. The Big Experiment: How to Summarize a Book?

Here is where the story gets interesting. When the librarian reads a long story, it breaks it down into thousands of tiny pieces (tokens). To decide if the story is fake, the librarian needs to summarize the whole thing into one single "feeling" or "score."

The team tried four different ways to do this summarizing:

The "Complex Chef" (Weighted Layer & Attention): They tried teaching the computer to be a gourmet chef. "Look at the bottom layers of the book for basic facts, look at the top layers for deep meaning, and pay extra attention to the exciting parts!" They built fancy mechanisms to weigh every single word and layer differently.
The "Gated Fusion" (The Mixer): They tried mixing all these complex methods together, letting the computer decide dynamically which method to trust for each sentence.
The "Simple Average" (Mean Pooling): This was the boring option. It just took every word, gave them all the same importance, and calculated the average. Like taking a class test and just averaging everyone's score without worrying about who studied the hardest.

4. The Surprise Twist

You would think the "Complex Chef" would win, right? After all, it's smarter and more detailed.

But it didn't.

The Simple Average (Mean Pooling) won the competition with a score of 0.75 (a very good grade). The fancy, complex methods actually did worse.

Why?
The researchers realized they were trying to teach a complex recipe to a student who only had a small cookbook (limited data).

The Analogy: Imagine trying to teach a toddler to bake a soufflé using a 50-step recipe. They will get confused and burn the kitchen. But if you just tell them, "Mix the eggs and milk," they can do that perfectly.
Because the team only had 5,000 examples to learn from, the complex methods got confused and started "memorizing" the training data instead of learning the real rules. The simple average was stable, reliable, and didn't get confused.

5. A Clue They Found: The Length Trick

While analyzing the data, the team noticed a funny pattern.

Human stories were like long, winding novels (averaging 632 words).
AI stories were like short, punchy summaries (averaging 303 words).

It turns out, AI tends to be lazy and short, while humans tend to ramble and elaborate. The computer learned to use this "length" as a clue, though the team noted that relying only on length isn't a perfect strategy because a human could write a short note, and an AI could be forced to write a long one.

6. The Takeaway

The main lesson from this paper is a classic case of "Less is More."

When you don't have a massive amount of data to train on, you don't need the most complicated tools. Sometimes, the simplest method—just taking an average—works better because it doesn't get overwhelmed.

In short: The team built a detector for fake Arabic text. They tried to make it super-smart and complex, but it turned out that a simple, steady approach was the secret to winning the game.

Here is a detailed technical summary of the paper "GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification."

1. Problem Statement

The paper addresses the challenge of detecting AI-generated text in Arabic, a task that has received significantly less attention than its English counterpart due to Arabic's morphological complexity and regional stylistic diversity. The specific problem is framed as a binary classification task: given an input text $x$ , the system must predict a label $y \in \{human, machine\}$ to determine if the text was authored by a human or generated by an AI model (e.g., ChatGPT).

The authors participated in the AbjadGenEval shared task, which provided a dataset of 5,298 balanced Arabic samples (50% human, 50% machine-generated).

2. Methodology

Model Architecture

The core of the system is the multilingual E5-large encoder, a 24-layer Transformer model with a hidden size of 1,024. The architecture follows a standard encoder-classifier pipeline:

Tokenization: Input text is tokenized and passed through the E5-large encoder to generate contextualized token representations.
Pooling: Token representations are aggregated into a single fixed-size vector.
Classification: The pooled vector passes through a classification head (Feed-Forward Network with LayerNorm, GELU, and Dropout) to output class probabilities.

Pooling Strategies

A significant portion of the research focused on determining the optimal method for aggregating token representations. The authors compared four strategies:

Mean Pooling: Averages hidden states across all non-padded tokens (equal weighting).
Weighted Layer Pooling: Learns to combine outputs from multiple transformer layers using softmax-normalized weights, assuming different layers capture different feature types.
Multi-Head Attention Pooling: Uses 8 learnable query vectors to attend to specific tokens, concatenating the resulting context vectors.
Gated Fusion: Dynamically weights the outputs of multiple pooling methods using learned sigmoid gates.

Training Techniques & Regularization

Loss Function: The authors utilized Focal Loss instead of standard Cross-Entropy to down-weight easy examples and focus training on harder cases.
Multi-Sample Dropout: During training, 5 different dropout masks (rates 0.1 to 0.3) are applied, and the resulting logits are averaged. This acts as an internal ensemble to improve regularization without inference cost.
Layer-wise Learning Rate Decay (LLRD): Lower transformer layers receive smaller learning rates (decay factor 0.95) to prevent catastrophic forgetting of pre-trained knowledge.
Hyperparameters: The model was trained for 2 epochs with a batch size of 64 (effective), a learning rate of $2 \times 10^{-5}$, and a maximum sequence length of 512 tokens.

3. Key Contributions

The paper makes three primary contributions:

Systematic Pooling Comparison: It demonstrates that for Arabic text classification with limited training data, simple mean pooling outperforms complex, learnable aggregation methods (weighted layer, attention, gated fusion).
Dataset Analysis: It identifies a critical statistical disparity in the dataset: human-written texts are significantly longer (average 632 words) than machine-generated texts (average 303 words).
Training Recipe: It proposes a robust training configuration combining LLRD, multi-sample dropout, and Focal Loss for fine-tuning multilingual encoders on small datasets.

4. Results

Performance: The final system achieved an F1 score of 0.75 on the official blind test set using Mean Pooling.
Pooling Strategy Comparison:
- Mean Pooling: 0.75 F1 (Best performance).
- Weighted Layer + Attention + Gated Fusion: 0.70 F1.
- Weighted Layer + Attention: 0.71 F1.
- Note: While complex models achieved perfect scores on the small development set, they suffered from overfitting on the test set.

5. Analysis & Significance

Why Mean Pooling Worked Best

The authors hypothesize that simple mean pooling outperformed complex methods due to:

Data Scarcity: With only ~5,300 samples, complex pooling mechanisms (which introduce many new parameters) overfit the data.
Pre-trained Quality: The E5-large model already provides high-quality token representations; adding learnable transformations degrades performance under limited supervision.
Implicit Regularization: Mean pooling adds no learnable parameters to the aggregation stage, forcing the model to rely on the robust features of the pre-trained encoder.
Distributional Robustness: It treats all tokens equally, which is beneficial when discriminative features are distributed throughout the text rather than localized.

Length Bias and Limitations

The analysis revealed a strong correlation between text length and class (Human > Machine).

Truncation: Since the model is limited to 512 tokens, longer human texts are truncated, potentially losing discriminative information.
Bias Risk: The model may be partially relying on text length as a proxy feature, which could reduce robustness against adversarial examples where length is controlled.

Significance

This work is significant for the NLP community as it challenges the assumption that "more complex = better" in the era of large language models. It provides empirical evidence that for low-resource or small-data scenarios (even with powerful pre-trained models), simpler architectural choices often yield better generalization. The findings offer a practical blueprint for building robust detectors for low-resource languages like Arabic, where large-scale labeled datasets are difficult to obtain.

6. Future Directions

The authors suggest future work should focus on:

Incorporating external datasets to increase training volume.
Utilizing longer context windows to capture full document content.
Exploring ensemble methods that combine multiple pooling strategies.
Investigating the relationship between training data size and optimal pooling complexity.