Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication

Here is an explanation of the paper, translated into simple language with some creative analogies to help visualize the concepts.

The Big Picture: Helping People Speak Without Words

Imagine a person who cannot speak out loud due to a disability. To communicate, they use a special "talking board" (an AAC system). This board is filled with thousands of pictures (pictograms) representing words like "eat," "ball," "happy," or "mom."

To say a sentence like "I want an apple," the user has to hunt through a giant grid of pictures, find the "I," then the "want," then the "apple," and tap them one by one. It's like trying to write a letter by digging through a massive box of Scrabble tiles to find the right letters, one by one. It takes a long time and can be frustrating.

The Goal of this Paper:
The researchers wanted to build a "smart assistant" for this talking board. Just like your phone predicts the next word you might type, they wanted a system that could look at the pictures the user has already tapped and say, "Hey, you probably want to tap the apple picture next!"

The Challenge: The "Dictionary" Problem

The tricky part is that these talking boards use pictures, not text. Computers are great at reading words, but they struggle to understand that a picture of a red fruit means the word "apple."

Furthermore, the researchers were working with Brazilian Portuguese. While there are huge databases of English text to teach computers, there wasn't a big library of Portuguese sentences specifically written by people using these picture boards. The computer needed to learn the "language of the pictures."

The Solution: Teaching a Robot to Read Pictures

The team used a powerful AI brain called BERTimbau (a version of the famous BERT model trained on Portuguese). Here is how they taught it to predict the next picture:

1. Building the Training Library (The Corpus)

You can't teach a chef to cook without ingredients. The researchers needed a "recipe book" of sentences.

Step A: They asked real experts (speech therapists and parents) to write down common sentences these users say.
Step B: They used a super-smart AI (GPT-3) to read those sentences and write thousands of new ones that sounded just like them.
Step C: They converted these text sentences back into picture sequences.
Analogy: Imagine teaching a dog to fetch. First, you show it a real ball (human sentences). Then, you ask a robot to draw thousands of pictures of balls (synthetic sentences) so the dog gets lots of practice.

2. The "Translation" Test

Now they had to teach the AI how to "see" a picture. They tested four different ways to describe a picture to the computer:

Method A (The Caption): Just use the word written under the picture (e.g., "Cat").
Method B (The Synonyms): Use a list of similar words (e.g., "Cat," "Feline," "Kitty").
Method C (The Definition): Use a dictionary definition (e.g., "A small domesticated carnivorous mammal").
Method D (The Image): Show the computer the actual picture file.

The Results: What Worked Best?

The researchers ran the AI through a test to see which method made the best predictions.

The Winner: Captions (The Words) and Synonyms.
- The Analogy: It turns out the computer learns best when you tell it the name of the picture. If you say "Cat," it knows what to do. If you give it a list of synonyms ("Cat," "Feline"), it gets even better at guessing the context (lower "perplexity," which is a fancy way of saying the computer is less confused).
The Loser: Definitions and Images.
- The Analogy: Trying to teach the computer by showing it the actual picture or a long dictionary definition was like trying to teach someone to drive by reading a manual on engine mechanics. It was too complicated and the computer didn't learn as fast. The "image" method was particularly bad because the computer's "brain" for reading text and its "brain" for seeing pictures speak different languages.

The Takeaway

The paper concludes that the best way to build a smart picture-predictor is to treat the pictures like words.

If you have a dictionary of synonyms: Use that! It helps the computer understand the meaning better.
If you don't: Just use the simple word written under the picture. It works almost as well and is much easier to set up.
Don't bother with the actual images for this specific task; it's too heavy and doesn't help the computer guess the next word.

Why This Matters

This research is like giving a new set of glasses to people who rely on picture boards. Instead of scrolling through hundreds of pictures to find the one they need, the system can now suggest the top 5 or 10 most likely pictures right at the top of the screen. This saves time, reduces frustration, and helps people with complex communication needs share their thoughts, feelings, and needs much faster.

In short: They taught a computer to speak "Picture Language" by translating pictures into words, and they found that the simplest translation (just the word under the picture) is often the most effective.

Here is a detailed technical summary of the paper "Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication."

1. Problem Statement

Individuals with Complex Communication Needs (CCN) rely on Augmentative and Alternative Communication (AAC) systems, which often use pictograms (communication cards) to construct sentences. A significant challenge in high-tech AAC systems is the "vocabulary explosion": as a user's vocabulary grows, finding the correct pictogram to complete a sentence becomes increasingly difficult and time-consuming.

While predictive models (like word prediction) exist for natural language, applying them to AAC is difficult because:

Lack of Corpora: There is a scarcity of large-scale, natural language corpora specifically tailored to AAC communication patterns in languages other than English.
Representation Ambiguity: It is unclear how best to represent a pictogram for a Transformer-based model. Should it be treated as a simple word (caption), a concept (synonyms/definitions), or a visual object (image)?
Language Specificity: Most existing predictive models are English-centric, leaving a gap for Brazilian Portuguese users.

2. Methodology

The authors propose a pipeline to fine-tune BERTimbau (a Brazilian Portuguese version of BERT) for pictogram prediction. The methodology consists of three main phases:

A. Corpus Construction

Since no dedicated AAC corpus existed for Portuguese, the authors constructed one using a four-step pipeline:

Collection: 667 unique sentences were collected from 17 AAC practitioners (speech therapists, psychologists, parents) across various contexts (home, school, leisure).
Data Augmentation: The collected sentences were used as few-shot examples to prompt GPT-3 (text-davinci-002) to generate 2,772 synthetic sentences. Additionally, GPT-3 was prompted using controlled vocabulary lists from ARASAAC (a major Portuguese pictogram database) to generate further variations.
Data Cleaning: Sentences were filtered for offensive content, perplexity (using BERTimbau), and length (3–11 tokens).
Text-to-Pictogram Transformation: Natural language sentences were converted into sequences of pictogram identifiers. This involved:
- Tokenization of Multi-Word Expressions (MWEs).
- Word Sense Disambiguation: Using BERTimbau to encode sentence context and pictogram definitions, then applying a K-Nearest Neighbor (KNN) algorithm to select the correct pictogram when a word maps to multiple images (e.g., "bank" as a financial institution vs. a river bank).
- Result: A final corpus of 13,796 sentences mapped to ARASAAC pictogram IDs.

B. Model Adaptation & Fine-tuning

The authors adapted BERTimbau for this specific task:

Vocabulary Replacement: The standard WordPiece vocabulary was replaced with a vocabulary of unique ARASAAC pictogram IDs.
Embedding Layer Modification: The input embedding layer was re-initialized to handle the new vocabulary size.
Embedding Strategies: The authors tested four distinct methods to generate input embeddings for the pictograms:
1. Captions: Using the text caption associated with the pictogram.
2. Synonyms: Averaging embeddings of synonyms related to the pictogram.
3. Definitions: Using dictionary definitions (concatenated with keywords) to compute embeddings (tested via mean vector of input embeddings and [CLS] token output).
4. Images: Using a Vision Transformer (ViT) pre-trained on ImageNet to generate visual embeddings.
Training: The model was fine-tuned using Masked Language Modeling (MLM) with a batch size of 768 sequences (13 tokens each) over 200–500 epochs on an NVIDIA Tesla V100 GPU.

C. Evaluation Metrics

Performance was evaluated on the test set using:

Perplexity (PPL): To measure how well the model generalizes to unseen data.
Top-n Accuracy (ACC@n): To simulate real-world AAC grid sizes, where $n \in \{1, 9, 18, 25, 36\}$ . This measures if the correct pictogram appears in the top $n$ predictions.

3. Key Contributions

First AAC Corpus for Brazilian Portuguese: The creation and release of a synthetic yet linguistically validated corpus of 13,796 sentences specifically designed for AAC pictogram prediction.
Comparative Analysis of Pictogram Representation: A systematic evaluation of how different semantic and visual representations (captions, synonyms, definitions, images) affect Transformer-based prediction performance in a low-resource domain.
Methodological Framework: A reproducible pipeline for constructing AAC-specific datasets using human-in-the-loop collection and LLM-based augmentation, adaptable to other languages.
Open Source: The code and dataset are made available to the community to facilitate further research in AAC.

4. Results and Analysis

The experiments yielded the following key findings:

Textual Representations vs. Visual: Text-based representations (captions, synonyms, definitions) significantly outperformed image-based representations.
- Image Models: Models using ViT embeddings (alone or combined) showed very high perplexity (~100–122) and low accuracy (<30% at Top-36), likely due to the mismatch between the visual vector space and the BERT text embedding space, requiring more data than available.
Captions vs. Synonyms:
- Synonyms: Achieved the lowest perplexity (14.28), indicating the model learned the underlying language structure best and generalized well.
- Captions: Achieved the highest Top-n accuracies (e.g., ACC@1 = 0.237, ACC@36 = 0.702).
- Definitions: Performed worse than captions and synonyms, suggesting that dictionary definitions may introduce noise or are too verbose for the model to effectively utilize in this context.
Design Implications: The choice between captions and synonyms is a design trade-off. Synonyms offer better generalization (lower perplexity) but require a pre-existing synonym database. Captions are simpler to implement but may struggle with ambiguity if multiple pictograms share the same caption.

5. Significance and Future Work

Practical Impact: This work provides a viable path for implementing predictive authoring in Brazilian Portuguese AAC systems, potentially reducing the number of selections required to communicate and increasing the communication rate for users with CCN.
Low-Resource Adaptation: The study demonstrates that even with a relatively small, synthetically augmented corpus, Transformer models can be effectively fine-tuned for specialized domains like AAC.
Limitations: The study did not evaluate the system with actual end-users (people with CCN), and the synthetic corpus may inherit biases from GPT-3.
Future Directions: The authors plan to:
- Conduct user studies with AAC practitioners and users to validate real-world efficacy.
- Implement text expansion systems to convert telegraphic AAC sentences into full natural language for better interlocutor understanding.
- Explore models specifically trained on Brazilian Portuguese (e.g., Sabiá) to improve generation quality.

In conclusion, the paper establishes that using pictogram captions or synonyms with a fine-tuned BERT model is the most effective approach for Brazilian Portuguese AAC prediction, offering a robust alternative to traditional n-gram models and knowledge bases.