Building Korean linguistic resource for NLU data… — Plain-Language Explanation

Original authors: Jeongwoo Yoon, On-yu Park, Changhoe Hwang, Gwanghoon Yoo, Eric Laporte, Jeesun Nam

Published 2026-05-12✓ Author reviewed ⓘ

📖 4 min read☕ Coffee break read

Original authors: Jeongwoo Yoon, On-yu Park, Changhoe Hwang, Gwanghoon Yoo, Eric Laporte, Jeesun Nam

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot how to talk to people who are angry or confused about their bank accounts. To do this, the robot needs a "textbook" full of examples of what people actually say. But here's the problem: real people are messy. They use slang, they get angry, they use different levels of politeness, and they say the same thing in a thousand different ways. Collecting enough real examples by hand is like trying to catch every single drop of rain in a storm with a bucket—it takes forever and is incredibly expensive.

This paper introduces a solution called FIAD (Financial Annotated Dataset). Think of FIAD not as a bucket of rain, but as a high-tech "sentence factory."

Here is how the factory works, broken down into simple steps:

1. The Blueprint (Data Analysis)

First, the researchers didn't just guess what people say. They went to the "source": they looked at over 126,000 reviews of banking apps. They focused on the unhappy reviews (low scores) because that's where people are most likely to say, "Fix this!" or "I can't do that!" They used a computer tool to chop these reviews down into their smallest building blocks (words and grammar bits) to see what patterns emerged.

2. The Three Conveyor Belts (Resource Construction)

Instead of writing sentences one by one, they built a machine with three main conveyor belts. Each belt adds a specific part to the sentence:

Belt A: The "What" (TOPIC)
This belt holds the nouns. It has two bins:
- Entities: Specific names like "Kakao Bank" or "Toss App."
- Features: General banking words like "loan," "account," or "speed."
- Analogy: This is like a box of Lego bricks. You can pick a red brick (Kakao Bank) or a blue brick (Toss App), but they are all the same shape (a noun).
Belt B: The "Action" (EVENT)
This belt holds the verbs and the logic. It decides what action is happening, like "create," "send," or "buy."
- The Smart Filter: This belt is smart. It knows that you can "create" an account, but you can't "create" a speed. It checks the rules to make sure the action matches the noun. If you try to put "create" next to "speed," the machine rejects it.
Belt C: The "Tone" (DISCOURSE MARKER)
This is the most unique part. In Korean, how you end a sentence changes its meaning and politeness level. This belt adds the "flavor."
- It can add a polite ending ("Could you please...?"), a direct command ("Do it!"), or a question ("Can you...?").
- It also handles honorifics (respect levels). Just as you might speak differently to your boss versus your best friend, this belt can generate sentences that are formal, polite, or casual.

3. The Assembly Line (Data Generation)

Now, the magic happens. The machine connects these three belts.

It picks a noun from Belt A.
It picks a matching action from Belt B.
It wraps it all in a specific tone from Belt C.

Because the machine can mix and match these parts in millions of ways, it can generate 60 trillion possible sentences! However, the researchers don't use all of them. They use a formula to pick the most natural-sounding, shorter sentences first (because people usually try to be brief).

4. The Test Drive (Experiments)

The researchers took the sentences generated by this factory and used them to train an AI model (a digital brain) to understand banking requests.

The Result: The AI learned very well. It could correctly guess what the user wanted (the "Intent") about 95% of the time and could correctly identify the specific details (the "Entity," like which bank or which product) about 86% of the time.
The Comparison: They tested different "brains" (pre-trained models) to see which one worked best with this new data. The model using a specific Korean language brain (KorBERT) performed the best.

The Bottom Line

The paper claims that instead of hiring hundreds of people to write thousands of sentences by hand, you can build a linguistic recipe book (FIAD). This book contains the rules of grammar, the vocabulary of banking, and the rules of politeness. By following these rules, you can automatically bake a massive, high-quality "cake" of training data. This allows you to teach a banking chatbot to understand Korean customers quickly, cheaply, and accurately, without needing to wait for real humans to type out every single variation of a request.

Building Korean linguistic resource for NLU data generation of banking app CS dialog system

1. The Blueprint (Data Analysis)

2. The Three Conveyor Belts (Resource Construction)

3. The Assembly Line (Data Generation)

4. The Test Drive (Experiments)

The Bottom Line

Technical Summary: Building Korean Linguistic Resource for NLU Data Generation of Banking App CS Dialog System

Building Korean linguistic resource for NLU data generation of banking app CS dialog system

1. The Blueprint (Data Analysis)

2. The Three Conveyor Belts (Resource Construction)

3. The Assembly Line (Data Generation)

4. The Test Drive (Experiments)

The Bottom Line

Technical Summary: Building Korean Linguistic Resource for NLU Data Generation of Banking App CS Dialog System

More like this