Ayn: A Tiny yet Competitive Indian Legal Language Model Pretrained from Scratch

Imagine you're trying to solve a very specific, complicated puzzle: Indian Law.

Right now, the tech world is obsessed with building "Giant Brains" (Large Language Models or LLMs). These are massive AI models, like Llama-3 or OLMo, that have been fed the entire internet, millions of books, and countless websites. They are incredibly smart, but they are also huge, expensive to feed, and slow to run. Think of them as a 747 jumbo jet: it can fly anywhere in the world, but it costs a fortune in fuel and requires a massive runway to take off.

The authors of this paper asked a simple question: "Do we really need a jumbo jet to deliver a pizza to the neighborhood?"

They decided to build a Tiny Language Model (TLM) called AYN. It's small (only 88 million parameters, compared to the giants' billions), cheap to train, and designed specifically for one job: Indian Supreme Court cases.

Here is the story of how they built it and why it's a big deal, explained through some everyday analogies.

1. The Problem: The "Western" Bias and the "One-Size-Fits-All" Trap

Imagine you go to a doctor who has only ever studied medical textbooks from the US and UK. If you go to them with a rare tropical disease specific to India, they might miss the diagnosis because their training data doesn't cover it.

Similarly, big AI models are trained mostly on Western legal data. They don't understand the unique flavor of Indian law, which involves:

Old-fashioned words: Like "hereinafter" or "notwithstanding."
Complex citations: Like "Section 3(1)(b) of the Act."
Code-switching: Mixing English with local languages and specific jargon.

When you feed a giant Western-trained AI an Indian legal document, it often gets confused, like a tourist trying to read a menu in a language they only half-know.

2. The Solution: Building a "Specialist Chef" (AYN)

Instead of trying to force the giant AI to learn Indian law, the authors built a specialist chef from scratch.

The Ingredients (Data): They didn't just grab random internet text. They curated a specific "recipe book" containing 142 million words of Indian Supreme Court cases, the Constitution of India, and the Penal Code.
The Knife (Tokenizer): This is a crucial part. Imagine trying to cut a very intricate cake. A standard knife (a generic AI tokenizer) might chop it into messy, useless crumbs. The authors built a custom laser-cutter (a custom tokenizer) specifically designed to slice Indian legal terms perfectly. For example, instead of chopping "statutory" into "stat," "ut," "ory," their tool keeps it as one whole, meaningful word.
The Training: They trained this small model on a single graphics card (an A100) for about a week (185 hours). It cost them less than $500 and produced a tiny carbon footprint (like a few minutes of driving a car).

3. The Showdown: The Underdog vs. The Giants

The authors put their tiny model, AYN, in a ring against the giants (models 10 to 80 times larger).

Round 1: Predicting the Verdict (The "Judge" Test)

The Task: Read a case and predict if the appeal will be accepted or rejected.
The Result: The tiny AYN model crushed the giants.
- Analogy: Imagine a local expert who has read every single case in the last 50 years vs. a generalist who has read everything in the world but only skimmed the legal section. The local expert wins every time. AYN was more accurate than models 30 to 80 times its size.

Round 2: Summarizing the Case (The "Lawyer's Brief" Test)

The Task: Read a 28,000-word legal document and write a 5,000-word summary.
The Result: AYN performed as well as models 30 times larger.
- Analogy: It's like a small, focused team of lawyers summarizing a case just as well as a massive law firm with hundreds of partners.

Round 3: General Knowledge (The "Trivia" Test)

The Task: Answer general questions about logic, science, and language (not just law).
The Result: AYN held its own. It didn't beat the biggest giants, but it beat several other large models and performed surprisingly well considering it was only trained on legal texts.
- Analogy: Even though AYN is a "lawyer," it's so smart that it can still pass a general trivia quiz better than some other "generalist" models.

4. Why This Matters: The "Tiny but Mighty" Revolution

The paper proves that you don't always need a sledgehammer to crack a nut.

Cost: Training the giant models costs millions of dollars and produces a lot of carbon emissions. Training AYN cost less than $500 and was eco-friendly.
Accessibility: Because AYN is so small, it can run on a single computer or even a powerful laptop. You don't need a supercomputer to use it. This means lawyers in India (and other developing nations) can use powerful AI tools without needing a massive budget.
Fairness: It shows that we can build AI that respects local cultures and languages, rather than just copying Western models.

The Catch (Limitations)

The authors are honest about the flaws. AYN is a specialist.

It only knows Indian Supreme Court cases. It doesn't know about District Courts or High Courts yet.
It only speaks English. It doesn't speak Hindi, Tamil, or other Indian languages (though the authors say they can add this later).
It's a generative AI, so it can sometimes "hallucinate" (make things up). You can't trust it blindly in a real court without a human lawyer checking its work.

The Bottom Line

This paper is a victory for efficiency and specialization. It tells us that in the age of AI, sometimes the best approach isn't to build a bigger, more expensive brain, but to build a smaller, smarter, and more focused brain that knows its specific job inside out.

AYN is the "local expert" that proves you don't need to be a giant to be a genius.

1. Problem Statement

The paper addresses the high computational cost and environmental impact of training and deploying Large Language Models (LLMs) for domain-specific tasks, particularly in the legal field. While decoder-only LLMs (1B–8B+ parameters) are effective, they require massive datasets and resources.

Domain Specificity: Legal language involves specialized, archaic vocabulary, complex citation structures, and nested sentences. General-purpose tokenizers often fragment these terms inefficiently.
Regional Bias: Existing legal LLMs are predominantly trained on Western legal data, leading to biases and reduced effectiveness in jurisdictions like India, which features multilingualism, distinct legal traditions, and code-switching.
Data Scarcity: Annotated legal data is scarce and expensive to produce. The authors question whether a Tiny Language Model (TLM) (<100M parameters), pretrained from scratch on a limited, curated domain corpus, can outperform or rival much larger general-purpose LLMs.

2. Methodology

A. Data Curation and Preprocessing

Corpus Construction: The authors created a specialized corpus totaling 142.6 million words by extending the existing Indian Legal Documents Corpus (ILDC).
- Components: 34,816 Supreme Court of India (SCI) cases (1947–2020), 3,046 new SCI cases (May 2020–Dec 2023), the Constitution of India, and the Indian Penal Code.
- Cleaning: Used regular expressions to remove metadata (case numbers, dates, judge names) while preserving the decision sections. PDF text extraction was performed using Tesseract for the Constitution and Penal Code.

B. Tokenization

Custom Legal BPE Tokenizer: Instead of using off-the-shelf tokenizers (e.g., LLaMA-2), the authors trained a Byte-Pair Encoding (BPE) tokenizer from scratch using SentencePiece.
Optimization: The tokenizer was optimized for legal terminology (e.g., "hereinafter," "statutory," "jurisdiction") and legal citations.
- Vocabulary Size: 3,500 tokens.
- Result: Significantly reduced fragmentation of legal jargon compared to general-purpose tokenizers, leading to more semantically coherent representations.

C. Model Architecture (AYN)

Type: Decoder-only Transformer.
Parameters: 88 Million.
Configuration:
- Hidden dimension: 768.
- Layers: 12.
- Feed-forward dimension: 2048.
- Components: RMSNorm (epsilon $10^{-5}$ ), SwiGLU activation, RoPE (Rotary Positional Embeddings) with interpolation for long sequences.
- Context Length: Trained with a context size of 8,192 tokens using a "shrinking factor" interpolation method to fit on a single A100 GPU.
- Weight Tying: Embedding and softmax layers share weights.

D. Training Procedure

Hardware: Single NVIDIA A100 (40GB) GPU.
Duration: 185 hours.
Hyperparameters:
- Optimizer: AdamW ( $\beta_1=0.9, \beta_2=0.95$ ).
- Learning Rate: 0.003 (Cosine schedule with 1000-step warmup).
- Batch Size: 8 (with 8 gradient accumulation steps).
- Precision: BF16 mixed precision.
Efficiency: Achieved a Model FLOPs Utilization (MFU) of 41.3% and a carbon footprint of only 0.0196 tCO2eq.

3. Key Contributions

Resource Creation:
- A new, expanded corpus of Indian Supreme Court cases and legal statutes (142.6M words).
- A domain-specific BPE tokenizer tailored for Indian legal text.
- The AYN model (88M parameters), pretrained from scratch.
Empirical Study: A comprehensive comparison of a <100M parameter TLM against LLMs ranging from 1B to 8B parameters on both legal and general NLP tasks.
Cost-Efficiency Demonstration: Proving that a model trained on a single GPU for <500 hours can compete with models trained on massive clusters.

4. Results

A. Legal Task Performance

Legal Case Judgment Prediction (Classification):
- Zero-Shot: AYN (88M) achieved 52.00% accuracy and 0.5037 Macro-F1, outperforming all larger LLMs (1B–8B) and a domain-adapted LLaMA-2 7B (CPTLlama-2) by margins of 1.14% to 15.37%.
- Fine-Tuning: When coupled with a simple discriminative classifier head, AYN reached 69.69% accuracy, significantly outperforming larger models (which plateaued around 60-63%).
Abstractive Summarization:
- AYN rivalled LLMs up to 30 times larger (e.g., LLaMA-3.2 3B) in generating 5,000-token summaries.
- It achieved superior ROUGE-1, BLEU, and METEOR scores compared to LLaMA-3.2 (1B/3B) and OLMo-2 7B, though it trailed behind the 8B models.
- Length Analysis: Optimal performance was observed at 2,048 tokens (highest BERTScore of 0.5287). Performance degraded slightly for extremely long summaries (6k–8k tokens) due to coherence loss.

B. General NLP Benchmarks (Zero-Shot)

Evaluated on MMLU, WIC, QNLI, and LogiQA.
Performance: AYN outperformed six larger LLMs (including Pythia 6.9B, Falcon 7B, and LLaMA 7B) on average, despite being trained exclusively on legal data.
Comparison: It performed on par with OLMo 1B but underperformed the largest models (LLaMA-3 8B, OLMo-2 7B) by 7–10%, which is expected given the narrow training domain.

C. Instruction Tuning (Appendix)

Fine-tuned on 10,763 legal instructions.
Evaluated by GPT-3.5-Turbo: Achieved scores of 8.0+ for relevance and legal reasoning on Supreme Court tasks, demonstrating strong generalization from limited instruction data.

5. Significance and Conclusion

Efficiency vs. Scale: The paper demonstrates that for specific, data-scarce domains like Indian law, a domain-specialized TLM pretrained from scratch is more effective than a general-purpose LLM or a continually pretrained large model.
Accessibility: AYN offers a low-cost, low-carbon alternative (budget < $500, 0.0196 tCO2eq) suitable for resource-constrained environments in the Global South.
Bias Mitigation: By focusing on Indian legal data, the model addresses the Western-centric bias prevalent in current legal LLMs.
Limitations: The model is currently limited to English and the Supreme Court of India (not High/District courts). It lacks human expert evaluation for hallucination and factuality, and no guardrails are currently implemented.

Conclusion: The study validates the hypothesis that "small is beautiful" for domain-specific applications. AYN proves that a tiny, domain-optimized model can outperform models 30–80 times its size on specific legal tasks, offering a sustainable path for legal AI in underrepresented jurisdictions.