Tiny Reasoning Language Model (TRLM-135M): Revolutionizing Reasoning in Small Models

⏱️ Estimated reading time: 8 minutes

Introduction

The Tiny Reasoning Language Model (TRLM-135M) is a research prototype with 135M parameters designed to study how small models can learn step-by-step reasoning. Built on top of SmolLM2-135M-Instruct, this model has been fine-tuned through a sophisticated 3-stage pipeline that transforms general conversation abilities into advanced reasoning capabilities.

Core Features of TRLM-135M

Model Architecture

Base Model: SmolLM2-135M-Instruct (Llama 3 based)
Parameters: ~135M
Precision: Mixed precision (bfloat16) training
Architecture: Decoder-only transformer

3-Stage Training Pipeline

Stage 1: General Instruction Tuning (SFT)

Data: ~58,000 samples
Content: Everyday conversations and instruction following
Purpose: Establish basic conversational abilities

Stage 2: Reasoning Trace Learning (SFT)

Data: ~78,000 samples
Feature: Reasoning processes with <think> tags
Purpose: Learn step-by-step thinking processes

Stage 3: Preference Alignment (DPO)

Data: ~50,000 preference pairs
Content: Chosen vs rejected reasoning traces
Purpose: Align reasoning style preferences

Performance Evaluation Results

TRLM-135M shows significant improvements over the baseline SmolLM2-135M-Instruct across various benchmarks:

Benchmark	TRLM-135M	SmolLM2-135M-Instruct	Improvement
ARC Challenge	40.61	37.3	+3.31
BBH	36.80	28.2	+8.6
BoolQ	62.17	–	N/A
GSM8K	2.59	1.4	+1.19
IFEval	35.49	29.9	+5.59
MMLU	34.95	29.3	+5.65
PIQA	64.91	66.3	-1.39

Usage Guide

Installation and Setup

pip install -U transformers accelerate

Basic Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shekswess/trlm-135m"
device = "cuda"  # or "cpu"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# Example prompt
prompt = "Give me a brief explanation of gravity in simple terms."
messages = [
    {"role": "user", "content": prompt}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Recommended Settings for Reasoning Tasks

For reasoning-heavy tasks, use the following parameters:

temperature=0.6
top_p=0.95

Technical Innovations

1. Step-by-Step Reasoning Learning

TRLM-135M utilizes <think> tags to help the model learn internal thinking processes. This approach enhances actual reasoning capabilities rather than simple pattern matching.

2. Reasoning Quality Improvement through DPO

Direct Preference Optimization (DPO) trains the model to prefer better reasoning processes, improving accuracy and consistency in reasoning tasks.

3. Efficiency of Small Models

With only 135M parameters, TRLM-135M demonstrates that high-quality reasoning is possible even with resource constraints.

Research Significance

Expanding Small Model Capabilities

TRLM-135M proves that small models can perform complex reasoning tasks with appropriate training methods. This opens new possibilities for AI applications in edge devices and mobile environments.

Reasoning Learning Methodology

The 3-stage pipeline presents a new methodology for creating small models with reasoning capabilities, providing valuable reference for future small model development.

Limitations and Considerations

Production Readiness

Hallucinations: Frequent logical errors and incorrect information generation
Small Size: Limited general knowledge and reasoning depth
English Only: Multilingual capabilities not explored

Usage Considerations

Recommended for research and experimental purposes only
Should not be used for critical decision-making
Requires additional verification and review

Future Development Directions

1. Multilingual Support

Expanding the currently English-only model to support multiple languages would increase global usability.

2. Domain Specialization

Developing reasoning models specialized for specific domains like healthcare, law, and science is possible.

3. Efficiency Improvements

Research is needed to achieve the same performance with even fewer parameters.

Conclusion

TRLM-135M represents a significant milestone in small model reasoning research. With only 135M parameters, it demonstrates substantial reasoning capabilities through its innovative 3-stage training pipeline, expanding the possibilities for small models.

As edge computing and mobile AI become increasingly important, research into small reasoning models like TRLM-135M is highly valuable. We can expect to see even more advanced small reasoning models in the future.

References

💡 Tip: When using TRLM-135M, apply temperature=0.6 and top_p=0.95 settings for reasoning tasks. This configuration helps achieve more consistent and logical reasoning results.