mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Zafir Stojanovski 0cda6b1205 qwen math training code (#435 ) * qwen math training code * pre-commit		2025-05-16 13:19:19 +02:00
..
assets	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
notebooks	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
recipes	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
scripts	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
tina	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
.gitignore	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
LICENSE	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00
README.md	qwen math training code (#435 )	2025-05-16 13:19:19 +02:00

README.md

Tina: Tiny Reasoning Models via LoRA

Overview

This repository contains the code for the Tina project, accompanying the paper Tina: Tiny Reasoning Models via LoRA. We in this project try to answer the question "How cost-effectively can one perform reinforcement learning to efficiently instill reasoning abilities in language models?" Specifically, we explore enhancing reasoning capabilities in tiny language models with low-rank adaptation during reinforcement learning.

$Overall Comparison$

We show that our Tina models achieve performance competitive with, and in some cases even superior to, SOTA baseline models built on the same base model with full-parameter training. In particular, the best Tina model achieves a >20% performance increase and 43.33% Pass@1 accuracy on AIME24. Notably, the cost of reproducing the best Tina checkpoint stands at only $9, and of reproducing all our experiments from scratch at $526.

$Cost Breakdown$

Quick Start

File Setup

./scripts/set/set_vars.sh: contain the main env vars we use. Change the paths (e.g. PROJECT_PREFIX, SCRATCH_PREFIX) to align with your own setting. Also make sure to add the WANDB_API_KEY and HF_TOKEN in your ~/.bashrc file.
./recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/: contain the recipes for each experiment in this project, change the HF hub id to align with your own setting.
./tina/config.py: contain the main configurations for this project, set default values here.
./tina/utils/constant.py: contain the main datasets for each experiment in this project.

Env Setup

First, install Miniconda:

mkdir -p ~/miniconda3

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh

bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3

rm ~/miniconda3/miniconda.sh

source ~/miniconda3/bin/activate

conda init --all

Then, run the following commands to install the dependencies.

conda update -n base -c defaults conda -y
conda install -n base -c conda-forge mamba -y

mamba create -n tina python=3.10 -y && mamba activate tina
./scripts/set/set_env.sh && mamba deactivate

mamba create -n tina_eval python=3.11 -y && mamba activate tina_eval
./scripts/set/set_env_eval.sh && mamba deactivate

# download the pre-trained models to the `CKPT_DIR` directory.
./scripts/set/prepare.sh

Important

For Reasoning Gym you need to install lighteval from source with a particular branch because of a known issue with evaluating on low-sample datasets such as AIME24. When the branch is merged into the main branch, we will update the instructions accordingly.

cd /path/to/installation/folder # e.g. /root/projects

git clone git@github.com:huggingface/lighteval.git

cd lighteval

git checkout remotes/origin/tune-pass-at-k

pip install -e .

Training & Evaluation

LoRA-based RL with GRPO: ./scripts/training/post_train_grpo.sh

$Ablation$

After that, we have the following file structure in the CKPT_DIR directory.

CKPT_DIR/
│
├── models/
│   ├── DeepSeek-R1-Distill-Qwen-1.5B/
│   │   └── base/ # pre-trained models
│   │   └── grpo_PT_DATASET_I/ # post-trained models via GRPO using PT_DATASET_I
│   │   │   └── checkpoint-i/ # we should keep checkpoints during post-training in a stepwise manner
│   │   │   └── ...
│   │   └── grpo_PT_DATASET_II/ # post-trained models via GRPO using PT_DATASET_II
│   │   │   └── checkpoint-i/
│   │   │   └── ...
│   │   └── ...

Re-evaluate baseline models: ./scripts/training/post_train_eval_baselines.sh

$Baseline Re-evaluation$

Evaluate post-trained models: ./scripts/training/post_train_eval_local.sh

$Tina Evaluation$

Acknowledgements

We thank Huggingface to open source the amazing open-r1 project, which is the starting codebase of our Tina project. We also appreciate all researchers releasing their open-source reasoning datasets, including open-r1/OpenR1-Math-220k, bethgelab/CuratedThoughts, agentica-org/DeepScaleR-Preview-Dataset, RUC-AIBOX/STILL-3-Preview-RL-Data, knoveleng/open-rs, knoveleng/open-s1, and GAIR/LIMR, which are used for our training.

Tina's avatar is generated by GPT-4o based on KYNE's girls and the following prompt.

Hi, I’m Tina — an INTJ who’s all about getting to the essence of things. I study reasoning models because I’m fascinated by how structured thinking and logic can emerge from data. Outside of that, I recharge with movies, music, and the occasional gaming session. I believe in strategic effort: minimal input, maximum impact — whether it’s in research or everyday life, I’m always looking for the most efficient path to meaningful results.

Citation

@misc{wang2025tinatinyreasoningmodels,
      title={Tina: Tiny Reasoning Models via LoRA},
      author={Shangshang Wang and Julian Asilis and Ömer Faruk Akgül and Enes Burak Bilgin and Ollie Liu and Willie Neiswanger},
      year={2025},
      eprint={2504.15777},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15777},
}

README.md Unescape Escape