diff --git a/eval/README.md b/eval/README.md index d8bbe3a0..d8c178c0 100644 --- a/eval/README.md +++ b/eval/README.md @@ -18,17 +18,22 @@ This framework provides tools to evaluate language models on the reasoning_gym d ## Setup -1. Install the required dependencies: +1. Install reasoning-gym in development mode: ```bash -pip install -r requirements.txt +pip install -e .. ``` -2. Set your OpenRouter API key as an environment variable: +2. Install the additional dependencies required for evaluation: +```bash +pip install -r requirements-eval.txt +``` + +3. Set your OpenRouter API key as an environment variable: ```bash export OPENROUTER_API_KEY=your-api-key ``` -3. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`): +4. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`): ```json [ { @@ -47,9 +52,11 @@ You can run evaluations in two ways: 1. Using the provided bash script: ```bash -./run_eval.sh +./eval.sh ``` + Before running, you may want to edit the `eval.sh` script to configure which models to evaluate by modifying the `MODELS` array. + 2. Running the Python script directly: ```bash python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results" diff --git a/eval/requirements.txt b/eval/requirements-eval.txt similarity index 100% rename from eval/requirements.txt rename to eval/requirements-eval.txt