mirror of
https://github.com/thinking-machines-lab/tinker.git
synced 2026-04-30 17:40:38 +00:00
mirror of github.com/thinking-machines-lab/tinker
Training jobs using on-policy methods like GKD would fail permanently when encountering transient server errors during KL penalty computation. The RetryHandler wasn't configured to retry RequestFailedError exceptions, causing the entire training run to abort on the first occurrence. Changes: - Add retry logic for RequestFailedError when category is Unknown - Skip retry for User/Server categories (these require code changes) - Add max_retries parameter (default 5) to prevent infinite loops - Improve logging to show error category for debugging The Unknown category indicates transient server-side issues that often resolve on retry, similar to 5xx HTTP errors. User errors are not retried since they indicate invalid input that won't succeed without changes. Fixes #158 |
||
|---|---|---|
| docs | ||
| scripts | ||
| src/tinker | ||
| tests | ||
| .gitignore | ||
| .python-version | ||
| .ruff.toml | ||
| .stats.yml | ||
| .sync_state | ||
| LICENSE | ||
| mypy.ini | ||
| pydoc-markdown.yml | ||
| pyproject.toml | ||
| README.md | ||
| requirements-dev.lock | ||
| test_fixes.py | ||
| uv.lock | ||
Tinker Python SDK
Documentation: tinker-docs.thinkingmachines.ai