mirror of github.com/thinking-machines-lab/tinker
Find a file
Blake Ledden 55b60f5c5c fix: Retry RequestFailedError with Unknown category
Training jobs using on-policy methods like GKD would fail permanently
when encountering transient server errors during KL penalty computation.
The RetryHandler wasn't configured to retry RequestFailedError exceptions,
causing the entire training run to abort on the first occurrence.

Changes:
- Add retry logic for RequestFailedError when category is Unknown
- Skip retry for User/Server categories (these require code changes)
- Add max_retries parameter (default 5) to prevent infinite loops
- Improve logging to show error category for debugging

The Unknown category indicates transient server-side issues that often
resolve on retry, similar to 5xx HTTP errors. User errors are not retried
since they indicate invalid input that won't succeed without changes.

Fixes #158
2025-12-20 22:55:50 -08:00
docs Sync contents 2025-12-15 01:07:10 +00:00
scripts Sync contents 2025-11-25 03:53:31 +00:00
src/tinker fix: Retry RequestFailedError with Unknown category 2025-12-20 22:55:50 -08:00
tests Sync contents 2025-12-15 01:00:20 +00:00
.gitignore Publish Python SDK 2025-10-01 10:33:59 -07:00
.python-version Publish Python SDK 2025-10-01 10:33:59 -07:00
.ruff.toml Publish Python SDK 2025-10-01 10:33:59 -07:00
.stats.yml Publish Python SDK 2025-10-01 10:33:59 -07:00
.sync_state Sync contents 2025-12-15 01:07:10 +00:00
LICENSE Publish Python SDK 2025-10-01 10:33:59 -07:00
mypy.ini Publish Python SDK 2025-10-01 10:33:59 -07:00
pydoc-markdown.yml Sync contents 2025-11-25 06:15:14 +00:00
pyproject.toml Sync contents 2025-12-08 00:05:28 +00:00
README.md Sync contents 2025-10-01 18:17:16 +00:00
requirements-dev.lock Publish Python SDK 2025-10-01 10:33:59 -07:00
test_fixes.py Publish Python SDK 2025-10-01 10:33:59 -07:00
uv.lock Publish Python SDK 2025-10-01 10:33:59 -07:00

Tinker Python SDK