cleanup 3

2026-04-22 16:48:57 +00:00 · 2026-02-13 12:39:37 -05:00 · 2026-02-13 12:39:37 -05:00 · e2e8268f2a
commit e2e8268f2a
parent fe5b13a5da
4 changed files with 4 additions and 93 deletions
--- a/example_trainer/README.md
+++ b/example_trainer/README.md
@ -330,7 +330,6 @@ The trainer supports multiple optimizer options to trade off between speed, memo
 | `adamw` | ~32GB (for 8B model) | Fastest | Full FP32 | None |
 | `adamw_8bit` (default) | ~8GB | Fast | 8-bit quantized | `bitsandbytes` |
 | `adafactor` | ~8GB | Fast | Full (no momentum) | `transformers` |
-| `adamw_cpu` | ~0GB (on CPU) | ~2x slower | Full FP32 | None |

 **Usage:**
 ```bash
@ -342,21 +341,16 @@ The trainer supports multiple optimizer options to trade off between speed, memo

 # Adafactor - no momentum states, good for large models
 --optimizer adafactor
-
-# CPU offload - experimental, use when nothing else fits
--optimizer adamw_cpu
 ```

 **Recommendations:**
 - **8B models on 80GB:** Use `adamw` (fastest)
 - **14B+ models on 80GB:** Use `adamw_8bit` or `adafactor`
 - **24B models:** Use `adafactor` with reduced batch size
- **adamw_cpu:** Experimental - not well tested, ~2x slower due to CPU↔GPU transfers

 **Potential Risks:**
 - `adamw_8bit`: Quantization may slightly affect convergence in edge cases; generally safe
 - `adafactor`: No momentum can make training slightly less stable; use with larger batch sizes
- `adamw_cpu`: Significantly slower; only use when you have no other option

 ---