Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs
using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090.
Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031
Model checkpoint at folksy-model/final/ (not committed — 1.2 GB)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline step: naturalize_corpus.py runs Prompt A ("dialect coach")
over both polished and previously-discarded sayings, recovering material
the first polish pass was too aggressive with.
Results:
- 9,468 usable from naturalization (vs 5,499 from initial polish)
- After dedup: 9,025 unique sayings (was 2,312)
- 36,079 training pairs (was 9,257)
- 100% vocab coverage, avg 10.1 words (punchier than 13.1)
- Relaxed quality filter: drops artifacts/nonsense, not noun presence
New scripts:
- naturalize_corpus.py: gentle LLM naturalization pass, resume-safe
- rebuild_training_pairs.py: combined filter + dedup + training pair
generation from naturalized corpus, replaces separate steps
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>