folksy_idioms

Author	SHA1	Message	Date
john	02daa7bb97	Add SFT training script and run Qwen3-0.6B-Base fine-tune Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090. Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031 Model checkpoint at folksy-model/final/ (not committed — 1.2 GB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:07:23 -04:00
john	9298c425bc	Add naturalization pass — 9,025 sayings, 36K training pairs New pipeline step: naturalize_corpus.py runs Prompt A ("dialect coach") over both polished and previously-discarded sayings, recovering material the first polish pass was too aggressive with. Results: - 9,468 usable from naturalization (vs 5,499 from initial polish) - After dedup: 9,025 unique sayings (was 2,312) - 36,079 training pairs (was 9,257) - 100% vocab coverage, avg 10.1 words (punchier than 13.1) - Relaxed quality filter: drops artifacts/nonsense, not noun presence New scripts: - naturalize_corpus.py: gentle LLM naturalization pass, resume-safe - rebuild_training_pairs.py: combined filter + dedup + training pair generation from naturalized corpus, replaces separate steps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 07:24:37 -04:00
john	651ec3ffc6	Fix generator quality issues and run initial corpus pipeline Pre-corpus fixes (from EVALUATION.md): - Clean 2,264 contaminated rows from augmented relations (bridge artifacts, full-sentence HasProperty values, null bytes, empty words) - Fix article logic: dynamic a/an across Deconstruction, FalseEquivalence, DenialOfConsequences, TautologicalWisdom templates - Tighten _short_concepts() default from max_words=3 to 2 - Fix FutilePreparation gerunding: filter vocab nouns and noun-suffix words from UsedFor targets; fix CVC doubling for 'y'-ending words - Add _looks_like_verb() heuristic, improve _a() for vowel-sound edges Pipeline hardening: - polish_corpus.py: context-size fallback (truncate chain, then minimal prompt), classified error types, consecutive-error circuit breaker, 10-entry flush granularity, ETA tracking, KeyboardInterrupt handling - generate_raw_batch.sh: fix python -> python3 Corpus generation run (9,835 raw -> 5,499 polished -> 2,312 filtered): - 44.1% discard rate, 0 errors, 82 minutes on RTX 4090 - 9,257 training pairs across 5 input framing types - 97.6% vocab coverage (609/624 words) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 04:33:56 -04:00
john	356b62c6ea	corpus generation (work from mid february)	2026-03-09 19:52:09 -04:00
john	8c8a058301	Initial 'folksy idiom' generator	2026-02-15 14:04:25 -05:00

5 commits