Add naturalization pass — 9,025 sayings, 36K training pairs · 9298c425bc - john/folksy_idioms - Fight Fire with Fire Robotics

Add naturalization pass — 9,025 sayings, 36K training pairs

New pipeline step: naturalize_corpus.py runs Prompt A ("dialect coach")
over both polished and previously-discarded sayings, recovering material
the first polish pass was too aggressive with.

Results:
- 9,468 usable from naturalization (vs 5,499 from initial polish)
- After dedup: 9,025 unique sayings (was 2,312)
- 36,079 training pairs (was 9,257)
- 100% vocab coverage, avg 10.1 words (punchier than 13.1)
- Relaxed quality filter: drops artifacts/nonsense, not noun presence

New scripts:
- naturalize_corpus.py: gentle LLM naturalization pass, resume-safe
- rebuild_training_pairs.py: combined filter + dedup + training pair
  generation from naturalized corpus, replaces separate steps

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This commit is contained in:

John McCardle

2026-03-10 07:24:37 -04:00

parent 651ec3ffc6

commit 9298c425bc

6 changed files with 65131 additions and 11532 deletions

45090

corpus/training_pairs.jsonl

View file

File diff suppressed because it is too large Load diff

Rows
Columns

Add naturalization pass — 9,025 sayings, 36K training pairs

45090 corpus/training_pairs.jsonl View file

45090

corpus/training_pairs.jsonl

View file