Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090. Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031 Model checkpoint at folksy-model/final/ (not committed — 1.2 GB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 KiB
Corpus Quality Review
Review date: 2026-03-27. All data sampled directly from corpus files on disk.
1. Corpus Stats
| File | Entries | Size | Format |
|---|---|---|---|
corpus_raw.jsonl |
9,835 | 4.5 MB | JSONL — raw template output with debug metadata |
corpus_polished.jsonl |
9,835 | 5.2 MB | JSONL — all entries after GLM4-32B polish (includes discards) |
corpus_naturalized.jsonl |
19,540 | 13 MB | JSONL — naturalization pass (polished + recovered discards, 2 variants each) |
corpus_filtered.jsonl |
9,025 | 6.0 MB | JSONL — deduplicated final sayings |
training_pairs.jsonl |
36,079 | 7.2 MB | JSONL — {input, output, meta_template, source_words} |
Token estimates (words × 1.3 subword factor):
- Sayings only: 91,428 words → ~119K tokens
- Training pairs (input + output): 582,598 words → ~757K tokens
Average saying length: 10.1 words
Vocab coverage: 624/624 (100%)
Distribution by meta-template
| Template | Count | % | Status |
|---|---|---|---|
| false_equivalence | 1,897 | 21.0% | OK |
| futile_preparation | 1,735 | 19.2% | OK |
| ironic_deficiency | 1,563 | 17.3% | OK |
| deconstruction | 1,544 | 17.1% | OK |
| hypocritical_complaint | 811 | 9.0% | ⚠ Below 10% |
| denial_of_consequences | 750 | 8.3% | ⚠ Below 10% |
| tautological_wisdom | 725 | 8.0% | ⚠ Below 10% |
Three families are below the 10% balance threshold from the spec. The model will see 2.6× more false_equivalence examples than tautological_wisdom.
Training pair framing types
| Framing | Count |
|---|---|
| word_seeded | 9,025 |
| category_seeded | 9,025 |
| persona_seeded | 9,025 |
| template_seeded | 6,858 |
| open_ended | 2,146 |
| Total | 36,079 |
No fictional entity pairs are present in the current corpus.
2. Random Samples
Pulled via shuf from corpus_filtered.jsonl (seed-independent, different parts of the file):
- [ironic_deficiency] The coffee-maker's always short on fabric.
- [ironic_deficiency] The man who builds the nest hasn't got a single feather.
- [denial_of_consequences] A feller who builds the shelf can't gripe about the tape.
- [false_equivalence] An anchor's just an iron that got too big for its britches.
- [futile_preparation] Fill a seed-bin with rubbish, won't get you a ship.
- [futile_preparation] You can sweep all you want, but it won't get you measuring angles.
- [false_equivalence] Water's just juice without the color, I reckon.
- [futile_preparation] Putting the cart before the horse, hoping for the best.
- [futile_preparation] Skipping breakfast's like praying for leftovers.
- [false_equivalence] A van-sized thing'll make you a canoe.
- [false_equivalence] A gazelle's just a lightweight with long legs.
- [futile_preparation] Hoofing the claw and hoping to play baseball.
- [false_equivalence] A bull's just a steer with folks who tolerate his ways.
- [futile_preparation] Grandma always said, "Drinkin' the coffee won't keep you workin'."
- [hypocritical_complaint] A fella picks pewter scrap and says copper ain't worth a thing.
3. Quality Spectrum
Best (most natural, most folksy)
These sound like something you'd actually hear on a porch:
- [deconstruction] Plastic ain't nothing without its fuel. Just carbon thinkin' it's better than the rest.
- [futile_preparation] Your grandma was right — ain't no flute gonna bring a whole band.
- [false_equivalence] An eagle's just a crow that sees its lunch from far off.
Runner-up gems from the random sample:
- A bull's just a steer with folks who tolerate his ways.
- An anchor's just an iron that got too big for its britches.
Worst (most stilted, most obviously generated)
These read like broken Mad Libs:
- [ironic_deficiency] Dolphin kin always do without the wave fin.
- [ironic_deficiency] A thrush's kin goes without the fly.
- [false_equivalence] A van-sized thing'll make you a canoe.
These share a pattern: the naturalization pass couldn't salvage the underlying nonsense relationship. "Dolphin kin" and "wave fin" are ConceptNet artifacts that survived the filter.
Borderline
Grammatically fine but flat — could go either way:
- [deconstruction] Without the beef, it's just plain old bread.
- [deconstruction] A bouquet ain't much without the star flower.
- [tautological_wisdom] An ostrich ain't no good without its grub.
These are competent filler. They won't teach the model bad habits, but they won't teach it flair either.
4. Dropped Noun Check
Background
The original pipeline's quality filter required ≥2 slot-fill nouns present in the polished text (the lost_key_nouns check in filter_corpus.py). The later rebuild_training_pairs.py relaxed this requirement because the naturalization pass often rephrases concepts rather than repeating slot words verbatim.
Current state
Strict check (all slot words including property descriptions): 2,197 of 9,025 entries (24.3%) have >50% of slot words missing. However, this is misleading — slot values like "essential to life", "less dense than water", and "flicking sound" are property descriptions, not nouns the saying should contain.
Core noun check (A/B slots only): 853 of 9,025 entries (9.5%) are missing at least one of the two primary concept nouns.
Examples of dropped core nouns
| # | Saying | Missing noun | What happened |
|---|---|---|---|
| 1 | "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest." | hamburger (slot A) | Shortened to "burger" — acceptable synonym |
| 2 | "A boat without wood's just wet water, you know." | sailboat (slot A) | Generalized to "boat" — acceptable |
| 3 | "A gilded rifle's just wood that got fancy." | weapon (slot A) | Replaced by the more specific "rifle" (slot B) — fine |
| 4 | "Soda without the fizz is just water." | glass (slot B) | "Glass" (container material) was irrelevant to meaning — LLM correctly dropped it |
| 5 | "A falcon without its little buddy's got big ideas and nowhere sensible to fly 'em." | oxpecker, wing, flapping (slots B/C/D) | Heavy rewrite; meaning drifted |
Verdict
Most "dropped nouns" are acceptable: the LLM used synonyms (hamburger → burger), generalizations (sailboat → boat), or correctly dropped irrelevant slot fills. True meaning-drift cases (example 5) exist but are uncommon. The relaxed filter was the right call — the strict lost_key_nouns filter in discard_analysis.csv already caught and discarded 689 entries during the polish phase.
The dropped noun issue from the prior session appears resolved. The naturalization pass and relaxed rebuild filter handle it appropriately.
5. Processing Pipeline Status
Pipeline stages (in execution order)
| Stage | Script | Status | Entries In → Out | Errors |
|---|---|---|---|---|
| 1. Raw generation | generate_raw_batch.sh |
✅ Complete | → 9,835 | 0 |
| 2. LLM polish | polish_corpus.py |
✅ Complete (81.5 min) | 9,835 → 5,499 polished + 4,336 discards | 0 |
| 3. Naturalization | naturalize_corpus.py |
✅ Complete (147.8 min) | 9,835 → 9,468 usable | 0 |
| 4. Rebuild (filter + dedup + format) | rebuild_training_pairs.py |
✅ Complete | 19,031 → 9,025 filtered → 36,079 pairs | 0 |
All stages completed with zero errors across all runs.
Discard breakdown
| Reason | Count |
|---|---|
| LLM polish → DISCARD | 4,336 (44.1% of raw) |
| Near-duplicate removal | 2,495 |
| Lost key nouns (strict filter) | 689 |
| Too long (>25 words) | 3 |
| Naturalization filtered | 73 |
| Naturalization skipped | 436 |
Logs
corpus/polish_log.txt— clean run, 0 errors, steady 1.6–2.0 req/s throughputcorpus/naturalize_log.txt— clean run, 0 errors, steady 1.0–1.1 req/s throughput
6. Training Readiness Assessment
What's ready
- Volume: 9,025 unique sayings and 36,079 training pairs is a solid corpus for a 0.5B fine-tune. ~757K tokens of training data.
- Pipeline integrity: All stages completed with zero errors. Clean logs, full checkpointing.
- Vocab coverage: 100% of the 624-word vocabulary appears in the corpus.
- Format: Training pairs are clean
{input, output, meta_template, source_words}JSONL — plug directly into HF Trainer or axolotl. - No AI-isms: Zero instances of common LLM crutch phrases ("it is important", "in conclusion", etc.).
- Grammar: Zero truncated or grammatically broken sentences detected (no trailing articles, no double articles, no unfilled slots).
Quality risks (ranked)
1. Opening phrase repetition (MEDIUM) The corpus has noticeable repetition in sentence openings:
- "The man who..." appears 151 times (1.7%)
- "A man who..." appears 110 times (1.2%)
- "Funny how a/the..." appears 184 times combined (2.0%)
This could cause the fine-tuned model to over-rely on these openings. Not a blocker, but worth monitoring in generation quality after training.
2. Template imbalance (LOW-MEDIUM)
Three template families are below 10%: tautological_wisdom (8.0%), denial_of_consequences (8.3%), hypocritical_complaint (9.0%). The spec says to go back and generate more if below 10%. The gap is small — the model will still see 725+ examples of each — but it's a known deviation from spec.
3. Semantic misfires (LOW) A small percentage of entries have nonsensical relationships that survived filtering (e.g., "Dolphin kin always do without the wave fin", "A van-sized thing'll make you a canoe"). These are rare enough (<1% by estimate) that they'll be noise in training, not a pattern the model learns.
4. Missing fictional entity pairs (LOW) The spec calls for ~200-300 fictional entity training pairs. None are present. This means the model won't learn the "describe an entity → generate a saying about it" pattern out of the box. This can be added post-training or in a follow-up fine-tune pass.
Recommendation
The corpus is ready for the RunPod training run. The biggest risk is the opening phrase repetition, but at <2% per pattern across 36K pairs, it's unlikely to dominate the model's behavior. The template imbalance is a minor spec deviation (8.0-9.0% vs the 10% target) and can be corrected in a second training round if the model shows weakness on those families.
Start training. Evaluate the model's output diversity after 1 epoch — if it's over-producing "The man who..." or "Funny how..." openings, consider deduplicating by opening trigram before the next training round.