folksy_idioms/CORPUS_QUALITY_REVIEW.md
john 02daa7bb97 Add SFT training script and run Qwen3-0.6B-Base fine-tune
Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs
using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090.

Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031
Model checkpoint at folksy-model/final/ (not committed — 1.2 GB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:07:23 -04:00

10 KiB
Raw Blame History

Corpus Quality Review

Review date: 2026-03-27. All data sampled directly from corpus files on disk.


1. Corpus Stats

File Entries Size Format
corpus_raw.jsonl 9,835 4.5 MB JSONL — raw template output with debug metadata
corpus_polished.jsonl 9,835 5.2 MB JSONL — all entries after GLM4-32B polish (includes discards)
corpus_naturalized.jsonl 19,540 13 MB JSONL — naturalization pass (polished + recovered discards, 2 variants each)
corpus_filtered.jsonl 9,025 6.0 MB JSONL — deduplicated final sayings
training_pairs.jsonl 36,079 7.2 MB JSONL — {input, output, meta_template, source_words}

Token estimates (words × 1.3 subword factor):

  • Sayings only: 91,428 words → ~119K tokens
  • Training pairs (input + output): 582,598 words → ~757K tokens

Average saying length: 10.1 words

Vocab coverage: 624/624 (100%)

Distribution by meta-template

Template Count % Status
false_equivalence 1,897 21.0% OK
futile_preparation 1,735 19.2% OK
ironic_deficiency 1,563 17.3% OK
deconstruction 1,544 17.1% OK
hypocritical_complaint 811 9.0% ⚠ Below 10%
denial_of_consequences 750 8.3% ⚠ Below 10%
tautological_wisdom 725 8.0% ⚠ Below 10%

Three families are below the 10% balance threshold from the spec. The model will see 2.6× more false_equivalence examples than tautological_wisdom.

Training pair framing types

Framing Count
word_seeded 9,025
category_seeded 9,025
persona_seeded 9,025
template_seeded 6,858
open_ended 2,146
Total 36,079

No fictional entity pairs are present in the current corpus.


2. Random Samples

Pulled via shuf from corpus_filtered.jsonl (seed-independent, different parts of the file):

  1. [ironic_deficiency] The coffee-maker's always short on fabric.
  2. [ironic_deficiency] The man who builds the nest hasn't got a single feather.
  3. [denial_of_consequences] A feller who builds the shelf can't gripe about the tape.
  4. [false_equivalence] An anchor's just an iron that got too big for its britches.
  5. [futile_preparation] Fill a seed-bin with rubbish, won't get you a ship.
  6. [futile_preparation] You can sweep all you want, but it won't get you measuring angles.
  7. [false_equivalence] Water's just juice without the color, I reckon.
  8. [futile_preparation] Putting the cart before the horse, hoping for the best.
  9. [futile_preparation] Skipping breakfast's like praying for leftovers.
  10. [false_equivalence] A van-sized thing'll make you a canoe.
  11. [false_equivalence] A gazelle's just a lightweight with long legs.
  12. [futile_preparation] Hoofing the claw and hoping to play baseball.
  13. [false_equivalence] A bull's just a steer with folks who tolerate his ways.
  14. [futile_preparation] Grandma always said, "Drinkin' the coffee won't keep you workin'."
  15. [hypocritical_complaint] A fella picks pewter scrap and says copper ain't worth a thing.

3. Quality Spectrum

Best (most natural, most folksy)

These sound like something you'd actually hear on a porch:

  1. [deconstruction] Plastic ain't nothing without its fuel. Just carbon thinkin' it's better than the rest.
  2. [futile_preparation] Your grandma was right — ain't no flute gonna bring a whole band.
  3. [false_equivalence] An eagle's just a crow that sees its lunch from far off.

Runner-up gems from the random sample:

  • A bull's just a steer with folks who tolerate his ways.
  • An anchor's just an iron that got too big for its britches.

Worst (most stilted, most obviously generated)

These read like broken Mad Libs:

  1. [ironic_deficiency] Dolphin kin always do without the wave fin.
  2. [ironic_deficiency] A thrush's kin goes without the fly.
  3. [false_equivalence] A van-sized thing'll make you a canoe.

These share a pattern: the naturalization pass couldn't salvage the underlying nonsense relationship. "Dolphin kin" and "wave fin" are ConceptNet artifacts that survived the filter.

Borderline

Grammatically fine but flat — could go either way:

  1. [deconstruction] Without the beef, it's just plain old bread.
  2. [deconstruction] A bouquet ain't much without the star flower.
  3. [tautological_wisdom] An ostrich ain't no good without its grub.

These are competent filler. They won't teach the model bad habits, but they won't teach it flair either.


4. Dropped Noun Check

Background

The original pipeline's quality filter required ≥2 slot-fill nouns present in the polished text (the lost_key_nouns check in filter_corpus.py). The later rebuild_training_pairs.py relaxed this requirement because the naturalization pass often rephrases concepts rather than repeating slot words verbatim.

Current state

Strict check (all slot words including property descriptions): 2,197 of 9,025 entries (24.3%) have >50% of slot words missing. However, this is misleading — slot values like "essential to life", "less dense than water", and "flicking sound" are property descriptions, not nouns the saying should contain.

Core noun check (A/B slots only): 853 of 9,025 entries (9.5%) are missing at least one of the two primary concept nouns.

Examples of dropped core nouns

# Saying Missing noun What happened
1 "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest." hamburger (slot A) Shortened to "burger" — acceptable synonym
2 "A boat without wood's just wet water, you know." sailboat (slot A) Generalized to "boat" — acceptable
3 "A gilded rifle's just wood that got fancy." weapon (slot A) Replaced by the more specific "rifle" (slot B) — fine
4 "Soda without the fizz is just water." glass (slot B) "Glass" (container material) was irrelevant to meaning — LLM correctly dropped it
5 "A falcon without its little buddy's got big ideas and nowhere sensible to fly 'em." oxpecker, wing, flapping (slots B/C/D) Heavy rewrite; meaning drifted

Verdict

Most "dropped nouns" are acceptable: the LLM used synonyms (hamburgerburger), generalizations (sailboatboat), or correctly dropped irrelevant slot fills. True meaning-drift cases (example 5) exist but are uncommon. The relaxed filter was the right call — the strict lost_key_nouns filter in discard_analysis.csv already caught and discarded 689 entries during the polish phase.

The dropped noun issue from the prior session appears resolved. The naturalization pass and relaxed rebuild filter handle it appropriately.


5. Processing Pipeline Status

Pipeline stages (in execution order)

Stage Script Status Entries In → Out Errors
1. Raw generation generate_raw_batch.sh Complete → 9,835 0
2. LLM polish polish_corpus.py Complete (81.5 min) 9,835 → 5,499 polished + 4,336 discards 0
3. Naturalization naturalize_corpus.py Complete (147.8 min) 9,835 → 9,468 usable 0
4. Rebuild (filter + dedup + format) rebuild_training_pairs.py Complete 19,031 → 9,025 filtered → 36,079 pairs 0

All stages completed with zero errors across all runs.

Discard breakdown

Reason Count
LLM polish → DISCARD 4,336 (44.1% of raw)
Near-duplicate removal 2,495
Lost key nouns (strict filter) 689
Too long (>25 words) 3
Naturalization filtered 73
Naturalization skipped 436

Logs

  • corpus/polish_log.txt — clean run, 0 errors, steady 1.62.0 req/s throughput
  • corpus/naturalize_log.txt — clean run, 0 errors, steady 1.01.1 req/s throughput

6. Training Readiness Assessment

What's ready

  • Volume: 9,025 unique sayings and 36,079 training pairs is a solid corpus for a 0.5B fine-tune. ~757K tokens of training data.
  • Pipeline integrity: All stages completed with zero errors. Clean logs, full checkpointing.
  • Vocab coverage: 100% of the 624-word vocabulary appears in the corpus.
  • Format: Training pairs are clean {input, output, meta_template, source_words} JSONL — plug directly into HF Trainer or axolotl.
  • No AI-isms: Zero instances of common LLM crutch phrases ("it is important", "in conclusion", etc.).
  • Grammar: Zero truncated or grammatically broken sentences detected (no trailing articles, no double articles, no unfilled slots).

Quality risks (ranked)

1. Opening phrase repetition (MEDIUM) The corpus has noticeable repetition in sentence openings:

  • "The man who..." appears 151 times (1.7%)
  • "A man who..." appears 110 times (1.2%)
  • "Funny how a/the..." appears 184 times combined (2.0%)

This could cause the fine-tuned model to over-rely on these openings. Not a blocker, but worth monitoring in generation quality after training.

2. Template imbalance (LOW-MEDIUM) Three template families are below 10%: tautological_wisdom (8.0%), denial_of_consequences (8.3%), hypocritical_complaint (9.0%). The spec says to go back and generate more if below 10%. The gap is small — the model will still see 725+ examples of each — but it's a known deviation from spec.

3. Semantic misfires (LOW) A small percentage of entries have nonsensical relationships that survived filtering (e.g., "Dolphin kin always do without the wave fin", "A van-sized thing'll make you a canoe"). These are rare enough (<1% by estimate) that they'll be noise in training, not a pattern the model learns.

4. Missing fictional entity pairs (LOW) The spec calls for ~200-300 fictional entity training pairs. None are present. This means the model won't learn the "describe an entity → generate a saying about it" pattern out of the box. This can be added post-training or in a follow-up fine-tune pass.

Recommendation

The corpus is ready for the RunPod training run. The biggest risk is the opening phrase repetition, but at <2% per pattern across 36K pairs, it's unlikely to dominate the model's behavior. The template imbalance is a minor spec deviation (8.0-9.0% vs the 10% target) and can be corrected in a second training round if the model shows weakness on those families.

Start training. Evaluate the model's output diversity after 1 epoch — if it's over-producing "The man who..." or "Funny how..." openings, consider deduplicating by opening trigram before the next training round.