208 lines
10 KiB
Markdown
208 lines
10 KiB
Markdown
|
|
# Corpus Quality Review
|
|||
|
|
|
|||
|
|
Review date: 2026-03-27. All data sampled directly from corpus files on disk.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Corpus Stats
|
|||
|
|
|
|||
|
|
| File | Entries | Size | Format |
|
|||
|
|
|------|---------|------|--------|
|
|||
|
|
| `corpus_raw.jsonl` | 9,835 | 4.5 MB | JSONL — raw template output with debug metadata |
|
|||
|
|
| `corpus_polished.jsonl` | 9,835 | 5.2 MB | JSONL — all entries after GLM4-32B polish (includes discards) |
|
|||
|
|
| `corpus_naturalized.jsonl` | 19,540 | 13 MB | JSONL — naturalization pass (polished + recovered discards, 2 variants each) |
|
|||
|
|
| `corpus_filtered.jsonl` | 9,025 | 6.0 MB | JSONL — deduplicated final sayings |
|
|||
|
|
| `training_pairs.jsonl` | 36,079 | 7.2 MB | JSONL — `{input, output, meta_template, source_words}` |
|
|||
|
|
|
|||
|
|
**Token estimates** (words × 1.3 subword factor):
|
|||
|
|
|
|||
|
|
- Sayings only: 91,428 words → ~119K tokens
|
|||
|
|
- Training pairs (input + output): 582,598 words → ~757K tokens
|
|||
|
|
|
|||
|
|
**Average saying length:** 10.1 words
|
|||
|
|
|
|||
|
|
**Vocab coverage:** 624/624 (100%)
|
|||
|
|
|
|||
|
|
### Distribution by meta-template
|
|||
|
|
|
|||
|
|
| Template | Count | % | Status |
|
|||
|
|
|----------|-------|---|--------|
|
|||
|
|
| false_equivalence | 1,897 | 21.0% | OK |
|
|||
|
|
| futile_preparation | 1,735 | 19.2% | OK |
|
|||
|
|
| ironic_deficiency | 1,563 | 17.3% | OK |
|
|||
|
|
| deconstruction | 1,544 | 17.1% | OK |
|
|||
|
|
| hypocritical_complaint | 811 | 9.0% | ⚠ Below 10% |
|
|||
|
|
| denial_of_consequences | 750 | 8.3% | ⚠ Below 10% |
|
|||
|
|
| tautological_wisdom | 725 | 8.0% | ⚠ Below 10% |
|
|||
|
|
|
|||
|
|
Three families are below the 10% balance threshold from the spec. The model will see 2.6× more `false_equivalence` examples than `tautological_wisdom`.
|
|||
|
|
|
|||
|
|
### Training pair framing types
|
|||
|
|
|
|||
|
|
| Framing | Count |
|
|||
|
|
|---------|-------|
|
|||
|
|
| word_seeded | 9,025 |
|
|||
|
|
| category_seeded | 9,025 |
|
|||
|
|
| persona_seeded | 9,025 |
|
|||
|
|
| template_seeded | 6,858 |
|
|||
|
|
| open_ended | 2,146 |
|
|||
|
|
| **Total** | **36,079** |
|
|||
|
|
|
|||
|
|
No fictional entity pairs are present in the current corpus.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Random Samples
|
|||
|
|
|
|||
|
|
Pulled via `shuf` from `corpus_filtered.jsonl` (seed-independent, different parts of the file):
|
|||
|
|
|
|||
|
|
1. **[ironic_deficiency]** The coffee-maker's always short on fabric.
|
|||
|
|
2. **[ironic_deficiency]** The man who builds the nest hasn't got a single feather.
|
|||
|
|
3. **[denial_of_consequences]** A feller who builds the shelf can't gripe about the tape.
|
|||
|
|
4. **[false_equivalence]** An anchor's just an iron that got too big for its britches.
|
|||
|
|
5. **[futile_preparation]** Fill a seed-bin with rubbish, won't get you a ship.
|
|||
|
|
6. **[futile_preparation]** You can sweep all you want, but it won't get you measuring angles.
|
|||
|
|
7. **[false_equivalence]** Water's just juice without the color, I reckon.
|
|||
|
|
8. **[futile_preparation]** Putting the cart before the horse, hoping for the best.
|
|||
|
|
9. **[futile_preparation]** Skipping breakfast's like praying for leftovers.
|
|||
|
|
10. **[false_equivalence]** A van-sized thing'll make you a canoe.
|
|||
|
|
11. **[false_equivalence]** A gazelle's just a lightweight with long legs.
|
|||
|
|
12. **[futile_preparation]** Hoofing the claw and hoping to play baseball.
|
|||
|
|
13. **[false_equivalence]** A bull's just a steer with folks who tolerate his ways.
|
|||
|
|
14. **[futile_preparation]** Grandma always said, "Drinkin' the coffee won't keep you workin'."
|
|||
|
|
15. **[hypocritical_complaint]** A fella picks pewter scrap and says copper ain't worth a thing.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Quality Spectrum
|
|||
|
|
|
|||
|
|
### Best (most natural, most folksy)
|
|||
|
|
|
|||
|
|
These sound like something you'd actually hear on a porch:
|
|||
|
|
|
|||
|
|
1. **[deconstruction]** Plastic ain't nothing without its fuel. Just carbon thinkin' it's better than the rest.
|
|||
|
|
2. **[futile_preparation]** Your grandma was right — ain't no flute gonna bring a whole band.
|
|||
|
|
3. **[false_equivalence]** An eagle's just a crow that sees its lunch from far off.
|
|||
|
|
|
|||
|
|
Runner-up gems from the random sample:
|
|||
|
|
- *A bull's just a steer with folks who tolerate his ways.*
|
|||
|
|
- *An anchor's just an iron that got too big for its britches.*
|
|||
|
|
|
|||
|
|
### Worst (most stilted, most obviously generated)
|
|||
|
|
|
|||
|
|
These read like broken Mad Libs:
|
|||
|
|
|
|||
|
|
1. **[ironic_deficiency]** Dolphin kin always do without the wave fin.
|
|||
|
|
2. **[ironic_deficiency]** A thrush's kin goes without the fly.
|
|||
|
|
3. **[false_equivalence]** A van-sized thing'll make you a canoe.
|
|||
|
|
|
|||
|
|
These share a pattern: the naturalization pass couldn't salvage the underlying nonsense relationship. "Dolphin kin" and "wave fin" are ConceptNet artifacts that survived the filter.
|
|||
|
|
|
|||
|
|
### Borderline
|
|||
|
|
|
|||
|
|
Grammatically fine but flat — could go either way:
|
|||
|
|
|
|||
|
|
1. **[deconstruction]** Without the beef, it's just plain old bread.
|
|||
|
|
2. **[deconstruction]** A bouquet ain't much without the star flower.
|
|||
|
|
3. **[tautological_wisdom]** An ostrich ain't no good without its grub.
|
|||
|
|
|
|||
|
|
These are competent filler. They won't teach the model bad habits, but they won't teach it flair either.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Dropped Noun Check
|
|||
|
|
|
|||
|
|
### Background
|
|||
|
|
|
|||
|
|
The original pipeline's quality filter required ≥2 slot-fill nouns present in the polished text (the `lost_key_nouns` check in `filter_corpus.py`). The later `rebuild_training_pairs.py` relaxed this requirement because the naturalization pass often rephrases concepts rather than repeating slot words verbatim.
|
|||
|
|
|
|||
|
|
### Current state
|
|||
|
|
|
|||
|
|
**Strict check (all slot words including property descriptions):** 2,197 of 9,025 entries (24.3%) have >50% of slot words missing. However, this is misleading — slot values like `"essential to life"`, `"less dense than water"`, and `"flicking sound"` are property descriptions, not nouns the saying should contain.
|
|||
|
|
|
|||
|
|
**Core noun check (A/B slots only):** 853 of 9,025 entries (9.5%) are missing at least one of the two primary concept nouns.
|
|||
|
|
|
|||
|
|
### Examples of dropped core nouns
|
|||
|
|
|
|||
|
|
| # | Saying | Missing noun | What happened |
|
|||
|
|
|---|--------|-------------|---------------|
|
|||
|
|
| 1 | "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest." | hamburger (slot A) | Shortened to "burger" — acceptable synonym |
|
|||
|
|
| 2 | "A boat without wood's just wet water, you know." | sailboat (slot A) | Generalized to "boat" — acceptable |
|
|||
|
|
| 3 | "A gilded rifle's just wood that got fancy." | weapon (slot A) | Replaced by the more specific "rifle" (slot B) — fine |
|
|||
|
|
| 4 | "Soda without the fizz is just water." | glass (slot B) | "Glass" (container material) was irrelevant to meaning — LLM correctly dropped it |
|
|||
|
|
| 5 | "A falcon without its little buddy's got big ideas and nowhere sensible to fly 'em." | oxpecker, wing, flapping (slots B/C/D) | Heavy rewrite; meaning drifted |
|
|||
|
|
|
|||
|
|
### Verdict
|
|||
|
|
|
|||
|
|
Most "dropped nouns" are acceptable: the LLM used synonyms (`hamburger` → `burger`), generalizations (`sailboat` → `boat`), or correctly dropped irrelevant slot fills. True meaning-drift cases (example 5) exist but are uncommon. The relaxed filter was the right call — the strict `lost_key_nouns` filter in `discard_analysis.csv` already caught and discarded 689 entries during the polish phase.
|
|||
|
|
|
|||
|
|
**The dropped noun issue from the prior session appears resolved.** The naturalization pass and relaxed rebuild filter handle it appropriately.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Processing Pipeline Status
|
|||
|
|
|
|||
|
|
### Pipeline stages (in execution order)
|
|||
|
|
|
|||
|
|
| Stage | Script | Status | Entries In → Out | Errors |
|
|||
|
|
|-------|--------|--------|-----------------|--------|
|
|||
|
|
| 1. Raw generation | `generate_raw_batch.sh` | ✅ Complete | → 9,835 | 0 |
|
|||
|
|
| 2. LLM polish | `polish_corpus.py` | ✅ Complete (81.5 min) | 9,835 → 5,499 polished + 4,336 discards | 0 |
|
|||
|
|
| 3. Naturalization | `naturalize_corpus.py` | ✅ Complete (147.8 min) | 9,835 → 9,468 usable | 0 |
|
|||
|
|
| 4. Rebuild (filter + dedup + format) | `rebuild_training_pairs.py` | ✅ Complete | 19,031 → 9,025 filtered → 36,079 pairs | 0 |
|
|||
|
|
|
|||
|
|
All stages completed with **zero errors** across all runs.
|
|||
|
|
|
|||
|
|
### Discard breakdown
|
|||
|
|
|
|||
|
|
| Reason | Count |
|
|||
|
|
|--------|-------|
|
|||
|
|
| LLM polish → DISCARD | 4,336 (44.1% of raw) |
|
|||
|
|
| Near-duplicate removal | 2,495 |
|
|||
|
|
| Lost key nouns (strict filter) | 689 |
|
|||
|
|
| Too long (>25 words) | 3 |
|
|||
|
|
| Naturalization filtered | 73 |
|
|||
|
|
| Naturalization skipped | 436 |
|
|||
|
|
|
|||
|
|
### Logs
|
|||
|
|
|
|||
|
|
- `corpus/polish_log.txt` — clean run, 0 errors, steady 1.6–2.0 req/s throughput
|
|||
|
|
- `corpus/naturalize_log.txt` — clean run, 0 errors, steady 1.0–1.1 req/s throughput
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Training Readiness Assessment
|
|||
|
|
|
|||
|
|
### What's ready
|
|||
|
|
|
|||
|
|
- **Volume:** 9,025 unique sayings and 36,079 training pairs is a solid corpus for a 0.5B fine-tune. ~757K tokens of training data.
|
|||
|
|
- **Pipeline integrity:** All stages completed with zero errors. Clean logs, full checkpointing.
|
|||
|
|
- **Vocab coverage:** 100% of the 624-word vocabulary appears in the corpus.
|
|||
|
|
- **Format:** Training pairs are clean `{input, output, meta_template, source_words}` JSONL — plug directly into HF Trainer or axolotl.
|
|||
|
|
- **No AI-isms:** Zero instances of common LLM crutch phrases ("it is important", "in conclusion", etc.).
|
|||
|
|
- **Grammar:** Zero truncated or grammatically broken sentences detected (no trailing articles, no double articles, no unfilled slots).
|
|||
|
|
|
|||
|
|
### Quality risks (ranked)
|
|||
|
|
|
|||
|
|
**1. Opening phrase repetition (MEDIUM)**
|
|||
|
|
The corpus has noticeable repetition in sentence openings:
|
|||
|
|
- "The man who..." appears 151 times (1.7%)
|
|||
|
|
- "A man who..." appears 110 times (1.2%)
|
|||
|
|
- "Funny how a/the..." appears 184 times combined (2.0%)
|
|||
|
|
|
|||
|
|
This could cause the fine-tuned model to over-rely on these openings. Not a blocker, but worth monitoring in generation quality after training.
|
|||
|
|
|
|||
|
|
**2. Template imbalance (LOW-MEDIUM)**
|
|||
|
|
Three template families are below 10%: `tautological_wisdom` (8.0%), `denial_of_consequences` (8.3%), `hypocritical_complaint` (9.0%). The spec says to go back and generate more if below 10%. The gap is small — the model will still see 725+ examples of each — but it's a known deviation from spec.
|
|||
|
|
|
|||
|
|
**3. Semantic misfires (LOW)**
|
|||
|
|
A small percentage of entries have nonsensical relationships that survived filtering (e.g., "Dolphin kin always do without the wave fin", "A van-sized thing'll make you a canoe"). These are rare enough (<1% by estimate) that they'll be noise in training, not a pattern the model learns.
|
|||
|
|
|
|||
|
|
**4. Missing fictional entity pairs (LOW)**
|
|||
|
|
The spec calls for ~200-300 fictional entity training pairs. None are present. This means the model won't learn the "describe an entity → generate a saying about it" pattern out of the box. This can be added post-training or in a follow-up fine-tune pass.
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
|
|||
|
|
**The corpus is ready for the RunPod training run.** The biggest risk is the opening phrase repetition, but at <2% per pattern across 36K pairs, it's unlikely to dominate the model's behavior. The template imbalance is a minor spec deviation (8.0-9.0% vs the 10% target) and can be corrected in a second training round if the model shows weakness on those families.
|
|||
|
|
|
|||
|
|
Start training. Evaluate the model's output diversity after 1 epoch — if it's over-producing "The man who..." or "Funny how..." openings, consider deduplicating by opening trigram before the next training round.
|