corpus generation (work from mid february)

This commit is contained in:
John McCardle 2026-03-09 19:52:09 -04:00
commit 356b62c6ea
16 changed files with 25872 additions and 38 deletions

303
EVALUATION.md Normal file
View file

@ -0,0 +1,303 @@
# Folksy Generator — Evaluation Report
**Date:** 2026-02-17
**Evaluator:** Claude (automated)
**Scope:** Post-integration health check after three LLM augmentation phases
---
## 1. Project Structure Overview
```
folksy-generator/
├── folksy_generator.py # Main CLI generator (910 lines)
├── FOLKSY_GENERATOR_SPEC.md # Original project spec
├── GRAPH_ENHANCEMENT_SPEC.md # LLM graph augmentation spec (Phases 1-3)
├── CORPUS_GENERATION_SPEC.md # Corpus generation spec (next phase)
├── data/
│ ├── folksy_vocab.csv # Curated vocabulary (624 words, expanded from 534)
│ ├── folksy_vocab.csv.bak.* # Pre-expansion backup (534 words)
│ ├── folksy_relations.csv # Original ConceptNet edges (11,096 edges)
│ ├── folksy_relations_augmented.csv # LLM-generated edges (11,220 edges)
│ ├── classified_proverbs.csv # Labeled real proverbs for reference
│ ├── candidate_additions.csv # OOV words suggested by LLM (3,678 unique)
│ └── enhancement_log.csv # Processing log for all 3 phases
├── scripts/
│ ├── extract_from_conceptnet.py # One-time ConceptNet extraction (requires psql)
│ ├── extract_relations.py # Relation extraction helper
│ ├── classify_proverbs.py # Proverb classification
│ ├── expand_vocab.py # Phase: vocab expansion (+90 words)
│ ├── enhance_graph.py # Phase: LLM edge augmentation
│ ├── generate_raw_batch.sh # Bulk generation script
│ ├── polish_corpus.py # LLM polish pipeline
│ ├── filter_corpus.py # Quality filtering
│ ├── format_training_pairs.py # Training pair generation
│ └── compute_corpus_stats.py # Corpus statistics
├── examples/
│ ├── my_world.json # Fictional entity examples (5 entities)
│ └── sample_output.txt # Pre-integration sample output
├── schemas/
│ └── fictional_entities.schema.json
└── corpus/ # Empty — not yet populated
```
**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
---
## 2. What the Three LLM Integration Phases Produced
Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
### Phase 1: Per-Word Relationship Expansion
- **624 words** processed through GLM4-32B
- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
- 9,510 edges rejected as OOV (target words not in folksy vocab)
- 61 duplicates filtered
- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
- **148 low-connectivity words** targeted
- 6,272 bridge edges accepted
- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
### Phase 3: Property Enrichment
- **624 words** processed for distinctive HasProperty edges
- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
- 61 duplicates filtered
- Targeted at improving `false_equivalence` template output
### Vocab Expansion (via `expand_vocab.py`)
- Original vocabulary: **534 words**
- Current vocabulary: **624 words** (+90 words added)
- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
### Combined Data Summary
| Dataset | Count |
|---------|-------|
| Original ConceptNet edges | 11,096 |
| LLM-augmented edges | 11,220 |
| **Total edges (combined)** | **22,316** |
| Original vocabulary | 534 |
| Expanded vocabulary | 624 |
| Candidate OOV words (not added) | 3,678 |
---
## 3. Term Database Statistics
### Vocabulary by Category (36 categories)
| Category | Words | | Category | Words |
|----------|-------|-|----------|-------|
| bird | 97 | | fish | 16 |
| animal | 65 | | spice | 16 |
| tool | 56 | | fruit | 15 |
| plant | 43 | | mineral | 14 |
| food | 38 | | insect | 14 |
| material | 36 | | structure | 13 |
| container | 34 | | beverage | 9 |
| instrument | 28 | | fabric | 9 |
| landscape | 27 | | tree | 8 |
| vegetable | 24 | | wood | 7 |
| building | 21 | | herb | 7 |
| metal | 19 | | rock | 6 |
| flower | 19 | | water | 6 |
| vehicle | 18 | | furniture | 5 |
| stone | 17 | | clothing | 5 |
| weapon | 17 | | shelter | 5 |
| — | — | | crop, seed, organism, grain | 3-4 each |
### Edge Distribution — Original ConceptNet
| Relation | Edges |
|----------|-------|
| AtLocation | 5,294 |
| UsedFor | 2,481 |
| CapableOf | 1,138 |
| ReceivesAction | 485 |
| HasProperty | 422 |
| HasA | 307 |
| HasPrerequisite | 261 |
| MadeOf | 181 |
| PartOf | 170 |
| Others (6 types) | 257 |
### Edge Distribution — LLM Augmented
| Relation | Edges |
|----------|-------|
| HasProperty | 3,985 |
| HasA | 1,719 |
| PartOf | 1,247 |
| UsedFor | 1,230 |
| MadeOf | 1,217 |
| AtLocation | 1,008 |
| CapableOf | 288 |
| HasPrerequisite | 250 |
| Others (4 types) | 276 |
The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
---
## 4. Sample Generated Output (30 Sayings)
Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
1. An scarf ain't nothing but cotton that met some wool.
2. The only difference between a hummingbird and a dodo is metabolism.
3. An salt ain't nothing but ore that met some crystals.
4. Funny how the earthworm never has enough food for itself.
5. What's a coop but a kitchen with sound?
6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
7. Don't take the wheel and then gripe about the hull.
8. A bamboo don't come without its water, now does it?
9. Nobody's got less salsa than the man who makes the mango.
10. That's like eating the sea and complaining the savanna tastes off.
11. My daddy always said, can't have waking up in morning without coffee.
12. Take the bison out of meat and all you've got left is salty taste flesh.
13. Like baiting the flock and hoping for keep as pet.
14. The ice's family always goes without cool body.
15. There's a fella who takes the wax and says the sugar's no good.
16. That's just holding the drawer and praying for store blanket.
17. You know what they say, a mica with no schist is just a rough surface rock.
18. An silver ain't nothing but hairbrushes that met some alloy.
19. A kite is just a pelican that's got catch wind.
20. Like making the denim and hoping for material.
21. The nut feeds everyone's fit bolt but its own.
22. The pitcher's family always goes without throw fast ball.
23. A nail is just a weapon that's got smooth length.
24. You want lid? Well, first you're gonna need container.
25. Don't build the micrometer and say you ain't got workshop.
26. Ain't no sleeping at night ever came from nothing — you need bed.
27. What's a cicada but a lacebug with nocturnal behavior?
28. Don't drink the dish and then gripe about the gnocchi.
29. You can't put out a herring and then wonder where all the herringbone came from.
30. That's just lorikeeting the fruit and praying for breaking wind.
---
## 5. Quality Assessment
### Rating Summary
I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
| Rating | Count | % | Description |
|--------|-------|---|-------------|
| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
### Good Examples (natural-sounding, humorous)
- "Nobody's got less salsa than the man who makes the mango."
- "There's a fella who takes the wax and says the sugar's no good."
- "A bamboo don't come without its water, now does it?"
- "Don't take the wheel and then gripe about the hull."
- "Ain't no sleeping at night ever came from nothing — you need bed."
- "My daddy always said, can't have waking up in morning without coffee."
- "What's a cicada but a lacebug with nocturnal behavior?"
- "You can't put out a herring and then wonder where all the herringbone came from."
### Common Issues Identified
#### 1. Article / Grammar Errors (frequent)
- "An scarf ain't nothing but..." — should be "A scarf"
- "An silver ain't nothing but..." — should be "Silver"
- "An salt ain't nothing but..." — should be "Salt"
- "A have children don't come without..." — broken slot fill leaking action phrase as noun
#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
- "waking up in morning", "sleeping at night", "salty taste"
- "breaking wind", "store blanket", "rough surface"
- These are raw ConceptNet concept IDs that should have been filtered or reformatted
#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
- "lorikeeting the fruit" — `lorikeet` treated as a verb
- "fooding the earthworm" — `food` treated as a verb
- "jeansing the denim" — `jeans` treated as a verb
- "safariing the lion" — `safari` treated as a verb
- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
#### 4. LLM Enhancement Artifacts Leaking (moderate)
- "bridge word: plate" appearing in output text
- "bridge 2: **food**" appearing in output text
- "*bridge word: absorption*" appearing in output text
- These are raw LLM response fragments that weren't properly cleaned during Phase 2
#### 5. Semantic Mismatches (occasional)
- "A lynx is just a earthworm that's got feline." — wrong category siblings
- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
- "A emu is just a ferret that's got walk backwards." — cross-class comparison
### Per-Template Quality Assessment
| Template | Typical Quality | Key Issue |
|----------|----------------|-----------|
| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
---
## 6. Errors, Warnings, and Issues
### No Errors at Runtime
- Generator runs without crashes on all template types
- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
- JSON output mode produces valid JSONL with complete metadata
- Fictional entity generation works
### Issues Found
| Severity | Issue | Impact |
|----------|-------|--------|
| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
| **Low** | No test suite exists | No automated validation of output quality |
| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
---
## 7. Readiness Assessment for Corpus Generation
### Ready
- Template engine is functional and produces output across all 7 meta-template families
- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
- Vocab expansion added 90 words to cover previously sparse categories
- JSON output mode with full debug metadata is working — ready for bulk generation
- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
- Fictional entity support is implemented and functional
- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
### Should Fix Before Corpus Generation
1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
### Nice to Have
- Add a basic test suite (even just smoke tests that confirm each template generates output)
- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
### Overall Verdict
**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
Fixing the four "Should Fix" items before corpus generation would:
- Reduce the discard rate (saving LLM compute time)
- Improve the quality floor of raw output (giving the polish LLM better material to work with)
- Eliminate artifact contamination that could propagate into training data
The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.