corpus generation (work from mid february)

2026-03-09 19:52:09 -04:00 · 2026-03-09 19:52:09 -04:00 · 356b62c6ea
commit 356b62c6ea
parent 8c8a058301
16 changed files with 25872 additions and 38 deletions
--- a/EVALUATION.md
+++ b/EVALUATION.md
@ -0,0 +1,303 @@
+# Folksy Generator — Evaluation Report
+
+**Date:** 2026-02-17
+**Evaluator:** Claude (automated)
+**Scope:** Post-integration health check after three LLM augmentation phases
+
+---
+
+## 1. Project Structure Overview
+
+```
+folksy-generator/
+├── folksy_generator.py              # Main CLI generator (910 lines)
+├── FOLKSY_GENERATOR_SPEC.md         # Original project spec
+├── GRAPH_ENHANCEMENT_SPEC.md        # LLM graph augmentation spec (Phases 1-3)
+├── CORPUS_GENERATION_SPEC.md        # Corpus generation spec (next phase)
+├── data/
+│   ├── folksy_vocab.csv             # Curated vocabulary (624 words, expanded from 534)
+│   ├── folksy_vocab.csv.bak.*       # Pre-expansion backup (534 words)
+│   ├── folksy_relations.csv         # Original ConceptNet edges (11,096 edges)
+│   ├── folksy_relations_augmented.csv  # LLM-generated edges (11,220 edges)
+│   ├── classified_proverbs.csv      # Labeled real proverbs for reference
+│   ├── candidate_additions.csv      # OOV words suggested by LLM (3,678 unique)
+│   └── enhancement_log.csv          # Processing log for all 3 phases
+├── scripts/
+│   ├── extract_from_conceptnet.py   # One-time ConceptNet extraction (requires psql)
+│   ├── extract_relations.py         # Relation extraction helper
+│   ├── classify_proverbs.py         # Proverb classification
+│   ├── expand_vocab.py              # Phase: vocab expansion (+90 words)
+│   ├── enhance_graph.py             # Phase: LLM edge augmentation
+│   ├── generate_raw_batch.sh        # Bulk generation script
+│   ├── polish_corpus.py             # LLM polish pipeline
+│   ├── filter_corpus.py             # Quality filtering
+│   ├── format_training_pairs.py     # Training pair generation
+│   └── compute_corpus_stats.py      # Corpus statistics
+├── examples/
+│   ├── my_world.json                # Fictional entity examples (5 entities)
+│   └── sample_output.txt            # Pre-integration sample output
+├── schemas/
+│   └── fictional_entities.schema.json
+└── corpus/                          # Empty — not yet populated
+```
+
+**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
+
+---
+
+## 2. What the Three LLM Integration Phases Produced
+
+Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
+
+### Phase 1: Per-Word Relationship Expansion
+- **624 words** processed through GLM4-32B
+- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
+- 9,510 edges rejected as OOV (target words not in folksy vocab)
+- 61 duplicates filtered
+- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
+
+### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
+- **148 low-connectivity words** targeted
+- 6,272 bridge edges accepted
+- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
+
+### Phase 3: Property Enrichment
+- **624 words** processed for distinctive HasProperty edges
+- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
+- 61 duplicates filtered
+- Targeted at improving `false_equivalence` template output
+
+### Vocab Expansion (via `expand_vocab.py`)
+- Original vocabulary: **534 words**
+- Current vocabulary: **624 words** (+90 words added)
+- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
+
+### Combined Data Summary
+
+| Dataset | Count |
+|---------|-------|
+| Original ConceptNet edges | 11,096 |
+| LLM-augmented edges | 11,220 |
+| **Total edges (combined)** | **22,316** |
+| Original vocabulary | 534 |
+| Expanded vocabulary | 624 |
+| Candidate OOV words (not added) | 3,678 |
+
+---
+
+## 3. Term Database Statistics
+
+### Vocabulary by Category (36 categories)
+
+| Category | Words | | Category | Words |
+|----------|-------|-|----------|-------|
+| bird | 97 | | fish | 16 |
+| animal | 65 | | spice | 16 |
+| tool | 56 | | fruit | 15 |
+| plant | 43 | | mineral | 14 |
+| food | 38 | | insect | 14 |
+| material | 36 | | structure | 13 |
+| container | 34 | | beverage | 9 |
+| instrument | 28 | | fabric | 9 |
+| landscape | 27 | | tree | 8 |
+| vegetable | 24 | | wood | 7 |
+| building | 21 | | herb | 7 |
+| metal | 19 | | rock | 6 |
+| flower | 19 | | water | 6 |
+| vehicle | 18 | | furniture | 5 |
+| stone | 17 | | clothing | 5 |
+| weapon | 17 | | shelter | 5 |
+| — | — | | crop, seed, organism, grain | 3-4 each |
+
+### Edge Distribution — Original ConceptNet
+
+| Relation | Edges |
+|----------|-------|
+| AtLocation | 5,294 |
+| UsedFor | 2,481 |
+| CapableOf | 1,138 |
+| ReceivesAction | 485 |
+| HasProperty | 422 |
+| HasA | 307 |
+| HasPrerequisite | 261 |
+| MadeOf | 181 |
+| PartOf | 170 |
+| Others (6 types) | 257 |
+
+### Edge Distribution — LLM Augmented
+
+| Relation | Edges |
+|----------|-------|
+| HasProperty | 3,985 |
+| HasA | 1,719 |
+| PartOf | 1,247 |
+| UsedFor | 1,230 |
+| MadeOf | 1,217 |
+| AtLocation | 1,008 |
+| CapableOf | 288 |
+| HasPrerequisite | 250 |
+| Others (4 types) | 276 |
+
+The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
+
+---
+
+## 4. Sample Generated Output (30 Sayings)
+
+Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
+
+1. An scarf ain't nothing but cotton that met some wool.
+2. The only difference between a hummingbird and a dodo is metabolism.
+3. An salt ain't nothing but ore that met some crystals.
+4. Funny how the earthworm never has enough food for itself.
+5. What's a coop but a kitchen with sound?
+6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
+7. Don't take the wheel and then gripe about the hull.
+8. A bamboo don't come without its water, now does it?
+9. Nobody's got less salsa than the man who makes the mango.
+10. That's like eating the sea and complaining the savanna tastes off.
+11. My daddy always said, can't have waking up in morning without coffee.
+12. Take the bison out of meat and all you've got left is salty taste flesh.
+13. Like baiting the flock and hoping for keep as pet.
+14. The ice's family always goes without cool body.
+15. There's a fella who takes the wax and says the sugar's no good.
+16. That's just holding the drawer and praying for store blanket.
+17. You know what they say, a mica with no schist is just a rough surface rock.
+18. An silver ain't nothing but hairbrushes that met some alloy.
+19. A kite is just a pelican that's got catch wind.
+20. Like making the denim and hoping for material.
+21. The nut feeds everyone's fit bolt but its own.
+22. The pitcher's family always goes without throw fast ball.
+23. A nail is just a weapon that's got smooth length.
+24. You want lid? Well, first you're gonna need container.
+25. Don't build the micrometer and say you ain't got workshop.
+26. Ain't no sleeping at night ever came from nothing — you need bed.
+27. What's a cicada but a lacebug with nocturnal behavior?
+28. Don't drink the dish and then gripe about the gnocchi.
+29. You can't put out a herring and then wonder where all the herringbone came from.
+30. That's just lorikeeting the fruit and praying for breaking wind.
+
+---
+
+## 5. Quality Assessment
+
+### Rating Summary
+
+I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
+
+| Rating | Count | % | Description |
+|--------|-------|---|-------------|
+| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
+| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
+| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
+
+### Good Examples (natural-sounding, humorous)
+- "Nobody's got less salsa than the man who makes the mango."
+- "There's a fella who takes the wax and says the sugar's no good."
+- "A bamboo don't come without its water, now does it?"
+- "Don't take the wheel and then gripe about the hull."
+- "Ain't no sleeping at night ever came from nothing — you need bed."
+- "My daddy always said, can't have waking up in morning without coffee."
+- "What's a cicada but a lacebug with nocturnal behavior?"
+- "You can't put out a herring and then wonder where all the herringbone came from."
+
+### Common Issues Identified
+
+#### 1. Article / Grammar Errors (frequent)
+- "An scarf ain't nothing but..." — should be "A scarf"
+- "An silver ain't nothing but..." — should be "Silver"
+- "An salt ain't nothing but..." — should be "Salt"
+- "A have children don't come without..." — broken slot fill leaking action phrase as noun
+
+#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
+- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
+- "waking up in morning", "sleeping at night", "salty taste"
+- "breaking wind", "store blanket", "rough surface"
+- These are raw ConceptNet concept IDs that should have been filtered or reformatted
+
+#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
+- "lorikeeting the fruit" — `lorikeet` treated as a verb
+- "fooding the earthworm" — `food` treated as a verb
+- "jeansing the denim" — `jeans` treated as a verb
+- "safariing the lion" — `safari` treated as a verb
+- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
+
+#### 4. LLM Enhancement Artifacts Leaking (moderate)
+- "bridge word: plate" appearing in output text
+- "bridge 2: **food**" appearing in output text
+- "*bridge word: absorption*" appearing in output text
+- These are raw LLM response fragments that weren't properly cleaned during Phase 2
+
+#### 5. Semantic Mismatches (occasional)
+- "A lynx is just a earthworm that's got feline." — wrong category siblings
+- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
+- "A emu is just a ferret that's got walk backwards." — cross-class comparison
+
+### Per-Template Quality Assessment
+
+| Template | Typical Quality | Key Issue |
+|----------|----------------|-----------|
+| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
+| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
+| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
+| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
+| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
+| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
+| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
+
+---
+
+## 6. Errors, Warnings, and Issues
+
+### No Errors at Runtime
+- Generator runs without crashes on all template types
+- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
+- JSON output mode produces valid JSONL with complete metadata
+- Fictional entity generation works
+
+### Issues Found
+
+| Severity | Issue | Impact |
+|----------|-------|--------|
+| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
+| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
+| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
+| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
+| **Low** | No test suite exists | No automated validation of output quality |
+| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
+| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
+
+---
+
+## 7. Readiness Assessment for Corpus Generation
+
+### Ready
+- Template engine is functional and produces output across all 7 meta-template families
+- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
+- Vocab expansion added 90 words to cover previously sparse categories
+- JSON output mode with full debug metadata is working — ready for bulk generation
+- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
+- Fictional entity support is implemented and functional
+- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
+
+### Should Fix Before Corpus Generation
+1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
+2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
+3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
+4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
+
+### Nice to Have
+- Add a basic test suite (even just smoke tests that confirm each template generates output)
+- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
+- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
+
+### Overall Verdict
+
+**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
+
+Fixing the four "Should Fix" items before corpus generation would:
+- Reduce the discard rate (saving LLM compute time)
+- Improve the quality floor of raw output (giving the polish LLM better material to work with)
+- Eliminate artifact contamination that could propagate into training data
+
+The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.