303 lines
15 KiB
Markdown
303 lines
15 KiB
Markdown
|
|
# Folksy Generator — Evaluation Report
|
||
|
|
|
||
|
|
**Date:** 2026-02-17
|
||
|
|
**Evaluator:** Claude (automated)
|
||
|
|
**Scope:** Post-integration health check after three LLM augmentation phases
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Project Structure Overview
|
||
|
|
|
||
|
|
```
|
||
|
|
folksy-generator/
|
||
|
|
├── folksy_generator.py # Main CLI generator (910 lines)
|
||
|
|
├── FOLKSY_GENERATOR_SPEC.md # Original project spec
|
||
|
|
├── GRAPH_ENHANCEMENT_SPEC.md # LLM graph augmentation spec (Phases 1-3)
|
||
|
|
├── CORPUS_GENERATION_SPEC.md # Corpus generation spec (next phase)
|
||
|
|
├── data/
|
||
|
|
│ ├── folksy_vocab.csv # Curated vocabulary (624 words, expanded from 534)
|
||
|
|
│ ├── folksy_vocab.csv.bak.* # Pre-expansion backup (534 words)
|
||
|
|
│ ├── folksy_relations.csv # Original ConceptNet edges (11,096 edges)
|
||
|
|
│ ├── folksy_relations_augmented.csv # LLM-generated edges (11,220 edges)
|
||
|
|
│ ├── classified_proverbs.csv # Labeled real proverbs for reference
|
||
|
|
│ ├── candidate_additions.csv # OOV words suggested by LLM (3,678 unique)
|
||
|
|
│ └── enhancement_log.csv # Processing log for all 3 phases
|
||
|
|
├── scripts/
|
||
|
|
│ ├── extract_from_conceptnet.py # One-time ConceptNet extraction (requires psql)
|
||
|
|
│ ├── extract_relations.py # Relation extraction helper
|
||
|
|
│ ├── classify_proverbs.py # Proverb classification
|
||
|
|
│ ├── expand_vocab.py # Phase: vocab expansion (+90 words)
|
||
|
|
│ ├── enhance_graph.py # Phase: LLM edge augmentation
|
||
|
|
│ ├── generate_raw_batch.sh # Bulk generation script
|
||
|
|
│ ├── polish_corpus.py # LLM polish pipeline
|
||
|
|
│ ├── filter_corpus.py # Quality filtering
|
||
|
|
│ ├── format_training_pairs.py # Training pair generation
|
||
|
|
│ └── compute_corpus_stats.py # Corpus statistics
|
||
|
|
├── examples/
|
||
|
|
│ ├── my_world.json # Fictional entity examples (5 entities)
|
||
|
|
│ └── sample_output.txt # Pre-integration sample output
|
||
|
|
├── schemas/
|
||
|
|
│ └── fictional_entities.schema.json
|
||
|
|
└── corpus/ # Empty — not yet populated
|
||
|
|
```
|
||
|
|
|
||
|
|
**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. What the Three LLM Integration Phases Produced
|
||
|
|
|
||
|
|
Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
|
||
|
|
|
||
|
|
### Phase 1: Per-Word Relationship Expansion
|
||
|
|
- **624 words** processed through GLM4-32B
|
||
|
|
- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
|
||
|
|
- 9,510 edges rejected as OOV (target words not in folksy vocab)
|
||
|
|
- 61 duplicates filtered
|
||
|
|
- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
|
||
|
|
|
||
|
|
### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
|
||
|
|
- **148 low-connectivity words** targeted
|
||
|
|
- 6,272 bridge edges accepted
|
||
|
|
- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
|
||
|
|
|
||
|
|
### Phase 3: Property Enrichment
|
||
|
|
- **624 words** processed for distinctive HasProperty edges
|
||
|
|
- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
|
||
|
|
- 61 duplicates filtered
|
||
|
|
- Targeted at improving `false_equivalence` template output
|
||
|
|
|
||
|
|
### Vocab Expansion (via `expand_vocab.py`)
|
||
|
|
- Original vocabulary: **534 words**
|
||
|
|
- Current vocabulary: **624 words** (+90 words added)
|
||
|
|
- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
|
||
|
|
|
||
|
|
### Combined Data Summary
|
||
|
|
|
||
|
|
| Dataset | Count |
|
||
|
|
|---------|-------|
|
||
|
|
| Original ConceptNet edges | 11,096 |
|
||
|
|
| LLM-augmented edges | 11,220 |
|
||
|
|
| **Total edges (combined)** | **22,316** |
|
||
|
|
| Original vocabulary | 534 |
|
||
|
|
| Expanded vocabulary | 624 |
|
||
|
|
| Candidate OOV words (not added) | 3,678 |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Term Database Statistics
|
||
|
|
|
||
|
|
### Vocabulary by Category (36 categories)
|
||
|
|
|
||
|
|
| Category | Words | | Category | Words |
|
||
|
|
|----------|-------|-|----------|-------|
|
||
|
|
| bird | 97 | | fish | 16 |
|
||
|
|
| animal | 65 | | spice | 16 |
|
||
|
|
| tool | 56 | | fruit | 15 |
|
||
|
|
| plant | 43 | | mineral | 14 |
|
||
|
|
| food | 38 | | insect | 14 |
|
||
|
|
| material | 36 | | structure | 13 |
|
||
|
|
| container | 34 | | beverage | 9 |
|
||
|
|
| instrument | 28 | | fabric | 9 |
|
||
|
|
| landscape | 27 | | tree | 8 |
|
||
|
|
| vegetable | 24 | | wood | 7 |
|
||
|
|
| building | 21 | | herb | 7 |
|
||
|
|
| metal | 19 | | rock | 6 |
|
||
|
|
| flower | 19 | | water | 6 |
|
||
|
|
| vehicle | 18 | | furniture | 5 |
|
||
|
|
| stone | 17 | | clothing | 5 |
|
||
|
|
| weapon | 17 | | shelter | 5 |
|
||
|
|
| — | — | | crop, seed, organism, grain | 3-4 each |
|
||
|
|
|
||
|
|
### Edge Distribution — Original ConceptNet
|
||
|
|
|
||
|
|
| Relation | Edges |
|
||
|
|
|----------|-------|
|
||
|
|
| AtLocation | 5,294 |
|
||
|
|
| UsedFor | 2,481 |
|
||
|
|
| CapableOf | 1,138 |
|
||
|
|
| ReceivesAction | 485 |
|
||
|
|
| HasProperty | 422 |
|
||
|
|
| HasA | 307 |
|
||
|
|
| HasPrerequisite | 261 |
|
||
|
|
| MadeOf | 181 |
|
||
|
|
| PartOf | 170 |
|
||
|
|
| Others (6 types) | 257 |
|
||
|
|
|
||
|
|
### Edge Distribution — LLM Augmented
|
||
|
|
|
||
|
|
| Relation | Edges |
|
||
|
|
|----------|-------|
|
||
|
|
| HasProperty | 3,985 |
|
||
|
|
| HasA | 1,719 |
|
||
|
|
| PartOf | 1,247 |
|
||
|
|
| UsedFor | 1,230 |
|
||
|
|
| MadeOf | 1,217 |
|
||
|
|
| AtLocation | 1,008 |
|
||
|
|
| CapableOf | 288 |
|
||
|
|
| HasPrerequisite | 250 |
|
||
|
|
| Others (4 types) | 276 |
|
||
|
|
|
||
|
|
The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Sample Generated Output (30 Sayings)
|
||
|
|
|
||
|
|
Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
|
||
|
|
|
||
|
|
1. An scarf ain't nothing but cotton that met some wool.
|
||
|
|
2. The only difference between a hummingbird and a dodo is metabolism.
|
||
|
|
3. An salt ain't nothing but ore that met some crystals.
|
||
|
|
4. Funny how the earthworm never has enough food for itself.
|
||
|
|
5. What's a coop but a kitchen with sound?
|
||
|
|
6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
|
||
|
|
7. Don't take the wheel and then gripe about the hull.
|
||
|
|
8. A bamboo don't come without its water, now does it?
|
||
|
|
9. Nobody's got less salsa than the man who makes the mango.
|
||
|
|
10. That's like eating the sea and complaining the savanna tastes off.
|
||
|
|
11. My daddy always said, can't have waking up in morning without coffee.
|
||
|
|
12. Take the bison out of meat and all you've got left is salty taste flesh.
|
||
|
|
13. Like baiting the flock and hoping for keep as pet.
|
||
|
|
14. The ice's family always goes without cool body.
|
||
|
|
15. There's a fella who takes the wax and says the sugar's no good.
|
||
|
|
16. That's just holding the drawer and praying for store blanket.
|
||
|
|
17. You know what they say, a mica with no schist is just a rough surface rock.
|
||
|
|
18. An silver ain't nothing but hairbrushes that met some alloy.
|
||
|
|
19. A kite is just a pelican that's got catch wind.
|
||
|
|
20. Like making the denim and hoping for material.
|
||
|
|
21. The nut feeds everyone's fit bolt but its own.
|
||
|
|
22. The pitcher's family always goes without throw fast ball.
|
||
|
|
23. A nail is just a weapon that's got smooth length.
|
||
|
|
24. You want lid? Well, first you're gonna need container.
|
||
|
|
25. Don't build the micrometer and say you ain't got workshop.
|
||
|
|
26. Ain't no sleeping at night ever came from nothing — you need bed.
|
||
|
|
27. What's a cicada but a lacebug with nocturnal behavior?
|
||
|
|
28. Don't drink the dish and then gripe about the gnocchi.
|
||
|
|
29. You can't put out a herring and then wonder where all the herringbone came from.
|
||
|
|
30. That's just lorikeeting the fruit and praying for breaking wind.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Quality Assessment
|
||
|
|
|
||
|
|
### Rating Summary
|
||
|
|
|
||
|
|
I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
|
||
|
|
|
||
|
|
| Rating | Count | % | Description |
|
||
|
|
|--------|-------|---|-------------|
|
||
|
|
| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
|
||
|
|
| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
|
||
|
|
| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
|
||
|
|
|
||
|
|
### Good Examples (natural-sounding, humorous)
|
||
|
|
- "Nobody's got less salsa than the man who makes the mango."
|
||
|
|
- "There's a fella who takes the wax and says the sugar's no good."
|
||
|
|
- "A bamboo don't come without its water, now does it?"
|
||
|
|
- "Don't take the wheel and then gripe about the hull."
|
||
|
|
- "Ain't no sleeping at night ever came from nothing — you need bed."
|
||
|
|
- "My daddy always said, can't have waking up in morning without coffee."
|
||
|
|
- "What's a cicada but a lacebug with nocturnal behavior?"
|
||
|
|
- "You can't put out a herring and then wonder where all the herringbone came from."
|
||
|
|
|
||
|
|
### Common Issues Identified
|
||
|
|
|
||
|
|
#### 1. Article / Grammar Errors (frequent)
|
||
|
|
- "An scarf ain't nothing but..." — should be "A scarf"
|
||
|
|
- "An silver ain't nothing but..." — should be "Silver"
|
||
|
|
- "An salt ain't nothing but..." — should be "Salt"
|
||
|
|
- "A have children don't come without..." — broken slot fill leaking action phrase as noun
|
||
|
|
|
||
|
|
#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
|
||
|
|
- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
|
||
|
|
- "waking up in morning", "sleeping at night", "salty taste"
|
||
|
|
- "breaking wind", "store blanket", "rough surface"
|
||
|
|
- These are raw ConceptNet concept IDs that should have been filtered or reformatted
|
||
|
|
|
||
|
|
#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
|
||
|
|
- "lorikeeting the fruit" — `lorikeet` treated as a verb
|
||
|
|
- "fooding the earthworm" — `food` treated as a verb
|
||
|
|
- "jeansing the denim" — `jeans` treated as a verb
|
||
|
|
- "safariing the lion" — `safari` treated as a verb
|
||
|
|
- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
|
||
|
|
|
||
|
|
#### 4. LLM Enhancement Artifacts Leaking (moderate)
|
||
|
|
- "bridge word: plate" appearing in output text
|
||
|
|
- "bridge 2: **food**" appearing in output text
|
||
|
|
- "*bridge word: absorption*" appearing in output text
|
||
|
|
- These are raw LLM response fragments that weren't properly cleaned during Phase 2
|
||
|
|
|
||
|
|
#### 5. Semantic Mismatches (occasional)
|
||
|
|
- "A lynx is just a earthworm that's got feline." — wrong category siblings
|
||
|
|
- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
|
||
|
|
- "A emu is just a ferret that's got walk backwards." — cross-class comparison
|
||
|
|
|
||
|
|
### Per-Template Quality Assessment
|
||
|
|
|
||
|
|
| Template | Typical Quality | Key Issue |
|
||
|
|
|----------|----------------|-----------|
|
||
|
|
| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
|
||
|
|
| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
|
||
|
|
| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
|
||
|
|
| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
|
||
|
|
| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
|
||
|
|
| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
|
||
|
|
| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Errors, Warnings, and Issues
|
||
|
|
|
||
|
|
### No Errors at Runtime
|
||
|
|
- Generator runs without crashes on all template types
|
||
|
|
- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
|
||
|
|
- JSON output mode produces valid JSONL with complete metadata
|
||
|
|
- Fictional entity generation works
|
||
|
|
|
||
|
|
### Issues Found
|
||
|
|
|
||
|
|
| Severity | Issue | Impact |
|
||
|
|
|----------|-------|--------|
|
||
|
|
| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
|
||
|
|
| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
|
||
|
|
| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
|
||
|
|
| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
|
||
|
|
| **Low** | No test suite exists | No automated validation of output quality |
|
||
|
|
| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
|
||
|
|
| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Readiness Assessment for Corpus Generation
|
||
|
|
|
||
|
|
### Ready
|
||
|
|
- Template engine is functional and produces output across all 7 meta-template families
|
||
|
|
- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
|
||
|
|
- Vocab expansion added 90 words to cover previously sparse categories
|
||
|
|
- JSON output mode with full debug metadata is working — ready for bulk generation
|
||
|
|
- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
|
||
|
|
- Fictional entity support is implemented and functional
|
||
|
|
- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
|
||
|
|
|
||
|
|
### Should Fix Before Corpus Generation
|
||
|
|
1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
|
||
|
|
2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
|
||
|
|
3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
|
||
|
|
4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
|
||
|
|
|
||
|
|
### Nice to Have
|
||
|
|
- Add a basic test suite (even just smoke tests that confirm each template generates output)
|
||
|
|
- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
|
||
|
|
- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
|
||
|
|
|
||
|
|
### Overall Verdict
|
||
|
|
|
||
|
|
**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
|
||
|
|
|
||
|
|
Fixing the four "Should Fix" items before corpus generation would:
|
||
|
|
- Reduce the discard rate (saving LLM compute time)
|
||
|
|
- Improve the quality floor of raw output (giving the polish LLM better material to work with)
|
||
|
|
- Eliminate artifact contamination that could propagate into training data
|
||
|
|
|
||
|
|
The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.
|