folksy_idioms/EVALUATION.md

15 KiB

Folksy Generator — Evaluation Report

Date: 2026-02-17 Evaluator: Claude (automated) Scope: Post-integration health check after three LLM augmentation phases


1. Project Structure Overview

folksy-generator/
├── folksy_generator.py              # Main CLI generator (910 lines)
├── FOLKSY_GENERATOR_SPEC.md         # Original project spec
├── GRAPH_ENHANCEMENT_SPEC.md        # LLM graph augmentation spec (Phases 1-3)
├── CORPUS_GENERATION_SPEC.md        # Corpus generation spec (next phase)
├── data/
│   ├── folksy_vocab.csv             # Curated vocabulary (624 words, expanded from 534)
│   ├── folksy_vocab.csv.bak.*       # Pre-expansion backup (534 words)
│   ├── folksy_relations.csv         # Original ConceptNet edges (11,096 edges)
│   ├── folksy_relations_augmented.csv  # LLM-generated edges (11,220 edges)
│   ├── classified_proverbs.csv      # Labeled real proverbs for reference
│   ├── candidate_additions.csv      # OOV words suggested by LLM (3,678 unique)
│   └── enhancement_log.csv          # Processing log for all 3 phases
├── scripts/
│   ├── extract_from_conceptnet.py   # One-time ConceptNet extraction (requires psql)
│   ├── extract_relations.py         # Relation extraction helper
│   ├── classify_proverbs.py         # Proverb classification
│   ├── expand_vocab.py              # Phase: vocab expansion (+90 words)
│   ├── enhance_graph.py             # Phase: LLM edge augmentation
│   ├── generate_raw_batch.sh        # Bulk generation script
│   ├── polish_corpus.py             # LLM polish pipeline
│   ├── filter_corpus.py             # Quality filtering
│   ├── format_training_pairs.py     # Training pair generation
│   └── compute_corpus_stats.py      # Corpus statistics
├── examples/
│   ├── my_world.json                # Fictional entity examples (5 entities)
│   └── sample_output.txt            # Pre-integration sample output
├── schemas/
│   └── fictional_entities.schema.json
└── corpus/                          # Empty — not yet populated

Entry point: python3 folksy_generator.py — no virtual environment, no dependencies beyond Python 3.11 stdlib.


2. What the Three LLM Integration Phases Produced

Git history shows a single initial commit (8c8a058 Initial 'folksy idiom' generator). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.

Phase 1: Per-Word Relationship Expansion

  • 624 words processed through GLM4-32B
  • 10,726 edges generated, 1,155 accepted (10.8% acceptance rate)
  • 9,510 edges rejected as OOV (target words not in folksy vocab)
  • 61 duplicates filtered
  • Filled gaps in AtLocation, UsedFor, HasA, MadeOf, PartOf, CapableOf, HasPrerequisite, Causes, HasProperty

Phase 2: Cross-Word Relationship Discovery (Bridge Words)

  • 148 low-connectivity words targeted
  • 6,272 bridge edges accepted
  • This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts

Phase 3: Property Enrichment

  • 624 words processed for distinctive HasProperty edges
  • 3,849 edges generated, 3,788 accepted (98.4% acceptance rate)
  • 61 duplicates filtered
  • Targeted at improving false_equivalence template output

Vocab Expansion (via expand_vocab.py)

  • Original vocabulary: 534 words
  • Current vocabulary: 624 words (+90 words added)
  • Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories

Combined Data Summary

Dataset Count
Original ConceptNet edges 11,096
LLM-augmented edges 11,220
Total edges (combined) 22,316
Original vocabulary 534
Expanded vocabulary 624
Candidate OOV words (not added) 3,678

3. Term Database Statistics

Vocabulary by Category (36 categories)

Category Words Category Words
bird 97 fish 16
animal 65 spice 16
tool 56 fruit 15
plant 43 mineral 14
food 38 insect 14
material 36 structure 13
container 34 beverage 9
instrument 28 fabric 9
landscape 27 tree 8
vegetable 24 wood 7
building 21 herb 7
metal 19 rock 6
flower 19 water 6
vehicle 18 furniture 5
stone 17 clothing 5
weapon 17 shelter 5
crop, seed, organism, grain 3-4 each

Edge Distribution — Original ConceptNet

Relation Edges
AtLocation 5,294
UsedFor 2,481
CapableOf 1,138
ReceivesAction 485
HasProperty 422
HasA 307
HasPrerequisite 261
MadeOf 181
PartOf 170
Others (6 types) 257

Edge Distribution — LLM Augmented

Relation Edges
HasProperty 3,985
HasA 1,719
PartOf 1,247
UsedFor 1,230
MadeOf 1,217
AtLocation 1,008
CapableOf 288
HasPrerequisite 250
Others (4 types) 276

The augmented edges deliberately fill the gaps in the original ConceptNet data. HasProperty went from 422 to 4,407 total — critical for the false_equivalence template.


4. Sample Generated Output (30 Sayings)

Generated with python3 folksy_generator.py --count 30 using the full augmented graph:

  1. An scarf ain't nothing but cotton that met some wool.
  2. The only difference between a hummingbird and a dodo is metabolism.
  3. An salt ain't nothing but ore that met some crystals.
  4. Funny how the earthworm never has enough food for itself.
  5. What's a coop but a kitchen with sound?
  6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
  7. Don't take the wheel and then gripe about the hull.
  8. A bamboo don't come without its water, now does it?
  9. Nobody's got less salsa than the man who makes the mango.
  10. That's like eating the sea and complaining the savanna tastes off.
  11. My daddy always said, can't have waking up in morning without coffee.
  12. Take the bison out of meat and all you've got left is salty taste flesh.
  13. Like baiting the flock and hoping for keep as pet.
  14. The ice's family always goes without cool body.
  15. There's a fella who takes the wax and says the sugar's no good.
  16. That's just holding the drawer and praying for store blanket.
  17. You know what they say, a mica with no schist is just a rough surface rock.
  18. An silver ain't nothing but hairbrushes that met some alloy.
  19. A kite is just a pelican that's got catch wind.
  20. Like making the denim and hoping for material.
  21. The nut feeds everyone's fit bolt but its own.
  22. The pitcher's family always goes without throw fast ball.
  23. A nail is just a weapon that's got smooth length.
  24. You want lid? Well, first you're gonna need container.
  25. Don't build the micrometer and say you ain't got workshop.
  26. Ain't no sleeping at night ever came from nothing — you need bed.
  27. What's a cicada but a lacebug with nocturnal behavior?
  28. Don't drink the dish and then gripe about the gnocchi.
  29. You can't put out a herring and then wonder where all the herringbone came from.
  30. That's just lorikeeting the fruit and praying for breaking wind.

5. Quality Assessment

Rating Summary

I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):

Rating Count % Description
Good 8 27% Sounds natural, humorous, structurally solid
Okay 9 30% Semantically coherent but grammatically rough
Bad 13 43% Broken grammar, nonsensical, or artifact leakage

Good Examples (natural-sounding, humorous)

  • "Nobody's got less salsa than the man who makes the mango."
  • "There's a fella who takes the wax and says the sugar's no good."
  • "A bamboo don't come without its water, now does it?"
  • "Don't take the wheel and then gripe about the hull."
  • "Ain't no sleeping at night ever came from nothing — you need bed."
  • "My daddy always said, can't have waking up in morning without coffee."
  • "What's a cicada but a lacebug with nocturnal behavior?"
  • "You can't put out a herring and then wonder where all the herringbone came from."

Common Issues Identified

1. Article / Grammar Errors (frequent)

  • "An scarf ain't nothing but..." — should be "A scarf"
  • "An silver ain't nothing but..." — should be "Silver"
  • "An salt ain't nothing but..." — should be "Salt"
  • "A have children don't come without..." — broken slot fill leaking action phrase as noun

2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)

  • "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
  • "waking up in morning", "sleeping at night", "salty taste"
  • "breaking wind", "store blanket", "rough surface"
  • These are raw ConceptNet concept IDs that should have been filtered or reformatted

3. Nonsensical Verb Conjugation in Futile Preparation (severe)

  • "lorikeeting the fruit" — lorikeet treated as a verb
  • "fooding the earthworm" — food treated as a verb
  • "jeansing the denim" — jeans treated as a verb
  • "safariing the lion" — safari treated as a verb
  • The _gerund() function applies gerunding to ANY UsedFor target, including nouns

4. LLM Enhancement Artifacts Leaking (moderate)

  • "bridge word: plate" appearing in output text
  • "bridge 2: food" appearing in output text
  • "bridge word: absorption" appearing in output text
  • These are raw LLM response fragments that weren't properly cleaned during Phase 2

5. Semantic Mismatches (occasional)

  • "A lynx is just a earthworm that's got feline." — wrong category siblings
  • "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
  • "A emu is just a ferret that's got walk backwards." — cross-class comparison

Per-Template Quality Assessment

Template Typical Quality Key Issue
deconstruction Okay Multi-word properties leak; article errors with "An"
denial_of_consequences Good Best template; LLM artifacts occasionally leak through
ironic_deficiency Okay-Bad Multi-word action phrases used as nouns ("throw fast ball")
futile_preparation Bad Nouns gerunded as verbs; worst template overall
hypocritical_complaint Okay Some odd part-of relationships; generally coherent structure
tautological_wisdom Good Simple structure avoids most issues; multi-word phrases still leak
false_equivalence Good Benefited most from Phase 3 property enrichment

6. Errors, Warnings, and Issues

No Errors at Runtime

  • Generator runs without crashes on all template types
  • All CLI flags work (--template, --count, --seed, --category, --debug, --json, --entities, --pure-conceptnet, --llm-weight-boost)
  • JSON output mode produces valid JSONL with complete metadata
  • Fictional entity generation works

Issues Found

Severity Issue Impact
High LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") Raw LLM response fragments leak into generated sayings
High Nouns gerunded as verbs in futile_preparation "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets
Medium Multi-word ConceptNet phrases not filtered "throw fast ball", "keep as pet" break sentence flow
Medium Article logic doesn't handle "a" vs "an" properly for all cases "An scarf", "An silver", "An salt"
Low No test suite exists No automated validation of output quality
Low No virtual environment or requirements.txt Only stdlib needed currently, but will need deps for corpus generation phase
Info Corpus directory is empty Expected — corpus generation is the next phase

7. Readiness Assessment for Corpus Generation

Ready

  • Template engine is functional and produces output across all 7 meta-template families
  • Augmented graph significantly improves vocabulary coverage (22,316 total edges)
  • Vocab expansion added 90 words to cover previously sparse categories
  • JSON output mode with full debug metadata is working — ready for bulk generation
  • Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
  • Fictional entity support is implemented and functional
  • All corpus pipeline scripts exist (generate_raw_batch.sh, polish_corpus.py, filter_corpus.py, format_training_pairs.py, compute_corpus_stats.py)

Should Fix Before Corpus Generation

  1. Clean Phase 2 artifacts from folksy_relations_augmented.csv — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
  2. Fix futile_preparation gerunding — the _gerund() function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
  3. Filter multi-word ConceptNet phrases — the _short_concepts() helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
  4. Fix article logic — the _a() function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"

Nice to Have

  • Add a basic test suite (even just smoke tests that confirm each template generates output)
  • Create requirements.txt (currently stdlib-only, but corpus phase will need requests at minimum)
  • Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions

Overall Verdict

The template generator works but produces rough output. This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be 40-50% due to the issues above.

Fixing the four "Should Fix" items before corpus generation would:

  • Reduce the discard rate (saving LLM compute time)
  • Improve the quality floor of raw output (giving the polish LLM better material to work with)
  • Eliminate artifact contamination that could propagate into training data

The generator is functional but not polished — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.