john 356b62c6ea corpus generation (work from mid february)

2026-03-09 19:52:09 -04:00

15 KiB

Raw Blame History

Folksy Generator — Evaluation Report

Date: 2026-02-17 Evaluator: Claude (automated) Scope: Post-integration health check after three LLM augmentation phases

1. Project Structure Overview

folksy-generator/
├── folksy_generator.py              # Main CLI generator (910 lines)
├── FOLKSY_GENERATOR_SPEC.md         # Original project spec
├── GRAPH_ENHANCEMENT_SPEC.md        # LLM graph augmentation spec (Phases 1-3)
├── CORPUS_GENERATION_SPEC.md        # Corpus generation spec (next phase)
├── data/
│   ├── folksy_vocab.csv             # Curated vocabulary (624 words, expanded from 534)
│   ├── folksy_vocab.csv.bak.*       # Pre-expansion backup (534 words)
│   ├── folksy_relations.csv         # Original ConceptNet edges (11,096 edges)
│   ├── folksy_relations_augmented.csv  # LLM-generated edges (11,220 edges)
│   ├── classified_proverbs.csv      # Labeled real proverbs for reference
│   ├── candidate_additions.csv      # OOV words suggested by LLM (3,678 unique)
│   └── enhancement_log.csv          # Processing log for all 3 phases
├── scripts/
│   ├── extract_from_conceptnet.py   # One-time ConceptNet extraction (requires psql)
│   ├── extract_relations.py         # Relation extraction helper
│   ├── classify_proverbs.py         # Proverb classification
│   ├── expand_vocab.py              # Phase: vocab expansion (+90 words)
│   ├── enhance_graph.py             # Phase: LLM edge augmentation
│   ├── generate_raw_batch.sh        # Bulk generation script
│   ├── polish_corpus.py             # LLM polish pipeline
│   ├── filter_corpus.py             # Quality filtering
│   ├── format_training_pairs.py     # Training pair generation
│   └── compute_corpus_stats.py      # Corpus statistics
├── examples/
│   ├── my_world.json                # Fictional entity examples (5 entities)
│   └── sample_output.txt            # Pre-integration sample output
├── schemas/
│   └── fictional_entities.schema.json
└── corpus/                          # Empty — not yet populated

Entry point: python3 folksy_generator.py — no virtual environment, no dependencies beyond Python 3.11 stdlib.

2. What the Three LLM Integration Phases Produced

Git history shows a single initial commit (8c8a058 Initial 'folksy idiom' generator). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.

Phase 1: Per-Word Relationship Expansion

624 words processed through GLM4-32B
10,726 edges generated, 1,155 accepted (10.8% acceptance rate)
9,510 edges rejected as OOV (target words not in folksy vocab)
61 duplicates filtered
Filled gaps in AtLocation, UsedFor, HasA, MadeOf, PartOf, CapableOf, HasPrerequisite, Causes, HasProperty

Phase 2: Cross-Word Relationship Discovery (Bridge Words)

148 low-connectivity words targeted
6,272 bridge edges accepted
This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts

Phase 3: Property Enrichment

624 words processed for distinctive HasProperty edges
3,849 edges generated, 3,788 accepted (98.4% acceptance rate)
61 duplicates filtered
Targeted at improving false_equivalence template output

Vocab Expansion (via `expand_vocab.py`)

Original vocabulary: 534 words
Current vocabulary: 624 words (+90 words added)
Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories

Combined Data Summary

Dataset	Count
Original ConceptNet edges	11,096
LLM-augmented edges	11,220
Total edges (combined)	22,316
Original vocabulary	534
Expanded vocabulary	624
Candidate OOV words (not added)	3,678

3. Term Database Statistics

Vocabulary by Category (36 categories)

Category	Words	Category	Words
bird	97	fish	16
animal	65	spice	16
tool	56	fruit	15
plant	43	mineral	14
food	38	insect	14
material	36	structure	13
container	34	beverage	9
instrument	28	fabric	9
landscape	27	tree	8
vegetable	24	wood	7
building	21	herb	7
metal	19	rock	6
flower	19	water	6
vehicle	18	furniture	5
stone	17	clothing	5
weapon	17	shelter	5
—	—	crop, seed, organism, grain	3-4 each

Edge Distribution — Original ConceptNet

Relation	Edges
AtLocation	5,294
UsedFor	2,481
CapableOf	1,138
ReceivesAction	485
HasProperty	422
HasA	307
HasPrerequisite	261
MadeOf	181
PartOf	170
Others (6 types)	257

Edge Distribution — LLM Augmented

Relation	Edges
HasProperty	3,985
HasA	1,719
PartOf	1,247
UsedFor	1,230
MadeOf	1,217
AtLocation	1,008
CapableOf	288
HasPrerequisite	250
Others (4 types)	276

The augmented edges deliberately fill the gaps in the original ConceptNet data. HasProperty went from 422 to 4,407 total — critical for the false_equivalence template.

4. Sample Generated Output (30 Sayings)

Generated with python3 folksy_generator.py --count 30 using the full augmented graph:

An scarf ain't nothing but cotton that met some wool.
The only difference between a hummingbird and a dodo is metabolism.
An salt ain't nothing but ore that met some crystals.
Funny how the earthworm never has enough food for itself.
What's a coop but a kitchen with sound?
My grandmother used to say, 'spooning the dessert won't bring you eating.'
Don't take the wheel and then gripe about the hull.
A bamboo don't come without its water, now does it?
Nobody's got less salsa than the man who makes the mango.
That's like eating the sea and complaining the savanna tastes off.
My daddy always said, can't have waking up in morning without coffee.
Take the bison out of meat and all you've got left is salty taste flesh.
Like baiting the flock and hoping for keep as pet.
The ice's family always goes without cool body.
There's a fella who takes the wax and says the sugar's no good.
That's just holding the drawer and praying for store blanket.
You know what they say, a mica with no schist is just a rough surface rock.
An silver ain't nothing but hairbrushes that met some alloy.
A kite is just a pelican that's got catch wind.
Like making the denim and hoping for material.
The nut feeds everyone's fit bolt but its own.
The pitcher's family always goes without throw fast ball.
A nail is just a weapon that's got smooth length.
You want lid? Well, first you're gonna need container.
Don't build the micrometer and say you ain't got workshop.
Ain't no sleeping at night ever came from nothing — you need bed.
What's a cicada but a lacebug with nocturnal behavior?
Don't drink the dish and then gripe about the gnocchi.
You can't put out a herring and then wonder where all the herringbone came from.
That's just lorikeeting the fruit and praying for breaking wind.

5. Quality Assessment

Rating Summary

I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):

Rating	Count	%	Description
Good	8	27%	Sounds natural, humorous, structurally solid
Okay	9	30%	Semantically coherent but grammatically rough
Bad	13	43%	Broken grammar, nonsensical, or artifact leakage

Good Examples (natural-sounding, humorous)

"Nobody's got less salsa than the man who makes the mango."
"There's a fella who takes the wax and says the sugar's no good."
"A bamboo don't come without its water, now does it?"
"Don't take the wheel and then gripe about the hull."
"Ain't no sleeping at night ever came from nothing — you need bed."
"My daddy always said, can't have waking up in morning without coffee."
"What's a cicada but a lacebug with nocturnal behavior?"
"You can't put out a herring and then wonder where all the herringbone came from."

Common Issues Identified

1. Article / Grammar Errors (frequent)

"An scarf ain't nothing but..." — should be "A scarf"
"An silver ain't nothing but..." — should be "Silver"
"An salt ain't nothing but..." — should be "Salt"
"A have children don't come without..." — broken slot fill leaking action phrase as noun

2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)

"throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
"waking up in morning", "sleeping at night", "salty taste"
"breaking wind", "store blanket", "rough surface"
These are raw ConceptNet concept IDs that should have been filtered or reformatted

3. Nonsensical Verb Conjugation in Futile Preparation (severe)

"lorikeeting the fruit" — lorikeet treated as a verb
"fooding the earthworm" — food treated as a verb
"jeansing the denim" — jeans treated as a verb
"safariing the lion" — safari treated as a verb
The _gerund() function applies gerunding to ANY UsedFor target, including nouns

4. LLM Enhancement Artifacts Leaking (moderate)

"bridge word: plate" appearing in output text
"bridge 2: food" appearing in output text
"bridge word: absorption" appearing in output text
These are raw LLM response fragments that weren't properly cleaned during Phase 2

5. Semantic Mismatches (occasional)

"A lynx is just a earthworm that's got feline." — wrong category siblings
"That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
"A emu is just a ferret that's got walk backwards." — cross-class comparison

Per-Template Quality Assessment

Template	Typical Quality	Key Issue
deconstruction	Okay	Multi-word properties leak; article errors with "An"
denial_of_consequences	Good	Best template; LLM artifacts occasionally leak through
ironic_deficiency	Okay-Bad	Multi-word action phrases used as nouns ("throw fast ball")
futile_preparation	Bad	Nouns gerunded as verbs; worst template overall
hypocritical_complaint	Okay	Some odd part-of relationships; generally coherent structure
tautological_wisdom	Good	Simple structure avoids most issues; multi-word phrases still leak
false_equivalence	Good	Benefited most from Phase 3 property enrichment

6. Errors, Warnings, and Issues

No Errors at Runtime

Generator runs without crashes on all template types
All CLI flags work (--template, --count, --seed, --category, --debug, --json, --entities, --pure-conceptnet, --llm-weight-boost)
JSON output mode produces valid JSONL with complete metadata
Fictional entity generation works

Issues Found

Severity	Issue	Impact
High	LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:")	Raw LLM response fragments leak into generated sayings
High	Nouns gerunded as verbs in `futile_preparation`	"lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets
Medium	Multi-word ConceptNet phrases not filtered	"throw fast ball", "keep as pet" break sentence flow
Medium	Article logic doesn't handle "a" vs "an" properly for all cases	"An scarf", "An silver", "An salt"
Low	No test suite exists	No automated validation of output quality
Low	No virtual environment or requirements.txt	Only stdlib needed currently, but will need deps for corpus generation phase
Info	Corpus directory is empty	Expected — corpus generation is the next phase

7. Readiness Assessment for Corpus Generation

Ready

Template engine is functional and produces output across all 7 meta-template families
Augmented graph significantly improves vocabulary coverage (22,316 total edges)
Vocab expansion added 90 words to cover previously sparse categories
JSON output mode with full debug metadata is working — ready for bulk generation
Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
Fictional entity support is implemented and functional
All corpus pipeline scripts exist (generate_raw_batch.sh, polish_corpus.py, filter_corpus.py, format_training_pairs.py, compute_corpus_stats.py)

Should Fix Before Corpus Generation

Clean Phase 2 artifacts from folksy_relations_augmented.csv — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
Fix futile_preparation gerunding — the _gerund() function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
Filter multi-word ConceptNet phrases — the _short_concepts() helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
Fix article logic — the _a() function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"

Nice to Have

Add a basic test suite (even just smoke tests that confirm each template generates output)
Create requirements.txt (currently stdlib-only, but corpus phase will need requests at minimum)
Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions

Overall Verdict

The template generator works but produces rough output. This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be 40-50% due to the issues above.

Fixing the four "Should Fix" items before corpus generation would:

Reduce the discard rate (saving LLM compute time)
Improve the quality floor of raw output (giving the polish LLM better material to work with)
Eliminate artifact contamination that could propagate into training data

The generator is functional but not polished — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.

15 KiB Raw Blame History