corpus generation (work from mid february)

2026-03-09 19:52:09 -04:00 · 2026-03-09 19:52:09 -04:00 · 356b62c6ea
commit 356b62c6ea
parent 8c8a058301
16 changed files with 25872 additions and 38 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1 @@
+*__pycache__
--- a/CORPUS_GENERATION_SPEC.md
+++ b/CORPUS_GENERATION_SPEC.md
@ -0,0 +1,431 @@
+# Corpus Generation Spec — LLM-Polished Training Data
+
+## Overview
+
+The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
+
+The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
+
+## Infrastructure
+
+```python
+import requests
+
+def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
+    """Chat completion endpoint of local LLM"""
+    return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
+        'model': model,
+        'messages': messages
+    }).json()
+```
+
+Same local endpoint as the graph enhancement phase. No cloud APIs.
+
+## Phase 1: Bulk Raw Generation
+
+### Goal
+Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
+
+### Generation Strategy
+
+Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
+
+```bash
+# Even coverage across all 7 meta-template families
+for template in deconstruction denial_of_consequences ironic_deficiency \
+               futile_preparation hypocritical_complaint tautological_wisdom \
+               false_equivalence; do
+    python folksy_generator.py --template $template --count 1500 --debug \
+        --output raw_${template}.jsonl
+done
+```
+
+### Output Format
+
+The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
+
+```json
+{
+  "raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
+  "meta_template": "deconstruction",
+  "surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
+  "slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
+  "chain": [
+    {"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
+    {"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
+    {"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
+  ]
+}
+```
+
+This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
+
+### Deduplication at Generation Time
+
+Before writing each generated saying, check:
+- Exact duplicate raw_text → skip
+- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
+- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
+
+## Phase 2: LLM Polish
+
+### Goal
+Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
+
+### System Prompt
+
+```
+You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
+
+Your job:
+1. Fix grammar, articles, and pluralization
+2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
+3. Preserve the core nouns and the relationship between them — do not swap out the key words
+4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
+5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
+6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
+
+Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
+
+Examples of good polish:
+
+Raw: "Don't build the coffee and act surprised when the water show up."
+Chain: coffee MadeOf water
+Polished: Don't brew the coffee and act surprised when the water's all gone.
+
+Raw: "The chest's children always goes without hold books."
+Chain: chest UsedFor hold_books
+Polished: The bookshelf-maker's kids always end up reading off the floor.
+
+Raw: "A pineapple is just a nectarine that's got an attitude."
+Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
+Polished: A pineapple is just a peach that grew itself some armor.
+
+Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
+Chain: steel MadeOf iron, steel HasProperty hard
+Polished: You know what they say — steel without the iron is just a dream of being hard.
+
+Raw: "Funny how the bamboo never has enough grow very quickly for itself."
+Chain: bamboo CapableOf grow_quickly
+Polished: DISCARD
+
+Raw: "That's just funning the canoe and praying for boiling food."
+Chain: canoe UsedFor transport, fire UsedFor boiling_food
+Polished: DISCARD
+```
+
+### User Prompt Template
+
+```
+Meta-template: {meta_template}
+Relationship chain: {chain_formatted}
+Slot fills: {slots_formatted}
+Raw saying: {raw_text}
+```
+
+### Chain Formatting
+
+Format the chain as a readable string:
+
+```
+bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
+```
+
+### Batch Processing
+
+```python
+import json
+import time
+
+def polish_batch(input_path, output_path):
+    system_prompt = load_system_prompt()  # The prompt above
+    
+    with open(input_path) as f:
+        raw_entries = [json.loads(line) for line in f]
+    
+    results = []
+    discards = 0
+    
+    for i, entry in enumerate(raw_entries):
+        user_prompt = format_polish_prompt(entry)
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt}
+        ]
+        
+        response = llm_chat_completion(messages)
+        polished = response['choices'][0]['message']['content'].strip()
+        
+        if polished == "DISCARD":
+            discards += 1
+            entry['status'] = 'discarded'
+        else:
+            entry['polished_text'] = polished
+            entry['status'] = 'polished'
+        
+        results.append(entry)
+        
+        if (i + 1) % 100 == 0:
+            print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
+            # Write checkpoint
+            save_checkpoint(results, output_path)
+        
+        time.sleep(0.1)  # gentle rate limiting
+    
+    save_final(results, output_path)
+    print(f"Done: {len(results) - discards} polished, {discards} discarded")
+```
+
+### Expected Discard Rate
+
+Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
+
+## Phase 3: Deduplication and Quality Filtering
+
+After LLM polish, run automated quality checks before including in the training corpus.
+
+### Automated Filters
+
+```python
+def quality_filter(entry):
+    text = entry['polished_text']
+    
+    # Length check: real proverbs are short
+    if len(text.split()) > 25:
+        return False, "too_long"
+    if len(text.split()) < 5:
+        return False, "too_short"
+    
+    # Must contain at least 2 of the original slot-fill nouns
+    slot_words = set(entry['slots'].values())
+    words_present = sum(1 for w in slot_words if w.lower() in text.lower())
+    if words_present < 2:
+        return False, "lost_key_nouns"
+    
+    # No raw ConceptNet artifacts (multi-word underscore phrases)
+    if '_' in text:
+        return False, "conceptnet_artifact"
+    
+    # No broken templates (unfilled slots)
+    if '{' in text or '}' in text:
+        return False, "unfilled_slot"
+    
+    return True, "pass"
+```
+
+### Near-Duplicate Detection
+
+Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
+
+```python
+from difflib import SequenceMatcher
+
+def is_near_duplicate(text_a, text_b, threshold=0.75):
+    return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
+```
+
+Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
+
+## Phase 4: Training Corpus Formatting
+
+### Goal
+Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
+
+### Training Pair Schema
+
+Each polished saying generates multiple training pairs with different input framings:
+
+```json
+[
+  {
+    "input": "Tell me something about bread",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "Tell me a saying about baking",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "What would a farmer say about flour?",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "Give me a deconstruction proverb",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  }
+]
+```
+
+### Input Framing Types
+
+For each polished saying, generate training pairs with these input patterns:
+
+1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
+2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
+3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
+4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
+5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
+
+Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
+
+### Fictional Entity Training Pairs
+
+Additionally, generate training pairs that demonstrate fictional entity handling:
+
+```json
+{
+  "input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
+  "output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
+}
+```
+
+For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
+
+### Format for Fiction Entity Input
+
+Standardize how entity descriptions appear in training inputs:
+
+```
+A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
+```
+
+Example:
+```
+A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
+```
+
+This format matches what a game developer or worldbuilder would naturally provide at inference time.
+
+## Phase 5: Corpus Statistics and Validation
+
+### Required Metrics
+
+Before declaring the corpus ready for fine-tuning, compute and report:
+
+```
+Total polished sayings: X
+Discarded during polish: X (Y%)
+Discarded during quality filter: X (Y%)
+Final training pairs: X
+
+Distribution by meta-template:
+  deconstruction:          X (Y%)
+  denial_of_consequences:  X (Y%)
+  ironic_deficiency:       X (Y%)
+  futile_preparation:      X (Y%)
+  hypocritical_complaint:  X (Y%)
+  tautological_wisdom:     X (Y%)
+  false_equivalence:       X (Y%)
+
+Distribution by input framing type:
+  word_seeded:     X
+  category_seeded: X
+  persona_seeded:  X
+  template_seeded: X
+  open_ended:      X
+  fictional:       X
+
+Unique slot words used: X (out of 534 vocab)
+Words never used in any saying: [list]
+Average saying length: X words
+```
+
+### Balance Check
+
+If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
+
+### Human Spot-Check
+
+Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
+- **Good:** Sounds natural, funny, could fool someone into thinking it's real
+- **Okay:** Grammatically correct but flat or too literal
+- **Bad:** Awkward, nonsensical, or lost the relationship
+
+Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
+
+## Output Files
+
+### `corpus_raw.jsonl`
+All raw generated sayings with debug metadata. One JSON object per line.
+
+### `corpus_polished.jsonl`
+All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
+
+### `corpus_filtered.jsonl`
+Only sayings that passed quality filtering. One JSON object per line.
+
+### `training_pairs.jsonl`
+Final training corpus. One JSON object per line:
+```json
+{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
+```
+
+### `corpus_stats.json`
+The metrics from Phase 5.
+
+### `discard_analysis.csv`
+Every discarded saying with its discard reason:
+```
+raw_text, meta_template, discard_stage, discard_reason
+"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
+"The fire's...", ironic_deficiency, quality_filter, "too_short"
+```
+
+This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
+
+## File Organization
+
+```
+folksy-generator/
+├── corpus/
+│   ├── corpus_raw.jsonl
+│   ├── corpus_polished.jsonl
+│   ├── corpus_filtered.jsonl
+│   ├── training_pairs.jsonl
+│   ├── corpus_stats.json
+│   └── discard_analysis.csv
+├── scripts/
+│   ├── generate_raw_batch.sh       # Runs generator across all templates
+│   ├── polish_corpus.py            # LLM polish pipeline
+│   ├── filter_corpus.py            # Quality filtering
+│   ├── format_training_pairs.py    # Training pair generation
+│   └── compute_corpus_stats.py     # Metrics and validation
+```
+
+## Execution Timeline
+
+Assuming ~1 second per LLM call on the local 4090:
+
+| Step | Items | Est. Time |
+|------|-------|-----------|
+| Raw generation (template engine only) | 10,500 | ~2 minutes |
+| LLM polish | 10,500 | ~3 hours |
+| Quality filtering | ~7,500 | ~1 minute |
+| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
+| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
+
+Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
+
+## Integration Notes
+
+### Feeding into Fine-Tuning
+
+The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
+
+### Iterative Improvement
+
+This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
+1. Generate more raw sayings targeting the weak area
+2. Polish and filter
+3. Append to training corpus
+4. Re-train
+
+The JSONL format and checkpoint system support this append workflow natively.
--- a/EVALUATION.md
+++ b/EVALUATION.md
@ -0,0 +1,303 @@
+# Folksy Generator — Evaluation Report
+
+**Date:** 2026-02-17
+**Evaluator:** Claude (automated)
+**Scope:** Post-integration health check after three LLM augmentation phases
+
+---
+
+## 1. Project Structure Overview
+
+```
+folksy-generator/
+├── folksy_generator.py              # Main CLI generator (910 lines)
+├── FOLKSY_GENERATOR_SPEC.md         # Original project spec
+├── GRAPH_ENHANCEMENT_SPEC.md        # LLM graph augmentation spec (Phases 1-3)
+├── CORPUS_GENERATION_SPEC.md        # Corpus generation spec (next phase)
+├── data/
+│   ├── folksy_vocab.csv             # Curated vocabulary (624 words, expanded from 534)
+│   ├── folksy_vocab.csv.bak.*       # Pre-expansion backup (534 words)
+│   ├── folksy_relations.csv         # Original ConceptNet edges (11,096 edges)
+│   ├── folksy_relations_augmented.csv  # LLM-generated edges (11,220 edges)
+│   ├── classified_proverbs.csv      # Labeled real proverbs for reference
+│   ├── candidate_additions.csv      # OOV words suggested by LLM (3,678 unique)
+│   └── enhancement_log.csv          # Processing log for all 3 phases
+├── scripts/
+│   ├── extract_from_conceptnet.py   # One-time ConceptNet extraction (requires psql)
+│   ├── extract_relations.py         # Relation extraction helper
+│   ├── classify_proverbs.py         # Proverb classification
+│   ├── expand_vocab.py              # Phase: vocab expansion (+90 words)
+│   ├── enhance_graph.py             # Phase: LLM edge augmentation
+│   ├── generate_raw_batch.sh        # Bulk generation script
+│   ├── polish_corpus.py             # LLM polish pipeline
+│   ├── filter_corpus.py             # Quality filtering
+│   ├── format_training_pairs.py     # Training pair generation
+│   └── compute_corpus_stats.py      # Corpus statistics
+├── examples/
+│   ├── my_world.json                # Fictional entity examples (5 entities)
+│   └── sample_output.txt            # Pre-integration sample output
+├── schemas/
+│   └── fictional_entities.schema.json
+└── corpus/                          # Empty — not yet populated
+```
+
+**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
+
+---
+
+## 2. What the Three LLM Integration Phases Produced
+
+Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
+
+### Phase 1: Per-Word Relationship Expansion
+- **624 words** processed through GLM4-32B
+- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
+- 9,510 edges rejected as OOV (target words not in folksy vocab)
+- 61 duplicates filtered
+- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
+
+### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
+- **148 low-connectivity words** targeted
+- 6,272 bridge edges accepted
+- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
+
+### Phase 3: Property Enrichment
+- **624 words** processed for distinctive HasProperty edges
+- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
+- 61 duplicates filtered
+- Targeted at improving `false_equivalence` template output
+
+### Vocab Expansion (via `expand_vocab.py`)
+- Original vocabulary: **534 words**
+- Current vocabulary: **624 words** (+90 words added)
+- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
+
+### Combined Data Summary
+
+| Dataset | Count |
+|---------|-------|
+| Original ConceptNet edges | 11,096 |
+| LLM-augmented edges | 11,220 |
+| **Total edges (combined)** | **22,316** |
+| Original vocabulary | 534 |
+| Expanded vocabulary | 624 |
+| Candidate OOV words (not added) | 3,678 |
+
+---
+
+## 3. Term Database Statistics
+
+### Vocabulary by Category (36 categories)
+
+| Category | Words | | Category | Words |
+|----------|-------|-|----------|-------|
+| bird | 97 | | fish | 16 |
+| animal | 65 | | spice | 16 |
+| tool | 56 | | fruit | 15 |
+| plant | 43 | | mineral | 14 |
+| food | 38 | | insect | 14 |
+| material | 36 | | structure | 13 |
+| container | 34 | | beverage | 9 |
+| instrument | 28 | | fabric | 9 |
+| landscape | 27 | | tree | 8 |
+| vegetable | 24 | | wood | 7 |
+| building | 21 | | herb | 7 |
+| metal | 19 | | rock | 6 |
+| flower | 19 | | water | 6 |
+| vehicle | 18 | | furniture | 5 |
+| stone | 17 | | clothing | 5 |
+| weapon | 17 | | shelter | 5 |
+| — | — | | crop, seed, organism, grain | 3-4 each |
+
+### Edge Distribution — Original ConceptNet
+
+| Relation | Edges |
+|----------|-------|
+| AtLocation | 5,294 |
+| UsedFor | 2,481 |
+| CapableOf | 1,138 |
+| ReceivesAction | 485 |
+| HasProperty | 422 |
+| HasA | 307 |
+| HasPrerequisite | 261 |
+| MadeOf | 181 |
+| PartOf | 170 |
+| Others (6 types) | 257 |
+
+### Edge Distribution — LLM Augmented
+
+| Relation | Edges |
+|----------|-------|
+| HasProperty | 3,985 |
+| HasA | 1,719 |
+| PartOf | 1,247 |
+| UsedFor | 1,230 |
+| MadeOf | 1,217 |
+| AtLocation | 1,008 |
+| CapableOf | 288 |
+| HasPrerequisite | 250 |
+| Others (4 types) | 276 |
+
+The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
+
+---
+
+## 4. Sample Generated Output (30 Sayings)
+
+Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
+
+1. An scarf ain't nothing but cotton that met some wool.
+2. The only difference between a hummingbird and a dodo is metabolism.
+3. An salt ain't nothing but ore that met some crystals.
+4. Funny how the earthworm never has enough food for itself.
+5. What's a coop but a kitchen with sound?
+6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
+7. Don't take the wheel and then gripe about the hull.
+8. A bamboo don't come without its water, now does it?
+9. Nobody's got less salsa than the man who makes the mango.
+10. That's like eating the sea and complaining the savanna tastes off.
+11. My daddy always said, can't have waking up in morning without coffee.
+12. Take the bison out of meat and all you've got left is salty taste flesh.
+13. Like baiting the flock and hoping for keep as pet.
+14. The ice's family always goes without cool body.
+15. There's a fella who takes the wax and says the sugar's no good.
+16. That's just holding the drawer and praying for store blanket.
+17. You know what they say, a mica with no schist is just a rough surface rock.
+18. An silver ain't nothing but hairbrushes that met some alloy.
+19. A kite is just a pelican that's got catch wind.
+20. Like making the denim and hoping for material.
+21. The nut feeds everyone's fit bolt but its own.
+22. The pitcher's family always goes without throw fast ball.
+23. A nail is just a weapon that's got smooth length.
+24. You want lid? Well, first you're gonna need container.
+25. Don't build the micrometer and say you ain't got workshop.
+26. Ain't no sleeping at night ever came from nothing — you need bed.
+27. What's a cicada but a lacebug with nocturnal behavior?
+28. Don't drink the dish and then gripe about the gnocchi.
+29. You can't put out a herring and then wonder where all the herringbone came from.
+30. That's just lorikeeting the fruit and praying for breaking wind.
+
+---
+
+## 5. Quality Assessment
+
+### Rating Summary
+
+I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
+
+| Rating | Count | % | Description |
+|--------|-------|---|-------------|
+| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
+| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
+| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
+
+### Good Examples (natural-sounding, humorous)
+- "Nobody's got less salsa than the man who makes the mango."
+- "There's a fella who takes the wax and says the sugar's no good."
+- "A bamboo don't come without its water, now does it?"
+- "Don't take the wheel and then gripe about the hull."
+- "Ain't no sleeping at night ever came from nothing — you need bed."
+- "My daddy always said, can't have waking up in morning without coffee."
+- "What's a cicada but a lacebug with nocturnal behavior?"
+- "You can't put out a herring and then wonder where all the herringbone came from."
+
+### Common Issues Identified
+
+#### 1. Article / Grammar Errors (frequent)
+- "An scarf ain't nothing but..." — should be "A scarf"
+- "An silver ain't nothing but..." — should be "Silver"
+- "An salt ain't nothing but..." — should be "Salt"
+- "A have children don't come without..." — broken slot fill leaking action phrase as noun
+
+#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
+- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
+- "waking up in morning", "sleeping at night", "salty taste"
+- "breaking wind", "store blanket", "rough surface"
+- These are raw ConceptNet concept IDs that should have been filtered or reformatted
+
+#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
+- "lorikeeting the fruit" — `lorikeet` treated as a verb
+- "fooding the earthworm" — `food` treated as a verb
+- "jeansing the denim" — `jeans` treated as a verb
+- "safariing the lion" — `safari` treated as a verb
+- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
+
+#### 4. LLM Enhancement Artifacts Leaking (moderate)
+- "bridge word: plate" appearing in output text
+- "bridge 2: **food**" appearing in output text
+- "*bridge word: absorption*" appearing in output text
+- These are raw LLM response fragments that weren't properly cleaned during Phase 2
+
+#### 5. Semantic Mismatches (occasional)
+- "A lynx is just a earthworm that's got feline." — wrong category siblings
+- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
+- "A emu is just a ferret that's got walk backwards." — cross-class comparison
+
+### Per-Template Quality Assessment
+
+| Template | Typical Quality | Key Issue |
+|----------|----------------|-----------|
+| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
+| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
+| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
+| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
+| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
+| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
+| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
+
+---
+
+## 6. Errors, Warnings, and Issues
+
+### No Errors at Runtime
+- Generator runs without crashes on all template types
+- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
+- JSON output mode produces valid JSONL with complete metadata
+- Fictional entity generation works
+
+### Issues Found
+
+| Severity | Issue | Impact |
+|----------|-------|--------|
+| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
+| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
+| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
+| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
+| **Low** | No test suite exists | No automated validation of output quality |
+| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
+| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
+
+---
+
+## 7. Readiness Assessment for Corpus Generation
+
+### Ready
+- Template engine is functional and produces output across all 7 meta-template families
+- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
+- Vocab expansion added 90 words to cover previously sparse categories
+- JSON output mode with full debug metadata is working — ready for bulk generation
+- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
+- Fictional entity support is implemented and functional
+- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
+
+### Should Fix Before Corpus Generation
+1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
+2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
+3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
+4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
+
+### Nice to Have
+- Add a basic test suite (even just smoke tests that confirm each template generates output)
+- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
+- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
+
+### Overall Verdict
+
+**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
+
+Fixing the four "Should Fix" items before corpus generation would:
+- Reduce the discard rate (saving LLM compute time)
+- Improve the quality floor of raw output (giving the polish LLM better material to work with)
+- Eliminate artifact contamination that could propagate into training data
+
+The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.
--- a/GRAPH_ENHANCEMENT_SPEC.md
+++ b/GRAPH_ENHANCEMENT_SPEC.md
@ -0,0 +1,318 @@
+# Graph Enhancement Spec — LLM-Augmented Folksy Subgraph
+
+## Overview
+
+The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.
+
+This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.
+
+## Infrastructure
+
+```python
+import requests
+
+def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
+    """Chat completion endpoint of local LLM"""
+    return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
+        'model': model,
+        'messages': messages
+    }).json()
+```
+
+All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.
+
+## Strategy
+
+For each word in `folksy_vocab.csv`, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as `folksy_relations.csv` and merged into the generator's working dataset.
+
+This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.
+
+## Phase 1: Per-Word Relationship Expansion
+
+### Input
+Every word in `folksy_vocab.csv`, plus its existing edges from `folksy_relations.csv`.
+
+### Process
+
+For each word, send a prompt that:
+1. Provides the word and its categories
+2. Lists its EXISTING relationships (so the LLM doesn't duplicate them)
+3. Asks for ADDITIONAL relationships across specific relation types
+4. Constrains output to a parseable structured format
+
+### System Prompt
+
+```
+You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
+
+Rules:
+- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
+- Every relationship must be something a typical adult would agree is true
+- Do not repeat any relationship already listed as "known"
+- Target words should be common English words (top 3000 frequency preferred)
+- Output ONLY the structured format shown below, one relationship per line
+- If you cannot think of good relationships for a given type, output NONE for that type
+- Aim for 3-5 relationships per type where possible
+
+Output format (one per line):
+RELATION_TYPE: target_word | short natural phrasing
+
+Example output:
+AtLocation: barn | you find a horse in a barn
+UsedFor: riding | a horse is used for riding
+HasA: mane | a horse has a mane
+CapableOf: gallop | a horse can gallop
+MadeOf: NONE
+PartOf: herd | a horse is part of a herd
+```
+
+### User Prompt Template
+
+```
+Word: {word}
+Categories: {categories}
+
+Known relationships:
+{existing_edges_formatted}
+
+Generate additional relationships for these types:
+- AtLocation (where is it found?)
+- UsedFor (what is it used for?)
+- HasA (what does it have / contain?)
+- PartOf (what is it part of?)
+- CapableOf (what can it do?)
+- MadeOf (what is it made of?)
+- HasPrerequisite (what do you need before you can have/use it?)
+- Causes (what does it cause or lead to?)
+- HasProperty (what adjectives describe it? — limit to physical/sensory properties)
+```
+
+### Formatting Existing Edges
+
+For the "Known relationships" section, format existing edges as:
+
+```
+AtLocation: pond (weight 1.0), lake (weight 4.47)
+CapableOf: swim (weight 2.0), fly (weight 1.0)
+UsedFor: (none in database)
+```
+
+This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.
+
+### Parsing LLM Output
+
+```python
+import re
+
+def parse_llm_relations(response_text, source_word):
+    """Parse structured LLM output into edge tuples."""
+    edges = []
+    for line in response_text.strip().split('\n'):
+        line = line.strip()
+        if not line or 'NONE' in line:
+            continue
+        match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
+        if match:
+            relation, target, surface = match.groups()
+            # Validate relation type
+            if relation in VALID_RELATIONS:
+                edges.append({
+                    'start_word': source_word,
+                    'end_word': target.strip().lower(),
+                    'relation': relation,
+                    'weight': 0.8,  # LLM-generated edges get a default weight below ConceptNet minimum
+                    'surface_text': surface.strip(),
+                    'source': 'llm_augmented'
+                })
+    return edges
+```
+
+### Weight Assignment
+
+LLM-generated edges get a default weight of **0.8** — deliberately below the ConceptNet minimum threshold of 1.0. This means:
+- They fill gaps and add diversity
+- They lose ties to ConceptNet edges (real data preferred when both exist)
+- They can be filtered out easily if needed (`weight >= 1.0` restores pure ConceptNet)
+- The generator can optionally boost or penalize LLM edges via a CLI flag
+
+### Deduplication
+
+Before merging, check each LLM-generated edge against existing edges:
+- If (start_word, end_word, relation) already exists → skip
+- If end_word is not in folksy_vocab → add to a `candidate_additions.csv` for review, but do NOT auto-add to vocab (avoids graph bloat)
+- If end_word IS in folksy_vocab → add edge to `folksy_relations_augmented.csv`
+
+## Phase 2: Cross-Word Relationship Discovery
+
+After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.
+
+### Process
+
+1. Identify word pairs that are in the same category but have no path of length ≤ 2 between them
+2. For a sample of these pairs, ask the LLM what connects them
+
+### Prompt for Bridge Discovery
+
+System prompt:
+```
+You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
+
+Rules:
+- The bridge word must be a common, concrete noun
+- State the relationship type for each connection
+- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
+
+Example:
+Words: "cow" and "butter"
+BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product
+```
+
+User prompt:
+```
+Words: "{word_a}" and "{word_b}"
+Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
+Find 1-3 bridge words that connect them.
+```
+
+### Candidate Selection
+
+Don't run this for all pairs — that's O(n²) on 534 words. Instead:
+
+1. Build the current 2-hop reachability matrix
+2. Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
+3. For each low-connectivity word, pick 5-10 random same-category words it can't reach
+4. Run bridge discovery on those pairs
+5. Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words
+
+## Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates
+
+The `false_equivalence` meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.
+
+### Prompt
+
+System prompt:
+```
+You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
+
+Rules:
+- Only physical/sensory/behavioral properties, not abstract qualities
+- Properties should DISTINGUISH this thing from similar things in its category
+- Output one property per line as: PROPERTY | brief explanation
+- Aim for 5-8 properties
+```
+
+User prompt:
+```
+Word: {word}
+Category: {categories}
+Other words in same category: {same_category_sample}
+
+What properties distinguish {word} from the others listed?
+```
+
+Including same-category peers in the prompt encourages the LLM to generate *differentiating* properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.
+
+### Output Format
+
+```
+fast | horses are known for running fast
+tall | horses are tall compared to most farm animals
+mane | horses have a distinctive mane
+shod | horses wear horseshoes
+```
+
+These get stored as HasProperty edges in the augmented relations file.
+
+## Output Files
+
+### `folksy_relations_augmented.csv`
+Same schema as `folksy_relations.csv` with additional columns:
+
+```
+start_word, end_word, relation, weight, surface_text, source
+corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
+dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
+horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented
+```
+
+The `source` column allows filtering: `source=conceptnet` for pure ConceptNet, `source=llm_augmented` for LLM additions, or both for the full enhanced graph.
+
+### `candidate_additions.csv`
+Words that appeared in LLM output but aren't in the current folksy vocab:
+
+```
+word, suggested_by, relation_context, frequency
+mane, horse, "HasA: a horse has a mane", 2
+bridle, horse, "HasA: a horse has a bridle", 1
+```
+
+The `frequency` column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).
+
+### `enhancement_log.csv`
+Track what was processed and what the LLM produced:
+
+```
+source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
+dog, 2025-02-15T10:30:00, 24, 18, 3, 3
+horse, 2025-02-15T10:30:45, 31, 22, 5, 4
+```
+
+## Execution Plan
+
+### Batch Processing
+
+534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.
+
+```python
+import csv
+import time
+
+def process_all_words(vocab_path, relations_path, output_path):
+    vocab = load_vocab(vocab_path)
+    relations = load_relations(relations_path)
+    all_new_edges = []
+    
+    for i, word_entry in enumerate(vocab):
+        word = word_entry['word']
+        categories = word_entry['categories']
+        existing = get_edges_for_word(relations, word)
+        
+        messages = build_expansion_prompt(word, categories, existing)
+        response = llm_chat_completion(messages)
+        response_text = response['choices'][0]['message']['content']
+        
+        new_edges = parse_llm_relations(response_text, word)
+        new_edges = deduplicate(new_edges, existing)
+        all_new_edges.extend(new_edges)
+        
+        if (i + 1) % 50 == 0:
+            print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
+        
+        time.sleep(0.1)  # gentle rate limiting
+    
+    save_augmented_relations(all_new_edges, output_path)
+```
+
+### Resumability
+
+Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.
+
+### Validation Pass
+
+After all LLM edges are generated, run a quick validation:
+1. No self-loops (start_word == end_word)
+2. All relation types are in the valid set
+3. No duplicate (start, end, relation) triples
+4. Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
+5. Spot-check 20 random LLM edges manually for sanity
+
+## Integration with Generator
+
+The generator's data loading should be updated to:
+
+1. Load `folksy_relations.csv` (original ConceptNet edges)
+2. If `folksy_relations_augmented.csv` exists, load and merge it
+3. CLI flag: `--pure-conceptnet` to disable LLM-augmented edges
+4. CLI flag: `--llm-weight-boost 0.2` to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)
+
+This keeps the original ConceptNet data pristine and the augmentation fully reversible.
--- a/data/candidate_additions.csv
+++ b/data/candidate_additions.csv
--- a/data/enhancement_log.csv
+++ b/data/enhancement_log.csv
--- a/data/folksy_relations_augmented.csv
+++ b/data/folksy_relations_augmented.csv
--- a/data/folksy_vocab.csv
+++ b/data/folksy_vocab.csv
@ -533,3 +533,93 @@ oxpecker,bird,0.0,4,0
 bowerbird,bird,0.0,3,0
 condor,bird,0.0,3,0
 gladiola,flower,0.0,3,0
+metal,metal,0.80,0,0
+soil,mineral,0.80,0,0
+beak,animal,0.80,0,0
+feather,"bird,material",0.80,0,0
+plant,plant,0.80,0,0
+forest,"landscape,tree",0.80,0,0
+food,food,0.80,0,0
+wing,bird,0.80,0,0
+seed,"seed,plant",0.80,0,0
+kitchen,"building,structure",0.80,0,0
+handle,tool,0.80,0,0
+tail,animal,0.80,0,0
+leaf,plant,0.80,0,0
+bone,"animal,material",0.80,0,0
+flesh,"animal,food",0.80,0,0
+flock,animal,0.80,0,0
+field,"landscape,crop",0.80,0,0
+fur,"animal,material",0.80,0,0
+workshop,"building,structure",0.80,0,0
+meat,"animal,food",0.80,0,0
+fiber,"plant,material",0.80,0,0
+farm,"structure,landscape",0.80,0,0
+skin,"animal,material",0.80,0,0
+leg,"animal,tool",0.80,0,0
+flower,"flower,plant",0.80,0,0
+ground,landscape,0.80,0,0
+petal,"flower,plant",0.80,0,0
+muscle,"organism,animal",0.80,0,0
+shade,"landscape,plant",0.80,0,0
+ocean,"water,landscape",0.80,0,0
+medicine,"herb,plant",0.80,0,0
+rubber,"material,fabric",0.80,0,0
+mineral,"mineral,stone",0.80,0,0
+toolbox,"tool,container",0.80,0,0
+land,landscape,0.80,0,0
+bird,"bird,animal",0.80,0,0
+lid,"container,tool",0.80,0,0
+bouquet,"flower,plant",0.80,0,0
+ceramic,"material,container",0.80,0,0
+lake,"water,landscape",0.80,0,0
+fat,"animal,food",0.80,0,0
+body,"organism,animal",0.80,0,0
+house,"shelter,building",0.80,0,0
+furniture,"furniture,structure",0.80,0,0
+concrete,"material,stone",0.80,0,0
+jewelry,material,0.80,0,0
+fruit,fruit,0.80,0,0
+fin,"animal,fish",0.80,0,0
+container,container,0.80,0,0
+branch,"plant,wood",0.80,0,0
+earth,"landscape,mineral",0.80,0,0
+fuel,material,0.80,0,0
+ore,"mineral,metal",0.80,0,0
+fireplace,"structure,tool",0.80,0,0
+dust,material,0.80,0,0
+door,"furniture,structure",0.80,0,0
+window,structure,0.80,0,0
+mouth,"animal,insect",0.80,0,0
+string,material,0.80,0,0
+fabric,fabric,0.80,0,0
+sugar,"food,spice",0.80,0,0
+trigger,"tool,weapon",0.80,0,0
+key,tool,0.80,0,0
+brick,"container,material,stone",0.80,0,0
+stone,"rock,stone",0.80,0,0
+mountain,"landscape,rock",0.80,0,0
+juice,"beverage,food",0.80,0,0
+cage,"structure,tool",0.80,0,0
+head,"animal,insect",0.80,0,0
+grain,grain,0.80,0,0
+home,"building,shelter",0.80,0,0
+crystal,"mineral,rock",0.80,0,0
+engine,"tool,vehicle",0.80,0,0
+hammer,"tool,weapon",0.80,0,0
+aquarium,container,0.80,0,0
+tooth,animal,0.80,0,0
+river,"water,landscape",0.80,0,0
+grassland,"landscape,plant",0.80,0,0
+sea,"water,landscape",0.80,0,0
+dessert,food,0.80,0,0
+wheel,"tool,vehicle",0.80,0,0
+needle,tool,0.80,0,0
+jungle,"landscape,plant",0.80,0,0
+blood,organism,0.80,0,0
+oil,"beverage,mineral",0.80,0,0
+mouthpiece,tool,0.80,0,0
+claw,animal,0.80,0,0
+spout,tool,0.80,0,0
+savanna,"landscape,plant",0.80,0,0
+desert,landscape,0.80,0,0
--- a/folksy_generator.py
+++ b/folksy_generator.py
@ -212,26 +212,45 @@ class Deconstruction(MetaTemplate):

        # Find what A is made of / requires
        ingredients = []
+        ingredient_rels = []  # track which relation found each ingredient
        for rel in ("MadeOf", "HasPrerequisite", "HasA"):
-            ingredients.extend(_short_concepts(self.graph.neighbors(a, rel, min_weight=0.5)))
+            found = _short_concepts(self.graph.neighbors(a, rel, min_weight=0.5))
+            for item in found:
+                ingredients.append(item)
+                ingredient_rels.append(rel)

        if len(ingredients) < 2:
            for rel in ("MadeOf", "HasPrerequisite"):
                for (start, w, s) in self.graph.reverse.get((a, rel), []):
                    if len(start.split("_")) <= 2:
                        ingredients.append((start, w, s))
+                        ingredient_rels.append(rel)

        if len(ingredients) < 2:
            return None, None

-        random.shuffle(ingredients)
-        b_word = _readable(ingredients[0][0])
-        d_word = _readable(ingredients[1][0])
+        # Shuffle together
+        combined = list(zip(ingredients, ingredient_rels))
+        random.shuffle(combined)
+        ingredients, ingredient_rels = zip(*combined)
+
+        b_edge = ingredients[0]
+        b_word = _readable(b_edge[0])
+        b_rel = ingredient_rels[0]
+        d_edge = ingredients[1]
+        d_word = _readable(d_edge[0])
+        d_rel = ingredient_rels[1]

        # Find a property for D
+        chain_edges = [
+            {"start": a, "relation": b_rel, "end": b_edge[0], "weight": b_edge[1], "surface_text": b_edge[2]},
+            {"start": a, "relation": d_rel, "end": d_edge[0], "weight": d_edge[1], "surface_text": d_edge[2]},
+        ]
        props = self.graph.neighbors(ingredients[1][0], "HasProperty")
        if props:
-            c_word = _readable(random.choice(props)[0])
+            c_prop = random.choice(props)
+            c_word = _readable(c_prop[0])
+            chain_edges.append({"start": d_edge[0], "relation": "HasProperty", "end": c_prop[0], "weight": c_prop[1], "surface_text": c_prop[2]})
        else:
            c_word = random.choice(["plain", "sorry", "old", "humble", "dry", "wet", "cold"])

@ -242,6 +261,7 @@ class Deconstruction(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{a} MadeOf/Has [{b_word}, {d_word}]; {d_word} HasProperty {c_word}",
+            "chain_edges": chain_edges,
            "slots": {"A": a, "B": b_word, "C": c_word, "D": d_word},
        }
        return saying, debug
@ -265,23 +285,31 @@ class DenialOfConsequences(MetaTemplate):
            return None, None

        # What is found at A? (reverse: B AtLocation A)
-        attracted = []
+        attracted = []  # (word, weight, surface_text, relation)
        for (b, w, s) in self.graph.reverse.get((a, "AtLocation"), []):
-            attracted.append((b, w))
+            attracted.append((b, w, s, "AtLocation"))

        # Also: what does A attract/cause?
        for rel in ("Causes", "CausesDesire"):
            for (b, w, s) in self.graph.edges.get((a, rel), []):
-                attracted.append((b, w))
+                attracted.append((b, w, s, rel))

        if not attracted:
            for (bridge, target, w1, w2) in self.graph.two_hop(a, "UsedFor", "AtLocation"):
-                attracted.append((target, w1 + w2))
+                attracted.append((target, w1 + w2, "", "AtLocation"))

        if not attracted:
            return None, None

-        b_word = _readable(random.choice(attracted)[0])
+        b_choice = random.choice(attracted)
+        b_word = _readable(b_choice[0])
+
+        chain_edges = [
+            {"start": b_choice[0] if b_choice[3] == "AtLocation" else a,
+             "relation": b_choice[3],
+             "end": a if b_choice[3] == "AtLocation" else b_choice[0],
+             "weight": b_choice[1], "surface_text": b_choice[2]},
+        ]

        create_verbs = {
            "pond": "dig", "birdhouse": "hang", "fence": "build", "trap": "set",
@ -301,6 +329,7 @@ class DenialOfConsequences(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{b_word} AtLocation {a}; {a} created by {c_word}",
+            "chain_edges": chain_edges,
            "slots": {"A": a, "B": b_word, "C": c_word},
        }
        return saying, debug
@ -324,14 +353,21 @@ class IronicDeficiency(MetaTemplate):
            return None, None

        products = []
+        product_rels = []
        for rel in ("UsedFor", "CapableOf", "Causes"):
-            products.extend(self.graph.neighbors(a, rel, min_weight=0.5))
+            found = self.graph.neighbors(a, rel, min_weight=0.5)
+            for item in found:
+                products.append(item)
+                product_rels.append(rel)

-        products = _short_concepts(products)
-        if not products:
+        # Filter to short concepts while keeping rel tracking
+        filtered = [(p, r) for p, r in zip(products, product_rels) if len(p[0].split("_")) <= 3]
+        if not filtered:
            return None, None

-        x_word = _readable(random.choice(products)[0])
+        choice_idx = random.randrange(len(filtered))
+        x_edge, x_rel = filtered[choice_idx]
+        x_word = _readable(x_edge[0])

        family_members = ["wife", "children", "household", "family", "own kind"]
        f_word = random.choice(family_members)
@ -339,10 +375,15 @@ class IronicDeficiency(MetaTemplate):
        template = self._pick_template()
        saying = template.format(A=a, X=x_word, F=f_word)

+        chain_edges = [
+            {"start": a, "relation": x_rel, "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
+        ]
+
        debug = {
            "template_family": self.id,
            "template": template,
            "chain": f"{a} UsedFor/Produces {x_word}; irony: {a} lacks {x_word}",
+            "chain_edges": chain_edges,
            "slots": {"A": a, "X": x_word, "F": f_word},
        }
        return saying, debug
@ -371,7 +412,12 @@ class FutilePreparation(MetaTemplate):
        if not uses:
            return None, None

-        action_word = random.choice(uses)[0]
+        action_edge = random.choice(uses)
+        action_word = action_edge[0]
+
+        chain_edges = [
+            {"start": seed, "relation": "UsedFor", "end": action_edge[0], "weight": action_edge[1], "surface_text": action_edge[2]},
+        ]

        # Find a different outcome in a related domain via 2-hop
        outcomes = []
@ -392,7 +438,8 @@ class FutilePreparation(MetaTemplate):
        if not outcomes:
            return None, None

-        y_word = random.choice(outcomes)[0]
+        y_choice = random.choice(outcomes)
+        y_word = y_choice[0]

        gerund = _gerund(action_word)
        verb = _readable(action_word)
@ -405,6 +452,7 @@ class FutilePreparation(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{seed} UsedFor {action_word}; different domain: {y_word}",
+            "chain_edges": chain_edges,
            "slots": {"seed": seed, "action": action_word, "Y": y_word},
        }
        return saying, debug
@ -430,21 +478,37 @@ class HypocriticalComplaint(MetaTemplate):

        # Find parts of Z
        parts = []
+        part_rels = []
        for rel in ("HasA", "PartOf", "MadeOf"):
-            parts.extend(_short_concepts(self.graph.neighbors(z, rel, min_weight=0.5)))
+            found = _short_concepts(self.graph.neighbors(z, rel, min_weight=0.5))
+            for item in found:
+                parts.append(item)
+                part_rels.append(rel)
            for (start, w, s) in self.graph.reverse.get((z, "PartOf"), []):
                if len(start.split("_")) <= 2:
                    parts.append((start, w, s))
+                    part_rels.append("PartOf")
            for (start, w, s) in self.graph.reverse.get((z, "HasA"), []):
                if len(start.split("_")) <= 2:
                    parts.append((start, w, s))
+                    part_rels.append("HasA")

        if len(parts) < 2:
            return None, None

-        random.shuffle(parts)
-        x_word = _readable(parts[0][0])
-        y_word = _readable(parts[1][0])
+        combined = list(zip(parts, part_rels))
+        random.shuffle(combined)
+        parts, part_rels = zip(*combined)
+
+        x_edge = parts[0]
+        x_word = _readable(x_edge[0])
+        y_edge = parts[1]
+        y_word = _readable(y_edge[0])
+
+        chain_edges = [
+            {"start": z, "relation": part_rels[0], "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
+            {"start": z, "relation": part_rels[1], "end": y_edge[0], "weight": y_edge[1], "surface_text": y_edge[2]},
+        ]

        consume_verbs = ["eat", "drink", "take", "pick", "use up", "grab"]
        verb = random.choice(consume_verbs)
@ -456,6 +520,7 @@ class HypocriticalComplaint(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{x_word} PartOf/HasA {z}; {y_word} PartOf/HasA {z}",
+            "chain_edges": chain_edges,
            "slots": {"Z": z, "X": x_word, "Y": y_word, "verb": verb},
        }
        return saying, debug
@ -480,19 +545,25 @@ class TautologicalWisdom(MetaTemplate):
            return None, None

        # seed HasPrerequisite/Causes something
+        # Store (x_word, y_word, weight, edge_info) where edge_info captures the raw edge
        chains = []
        for (target, w, s) in self.graph.edges.get((seed, "HasPrerequisite"), []):
-            chains.append((_readable(target), seed, w))  # X=prereq, Y=seed
+            chains.append((_readable(target), seed, w,
+                           {"start": seed, "relation": "HasPrerequisite", "end": target, "weight": w, "surface_text": s}))
        for (target, w, s) in self.graph.edges.get((seed, "Causes"), []):
-            chains.append((seed, _readable(target), w))   # X=seed, Y=effect
+            chains.append((seed, _readable(target), w,
+                           {"start": seed, "relation": "Causes", "end": target, "weight": w, "surface_text": s}))
        # Also: what does seed require?
        for (source, w, s) in self.graph.reverse.get((seed, "HasPrerequisite"), []):
-            chains.append((seed, _readable(source), w))
+            chains.append((seed, _readable(source), w,
+                           {"start": source, "relation": "HasPrerequisite", "end": seed, "weight": w, "surface_text": s}))

        if not chains:
            return None, None

-        x_word, y_word, _ = random.choice(chains)
+        choice = random.choice(chains)
+        x_word, y_word = choice[0], choice[1]
+        chain_edge = choice[3]

        template = self._pick_template()
        saying = template.format(X=x_word, Y=y_word)
@ -501,6 +572,7 @@ class TautologicalWisdom(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{x_word} -> {y_word} (prerequisite/cause)",
+            "chain_edges": [chain_edge],
            "slots": {"X": x_word, "Y": y_word},
        }
        return saying, debug
@ -543,15 +615,22 @@ class FalseEquivalence(MetaTemplate):
        a_props = _short_concepts(self.graph.neighbors(a, "HasProperty"), max_words=2)
        b_props = set(p[0] for p in self.graph.neighbors(b_word, "HasProperty"))

+        chain_edges = []
        differentiators = [p for p in a_props if p[0] not in b_props]
        if differentiators:
-            p_word = _readable(random.choice(differentiators)[0])
+            p_edge = random.choice(differentiators)
+            p_word = _readable(p_edge[0])
+            chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
        elif a_props:
-            p_word = _readable(random.choice(a_props)[0])
+            p_edge = random.choice(a_props)
+            p_word = _readable(p_edge[0])
+            chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
        else:
            a_caps = self.graph.neighbors(a, "CapableOf")
            if a_caps:
-                p_word = _readable(random.choice(a_caps)[0])
+                p_edge = random.choice(a_caps)
+                p_word = _readable(p_edge[0])
+                chain_edges.append({"start": a, "relation": "CapableOf", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
            else:
                p_word = random.choice(["ambition", "an attitude", "a plan", "patience"])

@ -562,6 +641,7 @@ class FalseEquivalence(MetaTemplate):
            "template_family": self.id,
            "template": template,
            "chain": f"{a} IsA same category as {b_word}; {a} HasProperty {p_word}",
+            "chain_edges": chain_edges,
            "slots": {"A": a, "B": b_word, "P": p_word},
        }
        return saying, debug
@ -621,7 +701,10 @@ TEMPLATE_REGISTRY = {

 def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
                 debug=False, max_retries=20):
-    """Generate a single folksy saying."""
+    """Generate a single folksy saying.
+
+    When debug=True, always returns (saying, debug_dict) with chain_edges included.
+    """
    for _ in range(max_retries):
        if template_id:
            tid = template_id
@ -631,7 +714,7 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
        cls = TEMPLATE_REGISTRY.get(tid)
        if not cls:
            print(f"Unknown template: {tid}", file=sys.stderr)
-            return None
+            return None, None

        tmpl = cls(graph)
        saying, dbg = tmpl.generate(seed_word=seed_word, seed_category=seed_category)
@ -643,6 +726,16 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
    return None, None


+def _get_seed_word(dbg):
+    """Extract the primary seed word from debug slots for dedup tracking."""
+    slots = dbg.get("slots", {})
+    # Templates use different slot names for the seed
+    for key in ("A", "Z", "seed", "X"):
+        if key in slots:
+            return slots[key]
+    return None
+
+
 def main():
    parser = argparse.ArgumentParser(
        description="Generate folksy fake-proverbs using ConceptNet relationships."
@ -655,8 +748,13 @@ def main():
    parser.add_argument("--count", "-n", type=int, default=1, help="Number of sayings to generate")
    parser.add_argument("--output", "-o", help="Output file (default: stdout)")
    parser.add_argument("--debug", "-d", action="store_true", help="Show relationship chain debug info")
+    parser.add_argument("--json", action="store_true", help="Output JSONL format with full metadata")
    parser.add_argument("--vocab", help="Path to folksy_vocab.csv")
    parser.add_argument("--relations", help="Path to folksy_relations.csv")
+    parser.add_argument("--pure-conceptnet", action="store_true",
+                        help="Skip loading augmented relations file")
+    parser.add_argument("--llm-weight-boost", type=float, default=0.0,
+                        help="Boost weight of LLM-augmented edges with weight < 1.0 (default: 0.0)")
    parser.add_argument("--list-templates", action="store_true", help="List available templates")
    parser.add_argument("--list-categories", action="store_true", help="List available categories")

@ -679,6 +777,30 @@ def main():
        print("Run scripts/extract_from_conceptnet.py first to generate data files.", file=sys.stderr)
        sys.exit(1)

+    # Load augmented relations if available
+    if not args.pure_conceptnet:
+        augmented_path = DATA_DIR / "folksy_relations_augmented.csv"
+        if augmented_path.exists():
+            boost = args.llm_weight_boost
+            with open(augmented_path, newline="", encoding="utf-8") as f:
+                reader = csv.DictReader(f)
+                count = 0
+                for row in reader:
+                    sw = row["start_word"]
+                    ew = row["end_word"]
+                    rel = row["relation"]
+                    w = float(row["weight"])
+                    if w < 1.0 and boost:
+                        w = min(w + boost, 1.0)
+                    surf = row.get("surface_text", "")
+                    graph.edges[(sw, rel)].append((ew, w, surf))
+                    graph.reverse[(ew, rel)].append((sw, w, surf))
+                    graph.all_edges[sw].append((ew, rel, w))
+                    graph.all_edges[ew].append((sw, rel, w))
+                    count += 1
+            if count:
+                print(f"Loaded {count} augmented edges.", file=sys.stderr)
+
    if args.list_categories:
        for cat in sorted(graph.by_category.keys()):
            print(f"  {cat:20s} ({len(graph.by_category[cat])} words)")
@ -688,26 +810,96 @@ def main():
    if args.entities:
        graph.merge_fictional(args.entities)

+    # JSON mode implies debug internally
+    use_debug = args.debug or args.json
+
    # Generate
    out = open(args.output, "w", encoding="utf-8") if args.output else sys.stdout
    try:
-        for i in range(args.count):
+        if args.count > 1:
+            # Deduplication tracking for batch mode
+            seen_text = set()
+            seen_slots = set()
+            seed_usage = defaultdict(int)
+            generated = 0
+            max_outer_attempts = args.count * 10  # generous outer limit
+            attempts = 0
+
+            while generated < args.count and attempts < max_outer_attempts:
+                attempts += 1
+                saying, dbg = generate_one(
+                    graph,
+                    template_id=args.template,
+                    seed_word=args.seed,
+                    seed_category=args.category,
+                    debug=use_debug,
+                )
+                if not saying:
+                    continue
+
+                # Dedup checks (failures don't count against retry limit)
+                if saying in seen_text:
+                    continue
+
+                if dbg:
+                    slots_key = (dbg["template_family"], frozenset(dbg["slots"].items()))
+                    if slots_key in seen_slots:
+                        continue
+
+                    seed_w = _get_seed_word(dbg)
+                    if seed_w and seed_usage[seed_w] >= 30:
+                        continue
+                    if seed_w:
+                        seed_usage[seed_w] += 1
+                    seen_slots.add(slots_key)
+
+                seen_text.add(saying)
+                generated += 1
+
+                if args.json and dbg:
+                    record = {
+                        "raw_text": saying,
+                        "meta_template": dbg["template_family"],
+                        "surface_template": dbg["template"],
+                        "slots": dbg["slots"],
+                        "chain": dbg.get("chain_edges", []),
+                    }
+                    out.write(json.dumps(record, ensure_ascii=False) + "\n")
+                else:
+                    out.write(saying + "\n")
+                    if args.debug and dbg:
+                        out.write(f"  [DEBUG] family={dbg['template_family']}\n")
+                        out.write(f"  [DEBUG] chain: {dbg['chain']}\n")
+                        out.write(f"  [DEBUG] slots: {dbg['slots']}\n")
+                        out.write("\n")
+        else:
+            # Single generation (no dedup needed)
            saying, dbg = generate_one(
                graph,
                template_id=args.template,
                seed_word=args.seed,
                seed_category=args.category,
-                debug=args.debug,
+                debug=use_debug,
            )
            if saying:
-                out.write(saying + "\n")
-                if args.debug and dbg:
-                    out.write(f"  [DEBUG] family={dbg['template_family']}\n")
-                    out.write(f"  [DEBUG] chain: {dbg['chain']}\n")
-                    out.write(f"  [DEBUG] slots: {dbg['slots']}\n")
-                    out.write("\n")
+                if args.json and dbg:
+                    record = {
+                        "raw_text": saying,
+                        "meta_template": dbg["template_family"],
+                        "surface_template": dbg["template"],
+                        "slots": dbg["slots"],
+                        "chain": dbg.get("chain_edges", []),
+                    }
+                    out.write(json.dumps(record, ensure_ascii=False) + "\n")
+                else:
+                    out.write(saying + "\n")
+                    if args.debug and dbg:
+                        out.write(f"  [DEBUG] family={dbg['template_family']}\n")
+                        out.write(f"  [DEBUG] chain: {dbg['chain']}\n")
+                        out.write(f"  [DEBUG] slots: {dbg['slots']}\n")
+                        out.write("\n")
            else:
-                out.write(f"(failed to generate saying #{i+1} after retries)\n")
+                out.write("(failed to generate saying after retries)\n")
    finally:
        if args.output:
            out.close()
--- a/scripts/compute_corpus_stats.py
+++ b/scripts/compute_corpus_stats.py
@ -0,0 +1,213 @@
+#!/usr/bin/env python3
+"""Compute corpus statistics and validation metrics.
+
+Reads corpus files and computes counts, distributions, coverage, and balance warnings.
+
+Usage:
+  python scripts/compute_corpus_stats.py
+  python scripts/compute_corpus_stats.py --corpus-dir corpus/
+"""
+
+import argparse
+import csv
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+DATA_DIR = PROJECT_DIR / "data"
+
+
+def load_jsonl(path):
+    """Load a JSONL file."""
+    entries = []
+    if not path.exists():
+        return entries
+    with open(path, encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                entries.append(json.loads(line))
+    return entries
+
+
+def classify_input_type(inp):
+    """Classify the input framing type of a training pair."""
+    if inp.startswith("Tell me something about"):
+        return "word_seeded"
+    elif inp.startswith("Tell me a saying about"):
+        return "category_seeded"
+    elif inp.startswith("What would a"):
+        return "persona_seeded"
+    elif inp.startswith("Give me a") and "proverb" in inp:
+        return "template_seeded"
+    elif any(inp.startswith(p) for p in [
+        "Tell me some folk", "What do they", "Give me a proverb",
+        "Share some", "What's a good"
+    ]):
+        return "open_ended"
+    else:
+        return "fictional"
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Compute corpus statistics.")
+    parser.add_argument("--corpus-dir", default=str(PROJECT_DIR / "corpus"),
+                        help="Corpus directory")
+    parser.add_argument("--output", default=None,
+                        help="Output JSON file (default: corpus_dir/corpus_stats.json)")
+    args = parser.parse_args()
+
+    corpus_dir = Path(args.corpus_dir)
+    output_path = Path(args.output) if args.output else corpus_dir / "corpus_stats.json"
+
+    # Load all corpus files
+    raw = load_jsonl(corpus_dir / "corpus_raw.jsonl")
+    polished = load_jsonl(corpus_dir / "corpus_polished.jsonl")
+    filtered = load_jsonl(corpus_dir / "corpus_filtered.jsonl")
+    training = load_jsonl(corpus_dir / "training_pairs.jsonl")
+
+    # Load vocab for coverage analysis
+    vocab_words = set()
+    vocab_path = DATA_DIR / "folksy_vocab.csv"
+    if vocab_path.exists():
+        with open(vocab_path, newline="", encoding="utf-8") as f:
+            for row in csv.DictReader(f):
+                vocab_words.add(row["word"])
+
+    stats = {}
+
+    # --- Raw corpus stats ---
+    stats["raw_count"] = len(raw)
+    raw_by_template = Counter(e.get("meta_template", "unknown") for e in raw)
+    stats["raw_by_template"] = dict(sorted(raw_by_template.items()))
+
+    # --- Polish stats ---
+    polished_entries = [e for e in polished if e.get("status") == "polished"]
+    discarded_entries = [e for e in polished if e.get("status") == "discarded"]
+    error_entries = [e for e in polished if e.get("status") == "error"]
+
+    stats["polished_count"] = len(polished_entries)
+    stats["discarded_during_polish"] = len(discarded_entries)
+    stats["errors_during_polish"] = len(error_entries)
+    if polished_entries or discarded_entries:
+        total_processed = len(polished_entries) + len(discarded_entries)
+        stats["polish_discard_rate"] = f"{len(discarded_entries)/total_processed*100:.1f}%"
+
+    polish_by_template = Counter(e.get("meta_template", "unknown") for e in polished_entries)
+    stats["polished_by_template"] = dict(sorted(polish_by_template.items()))
+
+    discard_by_template = Counter(e.get("meta_template", "unknown") for e in discarded_entries)
+    stats["discarded_by_template"] = dict(sorted(discard_by_template.items()))
+
+    # --- Filter stats ---
+    stats["filtered_count"] = len(filtered)
+
+    filter_by_template = Counter(e.get("meta_template", "unknown") for e in filtered)
+    stats["filtered_by_template"] = dict(sorted(filter_by_template.items()))
+
+    # Filter discard count
+    stats["discarded_during_filter"] = len(polished_entries) - len(filtered)
+
+    # --- Training pairs stats ---
+    stats["training_pair_count"] = len(training)
+
+    training_by_template = Counter(e.get("meta_template", "unknown") for e in training)
+    stats["training_by_template"] = dict(sorted(training_by_template.items()))
+
+    input_type_counts = Counter(classify_input_type(e.get("input", "")) for e in training)
+    stats["training_by_input_type"] = dict(sorted(input_type_counts.items()))
+
+    # --- Coverage analysis ---
+    used_words = set()
+    for entry in filtered:
+        slots = entry.get("slots", {})
+        for v in slots.values():
+            word = v.lower().replace(" ", "_")
+            if word in vocab_words:
+                used_words.add(word)
+
+    stats["unique_slot_words_used"] = len(used_words)
+    stats["total_vocab_words"] = len(vocab_words)
+    stats["vocab_coverage"] = f"{len(used_words)/len(vocab_words)*100:.1f}%" if vocab_words else "N/A"
+
+    never_used = sorted(vocab_words - used_words)
+    stats["words_never_used"] = never_used
+    stats["words_never_used_count"] = len(never_used)
+
+    # --- Saying length stats ---
+    lengths = []
+    for entry in filtered:
+        text = entry.get("polished_text", "")
+        if text:
+            lengths.append(len(text.split()))
+
+    if lengths:
+        stats["avg_saying_length_words"] = round(sum(lengths) / len(lengths), 1)
+        stats["min_saying_length_words"] = min(lengths)
+        stats["max_saying_length_words"] = max(lengths)
+
+    # --- Balance warnings ---
+    warnings = []
+    if filtered:
+        total_filtered = len(filtered)
+        for template, count in filter_by_template.items():
+            pct = count / total_filtered * 100
+            if pct < 10:
+                warnings.append(
+                    f"WARNING: {template} has only {count} entries ({pct:.1f}%) — "
+                    f"below 10% threshold. Generate more raw sayings for this family."
+                )
+
+    if training:
+        total_training = len(training)
+        for template, count in training_by_template.items():
+            pct = count / total_training * 100
+            if pct < 5:
+                warnings.append(
+                    f"WARNING: {template} has only {count} training pairs ({pct:.1f}%) — very underrepresented."
+                )
+
+    stats["balance_warnings"] = warnings
+
+    # --- Write output ---
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(stats, f, indent=2, ensure_ascii=False)
+
+    # --- Print summary ---
+    print("=" * 60)
+    print("CORPUS STATISTICS")
+    print("=" * 60)
+
+    print(f"\nRaw sayings:           {stats['raw_count']}")
+    print(f"Polished sayings:      {stats['polished_count']}")
+    print(f"Discarded (polish):    {stats.get('discarded_during_polish', 0)} ({stats.get('polish_discard_rate', 'N/A')})")
+    print(f"Discarded (filter):    {stats.get('discarded_during_filter', 0)}")
+    print(f"Final filtered:        {stats['filtered_count']}")
+    print(f"Training pairs:        {stats['training_pair_count']}")
+
+    print(f"\nDistribution by meta-template (filtered):")
+    for t, c in sorted(filter_by_template.items()):
+        pct = c / len(filtered) * 100 if filtered else 0
+        print(f"  {t:30s} {c:5d} ({pct:5.1f}%)")
+
+    print(f"\nDistribution by input framing type:")
+    for t, c in sorted(input_type_counts.items()):
+        print(f"  {t:20s} {c:5d}")
+
+    print(f"\nVocab coverage: {stats['vocab_coverage']} ({stats['unique_slot_words_used']}/{stats['total_vocab_words']})")
+    print(f"Average saying length: {stats.get('avg_saying_length_words', 'N/A')} words")
+
+    if warnings:
+        print(f"\nBalance warnings:")
+        for w in warnings:
+            print(f"  {w}")
+
+    print(f"\nFull stats: {output_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/enhance_graph.py
+++ b/scripts/enhance_graph.py
@ -0,0 +1,787 @@
+#!/usr/bin/env python3
+"""LLM-augmented graph enhancement for the folksy subgraph.
+
+Three phases:
+  Phase 1: Per-word relationship expansion
+  Phase 2: Cross-word bridge discovery
+  Phase 3: Property enrichment for false_equivalence templates
+
+Usage:
+  python scripts/enhance_graph.py --phase 1        # Run phase 1 only
+  python scripts/enhance_graph.py --phase 2        # Run phase 2 only
+  python scripts/enhance_graph.py --phase 3        # Run phase 3 only
+  python scripts/enhance_graph.py --all             # Run all phases
+  python scripts/enhance_graph.py --phase 1 --dry-run  # Print prompts without calling LLM
+"""
+
+import argparse
+import csv
+import os
+import random
+import re
+import sys
+import time
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+
+# Paths
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+DATA_DIR = PROJECT_DIR / "data"
+
+LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
+LLM_MODEL = "THUDM-GLM4-32B"
+
+VALID_RELATIONS = {
+    "AtLocation", "MadeOf", "PartOf", "UsedFor", "HasA", "HasProperty",
+    "Causes", "HasPrerequisite", "CapableOf", "ReceivesAction", "Desires",
+    "CausesDesire", "LocatedNear", "CreatedBy", "MotivatedByGoal", "HasSubevent",
+}
+
+AUGMENTED_CSV = DATA_DIR / "folksy_relations_augmented.csv"
+CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
+LOG_CSV = DATA_DIR / "enhancement_log.csv"
+
+# ---------------------------------------------------------------------------
+# Infrastructure
+# ---------------------------------------------------------------------------
+
+def llm_chat_completion(messages, max_retries=3):
+    """Chat completion with retry logic."""
+    import requests
+
+    for attempt in range(max_retries):
+        try:
+            resp = requests.post(LLM_ENDPOINT, json={
+                "model": LLM_MODEL,
+                "messages": messages,
+            }, timeout=120)
+            resp.raise_for_status()
+            data = resp.json()
+            return data["choices"][0]["message"]["content"]
+        except Exception as e:
+            wait = (2 ** attempt)
+            print(f"  LLM call failed (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
+            if attempt < max_retries - 1:
+                print(f"  Retrying in {wait}s...", file=sys.stderr)
+                time.sleep(wait)
+            else:
+                print(f"  Giving up on this word.", file=sys.stderr)
+                return None
+
+
+def load_vocab():
+    """Load folksy vocabulary."""
+    vocab = {}
+    with open(DATA_DIR / "folksy_vocab.csv", newline="", encoding="utf-8") as f:
+        for row in csv.DictReader(f):
+            word = row["word"]
+            cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
+            vocab[word] = {
+                "categories": cats,
+                "tangibility": float(row.get("tangibility_score", 0)),
+                "edge_count": int(row.get("conceptnet_edge_count", 0)),
+            }
+    return vocab
+
+
+def load_relations():
+    """Load existing relations (ConceptNet + any existing augmented)."""
+    edges = defaultdict(list)  # (start, relation) -> [(end, weight, surface)]
+    existing_triples = set()   # (start, end, relation) for dedup
+
+    for path in [DATA_DIR / "folksy_relations.csv", AUGMENTED_CSV]:
+        if not path.exists():
+            continue
+        with open(path, newline="", encoding="utf-8") as f:
+            for row in csv.DictReader(f):
+                sw = row["start_word"]
+                ew = row["end_word"]
+                rel = row["relation"]
+                if not row['weight']: continue # corruption / skip?
+                w = float(row["weight"])
+                surf = row.get("surface_text", "")
+                edges[(sw, rel)].append((ew, w, surf))
+                existing_triples.add((sw, ew, rel))
+
+    return edges, existing_triples
+
+
+def load_checkpoint():
+    """Load enhancement log to determine what's already been processed."""
+    processed = set()  # (word, phase)
+    if LOG_CSV.exists():
+        with open(LOG_CSV, newline="", encoding="utf-8") as f:
+            for row in csv.DictReader(f):
+                processed.add((row["source_word"], row["phase"]))
+    return processed
+
+
+def append_log(word, phase, edges_generated, edges_accepted, edges_duplicate, edges_oov):
+    """Append a row to the enhancement log."""
+    write_header = not LOG_CSV.exists()
+    with open(LOG_CSV, "a", newline="", encoding="utf-8") as f:
+        writer = csv.writer(f)
+        if write_header:
+            writer.writerow(["source_word", "phase", "timestamp",
+                             "edges_generated", "edges_accepted", "edges_duplicate", "edges_oov"])
+        writer.writerow([word, phase, datetime.now().isoformat(),
+                         edges_generated, edges_accepted, edges_duplicate, edges_oov])
+
+
+def append_augmented_edges(edges):
+    """Append edges to the augmented relations CSV."""
+    write_header = not AUGMENTED_CSV.exists()
+    with open(AUGMENTED_CSV, "a", newline="", encoding="utf-8") as f:
+        writer = csv.writer(f)
+        if write_header:
+            writer.writerow(["start_word", "end_word", "relation", "weight", "surface_text", "source"])
+        for e in edges:
+            writer.writerow([e["start_word"], e["end_word"], e["relation"],
+                             e["weight"], e["surface_text"], e["source"]])
+
+
+def append_candidates(candidates):
+    """Append candidate words to the candidate additions CSV."""
+    write_header = not CANDIDATE_CSV.exists()
+    with open(CANDIDATE_CSV, "a", newline="", encoding="utf-8") as f:
+        writer = csv.writer(f)
+        if write_header:
+            writer.writerow(["word", "suggested_by", "relation_context", "frequency"])
+        for c in candidates:
+            writer.writerow([c["word"], c["suggested_by"], c["relation_context"], c["frequency"]])
+
+
+# ---------------------------------------------------------------------------
+# Parsing
+# ---------------------------------------------------------------------------
+
+def parse_llm_relations(response_text, source_word):
+    """Parse structured LLM output into edge dicts.
+
+    Handles bullets, numbering, extra whitespace, multi-word targets.
+    """
+    edges = []
+    if not response_text:
+        return edges
+
+    for line in response_text.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+
+        # Strip leading bullets/numbers: "- ", "1. ", "* ", etc.
+        line = re.sub(r"^[\d]+[.)]\s*", "", line)
+        line = re.sub(r"^[-*•]\s*", "", line)
+        line = line.strip()
+
+        if not line or "NONE" in line.upper():
+            continue
+
+        # Match: RELATION_TYPE: target_word(s) | surface text
+        match = re.match(r"^(\w+):\s*(.+?)\s*\|\s*(.+)$", line)
+        if not match:
+            continue
+
+        relation, target_raw, surface = match.groups()
+        relation = relation.strip()
+
+        if relation not in VALID_RELATIONS:
+            continue
+
+        # Normalize target: lowercase, replace spaces with underscores for multi-word
+        target = target_raw.strip().lower()
+        target = re.sub(r"\s+", "_", target)
+
+        # Skip self-loops
+        if target == source_word:
+            continue
+
+        edges.append({
+            "start_word": source_word,
+            "end_word": target,
+            "relation": relation,
+            "weight": 0.8,
+            "surface_text": surface.strip(),
+            "source": "llm_augmented",
+        })
+
+    return edges
+
+
+def parse_bridge_response(response_text, word_a, word_b):
+    """Parse bridge discovery LLM output."""
+    edges = []
+    if not response_text:
+        return edges
+
+    for line in response_text.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+
+        # Strip common prefixes
+        line = re.sub(r"^[\d]+[.)]\s*", "", line)
+        line = re.sub(r"^[-*•]\s*", "", line)
+        line = re.sub(r"^BRIDGE:\s*", "", line, flags=re.IGNORECASE)
+        line = line.strip()
+
+        if not line:
+            continue
+
+        # BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
+        parts = [p.strip() for p in line.split("|")]
+        if len(parts) < 3:
+            continue
+
+        bridge_word = parts[0].strip().lower().replace(" ", "_")
+
+        # Parse relation_to_first
+        rel1_match = re.search(r"(?:relation_to_first|first):\s*(\w+)", parts[1], re.IGNORECASE)
+        rel2_match = re.search(r"(?:relation_to_second|second):\s*(\w+)", parts[2], re.IGNORECASE)
+
+        if not rel1_match or not rel2_match:
+            # Try simpler format: just the relation type
+            rel1_match = re.match(r"(\w+)", parts[1].split(":")[-1].strip())
+            rel2_match = re.match(r"(\w+)", parts[2].split(":")[-1].strip())
+
+        if not rel1_match or not rel2_match:
+            continue
+
+        rel1 = rel1_match.group(1)
+        rel2 = rel2_match.group(1)
+
+        if rel1 not in VALID_RELATIONS or rel2 not in VALID_RELATIONS:
+            continue
+
+        explanation = parts[3].strip() if len(parts) > 3 else ""
+
+        # Create edges: word_a -> bridge and bridge -> word_b
+        edges.append({
+            "start_word": word_a,
+            "end_word": bridge_word,
+            "relation": rel1,
+            "weight": 0.8,
+            "surface_text": explanation,
+            "source": "llm_bridge",
+        })
+        edges.append({
+            "start_word": bridge_word,
+            "end_word": word_b,
+            "relation": rel2,
+            "weight": 0.8,
+            "surface_text": explanation,
+            "source": "llm_bridge",
+        })
+
+    return edges
+
+
+def parse_property_response(response_text, word):
+    """Parse property enrichment LLM output."""
+    edges = []
+    if not response_text:
+        return edges
+
+    for line in response_text.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+
+        line = re.sub(r"^[\d]+[.)]\s*", "", line)
+        line = re.sub(r"^[-*•]\s*", "", line)
+        line = line.strip()
+
+        if not line:
+            continue
+
+        # PROPERTY | explanation
+        parts = [p.strip() for p in line.split("|")]
+        if len(parts) < 1:
+            continue
+
+        prop = parts[0].strip().lower().replace(" ", "_")
+        explanation = parts[1].strip() if len(parts) > 1 else f"{word} is {prop}"
+
+        if not prop or prop == word:
+            continue
+
+        edges.append({
+            "start_word": word,
+            "end_word": prop,
+            "relation": "HasProperty",
+            "weight": 0.8,
+            "surface_text": explanation,
+            "source": "llm_property",
+        })
+
+    return edges
+
+
+# ---------------------------------------------------------------------------
+# Phase 1: Per-Word Expansion
+# ---------------------------------------------------------------------------
+
+PHASE1_SYSTEM = """You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
+
+Rules:
+- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
+- Every relationship must be something a typical adult would agree is true
+- Do not repeat any relationship already listed as "known"
+- Target words should be common English words (top 3000 frequency preferred)
+- Output ONLY the structured format shown below, one relationship per line
+- If you cannot think of good relationships for a given type, output NONE for that type
+- Aim for 3-5 relationships per type where possible
+
+Output format (one per line):
+RELATION_TYPE: target_word | short natural phrasing
+
+Example output:
+AtLocation: barn | you find a horse in a barn
+UsedFor: riding | a horse is used for riding
+HasA: mane | a horse has a mane
+CapableOf: gallop | a horse can gallop
+MadeOf: NONE
+PartOf: herd | a horse is part of a herd"""
+
+
+PHASE1_USER = """Word: {word}
+Categories: {categories}
+
+Known relationships:
+{existing_edges}
+
+Generate additional relationships for these types:
+- AtLocation (where is it found?)
+- UsedFor (what is it used for?)
+- HasA (what does it have / contain?)
+- PartOf (what is it part of?)
+- CapableOf (what can it do?)
+- MadeOf (what is it made of?)
+- HasPrerequisite (what do you need before you can have/use it?)
+- Causes (what does it cause or lead to?)
+- HasProperty (what adjectives describe it? — limit to physical/sensory properties)"""
+
+
+def format_existing_edges(edges_dict, word):
+    """Format existing edges for a word grouped by relation type."""
+    relation_types = ["AtLocation", "UsedFor", "HasA", "PartOf", "CapableOf",
+                      "MadeOf", "HasPrerequisite", "Causes", "HasProperty"]
+
+    lines = []
+    for rel in relation_types:
+        targets = edges_dict.get((word, rel), [])
+        if targets:
+            formatted = ", ".join(f"{t[0]} (weight {t[1]:.1f})" for t in targets[:10])
+            lines.append(f"{rel}: {formatted}")
+        else:
+            lines.append(f"{rel}: (none in database)")
+    return "\n".join(lines)
+
+
+def run_phase1(vocab, edges, existing_triples, checkpoint, dry_run=False):
+    """Phase 1: Per-word relationship expansion."""
+    words = sorted(vocab.keys())
+    total = len(words)
+    total_accepted = 0
+    total_skipped = 0
+
+    print(f"Phase 1: Processing {total} words...")
+
+    for i, word in enumerate(words):
+        if (word, "1") in checkpoint:
+            total_skipped += 1
+            continue
+
+        categories = ", ".join(vocab[word]["categories"])
+        existing = format_existing_edges(edges, word)
+
+        user_prompt = PHASE1_USER.format(
+            word=word, categories=categories, existing_edges=existing
+        )
+
+        messages = [
+            {"role": "system", "content": PHASE1_SYSTEM},
+            {"role": "user", "content": user_prompt},
+        ]
+
+        if dry_run:
+            if i < 3:  # Show first 3 prompts
+                print(f"\n--- Prompt for '{word}' ---")
+                print(f"System: {PHASE1_SYSTEM[:200]}...")
+                print(f"User:\n{user_prompt}")
+            elif i == 3:
+                print(f"\n... ({total - 3} more words) ...")
+            continue
+
+        response = llm_chat_completion(messages)
+        parsed = parse_llm_relations(response, word) if response else []
+
+        # Classify edges
+        accepted = []
+        candidates = []
+        duplicates = 0
+
+        for edge in parsed:
+            triple = (edge["start_word"], edge["end_word"], edge["relation"])
+            if triple in existing_triples:
+                duplicates += 1
+                continue
+
+            existing_triples.add(triple)
+
+            if edge["end_word"] in vocab:
+                accepted.append(edge)
+            else:
+                candidates.append({
+                    "word": edge["end_word"],
+                    "suggested_by": word,
+                    "relation_context": f"{edge['relation']}: {edge['surface_text']}",
+                    "frequency": 1,
+                })
+
+        if accepted:
+            append_augmented_edges(accepted)
+            # Also update in-memory edges for subsequent words
+            for e in accepted:
+                edges[(e["start_word"], e["relation"])].append(
+                    (e["end_word"], e["weight"], e["surface_text"]))
+
+        if candidates:
+            append_candidates(candidates)
+
+        total_accepted += len(accepted)
+
+        append_log(word, "1", len(parsed), len(accepted), duplicates, len(candidates))
+
+        if (i + 1) % 50 == 0:
+            print(f"  [{i+1}/{total}] {total_accepted} edges accepted so far")
+
+        time.sleep(0.1)
+
+    if dry_run:
+        print(f"\nDry run complete. Would process {total - total_skipped} words.")
+    else:
+        print(f"\nPhase 1 complete: {total_accepted} new edges accepted.")
+
+
+# ---------------------------------------------------------------------------
+# Phase 2: Cross-Word Bridge Discovery
+# ---------------------------------------------------------------------------
+
+PHASE2_SYSTEM = """You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
+
+Rules:
+- The bridge word must be a common, concrete noun
+- State the relationship type for each connection
+- Valid relationship types: AtLocation, UsedFor, HasA, PartOf, CapableOf, MadeOf, HasPrerequisite, Causes, HasProperty, ReceivesAction, Desires, CausesDesire, LocatedNear, CreatedBy
+- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
+
+Example:
+Words: "cow" and "butter"
+milk | relation_to_first: CapableOf | relation_to_second: MadeOf | milk connects production to product"""
+
+
+PHASE2_USER = """Words: "{word_a}" and "{word_b}"
+Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
+Find 1-3 bridge words that connect them."""
+
+
+def build_reachability(vocab, edges):
+    """Build 2-hop reachability from vocab words to other vocab words."""
+    vocab_set = set(vocab.keys())
+    reachable = defaultdict(set)  # word -> set of reachable vocab words
+
+    for word in vocab:
+        # Direct (1-hop) neighbors in vocab
+        for (sw, rel), targets in edges.items():
+            if sw == word:
+                for (ew, w, s) in targets:
+                    if ew in vocab_set and ew != word:
+                        reachable[word].add(ew)
+                        # 2-hop from this neighbor
+                        for (sw2, rel2), targets2 in edges.items():
+                            if sw2 == ew:
+                                for (ew2, w2, s2) in targets2:
+                                    if ew2 in vocab_set and ew2 != word:
+                                        reachable[word].add(ew2)
+
+    return reachable
+
+
+def run_phase2(vocab, edges, existing_triples, checkpoint, dry_run=False):
+    """Phase 2: Cross-word bridge discovery."""
+    print("Phase 2: Building reachability matrix...")
+    reachable = build_reachability(vocab, edges)
+
+    # Find low-connectivity words
+    vocab_set = set(vocab.keys())
+    low_connectivity = []
+    for word in vocab:
+        reach_count = len(reachable.get(word, set()))
+        if reach_count < 10:
+            low_connectivity.append((word, reach_count))
+
+    low_connectivity.sort(key=lambda x: x[1])
+    print(f"  {len(low_connectivity)} words with <10 reachable vocab words")
+
+    # Build category index
+    by_category = defaultdict(list)
+    for word, info in vocab.items():
+        for cat in info["categories"]:
+            by_category[cat].append(word)
+
+    total_accepted = 0
+    pairs_processed = 0
+    total_skipped = 0
+
+    for word, reach_count in low_connectivity:
+        if (word, "2") in checkpoint:
+            total_skipped += 1
+            continue
+
+        word_cats = vocab[word]["categories"]
+        word_reachable = reachable.get(word, set())
+
+        # Find same-category words that are unreachable
+        unreachable = []
+        for cat in word_cats:
+            for peer in by_category.get(cat, []):
+                if peer != word and peer not in word_reachable:
+                    unreachable.append(peer)
+
+        if not unreachable:
+            append_log(word, "2", 0, 0, 0, 0)
+            continue
+
+        # Sample 5-10 unreachable peers
+        sample = random.sample(unreachable, min(10, len(unreachable)))
+
+        accepted_for_word = 0
+
+        for peer in sample:
+            pair_key = f"{word}:{peer}"
+            if (pair_key, "2") in checkpoint:
+                continue
+
+            categories_a = ", ".join(vocab[word]["categories"])
+            categories_b = ", ".join(vocab[peer]["categories"])
+
+            user_prompt = PHASE2_USER.format(
+                word_a=word, word_b=peer,
+                categories_a=categories_a, categories_b=categories_b,
+            )
+
+            messages = [
+                {"role": "system", "content": PHASE2_SYSTEM},
+                {"role": "user", "content": user_prompt},
+            ]
+
+            if dry_run:
+                if pairs_processed < 3:
+                    print(f"\n--- Bridge prompt: '{word}' <-> '{peer}' ---")
+                    print(f"User:\n{user_prompt}")
+                elif pairs_processed == 3:
+                    print(f"\n... (more pairs) ...")
+                pairs_processed += 1
+                continue
+
+            response = llm_chat_completion(messages)
+            parsed = parse_bridge_response(response, word, peer) if response else []
+
+            accepted = []
+            duplicates = 0
+            oov = 0
+
+            for edge in parsed:
+                triple = (edge["start_word"], edge["end_word"], edge["relation"])
+                if triple in existing_triples:
+                    duplicates += 1
+                    continue
+                existing_triples.add(triple)
+
+                # For bridge edges, both endpoints should ideally be in vocab
+                if edge["start_word"] in vocab_set and edge["end_word"] in vocab_set:
+                    accepted.append(edge)
+                elif edge["start_word"] in vocab_set or edge["end_word"] in vocab_set:
+                    # At least one end in vocab — still useful
+                    accepted.append(edge)
+                else:
+                    oov += 1
+
+            if accepted:
+                append_augmented_edges(accepted)
+                for e in accepted:
+                    edges[(e["start_word"], e["relation"])].append(
+                        (e["end_word"], e["weight"], e["surface_text"]))
+                accepted_for_word += len(accepted)
+
+            pairs_processed += 1
+            time.sleep(0.1)
+
+        total_accepted += accepted_for_word
+        append_log(word, "2", 0, accepted_for_word, 0, 0)
+
+        if (pairs_processed) % 20 == 0:
+            print(f"  {pairs_processed} pairs processed, {total_accepted} edges accepted")
+
+    if dry_run:
+        print(f"\nDry run complete. Would process {pairs_processed} word pairs.")
+    else:
+        print(f"\nPhase 2 complete: {total_accepted} bridge edges accepted from {pairs_processed} pairs.")
+
+
+# ---------------------------------------------------------------------------
+# Phase 3: Property Enrichment
+# ---------------------------------------------------------------------------
+
+PHASE3_SYSTEM = """You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
+
+Rules:
+- Only physical/sensory/behavioral properties, not abstract qualities
+- Properties should DISTINGUISH this thing from similar things in its category
+- Output one property per line as: PROPERTY | brief explanation
+- Aim for 5-8 properties"""
+
+
+PHASE3_USER = """Word: {word}
+Category: {categories}
+Other words in same category: {peers}
+
+What properties distinguish {word} from the others listed?"""
+
+
+def run_phase3(vocab, edges, existing_triples, checkpoint, dry_run=False):
+    """Phase 3: Property enrichment for false_equivalence templates."""
+    by_category = defaultdict(list)
+    for word, info in vocab.items():
+        for cat in info["categories"]:
+            by_category[cat].append(word)
+
+    words = sorted(vocab.keys())
+    total = len(words)
+    total_accepted = 0
+    total_skipped = 0
+
+    print(f"Phase 3: Property enrichment for {total} words...")
+
+    for i, word in enumerate(words):
+        if (word, "3") in checkpoint:
+            total_skipped += 1
+            continue
+
+        word_cats = vocab[word]["categories"]
+        categories = ", ".join(word_cats)
+
+        # Gather same-category peers (sample of 10)
+        peers = set()
+        for cat in word_cats:
+            for peer in by_category.get(cat, []):
+                if peer != word:
+                    peers.add(peer)
+        peer_sample = random.sample(list(peers), min(10, len(peers))) if peers else []
+
+        if not peer_sample:
+            append_log(word, "3", 0, 0, 0, 0)
+            continue
+
+        user_prompt = PHASE3_USER.format(
+            word=word, categories=categories,
+            peers=", ".join(peer_sample),
+        )
+
+        messages = [
+            {"role": "system", "content": PHASE3_SYSTEM},
+            {"role": "user", "content": user_prompt},
+        ]
+
+        if dry_run:
+            if i < 3:
+                print(f"\n--- Property prompt for '{word}' ---")
+                print(f"User:\n{user_prompt}")
+            elif i == 3:
+                print(f"\n... ({total - 3} more words) ...")
+            continue
+
+        response = llm_chat_completion(messages)
+        parsed = parse_property_response(response, word) if response else []
+
+        accepted = []
+        duplicates = 0
+
+        for edge in parsed:
+            triple = (edge["start_word"], edge["end_word"], edge["relation"])
+            if triple in existing_triples:
+                duplicates += 1
+                continue
+            existing_triples.add(triple)
+            accepted.append(edge)
+
+        if accepted:
+            append_augmented_edges(accepted)
+            for e in accepted:
+                edges[(e["start_word"], e["relation"])].append(
+                    (e["end_word"], e["weight"], e["surface_text"]))
+
+        total_accepted += len(accepted)
+        append_log(word, "3", len(parsed), len(accepted), duplicates, 0)
+
+        if (i + 1) % 50 == 0:
+            print(f"  [{i+1}/{total}] {total_accepted} properties accepted so far")
+
+        time.sleep(0.1)
+
+    if dry_run:
+        print(f"\nDry run complete. Would process {total - total_skipped} words.")
+    else:
+        print(f"\nPhase 3 complete: {total_accepted} new HasProperty edges accepted.")
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="LLM-augmented graph enhancement for folksy subgraph."
+    )
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--phase", type=int, choices=[1, 2, 3],
+                       help="Run a specific phase (1, 2, or 3)")
+    group.add_argument("--all", action="store_true",
+                       help="Run all three phases in sequence")
+    parser.add_argument("--dry-run", action="store_true",
+                        help="Print prompts without calling LLM")
+
+    args = parser.parse_args()
+
+    vocab = load_vocab()
+    edges, existing_triples = load_relations()
+    checkpoint = load_checkpoint()
+
+    print(f"Loaded {len(vocab)} vocab words, {len(existing_triples)} existing edge triples.")
+    print(f"Checkpoint: {len(checkpoint)} (word, phase) pairs already processed.")
+
+    phases = [args.phase] if args.phase else [1, 2, 3]
+
+    for phase in phases:
+        print(f"\n{'='*60}")
+        print(f"Running Phase {phase}")
+        print(f"{'='*60}")
+
+        if phase == 1:
+            run_phase1(vocab, edges, existing_triples, checkpoint, args.dry_run)
+        elif phase == 2:
+            run_phase2(vocab, edges, existing_triples, checkpoint, args.dry_run)
+        elif phase == 3:
+            run_phase3(vocab, edges, existing_triples, checkpoint, args.dry_run)
+
+        # Reload checkpoint after each phase for resumability
+        checkpoint = load_checkpoint()
+
+    print("\nDone.")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/expand_vocab.py
+++ b/scripts/expand_vocab.py
@ -0,0 +1,512 @@
+#!/usr/bin/env python3
+"""Expand folksy vocabulary with high-quality candidates from LLM suggestions.
+
+Reads candidate_additions.csv (words suggested by the LLM during phase 1 that
+weren't in the vocab), filters for quality, uses the LLM to assign categories,
+and appends the survivors to folksy_vocab.csv.
+
+After running this, re-run `enhance_graph.py --phase 1` to generate edges
+for the new words (the checkpoint will skip already-processed words).
+
+Usage:
+  python scripts/expand_vocab.py                  # Full run
+  python scripts/expand_vocab.py --dry-run         # Show what would be added
+  python scripts/expand_vocab.py --min-citations 8 # Stricter threshold
+"""
+
+import argparse
+import csv
+import json
+import re
+import shutil
+import sys
+import time
+from collections import Counter, defaultdict
+from datetime import datetime
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+DATA_DIR = PROJECT_DIR / "data"
+
+LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
+LLM_MODEL = "THUDM-GLM4-32B"
+
+VOCAB_CSV = DATA_DIR / "folksy_vocab.csv"
+CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
+
+# Valid categories from the existing vocabulary
+VALID_CATEGORIES = {
+    "animal", "beverage", "bird", "building", "clothing", "container", "crop",
+    "fabric", "fish", "flower", "food", "fruit", "furniture", "grain", "herb",
+    "insect", "instrument", "landscape", "material", "metal", "mineral",
+    "organism", "plant", "rock", "seed", "shelter", "spice", "stone",
+    "structure", "tool", "tree", "vegetable", "vehicle", "water", "weapon", "wood",
+}
+
+# ---------------------------------------------------------------------------
+# Exclusion lists
+# ---------------------------------------------------------------------------
+
+# Abstract concepts, emotions, processes — not concrete enough for folksy vocab
+EXCLUDE_ABSTRACT = {
+    "ecosystem", "satisfaction", "fullness", "warmth", "fear", "relaxation",
+    "growth", "interest", "nature", "protection", "digestion", "injury",
+    "decoration", "construction", "landscape", "noise", "sound", "energy",
+    "nourishment", "nutrition", "pollination", "sustainability", "tradition",
+    "biodiversity", "symbolism", "elegance", "resilience", "patience",
+    "beauty", "abundance", "fertility", "creativity", "harmony", "comfort",
+    "curiosity", "companionship", "loyalty", "aggression", "alertness",
+    "camouflage", "predation", "migration", "hibernation", "decomposition",
+    "erosion", "combustion", "fermentation", "oxidation", "corrosion",
+    "photosynthesis", "respiration", "evaporation", "precipitation",
+    "transpiration", "germination", "excitement", "enjoyment", "satiety",
+    "stability", "organization", "fragrance", "moisture", "wildlife",
+    "preservation", "conversation", "inspiration", "storage", "observation",
+    "hydration", "destruction", "entertainment", "education", "knowledge",
+    "safety", "practice", "research", "skill", "space", "license",
+    "collection", "habitat", "pollution", "health", "vibration", "wonder",
+    "awe", "refreshment", "irritation", "happiness", "joy", "damage",
+    "death", "pain", "thirst", "fear", "alarm", "contents", "ingredients",
+    "electricity", "oxygen", "navigation", "recreation", "meditation",
+    "nutrition", "celebration", "communication", "imagination", "devotion",
+    "ambition", "endurance", "independence", "discipline", "cooperation",
+    "sweetness", "fullness", "aroma", "flavor", "fragrance", "texture",
+    "smell", "color", "contents", "surface", "bottom", "edge",
+    "nutrients", "study", "outfit", "upholstery",
+}
+
+# Scientific/technical — not folksy enough for folk wisdom
+EXCLUDE_TECHNICAL = {
+    "cellulose", "exoskeleton", "protein", "tissue", "cells", "alloy",
+    "cellulose", "enzyme", "chlorophyll", "genome", "photon",
+    "organism", "molecule", "compound", "polymer", "isotope",
+    "ecosystem", "metabolism", "catalyst", "membrane", "chromosome",
+    "cell", "nutrient", "ingredient", "material", "content",
+}
+
+# Collective/institutional nouns — not concrete individual things
+EXCLUDE_INSTITUTIONAL = {
+    "orchestra", "fleet", "arsenal", "toolkit", "collection",
+    "restaurant", "museum", "university", "corporation", "organization",
+    "musician", "breakfast", "dinner", "meal", "dish", "sandwich",
+    "seafood", "refrigerator", "garage", "basement", "park",
+}
+
+# Adjectives and properties — useful as HasProperty targets but not as vocab words
+EXCLUDE_ADJECTIVES = {
+    "small", "large", "heavy", "colorful", "green", "brown", "hard",
+    "white", "round", "sharp", "sturdy", "long", "soft", "flat",
+    "sweet", "bitter", "smooth", "rough", "bright", "dark", "dry",
+    "wet", "thick", "thin", "warm", "cold", "hot", "tall", "short",
+    "red", "blue", "yellow", "black", "grey", "gray", "pink",
+    "fragrant", "loud", "spicy", "sour", "tough", "delicate", "strong",
+    "weak", "light", "dense", "portable", "lightweight", "transparent",
+    "opaque", "flexible", "rigid", "brittle", "elastic", "porous",
+    "compact", "edible", "toxic", "aromatic", "nocturnal", "aquatic",
+    "durable", "cylindrical", "wooden", "shiny", "solid", "narrow",
+    "metallic", "pungent", "juicy", "fast", "powerful", "woody",
+    "fibrous", "savory", "liquid", "enclosed", "rectangular", "wild",
+    "feathered", "leafy", "crunchy", "dangerous", "fuzzy", "slimy",
+    "natural", "waterproof", "electronic",
+}
+
+# Words that are clearly verbs or gerunds
+EXCLUDE_VERBS = {
+    "eating", "cooking", "growing", "fishing", "hunting", "flying",
+    "mining", "flavoring", "singing", "blooming", "holding", "baking",
+    "ripening", "opening", "cutting", "protecting", "seasoning",
+    "storing", "building", "swimming", "brewing", "weaving", "carving",
+    "climbing", "digging", "plowing", "sewing", "spinning", "tanning",
+    "swim", "run", "grow", "eat", "hunt", "peck", "bite", "dive",
+    "crawl", "cut", "shine", "sparkle",
+}
+
+
+def singularize(word):
+    """Best-effort singularization. Returns (singular, was_plural)."""
+    # Irregular plurals
+    irregulars = {
+        "teeth": "tooth", "feet": "foot", "geese": "goose", "mice": "mouse",
+        "lice": "louse", "dice": "die", "oxen": "ox", "children": "child",
+        "leaves": "leaf", "loaves": "loaf", "halves": "half", "knives": "knife",
+        "lives": "life", "wives": "wife", "wolves": "wolf", "shelves": "shelf",
+        "calves": "calf",
+    }
+    if word in irregulars:
+        return irregulars[word], True
+
+    # -ves -> -f (already covered some above, catch remaining)
+    if word.endswith("ves"):
+        candidate = word[:-3] + "f"
+        return candidate, True
+
+    # -ies -> -y
+    if word.endswith("ies") and len(word) > 4:
+        return word[:-3] + "y", True
+
+    # -ses, -xes, -zes, -ches, -shes -> drop -es
+    if word.endswith(("ses", "xes", "zes", "ches", "shes")):
+        return word[:-2], True
+
+    # -s (but not -ss, -us, -is)
+    if word.endswith("s") and not word.endswith(("ss", "us", "is")):
+        return word[:-1], True
+
+    return word, False
+
+
+def is_plural_of_existing(word, existing_vocab):
+    """Check if word is likely a plural form of an existing vocab word."""
+    # word + s
+    if word.endswith("s") and word[:-1] in existing_vocab:
+        return True
+    # word + es
+    if word.endswith("es") and word[:-2] in existing_vocab:
+        return True
+    # word ending ies -> y
+    if word.endswith("ies") and word[:-3] + "y" in existing_vocab:
+        return True
+    # word ending ves -> f/fe
+    if word.endswith("ves"):
+        if word[:-3] + "f" in existing_vocab:
+            return True
+        if word[:-3] + "fe" in existing_vocab:
+            return True
+    return False
+
+
+def is_plural_of_candidate(word, accepted_words):
+    """Check if word is a plural of another candidate, or vice versa."""
+    # Is this word a plural of something accepted?
+    if word.endswith("s") and word[:-1] in accepted_words:
+        return True
+    if word.endswith("es") and word[:-2] in accepted_words:
+        return True
+    if word.endswith("ies") and word[:-3] + "y" in accepted_words:
+        return True
+    # Is something accepted a plural of this word?
+    if word + "s" in accepted_words:
+        return True
+    if word + "es" in accepted_words:
+        return True
+    if word.endswith("f") and word[:-1] + "ves" in accepted_words:
+        return True
+    if word.endswith("fe") and word[:-2] + "ves" in accepted_words:
+        return True
+    return False
+
+
+# ---------------------------------------------------------------------------
+# LLM categorization
+# ---------------------------------------------------------------------------
+
+CATEGORIZE_SYSTEM = """You are a vocabulary categorizer. Given a list of concrete nouns, assign each one to one or more categories from this fixed list:
+
+animal, beverage, bird, building, clothing, container, crop, fabric, fish, flower, food, fruit, furniture, grain, herb, insect, instrument, landscape, material, metal, mineral, organism, plant, rock, seed, shelter, spice, stone, structure, tool, tree, vegetable, vehicle, water, weapon, wood
+
+Rules:
+- Use ONLY categories from the list above
+- A word can have multiple categories (e.g., "brick" -> material, stone)
+- If a word fits none of the categories well, output SKIP
+- Output format: word: category1, category2
+- One word per line"""
+
+CATEGORIZE_USER = """Categorize these words:
+{word_list}"""
+
+
+def llm_chat_completion(messages, max_retries=3):
+    """Chat completion with retry logic."""
+    import requests
+
+    for attempt in range(max_retries):
+        try:
+            resp = requests.post(LLM_ENDPOINT, json={
+                "model": LLM_MODEL,
+                "messages": messages,
+            }, timeout=120)
+            resp.raise_for_status()
+            data = resp.json()
+            return data["choices"][0]["message"]["content"]
+        except Exception as e:
+            wait = (2 ** attempt)
+            print(f"  LLM call failed (attempt {attempt+1}/{max_retries}): {e}",
+                  file=sys.stderr)
+            if attempt < max_retries - 1:
+                print(f"  Retrying in {wait}s...", file=sys.stderr)
+                time.sleep(wait)
+            else:
+                print(f"  Giving up on this batch.", file=sys.stderr)
+                return None
+
+
+def parse_categories(response_text, valid_words):
+    """Parse LLM categorization response."""
+    result = {}
+    if not response_text:
+        return result
+
+    for line in response_text.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+
+        # Strip bullets/numbers
+        line = re.sub(r"^[\d]+[.)]\s*", "", line)
+        line = re.sub(r"^[-*•]\s*", "", line)
+        line = line.strip()
+
+        # Match: word: cat1, cat2
+        match = re.match(r"^(\w+)\s*:\s*(.+)$", line)
+        if not match:
+            continue
+
+        word = match.group(1).strip().lower()
+        cats_raw = match.group(2).strip()
+
+        if "SKIP" in cats_raw.upper():
+            continue
+
+        cats = []
+        for c in cats_raw.split(","):
+            c = c.strip().lower()
+            if c in VALID_CATEGORIES:
+                cats.append(c)
+
+        if word in valid_words and cats:
+            result[word] = cats
+
+    return result
+
+
+def categorize_words(words, batch_size=25):
+    """Categorize words using the LLM in batches."""
+    all_categories = {}
+    word_set = set(words)
+
+    for i in range(0, len(words), batch_size):
+        batch = words[i:i + batch_size]
+        word_list = "\n".join(f"- {w}" for w in batch)
+
+        messages = [
+            {"role": "system", "content": CATEGORIZE_SYSTEM},
+            {"role": "user", "content": CATEGORIZE_USER.format(word_list=word_list)},
+        ]
+
+        response = llm_chat_completion(messages)
+        parsed = parse_categories(response, word_set)
+        all_categories.update(parsed)
+
+        categorized = len(parsed)
+        print(f"  Batch {i // batch_size + 1}: {categorized}/{len(batch)} categorized")
+        time.sleep(0.1)
+
+    return all_categories
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Expand folksy vocabulary with LLM-suggested candidates."
+    )
+    parser.add_argument("--min-citations", type=int, default=5,
+                        help="Minimum number of vocab words that suggested this candidate (default: 5)")
+    parser.add_argument("--dry-run", action="store_true",
+                        help="Show what would be added without modifying files")
+    parser.add_argument("--no-llm", action="store_true",
+                        help="Skip LLM categorization (use placeholder categories)")
+
+    args = parser.parse_args()
+
+    # Load existing vocab
+    existing_vocab = {}
+    with open(VOCAB_CSV, newline="", encoding="utf-8") as f:
+        for row in csv.DictReader(f):
+            existing_vocab[row["word"]] = row
+    existing_words = set(existing_vocab.keys())
+    print(f"Existing vocabulary: {len(existing_words)} words")
+
+    # Load candidates
+    candidates = []
+    with open(CANDIDATE_CSV, newline="", encoding="utf-8") as f:
+        for row in csv.DictReader(f):
+            candidates.append(row)
+
+    # Aggregate: count unique sources per candidate word
+    word_sources = defaultdict(set)
+    for c in candidates:
+        word_sources[c["word"]].add(c["suggested_by"])
+
+    print(f"Total candidate rows: {len(candidates)}")
+    print(f"Unique candidate words: {len(word_sources)}")
+
+    # Normalize plurals: merge citation counts into singular forms
+    normalized_sources = defaultdict(set)
+    for word, sources in word_sources.items():
+        singular, was_plural = singularize(word)
+        # Merge into the singular form
+        normalized_sources[singular].update(sources)
+    # Replace word_sources with normalized version
+    word_sources = {w: srcs for w, srcs in normalized_sources.items()}
+    print(f"After singularization: {len(word_sources)} unique candidates")
+
+    # Filter
+    accepted = []
+    reject_reasons = Counter()
+
+    # Sort by citation count descending for consistent ordering
+    sorted_candidates = sorted(word_sources.items(), key=lambda x: len(x[1]), reverse=True)
+    accepted_set = set()
+
+    for word, sources in sorted_candidates:
+        citation_count = len(sources)
+
+        # Minimum citation threshold
+        if citation_count < args.min_citations:
+            reject_reasons["below_threshold"] += 1
+            continue
+
+        # No multi-word (underscore) candidates
+        if "_" in word:
+            reject_reasons["multi_word"] += 1
+            continue
+
+        # Already in vocab
+        if word in existing_words:
+            reject_reasons["already_in_vocab"] += 1
+            continue
+
+        # Exclude abstracts
+        if word in EXCLUDE_ABSTRACT:
+            reject_reasons["abstract"] += 1
+            continue
+
+        # Exclude adjectives
+        if word in EXCLUDE_ADJECTIVES:
+            reject_reasons["adjective"] += 1
+            continue
+
+        # Exclude verbs/gerunds
+        if word in EXCLUDE_VERBS:
+            reject_reasons["verb_gerund"] += 1
+            continue
+
+        # Exclude technical/scientific
+        if word in EXCLUDE_TECHNICAL:
+            reject_reasons["technical"] += 1
+            continue
+
+        # Exclude institutional/collective
+        if word in EXCLUDE_INSTITUTIONAL:
+            reject_reasons["institutional"] += 1
+            continue
+
+        # Gerund pattern catch-all (but allow exceptions)
+        if word.endswith("ing") and word not in {"ring", "spring", "string", "wing", "ceiling"}:
+            reject_reasons["gerund_pattern"] += 1
+            continue
+
+        # Exclude plurals of existing vocab
+        if is_plural_of_existing(word, existing_words):
+            reject_reasons["plural_of_existing"] += 1
+            continue
+
+        # Exclude plurals of already-accepted candidates
+        if is_plural_of_candidate(word, accepted_set):
+            reject_reasons["plural_of_candidate"] += 1
+            continue
+
+        # Single character
+        if len(word) < 2:
+            reject_reasons["too_short"] += 1
+            continue
+
+        accepted.append((word, citation_count))
+        accepted_set.add(word)
+
+    print(f"\nFiltering results:")
+    print(f"  Accepted: {len(accepted)}")
+    for reason, count in reject_reasons.most_common():
+        print(f"  Rejected ({reason}): {count}")
+
+    if not accepted:
+        print("\nNo candidates passed filtering.")
+        return
+
+    # Show accepted words
+    print(f"\nAccepted candidates ({len(accepted)}):")
+    for word, count in accepted:
+        print(f"  {word:25s} cited by {count:3d} vocab words")
+
+    if args.dry_run:
+        print(f"\nDry run complete. Would add {len(accepted)} words to vocabulary.")
+        return
+
+    # Categorize with LLM
+    words_to_categorize = [w for w, _ in accepted]
+
+    if args.no_llm:
+        print("\nSkipping LLM categorization (--no-llm). Using 'material' as placeholder.")
+        categories = {w: ["material"] for w in words_to_categorize}
+    else:
+        print(f"\nCategorizing {len(words_to_categorize)} words with LLM...")
+        categories = categorize_words(words_to_categorize)
+
+    # Words the LLM couldn't categorize get skipped
+    uncategorized = [w for w in words_to_categorize if w not in categories]
+    if uncategorized:
+        print(f"\n  {len(uncategorized)} words could not be categorized (skipped):")
+        for w in uncategorized:
+            print(f"    {w}")
+
+    # Build new vocab entries
+    new_entries = []
+    for word, citation_count in accepted:
+        if word not in categories:
+            continue
+        cats = categories[word]
+        new_entries.append({
+            "word": word,
+            "categories": ",".join(cats),
+            "tangibility_score": "0.80",
+            "conceptnet_edge_count": "0",
+            "frequency_rank": "0",
+        })
+
+    if not new_entries:
+        print("\nNo entries to add after categorization.")
+        return
+
+    # Backup existing vocab
+    backup_path = VOCAB_CSV.with_suffix(f".csv.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}")
+    shutil.copy2(VOCAB_CSV, backup_path)
+    print(f"\nBacked up vocabulary to {backup_path.name}")
+
+    # Append to vocab CSV
+    with open(VOCAB_CSV, "a", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=["word", "categories", "tangibility_score",
+                                                "conceptnet_edge_count", "frequency_rank"])
+        for entry in new_entries:
+            writer.writerow(entry)
+
+    print(f"\nAdded {len(new_entries)} words to {VOCAB_CSV.name}")
+    print(f"New vocabulary size: {len(existing_words) + len(new_entries)}")
+
+    # Summary by category
+    cat_counts = Counter()
+    for entry in new_entries:
+        for c in entry["categories"].split(","):
+            cat_counts[c.strip()] += 1
+    print(f"\nNew words by category:")
+    for cat, count in cat_counts.most_common():
+        print(f"  {cat:20s} {count:3d}")
+
+    print(f"\nNext step: run 'python scripts/enhance_graph.py --phase 1' to generate edges for new words.")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/filter_corpus.py
+++ b/scripts/filter_corpus.py
@ -0,0 +1,177 @@
+#!/usr/bin/env python3
+"""Quality filtering for polished folksy sayings.
+
+Reads corpus_polished.jsonl, applies quality filters, outputs filtered corpus
+and discard analysis.
+
+Usage:
+  python scripts/filter_corpus.py
+  python scripts/filter_corpus.py --input corpus/corpus_polished.jsonl --output corpus/corpus_filtered.jsonl
+"""
+
+import argparse
+import csv
+import json
+import sys
+from difflib import SequenceMatcher
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+CORPUS_DIR = PROJECT_DIR / "corpus"
+
+
+def quality_filter(entry):
+    """Apply quality filters to a polished entry.
+
+    Returns (passed, reason) tuple.
+    """
+    text = entry.get("polished_text", "")
+    if not text:
+        return False, "no_polished_text"
+
+    words = text.split()
+
+    # Length check
+    if len(words) > 25:
+        return False, "too_long"
+    if len(words) < 5:
+        return False, "too_short"
+
+    # Must contain at least 2 of the original slot-fill nouns
+    slot_words = set(entry.get("slots", {}).values())
+    words_present = sum(1 for w in slot_words if w.lower() in text.lower())
+    if words_present < 2:
+        return False, "lost_key_nouns"
+
+    # No raw ConceptNet artifacts (multi-word underscore phrases)
+    if "_" in text:
+        return False, "conceptnet_artifact"
+
+    # No broken templates (unfilled slots)
+    if "{" in text or "}" in text:
+        return False, "unfilled_slot"
+
+    return True, "pass"
+
+
+def is_near_duplicate(text_a, text_b, threshold=0.75):
+    """Check if two texts are near-duplicates."""
+    return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
+
+
+def deduplicate_within_family(entries):
+    """Remove near-duplicates within each meta-template family.
+
+    Returns (kept, removed) lists.
+    """
+    by_family = {}
+    for entry in entries:
+        family = entry.get("meta_template", "unknown")
+        by_family.setdefault(family, []).append(entry)
+
+    kept = []
+    removed = []
+
+    for family, family_entries in by_family.items():
+        family_kept = []
+        for entry in family_entries:
+            text = entry.get("polished_text", "")
+            is_dup = False
+            for existing in family_kept:
+                if is_near_duplicate(text, existing.get("polished_text", "")):
+                    is_dup = True
+                    break
+            if is_dup:
+                removed.append((entry, "near_duplicate"))
+            else:
+                family_kept.append(entry)
+        kept.extend(family_kept)
+
+    return kept, removed
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Quality filtering for polished folksy sayings.")
+    parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
+                        help="Input polished JSONL file")
+    parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
+                        help="Output filtered JSONL file")
+    parser.add_argument("--discard-analysis", default=str(CORPUS_DIR / "discard_analysis.csv"),
+                        help="Discard analysis CSV file")
+    args = parser.parse_args()
+
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+    discard_path = Path(args.discard_analysis)
+
+    if not input_path.exists():
+        print(f"Error: {input_path} not found.", file=sys.stderr)
+        sys.exit(1)
+
+    # Load polished entries (only those with status=polished)
+    all_entries = []
+    already_discarded = 0
+    with open(input_path, encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            entry = json.loads(line)
+            if entry.get("status") == "polished":
+                all_entries.append(entry)
+            elif entry.get("status") == "discarded":
+                already_discarded += 1
+
+    print(f"Loaded {len(all_entries)} polished entries ({already_discarded} already discarded by LLM)")
+
+    # Apply quality filters
+    passed = []
+    discards = []  # (entry, reason)
+
+    for entry in all_entries:
+        ok, reason = quality_filter(entry)
+        if ok:
+            passed.append(entry)
+        else:
+            discards.append((entry, reason))
+
+    print(f"Quality filter: {len(passed)} passed, {len(discards)} discarded")
+
+    # Show discard breakdown
+    from collections import Counter
+    reason_counts = Counter(r for _, r in discards)
+    for reason, count in reason_counts.most_common():
+        print(f"  {reason}: {count}")
+
+    # Near-duplicate detection within template families
+    kept, dup_removed = deduplicate_within_family(passed)
+    discards.extend(dup_removed)
+
+    print(f"Near-duplicate removal: {len(dup_removed)} removed, {len(kept)} remaining")
+
+    # Write filtered output
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w", encoding="utf-8") as f:
+        for entry in kept:
+            f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+
+    print(f"\nFiltered corpus: {len(kept)} entries -> {output_path}")
+
+    # Write discard analysis
+    with open(discard_path, "w", newline="", encoding="utf-8") as f:
+        writer = csv.writer(f)
+        writer.writerow(["raw_text", "meta_template", "discard_stage", "discard_reason"])
+        for entry, reason in discards:
+            writer.writerow([
+                entry.get("raw_text", ""),
+                entry.get("meta_template", ""),
+                "llm_polish" if reason == "no_polished_text" else "quality_filter",
+                reason,
+            ])
+
+    print(f"Discard analysis: {len(discards)} entries -> {discard_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/format_training_pairs.py
+++ b/scripts/format_training_pairs.py
@ -0,0 +1,385 @@
+#!/usr/bin/env python3
+"""Format filtered sayings into training pairs for fine-tuning.
+
+Each polished saying generates 3-5 training pairs with different input framings.
+Also generates fictional entity training pairs.
+
+Usage:
+  python scripts/format_training_pairs.py
+  python scripts/format_training_pairs.py --input corpus/corpus_filtered.jsonl --output corpus/training_pairs.jsonl
+"""
+
+import argparse
+import csv
+import json
+import random
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+CORPUS_DIR = PROJECT_DIR / "corpus"
+DATA_DIR = PROJECT_DIR / "data"
+EXAMPLES_DIR = PROJECT_DIR / "examples"
+
+# Template name mappings for human-readable prompts
+TEMPLATE_NAMES = {
+    "deconstruction": "deconstruction",
+    "denial_of_consequences": "denial of consequences",
+    "ironic_deficiency": "ironic deficiency",
+    "futile_preparation": "futile preparation",
+    "hypocritical_complaint": "hypocritical complaint",
+    "tautological_wisdom": "tautological wisdom",
+    "false_equivalence": "false equivalence",
+}
+
+PERSONAS = ["farmer", "grandmother", "old sailor", "blacksmith", "innkeeper", "shepherd"]
+
+OPEN_ENDED_PROMPTS = [
+    "Tell me some folk wisdom.",
+    "What do they say?",
+    "Give me a proverb.",
+    "Share some old-time wisdom.",
+    "What's a good saying?",
+]
+
+# Auto-generated fictional entities for additional training pairs
+AUTO_ENTITIES = [
+    {
+        "name": "Stoneclaw",
+        "categories": ["animal", "predator"],
+        "properties": ["fierce", "rocky", "nocturnal"],
+        "relations": {"AtLocation": ["cave", "mountain"], "HasA": ["claws", "scales"], "CapableOf": ["hunting", "climbing"]},
+    },
+    {
+        "name": "Duskmelon",
+        "categories": ["fruit", "food"],
+        "properties": ["purple", "sweet", "fragrant"],
+        "relations": {"AtLocation": ["garden", "market"], "UsedFor": ["eating", "jam"], "MadeOf": ["seed", "juice"]},
+    },
+    {
+        "name": "Windloom",
+        "categories": ["tool", "craft"],
+        "properties": ["wooden", "portable", "intricate"],
+        "relations": {"UsedFor": ["weaving", "thread"], "MadeOf": ["wood", "string"], "AtLocation": ["workshop", "cottage"]},
+    },
+    {
+        "name": "Briarvine",
+        "categories": ["plant", "herb"],
+        "properties": ["thorny", "green", "medicinal"],
+        "relations": {"AtLocation": ["forest", "hedge"], "UsedFor": ["healing", "tea"], "HasA": ["thorn", "leaf"]},
+    },
+    {
+        "name": "Mudhog",
+        "categories": ["animal", "livestock"],
+        "properties": ["muddy", "stubborn", "heavy"],
+        "relations": {"AtLocation": ["farm", "swamp"], "Desires": ["food", "mud"], "CapableOf": ["digging", "rooting"]},
+    },
+    {
+        "name": "Frostberry",
+        "categories": ["fruit", "food"],
+        "properties": ["cold", "blue", "tiny"],
+        "relations": {"AtLocation": ["mountain", "tundra"], "UsedFor": ["eating", "preserves"], "HasProperty": ["cold", "tart"]},
+    },
+    {
+        "name": "Lanternmoss",
+        "categories": ["plant", "fungus"],
+        "properties": ["glowing", "damp", "soft"],
+        "relations": {"AtLocation": ["cave", "swamp"], "UsedFor": ["light", "decoration"], "HasProperty": ["luminous", "fragile"]},
+    },
+    {
+        "name": "Cinderhawk",
+        "categories": ["bird", "animal"],
+        "properties": ["fiery", "fast", "red"],
+        "relations": {"AtLocation": ["mountain", "volcano"], "CapableOf": ["flying", "hunting"], "HasA": ["talons", "feathers"]},
+    },
+    {
+        "name": "Rootstone",
+        "categories": ["stone", "material"],
+        "properties": ["veined", "hard", "ancient"],
+        "relations": {"AtLocation": ["quarry", "riverbed"], "UsedFor": ["building", "carving"], "MadeOf": ["mineral", "root"]},
+    },
+    {
+        "name": "Silkwort",
+        "categories": ["plant", "fiber"],
+        "properties": ["silky", "white", "tall"],
+        "relations": {"AtLocation": ["field", "meadow"], "UsedFor": ["weaving", "cloth"], "HasA": ["stem", "fiber"]},
+    },
+    {
+        "name": "Kettlefrog",
+        "categories": ["animal", "amphibian"],
+        "properties": ["loud", "round", "green"],
+        "relations": {"AtLocation": ["pond", "marsh"], "CapableOf": ["jumping", "croaking"], "Desires": ["flies", "water"]},
+    },
+    {
+        "name": "Dustwheat",
+        "categories": ["crop", "grain"],
+        "properties": ["dry", "golden", "hardy"],
+        "relations": {"AtLocation": ["field", "barn"], "UsedFor": ["bread", "flour"], "HasPrerequisite": ["rain", "soil"]},
+    },
+]
+
+
+def format_entity_description(entity):
+    """Format entity into a natural description string."""
+    name = entity["name"]
+    cats = entity.get("categories", [])
+    props = entity.get("properties", [])
+    rels = entity.get("relations", {})
+
+    parts = []
+
+    # Category description
+    if props and cats:
+        prop_str = ", ".join(props[:3])
+        cat_str = " and ".join(cats[:2])
+        parts.append(f"A {name} is a {prop_str} {cat_str}.")
+    elif cats:
+        parts.append(f"A {name} is a {' and '.join(cats[:2])}.")
+
+    # Location
+    if "AtLocation" in rels:
+        locs = rels["AtLocation"]
+        parts.append(f"It is found near {' and '.join(locs[:2])}.")
+
+    # Parts/properties
+    if "HasA" in rels:
+        has = rels["HasA"]
+        parts.append(f"It has {', '.join(has[:3])}.")
+
+    # Capabilities
+    if "CapableOf" in rels:
+        caps = rels["CapableOf"]
+        parts.append(f"It can {' and '.join(caps[:2])}.")
+
+    # Uses
+    if "UsedFor" in rels:
+        uses = rels["UsedFor"]
+        parts.append(f"It is used for {' and '.join(uses[:2])}.")
+
+    return " ".join(parts)
+
+
+def load_vocab_categories():
+    """Load vocab to get word -> categories mapping."""
+    word_cats = {}
+    vocab_path = DATA_DIR / "folksy_vocab.csv"
+    if vocab_path.exists():
+        with open(vocab_path, newline="", encoding="utf-8") as f:
+            for row in csv.DictReader(f):
+                word = row["word"]
+                cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
+                word_cats[word] = cats
+    return word_cats
+
+
+def generate_training_pairs(entry, word_cats):
+    """Generate 3-5 training pairs for a single polished saying."""
+    polished = entry.get("polished_text", "")
+    slots = entry.get("slots", {})
+    meta_template = entry.get("meta_template", "")
+
+    # Collect source words (concrete nouns from slots)
+    source_words = [v for v in slots.values()
+                    if v and not v.startswith("a ") and not v.startswith("an ") and len(v) > 1]
+
+    # Determine categories of slot words
+    slot_categories = set()
+    for word in source_words:
+        word_lower = word.lower().replace(" ", "_")
+        if word_lower in word_cats:
+            slot_categories.update(word_cats[word_lower])
+
+    pairs = []
+    base = {
+        "output": polished,
+        "meta_template": meta_template,
+        "source_words": source_words,
+    }
+
+    # 1. Word-seeded (always include)
+    if source_words:
+        word = random.choice(source_words)
+        pairs.append({**base, "input": f"Tell me something about {word}."})
+
+    # 2. Category-seeded (always include if we have categories)
+    if slot_categories:
+        cat = random.choice(list(slot_categories))
+        pairs.append({**base, "input": f"Tell me a saying about {cat}."})
+
+    # 3. Persona-seeded (always include)
+    persona = random.choice(PERSONAS)
+    if source_words:
+        word = random.choice(source_words)
+        pairs.append({**base, "input": f"What would a {persona} say about {word}?"})
+
+    # 4. Template-seeded (include ~70% of the time)
+    if random.random() < 0.7:
+        template_name = TEMPLATE_NAMES.get(meta_template, meta_template)
+        pairs.append({**base, "input": f"Give me a {template_name} proverb."})
+
+    # 5. Open-ended (include ~30% of the time)
+    if random.random() < 0.3:
+        prompt = random.choice(OPEN_ENDED_PROMPTS)
+        pairs.append({**base, "input": prompt})
+
+    return pairs
+
+
+def generate_fictional_pairs(entities):
+    """Generate training pairs for fictional entities.
+
+    These pairs include the entity description in the input.
+    """
+    pairs = []
+
+    # Generate 15-25 pairs per entity
+    for entity in entities:
+        name = entity["name"]
+        desc = format_entity_description(entity)
+        props = entity.get("properties", [])
+        rels = entity.get("relations", {})
+
+        # Collect words related to this entity
+        related_words = []
+        for targets in rels.values():
+            related_words.extend(targets)
+
+        n_pairs = random.randint(15, 25)
+
+        for _ in range(n_pairs):
+            framing = random.choice(["persona", "word", "category", "open"])
+
+            if framing == "persona":
+                persona = random.choice(PERSONAS)
+                input_text = f"{desc} What would a {persona} say about a {name}?"
+            elif framing == "word" and related_words:
+                word = random.choice(related_words)
+                input_text = f"{desc} Tell me a saying about {name} and {word}."
+            elif framing == "category":
+                cats = entity.get("categories", ["thing"])
+                cat = random.choice(cats)
+                input_text = f"{desc} Give me folk wisdom about this {cat}."
+            else:
+                input_text = f"{desc} Tell me some folk wisdom about {name}."
+
+            # Placeholder output — these would ideally be generated through the
+            # template engine with fictional entities loaded, then polished.
+            # For now, generate a structural placeholder that indicates the
+            # entity relationships.
+            pairs.append({
+                "input": input_text,
+                "output": "",  # Will be filled by actual generation
+                "meta_template": "fictional",
+                "source_words": [name] + related_words[:3],
+                "_needs_generation": True,
+                "_entity": entity,
+            })
+
+    return pairs
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Format training pairs for fine-tuning.")
+    parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
+                        help="Input filtered JSONL file")
+    parser.add_argument("--output", default=str(CORPUS_DIR / "training_pairs.jsonl"),
+                        help="Output training pairs JSONL file")
+    parser.add_argument("--entities", default=str(EXAMPLES_DIR / "my_world.json"),
+                        help="Fictional entities JSON file")
+    args = parser.parse_args()
+
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+    entities_path = Path(args.entities)
+
+    if not input_path.exists():
+        print(f"Error: {input_path} not found.", file=sys.stderr)
+        sys.exit(1)
+
+    # Load vocab categories
+    word_cats = load_vocab_categories()
+
+    # Load filtered entries
+    entries = []
+    with open(input_path, encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                entries.append(json.loads(line))
+
+    print(f"Loaded {len(entries)} filtered entries")
+
+    # Generate training pairs for each entry
+    all_pairs = []
+    for entry in entries:
+        pairs = generate_training_pairs(entry, word_cats)
+        all_pairs.extend(pairs)
+
+    print(f"Generated {len(all_pairs)} training pairs from polished sayings")
+
+    # Generate fictional entity pairs
+    fictional_entities = []
+    if entities_path.exists():
+        with open(entities_path, encoding="utf-8") as f:
+            data = json.load(f)
+            fictional_entities = data.get("entities", [])
+        print(f"Loaded {len(fictional_entities)} fictional entities from {entities_path}")
+
+    # Add auto-generated entities
+    fictional_entities.extend(AUTO_ENTITIES)
+    print(f"Total fictional entities (file + auto-generated): {len(fictional_entities)}")
+
+    fictional_pairs = generate_fictional_pairs(fictional_entities)
+
+    # Filter out placeholder pairs (those that still need generation)
+    # In a full pipeline, these would be generated through the template engine.
+    # For now, skip any with empty output.
+    real_fictional = [p for p in fictional_pairs if p.get("output")]
+    placeholder_fictional = [p for p in fictional_pairs if not p.get("output")]
+
+    if placeholder_fictional:
+        print(f"  {len(placeholder_fictional)} fictional pairs need generation via template engine")
+        print(f"  (Run folksy_generator.py with --entities to generate these, then re-run this script)")
+
+    all_pairs.extend(real_fictional)
+
+    # Clean up internal fields before writing
+    for pair in all_pairs:
+        pair.pop("_needs_generation", None)
+        pair.pop("_entity", None)
+
+    # Write output
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w", encoding="utf-8") as f:
+        for pair in all_pairs:
+            f.write(json.dumps(pair, ensure_ascii=False) + "\n")
+
+    # Stats
+    from collections import Counter
+    input_types = Counter()
+    for pair in all_pairs:
+        inp = pair["input"]
+        if inp.startswith("Tell me something about"):
+            input_types["word_seeded"] += 1
+        elif inp.startswith("Tell me a saying about"):
+            input_types["category_seeded"] += 1
+        elif inp.startswith("What would a"):
+            input_types["persona_seeded"] += 1
+        elif inp.startswith("Give me a") and "proverb" in inp:
+            input_types["template_seeded"] += 1
+        elif any(inp.startswith(p) for p in ["Tell me some folk", "What do they", "Give me a proverb", "Share some", "What's a good"]):
+            input_types["open_ended"] += 1
+        else:
+            input_types["fictional"] += 1
+
+    print(f"\nTotal training pairs: {len(all_pairs)}")
+    print("Distribution by input type:")
+    for itype, count in sorted(input_types.items()):
+        print(f"  {itype:20s} {count:5d}")
+
+    print(f"\nOutput: {output_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/generate_raw_batch.sh
+++ b/scripts/generate_raw_batch.sh
@ -0,0 +1,61 @@
+#!/usr/bin/env bash
+# Generate raw folksy sayings across all 7 templates.
+# Output: corpus/corpus_raw.jsonl (~10,500 entries)
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+CORPUS_DIR="$PROJECT_DIR/corpus"
+GENERATOR="$PROJECT_DIR/folksy_generator.py"
+
+COUNT_PER_TEMPLATE=${1:-1500}
+
+mkdir -p "$CORPUS_DIR"
+
+OUTPUT="$CORPUS_DIR/corpus_raw.jsonl"
+# Clear existing file
+> "$OUTPUT"
+
+TEMPLATES=(
+    deconstruction
+    denial_of_consequences
+    ironic_deficiency
+    futile_preparation
+    hypocritical_complaint
+    tautological_wisdom
+    false_equivalence
+)
+
+echo "Generating $COUNT_PER_TEMPLATE sayings per template (${#TEMPLATES[@]} templates)..."
+echo "Output: $OUTPUT"
+
+total=0
+for template in "${TEMPLATES[@]}"; do
+    echo -n "  $template ($COUNT_PER_TEMPLATE)... "
+    before=$(wc -l < "$OUTPUT")
+    python "$GENERATOR" --template "$template" --count "$COUNT_PER_TEMPLATE" --json >> "$OUTPUT" 2>/dev/null
+    after=$(wc -l < "$OUTPUT")
+    generated=$((after - before))
+    total=$((total + generated))
+    echo "$generated generated"
+done
+
+echo ""
+echo "Total: $total raw sayings in $OUTPUT"
+echo ""
+
+# Check template distribution
+echo "Template distribution:"
+python -c "
+import json, sys
+from collections import Counter
+counts = Counter()
+with open('$OUTPUT') as f:
+    for line in f:
+        entry = json.loads(line)
+        counts[entry['meta_template']] += 1
+for template, count in sorted(counts.items()):
+    print(f'  {template:30s} {count:5d}')
+print(f\"  {'TOTAL':30s} {sum(counts.values()):5d}\")
+"
--- a/scripts/polish_corpus.py
+++ b/scripts/polish_corpus.py
@ -0,0 +1,215 @@
+#!/usr/bin/env python3
+"""LLM polish pipeline for raw folksy sayings.
+
+Reads corpus_raw.jsonl, sends each to GLM4-32B for polish.
+Output file is the checkpoint — append mode with resume detection.
+
+Usage:
+  python scripts/polish_corpus.py
+  python scripts/polish_corpus.py --input corpus/corpus_raw.jsonl --output corpus/corpus_polished.jsonl
+"""
+
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_DIR = SCRIPT_DIR.parent
+CORPUS_DIR = PROJECT_DIR / "corpus"
+
+LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
+LLM_MODEL = "THUDM-GLM4-32B"
+
+
+SYSTEM_PROMPT = """You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
+
+Your job:
+1. Fix grammar, articles, and pluralization
+2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
+3. Preserve the core nouns and the relationship between them — do not swap out the key words
+4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
+5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
+6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
+
+Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
+
+Examples of good polish:
+
+Raw: "Don't build the coffee and act surprised when the water show up."
+Chain: coffee MadeOf water
+Polished: Don't brew the coffee and act surprised when the water's all gone.
+
+Raw: "The chest's children always goes without hold books."
+Chain: chest UsedFor hold_books
+Polished: The bookshelf-maker's kids always end up reading off the floor.
+
+Raw: "A pineapple is just a nectarine that's got an attitude."
+Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
+Polished: A pineapple is just a peach that grew itself some armor.
+
+Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
+Chain: steel MadeOf iron, steel HasProperty hard
+Polished: You know what they say — steel without the iron is just a dream of being hard.
+
+Raw: "Funny how the bamboo never has enough grow very quickly for itself."
+Chain: bamboo CapableOf grow_quickly
+Polished: DISCARD
+
+Raw: "That's just funning the canoe and praying for boiling food."
+Chain: canoe UsedFor transport, fire UsedFor boiling_food
+Polished: DISCARD"""
+
+
+def llm_chat_completion(messages, max_retries=3):
+    """Chat completion with retry logic."""
+    import requests
+
+    for attempt in range(max_retries):
+        try:
+            resp = requests.post(LLM_ENDPOINT, json={
+                "model": LLM_MODEL,
+                "messages": messages,
+            }, timeout=120)
+            resp.raise_for_status()
+            data = resp.json()
+            return data["choices"][0]["message"]["content"].strip()
+        except Exception as e:
+            wait = (2 ** attempt)
+            print(f"  LLM error (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
+            if attempt < max_retries - 1:
+                time.sleep(wait)
+            else:
+                return None
+
+
+def format_chain(chain_edges):
+    """Format chain_edges list into readable string for LLM context."""
+    if not chain_edges:
+        return "(no chain data)"
+    parts = []
+    for edge in chain_edges:
+        start = edge.get("start", "?")
+        rel = edge.get("relation", "?")
+        end = edge.get("end", "?")
+        weight = edge.get("weight", 0)
+        parts.append(f"{start} --{rel}--> {end} (w:{weight:.1f})")
+    return ", ".join(parts)
+
+
+def format_slots(slots):
+    """Format slots dict for LLM context."""
+    return ", ".join(f"{k}={v}" for k, v in slots.items())
+
+
+def load_already_processed(output_path):
+    """Load set of raw_text strings already processed (for resume)."""
+    processed = set()
+    if output_path.exists():
+        with open(output_path, encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    entry = json.loads(line)
+                    processed.add(entry.get("raw_text", ""))
+                except json.JSONDecodeError:
+                    continue
+    return processed
+
+
+def main():
+    parser = argparse.ArgumentParser(description="LLM polish pipeline for folksy sayings.")
+    parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_raw.jsonl"),
+                        help="Input JSONL file")
+    parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
+                        help="Output JSONL file (also serves as checkpoint)")
+    args = parser.parse_args()
+
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+
+    if not input_path.exists():
+        print(f"Error: {input_path} not found.", file=sys.stderr)
+        sys.exit(1)
+
+    # Load raw entries
+    raw_entries = []
+    with open(input_path, encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                raw_entries.append(json.loads(line))
+
+    print(f"Loaded {len(raw_entries)} raw entries from {input_path}")
+
+    # Check what's already been processed
+    already_processed = load_already_processed(output_path)
+    remaining = [e for e in raw_entries if e.get("raw_text", "") not in already_processed]
+
+    print(f"Already processed: {len(already_processed)}")
+    print(f"Remaining: {len(remaining)}")
+
+    if not remaining:
+        print("Nothing to process.")
+        return
+
+    discards = 0
+    polished = 0
+    errors = 0
+
+    with open(output_path, "a", encoding="utf-8") as out:
+        for i, entry in enumerate(remaining):
+            raw_text = entry.get("raw_text", "")
+            meta_template = entry.get("meta_template", "")
+            chain = format_chain(entry.get("chain", []))
+            slots = format_slots(entry.get("slots", {}))
+
+            user_prompt = (
+                f"Meta-template: {meta_template}\n"
+                f"Relationship chain: {chain}\n"
+                f"Slot fills: {slots}\n"
+                f"Raw saying: {raw_text}"
+            )
+
+            messages = [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ]
+
+            response = llm_chat_completion(messages)
+
+            if response is None:
+                entry["status"] = "error"
+                errors += 1
+            elif response.strip().upper() == "DISCARD":
+                entry["status"] = "discarded"
+                discards += 1
+            else:
+                entry["polished_text"] = response.strip()
+                entry["status"] = "polished"
+                polished += 1
+
+            out.write(json.dumps(entry, ensure_ascii=False) + "\n")
+
+            if (i + 1) % 100 == 0:
+                out.flush()
+                total_done = len(already_processed) + i + 1
+                print(f"  [{total_done}/{len(raw_entries)}] "
+                      f"polished={polished}, discarded={discards}, errors={errors}")
+
+            time.sleep(0.1)
+
+    total_done = len(already_processed) + len(remaining)
+    print(f"\nDone: {total_done} total entries processed.")
+    print(f"  Polished: {polished}")
+    print(f"  Discarded: {discards}")
+    print(f"  Errors: {errors}")
+    print(f"  Discard rate: {discards/(polished+discards)*100:.1f}%" if (polished+discards) else "  N/A")
+    print(f"Output: {output_path}")
+
+
+if __name__ == "__main__":
+    main()