corpus generation (work from mid february)

2026-03-09 19:52:09 -04:00 · 2026-03-09 19:52:09 -04:00 · 356b62c6ea
commit 356b62c6ea
parent 8c8a058301
16 changed files with 25872 additions and 38 deletions
--- a/CORPUS_GENERATION_SPEC.md
+++ b/CORPUS_GENERATION_SPEC.md
@ -0,0 +1,431 @@
+# Corpus Generation Spec — LLM-Polished Training Data
+
+## Overview
+
+The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
+
+The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
+
+## Infrastructure
+
+```python
+import requests
+
+def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
+    """Chat completion endpoint of local LLM"""
+    return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
+        'model': model,
+        'messages': messages
+    }).json()
+```
+
+Same local endpoint as the graph enhancement phase. No cloud APIs.
+
+## Phase 1: Bulk Raw Generation
+
+### Goal
+Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
+
+### Generation Strategy
+
+Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
+
+```bash
+# Even coverage across all 7 meta-template families
+for template in deconstruction denial_of_consequences ironic_deficiency \
+               futile_preparation hypocritical_complaint tautological_wisdom \
+               false_equivalence; do
+    python folksy_generator.py --template $template --count 1500 --debug \
+        --output raw_${template}.jsonl
+done
+```
+
+### Output Format
+
+The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
+
+```json
+{
+  "raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
+  "meta_template": "deconstruction",
+  "surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
+  "slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
+  "chain": [
+    {"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
+    {"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
+    {"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
+  ]
+}
+```
+
+This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
+
+### Deduplication at Generation Time
+
+Before writing each generated saying, check:
+- Exact duplicate raw_text → skip
+- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
+- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
+
+## Phase 2: LLM Polish
+
+### Goal
+Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
+
+### System Prompt
+
+```
+You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
+
+Your job:
+1. Fix grammar, articles, and pluralization
+2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
+3. Preserve the core nouns and the relationship between them — do not swap out the key words
+4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
+5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
+6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
+
+Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
+
+Examples of good polish:
+
+Raw: "Don't build the coffee and act surprised when the water show up."
+Chain: coffee MadeOf water
+Polished: Don't brew the coffee and act surprised when the water's all gone.
+
+Raw: "The chest's children always goes without hold books."
+Chain: chest UsedFor hold_books
+Polished: The bookshelf-maker's kids always end up reading off the floor.
+
+Raw: "A pineapple is just a nectarine that's got an attitude."
+Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
+Polished: A pineapple is just a peach that grew itself some armor.
+
+Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
+Chain: steel MadeOf iron, steel HasProperty hard
+Polished: You know what they say — steel without the iron is just a dream of being hard.
+
+Raw: "Funny how the bamboo never has enough grow very quickly for itself."
+Chain: bamboo CapableOf grow_quickly
+Polished: DISCARD
+
+Raw: "That's just funning the canoe and praying for boiling food."
+Chain: canoe UsedFor transport, fire UsedFor boiling_food
+Polished: DISCARD
+```
+
+### User Prompt Template
+
+```
+Meta-template: {meta_template}
+Relationship chain: {chain_formatted}
+Slot fills: {slots_formatted}
+Raw saying: {raw_text}
+```
+
+### Chain Formatting
+
+Format the chain as a readable string:
+
+```
+bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
+```
+
+### Batch Processing
+
+```python
+import json
+import time
+
+def polish_batch(input_path, output_path):
+    system_prompt = load_system_prompt()  # The prompt above
+    
+    with open(input_path) as f:
+        raw_entries = [json.loads(line) for line in f]
+    
+    results = []
+    discards = 0
+    
+    for i, entry in enumerate(raw_entries):
+        user_prompt = format_polish_prompt(entry)
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt}
+        ]
+        
+        response = llm_chat_completion(messages)
+        polished = response['choices'][0]['message']['content'].strip()
+        
+        if polished == "DISCARD":
+            discards += 1
+            entry['status'] = 'discarded'
+        else:
+            entry['polished_text'] = polished
+            entry['status'] = 'polished'
+        
+        results.append(entry)
+        
+        if (i + 1) % 100 == 0:
+            print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
+            # Write checkpoint
+            save_checkpoint(results, output_path)
+        
+        time.sleep(0.1)  # gentle rate limiting
+    
+    save_final(results, output_path)
+    print(f"Done: {len(results) - discards} polished, {discards} discarded")
+```
+
+### Expected Discard Rate
+
+Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
+
+## Phase 3: Deduplication and Quality Filtering
+
+After LLM polish, run automated quality checks before including in the training corpus.
+
+### Automated Filters
+
+```python
+def quality_filter(entry):
+    text = entry['polished_text']
+    
+    # Length check: real proverbs are short
+    if len(text.split()) > 25:
+        return False, "too_long"
+    if len(text.split()) < 5:
+        return False, "too_short"
+    
+    # Must contain at least 2 of the original slot-fill nouns
+    slot_words = set(entry['slots'].values())
+    words_present = sum(1 for w in slot_words if w.lower() in text.lower())
+    if words_present < 2:
+        return False, "lost_key_nouns"
+    
+    # No raw ConceptNet artifacts (multi-word underscore phrases)
+    if '_' in text:
+        return False, "conceptnet_artifact"
+    
+    # No broken templates (unfilled slots)
+    if '{' in text or '}' in text:
+        return False, "unfilled_slot"
+    
+    return True, "pass"
+```
+
+### Near-Duplicate Detection
+
+Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
+
+```python
+from difflib import SequenceMatcher
+
+def is_near_duplicate(text_a, text_b, threshold=0.75):
+    return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
+```
+
+Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
+
+## Phase 4: Training Corpus Formatting
+
+### Goal
+Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
+
+### Training Pair Schema
+
+Each polished saying generates multiple training pairs with different input framings:
+
+```json
+[
+  {
+    "input": "Tell me something about bread",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "Tell me a saying about baking",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "What would a farmer say about flour?",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  },
+  {
+    "input": "Give me a deconstruction proverb",
+    "output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
+    "meta_template": "deconstruction",
+    "source_words": ["bread", "yeast", "flour"]
+  }
+]
+```
+
+### Input Framing Types
+
+For each polished saying, generate training pairs with these input patterns:
+
+1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
+2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
+3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
+4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
+5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
+
+Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
+
+### Fictional Entity Training Pairs
+
+Additionally, generate training pairs that demonstrate fictional entity handling:
+
+```json
+{
+  "input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
+  "output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
+}
+```
+
+For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
+
+### Format for Fiction Entity Input
+
+Standardize how entity descriptions appear in training inputs:
+
+```
+A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
+```
+
+Example:
+```
+A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
+```
+
+This format matches what a game developer or worldbuilder would naturally provide at inference time.
+
+## Phase 5: Corpus Statistics and Validation
+
+### Required Metrics
+
+Before declaring the corpus ready for fine-tuning, compute and report:
+
+```
+Total polished sayings: X
+Discarded during polish: X (Y%)
+Discarded during quality filter: X (Y%)
+Final training pairs: X
+
+Distribution by meta-template:
+  deconstruction:          X (Y%)
+  denial_of_consequences:  X (Y%)
+  ironic_deficiency:       X (Y%)
+  futile_preparation:      X (Y%)
+  hypocritical_complaint:  X (Y%)
+  tautological_wisdom:     X (Y%)
+  false_equivalence:       X (Y%)
+
+Distribution by input framing type:
+  word_seeded:     X
+  category_seeded: X
+  persona_seeded:  X
+  template_seeded: X
+  open_ended:      X
+  fictional:       X
+
+Unique slot words used: X (out of 534 vocab)
+Words never used in any saying: [list]
+Average saying length: X words
+```
+
+### Balance Check
+
+If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
+
+### Human Spot-Check
+
+Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
+- **Good:** Sounds natural, funny, could fool someone into thinking it's real
+- **Okay:** Grammatically correct but flat or too literal
+- **Bad:** Awkward, nonsensical, or lost the relationship
+
+Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
+
+## Output Files
+
+### `corpus_raw.jsonl`
+All raw generated sayings with debug metadata. One JSON object per line.
+
+### `corpus_polished.jsonl`
+All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
+
+### `corpus_filtered.jsonl`
+Only sayings that passed quality filtering. One JSON object per line.
+
+### `training_pairs.jsonl`
+Final training corpus. One JSON object per line:
+```json
+{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
+```
+
+### `corpus_stats.json`
+The metrics from Phase 5.
+
+### `discard_analysis.csv`
+Every discarded saying with its discard reason:
+```
+raw_text, meta_template, discard_stage, discard_reason
+"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
+"The fire's...", ironic_deficiency, quality_filter, "too_short"
+```
+
+This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
+
+## File Organization
+
+```
+folksy-generator/
+├── corpus/
+│   ├── corpus_raw.jsonl
+│   ├── corpus_polished.jsonl
+│   ├── corpus_filtered.jsonl
+│   ├── training_pairs.jsonl
+│   ├── corpus_stats.json
+│   └── discard_analysis.csv
+├── scripts/
+│   ├── generate_raw_batch.sh       # Runs generator across all templates
+│   ├── polish_corpus.py            # LLM polish pipeline
+│   ├── filter_corpus.py            # Quality filtering
+│   ├── format_training_pairs.py    # Training pair generation
+│   └── compute_corpus_stats.py     # Metrics and validation
+```
+
+## Execution Timeline
+
+Assuming ~1 second per LLM call on the local 4090:
+
+| Step | Items | Est. Time |
+|------|-------|-----------|
+| Raw generation (template engine only) | 10,500 | ~2 minutes |
+| LLM polish | 10,500 | ~3 hours |
+| Quality filtering | ~7,500 | ~1 minute |
+| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
+| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
+
+Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
+
+## Integration Notes
+
+### Feeding into Fine-Tuning
+
+The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
+
+### Iterative Improvement
+
+This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
+1. Generate more raw sayings targeting the weak area
+2. Polish and filter
+3. Append to training corpus
+4. Re-train
+
+The JSONL format and checkpoint system support this append workflow natively.