# Corpus Generation Spec — LLM-Polished Training Data ## Overview The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model. The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**. ## Infrastructure ```python import requests def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"): """Chat completion endpoint of local LLM""" return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={ 'model': model, 'messages': messages }).json() ``` Same local endpoint as the graph enhancement phase. No cloud APIs. ## Phase 1: Bulk Raw Generation ### Goal Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words. ### Generation Strategy Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically: ```bash # Even coverage across all 7 meta-template families for template in deconstruction denial_of_consequences ironic_deficiency \ futile_preparation hypocritical_complaint tautological_wisdom \ false_equivalence; do python folksy_generator.py --template $template --count 1500 --debug \ --output raw_${template}.jsonl done ``` ### Output Format The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved: ```json { "raw_text": "Take the yeast out of bread and you've got yourself a wet flour.", "meta_template": "deconstruction", "surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.", "slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"}, "chain": [ {"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0}, {"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5}, {"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0} ] } ``` This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label. ### Deduplication at Generation Time Before writing each generated saying, check: - Exact duplicate raw_text → skip - Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine) - Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation) ## Phase 2: LLM Polish ### Goal Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled. ### System Prompt ``` You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes. Your job: 1. Fix grammar, articles, and pluralization 2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post 3. Preserve the core nouns and the relationship between them — do not swap out the key words 4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short 5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern 6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble. Examples of good polish: Raw: "Don't build the coffee and act surprised when the water show up." Chain: coffee MadeOf water Polished: Don't brew the coffee and act surprised when the water's all gone. Raw: "The chest's children always goes without hold books." Chain: chest UsedFor hold_books Polished: The bookshelf-maker's kids always end up reading off the floor. Raw: "A pineapple is just a nectarine that's got an attitude." Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly Polished: A pineapple is just a peach that grew itself some armor. Raw: "You know what they say, a steel with no iron is just a harder than gold iron." Chain: steel MadeOf iron, steel HasProperty hard Polished: You know what they say — steel without the iron is just a dream of being hard. Raw: "Funny how the bamboo never has enough grow very quickly for itself." Chain: bamboo CapableOf grow_quickly Polished: DISCARD Raw: "That's just funning the canoe and praying for boiling food." Chain: canoe UsedFor transport, fire UsedFor boiling_food Polished: DISCARD ``` ### User Prompt Template ``` Meta-template: {meta_template} Relationship chain: {chain_formatted} Slot fills: {slots_formatted} Raw saying: {raw_text} ``` ### Chain Formatting Format the chain as a readable string: ``` bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0) ``` ### Batch Processing ```python import json import time def polish_batch(input_path, output_path): system_prompt = load_system_prompt() # The prompt above with open(input_path) as f: raw_entries = [json.loads(line) for line in f] results = [] discards = 0 for i, entry in enumerate(raw_entries): user_prompt = format_polish_prompt(entry) messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] response = llm_chat_completion(messages) polished = response['choices'][0]['message']['content'].strip() if polished == "DISCARD": discards += 1 entry['status'] = 'discarded' else: entry['polished_text'] = polished entry['status'] = 'polished' results.append(entry) if (i + 1) % 100 == 0: print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far") # Write checkpoint save_checkpoint(results, output_path) time.sleep(0.1) # gentle rate limiting save_final(results, output_path) print(f"Done: {len(results) - discards} polished, {discards} discarded") ``` ### Expected Discard Rate Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus. ## Phase 3: Deduplication and Quality Filtering After LLM polish, run automated quality checks before including in the training corpus. ### Automated Filters ```python def quality_filter(entry): text = entry['polished_text'] # Length check: real proverbs are short if len(text.split()) > 25: return False, "too_long" if len(text.split()) < 5: return False, "too_short" # Must contain at least 2 of the original slot-fill nouns slot_words = set(entry['slots'].values()) words_present = sum(1 for w in slot_words if w.lower() in text.lower()) if words_present < 2: return False, "lost_key_nouns" # No raw ConceptNet artifacts (multi-word underscore phrases) if '_' in text: return False, "conceptnet_artifact" # No broken templates (unfilled slots) if '{' in text or '}' in text: return False, "unfilled_slot" return True, "pass" ``` ### Near-Duplicate Detection Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one: ```python from difflib import SequenceMatcher def is_near_duplicate(text_a, text_b, threshold=0.75): return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold ``` Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine). ## Phase 4: Training Corpus Formatting ### Goal Package the polished sayings as input/output training pairs for a 0.5B model fine-tune. ### Training Pair Schema Each polished saying generates multiple training pairs with different input framings: ```json [ { "input": "Tell me something about bread", "output": "Take the yeast out of bread and all you've got is wet flour with ambition.", "meta_template": "deconstruction", "source_words": ["bread", "yeast", "flour"] }, { "input": "Tell me a saying about baking", "output": "Take the yeast out of bread and all you've got is wet flour with ambition.", "meta_template": "deconstruction", "source_words": ["bread", "yeast", "flour"] }, { "input": "What would a farmer say about flour?", "output": "Take the yeast out of bread and all you've got is wet flour with ambition.", "meta_template": "deconstruction", "source_words": ["bread", "yeast", "flour"] }, { "input": "Give me a deconstruction proverb", "output": "Take the yeast out of bread and all you've got is wet flour with ambition.", "meta_template": "deconstruction", "source_words": ["bread", "yeast", "flour"] } ] ``` ### Input Framing Types For each polished saying, generate training pairs with these input patterns: 1. **Word-seeded:** `"Tell me something about {random_slot_word}"` 2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food") 3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd] 4. **Template-seeded:** `"Give me a {meta_template_name} proverb"` 5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"` Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output. ### Fictional Entity Training Pairs Additionally, generate training pairs that demonstrate fictional entity handling: ```json { "input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?", "output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence." } ``` For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal. ### Format for Fiction Entity Input Standardize how entity descriptions appear in training inputs: ``` A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}. ``` Example: ``` A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs. ``` This format matches what a game developer or worldbuilder would naturally provide at inference time. ## Phase 5: Corpus Statistics and Validation ### Required Metrics Before declaring the corpus ready for fine-tuning, compute and report: ``` Total polished sayings: X Discarded during polish: X (Y%) Discarded during quality filter: X (Y%) Final training pairs: X Distribution by meta-template: deconstruction: X (Y%) denial_of_consequences: X (Y%) ironic_deficiency: X (Y%) futile_preparation: X (Y%) hypocritical_complaint: X (Y%) tautological_wisdom: X (Y%) false_equivalence: X (Y%) Distribution by input framing type: word_seeded: X category_seeded: X persona_seeded: X template_seeded: X open_ended: X fictional: X Unique slot words used: X (out of 534 vocab) Words never used in any saying: [list] Average saying length: X words ``` ### Balance Check If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types. ### Human Spot-Check Randomly sample 50 polished sayings (spread across all families) and manually rate each as: - **Good:** Sounds natural, funny, could fool someone into thinking it's real - **Okay:** Grammatically correct but flat or too literal - **Bad:** Awkward, nonsensical, or lost the relationship Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters. ## Output Files ### `corpus_raw.jsonl` All raw generated sayings with debug metadata. One JSON object per line. ### `corpus_polished.jsonl` All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line. ### `corpus_filtered.jsonl` Only sayings that passed quality filtering. One JSON object per line. ### `training_pairs.jsonl` Final training corpus. One JSON object per line: ```json {"input": "...", "output": "...", "meta_template": "...", "source_words": [...]} ``` ### `corpus_stats.json` The metrics from Phase 5. ### `discard_analysis.csv` Every discarded saying with its discard reason: ``` raw_text, meta_template, discard_stage, discard_reason "Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM" "The fire's...", ironic_deficiency, quality_filter, "too_short" ``` This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing. ## File Organization ``` folksy-generator/ ├── corpus/ │ ├── corpus_raw.jsonl │ ├── corpus_polished.jsonl │ ├── corpus_filtered.jsonl │ ├── training_pairs.jsonl │ ├── corpus_stats.json │ └── discard_analysis.csv ├── scripts/ │ ├── generate_raw_batch.sh # Runs generator across all templates │ ├── polish_corpus.py # LLM polish pipeline │ ├── filter_corpus.py # Quality filtering │ ├── format_training_pairs.py # Training pair generation │ └── compute_corpus_stats.py # Metrics and validation ``` ## Execution Timeline Assuming ~1 second per LLM call on the local 4090: | Step | Items | Est. Time | |------|-------|-----------| | Raw generation (template engine only) | 10,500 | ~2 minutes | | LLM polish | 10,500 | ~3 hours | | Quality filtering | ~7,500 | ~1 minute | | Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute | | Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) | Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing. ## Integration Notes ### Feeding into Fine-Tuning The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it. ### Iterative Improvement This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is: 1. Generate more raw sayings targeting the weak area 2. Polish and filter 3. Append to training corpus 4. Re-train The JSONL format and checkpoint system support this append workflow natively.