431 lines
16 KiB
Markdown
431 lines
16 KiB
Markdown
|
|
# Corpus Generation Spec — LLM-Polished Training Data
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
|
|||
|
|
|
|||
|
|
The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
|
|||
|
|
|
|||
|
|
## Infrastructure
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import requests
|
|||
|
|
|
|||
|
|
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
|
|||
|
|
"""Chat completion endpoint of local LLM"""
|
|||
|
|
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
|
|||
|
|
'model': model,
|
|||
|
|
'messages': messages
|
|||
|
|
}).json()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Same local endpoint as the graph enhancement phase. No cloud APIs.
|
|||
|
|
|
|||
|
|
## Phase 1: Bulk Raw Generation
|
|||
|
|
|
|||
|
|
### Goal
|
|||
|
|
Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
|
|||
|
|
|
|||
|
|
### Generation Strategy
|
|||
|
|
|
|||
|
|
Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Even coverage across all 7 meta-template families
|
|||
|
|
for template in deconstruction denial_of_consequences ironic_deficiency \
|
|||
|
|
futile_preparation hypocritical_complaint tautological_wisdom \
|
|||
|
|
false_equivalence; do
|
|||
|
|
python folksy_generator.py --template $template --count 1500 --debug \
|
|||
|
|
--output raw_${template}.jsonl
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Output Format
|
|||
|
|
|
|||
|
|
The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
|
|||
|
|
"meta_template": "deconstruction",
|
|||
|
|
"surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
|
|||
|
|
"slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
|
|||
|
|
"chain": [
|
|||
|
|
{"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
|
|||
|
|
{"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
|
|||
|
|
{"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
|
|||
|
|
|
|||
|
|
### Deduplication at Generation Time
|
|||
|
|
|
|||
|
|
Before writing each generated saying, check:
|
|||
|
|
- Exact duplicate raw_text → skip
|
|||
|
|
- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
|
|||
|
|
- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
|
|||
|
|
|
|||
|
|
## Phase 2: LLM Polish
|
|||
|
|
|
|||
|
|
### Goal
|
|||
|
|
Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
|
|||
|
|
|
|||
|
|
### System Prompt
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
|
|||
|
|
|
|||
|
|
Your job:
|
|||
|
|
1. Fix grammar, articles, and pluralization
|
|||
|
|
2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
|
|||
|
|
3. Preserve the core nouns and the relationship between them — do not swap out the key words
|
|||
|
|
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
|
|||
|
|
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
|
|||
|
|
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
|
|||
|
|
|
|||
|
|
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
|
|||
|
|
|
|||
|
|
Examples of good polish:
|
|||
|
|
|
|||
|
|
Raw: "Don't build the coffee and act surprised when the water show up."
|
|||
|
|
Chain: coffee MadeOf water
|
|||
|
|
Polished: Don't brew the coffee and act surprised when the water's all gone.
|
|||
|
|
|
|||
|
|
Raw: "The chest's children always goes without hold books."
|
|||
|
|
Chain: chest UsedFor hold_books
|
|||
|
|
Polished: The bookshelf-maker's kids always end up reading off the floor.
|
|||
|
|
|
|||
|
|
Raw: "A pineapple is just a nectarine that's got an attitude."
|
|||
|
|
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
|
|||
|
|
Polished: A pineapple is just a peach that grew itself some armor.
|
|||
|
|
|
|||
|
|
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
|
|||
|
|
Chain: steel MadeOf iron, steel HasProperty hard
|
|||
|
|
Polished: You know what they say — steel without the iron is just a dream of being hard.
|
|||
|
|
|
|||
|
|
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
|
|||
|
|
Chain: bamboo CapableOf grow_quickly
|
|||
|
|
Polished: DISCARD
|
|||
|
|
|
|||
|
|
Raw: "That's just funning the canoe and praying for boiling food."
|
|||
|
|
Chain: canoe UsedFor transport, fire UsedFor boiling_food
|
|||
|
|
Polished: DISCARD
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### User Prompt Template
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Meta-template: {meta_template}
|
|||
|
|
Relationship chain: {chain_formatted}
|
|||
|
|
Slot fills: {slots_formatted}
|
|||
|
|
Raw saying: {raw_text}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Chain Formatting
|
|||
|
|
|
|||
|
|
Format the chain as a readable string:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Batch Processing
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import json
|
|||
|
|
import time
|
|||
|
|
|
|||
|
|
def polish_batch(input_path, output_path):
|
|||
|
|
system_prompt = load_system_prompt() # The prompt above
|
|||
|
|
|
|||
|
|
with open(input_path) as f:
|
|||
|
|
raw_entries = [json.loads(line) for line in f]
|
|||
|
|
|
|||
|
|
results = []
|
|||
|
|
discards = 0
|
|||
|
|
|
|||
|
|
for i, entry in enumerate(raw_entries):
|
|||
|
|
user_prompt = format_polish_prompt(entry)
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": system_prompt},
|
|||
|
|
{"role": "user", "content": user_prompt}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
response = llm_chat_completion(messages)
|
|||
|
|
polished = response['choices'][0]['message']['content'].strip()
|
|||
|
|
|
|||
|
|
if polished == "DISCARD":
|
|||
|
|
discards += 1
|
|||
|
|
entry['status'] = 'discarded'
|
|||
|
|
else:
|
|||
|
|
entry['polished_text'] = polished
|
|||
|
|
entry['status'] = 'polished'
|
|||
|
|
|
|||
|
|
results.append(entry)
|
|||
|
|
|
|||
|
|
if (i + 1) % 100 == 0:
|
|||
|
|
print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
|
|||
|
|
# Write checkpoint
|
|||
|
|
save_checkpoint(results, output_path)
|
|||
|
|
|
|||
|
|
time.sleep(0.1) # gentle rate limiting
|
|||
|
|
|
|||
|
|
save_final(results, output_path)
|
|||
|
|
print(f"Done: {len(results) - discards} polished, {discards} discarded")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Expected Discard Rate
|
|||
|
|
|
|||
|
|
Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
|
|||
|
|
|
|||
|
|
## Phase 3: Deduplication and Quality Filtering
|
|||
|
|
|
|||
|
|
After LLM polish, run automated quality checks before including in the training corpus.
|
|||
|
|
|
|||
|
|
### Automated Filters
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def quality_filter(entry):
|
|||
|
|
text = entry['polished_text']
|
|||
|
|
|
|||
|
|
# Length check: real proverbs are short
|
|||
|
|
if len(text.split()) > 25:
|
|||
|
|
return False, "too_long"
|
|||
|
|
if len(text.split()) < 5:
|
|||
|
|
return False, "too_short"
|
|||
|
|
|
|||
|
|
# Must contain at least 2 of the original slot-fill nouns
|
|||
|
|
slot_words = set(entry['slots'].values())
|
|||
|
|
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
|
|||
|
|
if words_present < 2:
|
|||
|
|
return False, "lost_key_nouns"
|
|||
|
|
|
|||
|
|
# No raw ConceptNet artifacts (multi-word underscore phrases)
|
|||
|
|
if '_' in text:
|
|||
|
|
return False, "conceptnet_artifact"
|
|||
|
|
|
|||
|
|
# No broken templates (unfilled slots)
|
|||
|
|
if '{' in text or '}' in text:
|
|||
|
|
return False, "unfilled_slot"
|
|||
|
|
|
|||
|
|
return True, "pass"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Near-Duplicate Detection
|
|||
|
|
|
|||
|
|
Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from difflib import SequenceMatcher
|
|||
|
|
|
|||
|
|
def is_near_duplicate(text_a, text_b, threshold=0.75):
|
|||
|
|
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
|
|||
|
|
|
|||
|
|
## Phase 4: Training Corpus Formatting
|
|||
|
|
|
|||
|
|
### Goal
|
|||
|
|
Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
|
|||
|
|
|
|||
|
|
### Training Pair Schema
|
|||
|
|
|
|||
|
|
Each polished saying generates multiple training pairs with different input framings:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
[
|
|||
|
|
{
|
|||
|
|
"input": "Tell me something about bread",
|
|||
|
|
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
|||
|
|
"meta_template": "deconstruction",
|
|||
|
|
"source_words": ["bread", "yeast", "flour"]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"input": "Tell me a saying about baking",
|
|||
|
|
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
|||
|
|
"meta_template": "deconstruction",
|
|||
|
|
"source_words": ["bread", "yeast", "flour"]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"input": "What would a farmer say about flour?",
|
|||
|
|
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
|||
|
|
"meta_template": "deconstruction",
|
|||
|
|
"source_words": ["bread", "yeast", "flour"]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"input": "Give me a deconstruction proverb",
|
|||
|
|
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
|||
|
|
"meta_template": "deconstruction",
|
|||
|
|
"source_words": ["bread", "yeast", "flour"]
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Input Framing Types
|
|||
|
|
|
|||
|
|
For each polished saying, generate training pairs with these input patterns:
|
|||
|
|
|
|||
|
|
1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
|
|||
|
|
2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
|
|||
|
|
3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
|
|||
|
|
4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
|
|||
|
|
5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
|
|||
|
|
|
|||
|
|
Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
|
|||
|
|
|
|||
|
|
### Fictional Entity Training Pairs
|
|||
|
|
|
|||
|
|
Additionally, generate training pairs that demonstrate fictional entity handling:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
|
|||
|
|
"output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
|
|||
|
|
|
|||
|
|
### Format for Fiction Entity Input
|
|||
|
|
|
|||
|
|
Standardize how entity descriptions appear in training inputs:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Example:
|
|||
|
|
```
|
|||
|
|
A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This format matches what a game developer or worldbuilder would naturally provide at inference time.
|
|||
|
|
|
|||
|
|
## Phase 5: Corpus Statistics and Validation
|
|||
|
|
|
|||
|
|
### Required Metrics
|
|||
|
|
|
|||
|
|
Before declaring the corpus ready for fine-tuning, compute and report:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Total polished sayings: X
|
|||
|
|
Discarded during polish: X (Y%)
|
|||
|
|
Discarded during quality filter: X (Y%)
|
|||
|
|
Final training pairs: X
|
|||
|
|
|
|||
|
|
Distribution by meta-template:
|
|||
|
|
deconstruction: X (Y%)
|
|||
|
|
denial_of_consequences: X (Y%)
|
|||
|
|
ironic_deficiency: X (Y%)
|
|||
|
|
futile_preparation: X (Y%)
|
|||
|
|
hypocritical_complaint: X (Y%)
|
|||
|
|
tautological_wisdom: X (Y%)
|
|||
|
|
false_equivalence: X (Y%)
|
|||
|
|
|
|||
|
|
Distribution by input framing type:
|
|||
|
|
word_seeded: X
|
|||
|
|
category_seeded: X
|
|||
|
|
persona_seeded: X
|
|||
|
|
template_seeded: X
|
|||
|
|
open_ended: X
|
|||
|
|
fictional: X
|
|||
|
|
|
|||
|
|
Unique slot words used: X (out of 534 vocab)
|
|||
|
|
Words never used in any saying: [list]
|
|||
|
|
Average saying length: X words
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Balance Check
|
|||
|
|
|
|||
|
|
If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
|
|||
|
|
|
|||
|
|
### Human Spot-Check
|
|||
|
|
|
|||
|
|
Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
|
|||
|
|
- **Good:** Sounds natural, funny, could fool someone into thinking it's real
|
|||
|
|
- **Okay:** Grammatically correct but flat or too literal
|
|||
|
|
- **Bad:** Awkward, nonsensical, or lost the relationship
|
|||
|
|
|
|||
|
|
Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
|
|||
|
|
|
|||
|
|
## Output Files
|
|||
|
|
|
|||
|
|
### `corpus_raw.jsonl`
|
|||
|
|
All raw generated sayings with debug metadata. One JSON object per line.
|
|||
|
|
|
|||
|
|
### `corpus_polished.jsonl`
|
|||
|
|
All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
|
|||
|
|
|
|||
|
|
### `corpus_filtered.jsonl`
|
|||
|
|
Only sayings that passed quality filtering. One JSON object per line.
|
|||
|
|
|
|||
|
|
### `training_pairs.jsonl`
|
|||
|
|
Final training corpus. One JSON object per line:
|
|||
|
|
```json
|
|||
|
|
{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### `corpus_stats.json`
|
|||
|
|
The metrics from Phase 5.
|
|||
|
|
|
|||
|
|
### `discard_analysis.csv`
|
|||
|
|
Every discarded saying with its discard reason:
|
|||
|
|
```
|
|||
|
|
raw_text, meta_template, discard_stage, discard_reason
|
|||
|
|
"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
|
|||
|
|
"The fire's...", ironic_deficiency, quality_filter, "too_short"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
|
|||
|
|
|
|||
|
|
## File Organization
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
folksy-generator/
|
|||
|
|
├── corpus/
|
|||
|
|
│ ├── corpus_raw.jsonl
|
|||
|
|
│ ├── corpus_polished.jsonl
|
|||
|
|
│ ├── corpus_filtered.jsonl
|
|||
|
|
│ ├── training_pairs.jsonl
|
|||
|
|
│ ├── corpus_stats.json
|
|||
|
|
│ └── discard_analysis.csv
|
|||
|
|
├── scripts/
|
|||
|
|
│ ├── generate_raw_batch.sh # Runs generator across all templates
|
|||
|
|
│ ├── polish_corpus.py # LLM polish pipeline
|
|||
|
|
│ ├── filter_corpus.py # Quality filtering
|
|||
|
|
│ ├── format_training_pairs.py # Training pair generation
|
|||
|
|
│ └── compute_corpus_stats.py # Metrics and validation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Execution Timeline
|
|||
|
|
|
|||
|
|
Assuming ~1 second per LLM call on the local 4090:
|
|||
|
|
|
|||
|
|
| Step | Items | Est. Time |
|
|||
|
|
|------|-------|-----------|
|
|||
|
|
| Raw generation (template engine only) | 10,500 | ~2 minutes |
|
|||
|
|
| LLM polish | 10,500 | ~3 hours |
|
|||
|
|
| Quality filtering | ~7,500 | ~1 minute |
|
|||
|
|
| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
|
|||
|
|
| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
|
|||
|
|
|
|||
|
|
Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
|
|||
|
|
|
|||
|
|
## Integration Notes
|
|||
|
|
|
|||
|
|
### Feeding into Fine-Tuning
|
|||
|
|
|
|||
|
|
The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
|
|||
|
|
|
|||
|
|
### Iterative Improvement
|
|||
|
|
|
|||
|
|
This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
|
|||
|
|
1. Generate more raw sayings targeting the weak area
|
|||
|
|
2. Polish and filter
|
|||
|
|
3. Append to training corpus
|
|||
|
|
4. Re-train
|
|||
|
|
|
|||
|
|
The JSONL format and checkpoint system support this append workflow natively.
|