corpus generation (work from mid february)
This commit is contained in:
parent
8c8a058301
commit
356b62c6ea
16 changed files with 25872 additions and 38 deletions
431
CORPUS_GENERATION_SPEC.md
Normal file
431
CORPUS_GENERATION_SPEC.md
Normal file
|
|
@ -0,0 +1,431 @@
|
|||
# Corpus Generation Spec — LLM-Polished Training Data
|
||||
|
||||
## Overview
|
||||
|
||||
The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
|
||||
|
||||
The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
|
||||
"""Chat completion endpoint of local LLM"""
|
||||
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
|
||||
'model': model,
|
||||
'messages': messages
|
||||
}).json()
|
||||
```
|
||||
|
||||
Same local endpoint as the graph enhancement phase. No cloud APIs.
|
||||
|
||||
## Phase 1: Bulk Raw Generation
|
||||
|
||||
### Goal
|
||||
Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
|
||||
|
||||
### Generation Strategy
|
||||
|
||||
Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
|
||||
|
||||
```bash
|
||||
# Even coverage across all 7 meta-template families
|
||||
for template in deconstruction denial_of_consequences ironic_deficiency \
|
||||
futile_preparation hypocritical_complaint tautological_wisdom \
|
||||
false_equivalence; do
|
||||
python folksy_generator.py --template $template --count 1500 --debug \
|
||||
--output raw_${template}.jsonl
|
||||
done
|
||||
```
|
||||
|
||||
### Output Format
|
||||
|
||||
The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
|
||||
|
||||
```json
|
||||
{
|
||||
"raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
|
||||
"meta_template": "deconstruction",
|
||||
"surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
|
||||
"slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
|
||||
"chain": [
|
||||
{"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
|
||||
{"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
|
||||
{"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
|
||||
|
||||
### Deduplication at Generation Time
|
||||
|
||||
Before writing each generated saying, check:
|
||||
- Exact duplicate raw_text → skip
|
||||
- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
|
||||
- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
|
||||
|
||||
## Phase 2: LLM Polish
|
||||
|
||||
### Goal
|
||||
Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
|
||||
|
||||
### System Prompt
|
||||
|
||||
```
|
||||
You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
|
||||
|
||||
Your job:
|
||||
1. Fix grammar, articles, and pluralization
|
||||
2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
|
||||
3. Preserve the core nouns and the relationship between them — do not swap out the key words
|
||||
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
|
||||
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
|
||||
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
|
||||
|
||||
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
|
||||
|
||||
Examples of good polish:
|
||||
|
||||
Raw: "Don't build the coffee and act surprised when the water show up."
|
||||
Chain: coffee MadeOf water
|
||||
Polished: Don't brew the coffee and act surprised when the water's all gone.
|
||||
|
||||
Raw: "The chest's children always goes without hold books."
|
||||
Chain: chest UsedFor hold_books
|
||||
Polished: The bookshelf-maker's kids always end up reading off the floor.
|
||||
|
||||
Raw: "A pineapple is just a nectarine that's got an attitude."
|
||||
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
|
||||
Polished: A pineapple is just a peach that grew itself some armor.
|
||||
|
||||
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
|
||||
Chain: steel MadeOf iron, steel HasProperty hard
|
||||
Polished: You know what they say — steel without the iron is just a dream of being hard.
|
||||
|
||||
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
|
||||
Chain: bamboo CapableOf grow_quickly
|
||||
Polished: DISCARD
|
||||
|
||||
Raw: "That's just funning the canoe and praying for boiling food."
|
||||
Chain: canoe UsedFor transport, fire UsedFor boiling_food
|
||||
Polished: DISCARD
|
||||
```
|
||||
|
||||
### User Prompt Template
|
||||
|
||||
```
|
||||
Meta-template: {meta_template}
|
||||
Relationship chain: {chain_formatted}
|
||||
Slot fills: {slots_formatted}
|
||||
Raw saying: {raw_text}
|
||||
```
|
||||
|
||||
### Chain Formatting
|
||||
|
||||
Format the chain as a readable string:
|
||||
|
||||
```
|
||||
bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
import json
|
||||
import time
|
||||
|
||||
def polish_batch(input_path, output_path):
|
||||
system_prompt = load_system_prompt() # The prompt above
|
||||
|
||||
with open(input_path) as f:
|
||||
raw_entries = [json.loads(line) for line in f]
|
||||
|
||||
results = []
|
||||
discards = 0
|
||||
|
||||
for i, entry in enumerate(raw_entries):
|
||||
user_prompt = format_polish_prompt(entry)
|
||||
messages = [
|
||||
{"role": "system", "content": system_prompt},
|
||||
{"role": "user", "content": user_prompt}
|
||||
]
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
polished = response['choices'][0]['message']['content'].strip()
|
||||
|
||||
if polished == "DISCARD":
|
||||
discards += 1
|
||||
entry['status'] = 'discarded'
|
||||
else:
|
||||
entry['polished_text'] = polished
|
||||
entry['status'] = 'polished'
|
||||
|
||||
results.append(entry)
|
||||
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
|
||||
# Write checkpoint
|
||||
save_checkpoint(results, output_path)
|
||||
|
||||
time.sleep(0.1) # gentle rate limiting
|
||||
|
||||
save_final(results, output_path)
|
||||
print(f"Done: {len(results) - discards} polished, {discards} discarded")
|
||||
```
|
||||
|
||||
### Expected Discard Rate
|
||||
|
||||
Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
|
||||
|
||||
## Phase 3: Deduplication and Quality Filtering
|
||||
|
||||
After LLM polish, run automated quality checks before including in the training corpus.
|
||||
|
||||
### Automated Filters
|
||||
|
||||
```python
|
||||
def quality_filter(entry):
|
||||
text = entry['polished_text']
|
||||
|
||||
# Length check: real proverbs are short
|
||||
if len(text.split()) > 25:
|
||||
return False, "too_long"
|
||||
if len(text.split()) < 5:
|
||||
return False, "too_short"
|
||||
|
||||
# Must contain at least 2 of the original slot-fill nouns
|
||||
slot_words = set(entry['slots'].values())
|
||||
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
|
||||
if words_present < 2:
|
||||
return False, "lost_key_nouns"
|
||||
|
||||
# No raw ConceptNet artifacts (multi-word underscore phrases)
|
||||
if '_' in text:
|
||||
return False, "conceptnet_artifact"
|
||||
|
||||
# No broken templates (unfilled slots)
|
||||
if '{' in text or '}' in text:
|
||||
return False, "unfilled_slot"
|
||||
|
||||
return True, "pass"
|
||||
```
|
||||
|
||||
### Near-Duplicate Detection
|
||||
|
||||
Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
|
||||
|
||||
```python
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
def is_near_duplicate(text_a, text_b, threshold=0.75):
|
||||
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
|
||||
```
|
||||
|
||||
Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
|
||||
|
||||
## Phase 4: Training Corpus Formatting
|
||||
|
||||
### Goal
|
||||
Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
|
||||
|
||||
### Training Pair Schema
|
||||
|
||||
Each polished saying generates multiple training pairs with different input framings:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"input": "Tell me something about bread",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "Tell me a saying about baking",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "What would a farmer say about flour?",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "Give me a deconstruction proverb",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Input Framing Types
|
||||
|
||||
For each polished saying, generate training pairs with these input patterns:
|
||||
|
||||
1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
|
||||
2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
|
||||
3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
|
||||
4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
|
||||
5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
|
||||
|
||||
Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
|
||||
|
||||
### Fictional Entity Training Pairs
|
||||
|
||||
Additionally, generate training pairs that demonstrate fictional entity handling:
|
||||
|
||||
```json
|
||||
{
|
||||
"input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
|
||||
"output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
|
||||
}
|
||||
```
|
||||
|
||||
For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
|
||||
|
||||
### Format for Fiction Entity Input
|
||||
|
||||
Standardize how entity descriptions appear in training inputs:
|
||||
|
||||
```
|
||||
A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
|
||||
```
|
||||
|
||||
Example:
|
||||
```
|
||||
A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
|
||||
```
|
||||
|
||||
This format matches what a game developer or worldbuilder would naturally provide at inference time.
|
||||
|
||||
## Phase 5: Corpus Statistics and Validation
|
||||
|
||||
### Required Metrics
|
||||
|
||||
Before declaring the corpus ready for fine-tuning, compute and report:
|
||||
|
||||
```
|
||||
Total polished sayings: X
|
||||
Discarded during polish: X (Y%)
|
||||
Discarded during quality filter: X (Y%)
|
||||
Final training pairs: X
|
||||
|
||||
Distribution by meta-template:
|
||||
deconstruction: X (Y%)
|
||||
denial_of_consequences: X (Y%)
|
||||
ironic_deficiency: X (Y%)
|
||||
futile_preparation: X (Y%)
|
||||
hypocritical_complaint: X (Y%)
|
||||
tautological_wisdom: X (Y%)
|
||||
false_equivalence: X (Y%)
|
||||
|
||||
Distribution by input framing type:
|
||||
word_seeded: X
|
||||
category_seeded: X
|
||||
persona_seeded: X
|
||||
template_seeded: X
|
||||
open_ended: X
|
||||
fictional: X
|
||||
|
||||
Unique slot words used: X (out of 534 vocab)
|
||||
Words never used in any saying: [list]
|
||||
Average saying length: X words
|
||||
```
|
||||
|
||||
### Balance Check
|
||||
|
||||
If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
|
||||
|
||||
### Human Spot-Check
|
||||
|
||||
Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
|
||||
- **Good:** Sounds natural, funny, could fool someone into thinking it's real
|
||||
- **Okay:** Grammatically correct but flat or too literal
|
||||
- **Bad:** Awkward, nonsensical, or lost the relationship
|
||||
|
||||
Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
|
||||
|
||||
## Output Files
|
||||
|
||||
### `corpus_raw.jsonl`
|
||||
All raw generated sayings with debug metadata. One JSON object per line.
|
||||
|
||||
### `corpus_polished.jsonl`
|
||||
All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
|
||||
|
||||
### `corpus_filtered.jsonl`
|
||||
Only sayings that passed quality filtering. One JSON object per line.
|
||||
|
||||
### `training_pairs.jsonl`
|
||||
Final training corpus. One JSON object per line:
|
||||
```json
|
||||
{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
|
||||
```
|
||||
|
||||
### `corpus_stats.json`
|
||||
The metrics from Phase 5.
|
||||
|
||||
### `discard_analysis.csv`
|
||||
Every discarded saying with its discard reason:
|
||||
```
|
||||
raw_text, meta_template, discard_stage, discard_reason
|
||||
"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
|
||||
"The fire's...", ironic_deficiency, quality_filter, "too_short"
|
||||
```
|
||||
|
||||
This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
|
||||
|
||||
## File Organization
|
||||
|
||||
```
|
||||
folksy-generator/
|
||||
├── corpus/
|
||||
│ ├── corpus_raw.jsonl
|
||||
│ ├── corpus_polished.jsonl
|
||||
│ ├── corpus_filtered.jsonl
|
||||
│ ├── training_pairs.jsonl
|
||||
│ ├── corpus_stats.json
|
||||
│ └── discard_analysis.csv
|
||||
├── scripts/
|
||||
│ ├── generate_raw_batch.sh # Runs generator across all templates
|
||||
│ ├── polish_corpus.py # LLM polish pipeline
|
||||
│ ├── filter_corpus.py # Quality filtering
|
||||
│ ├── format_training_pairs.py # Training pair generation
|
||||
│ └── compute_corpus_stats.py # Metrics and validation
|
||||
```
|
||||
|
||||
## Execution Timeline
|
||||
|
||||
Assuming ~1 second per LLM call on the local 4090:
|
||||
|
||||
| Step | Items | Est. Time |
|
||||
|------|-------|-----------|
|
||||
| Raw generation (template engine only) | 10,500 | ~2 minutes |
|
||||
| LLM polish | 10,500 | ~3 hours |
|
||||
| Quality filtering | ~7,500 | ~1 minute |
|
||||
| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
|
||||
| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
|
||||
|
||||
Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
|
||||
|
||||
## Integration Notes
|
||||
|
||||
### Feeding into Fine-Tuning
|
||||
|
||||
The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
|
||||
|
||||
### Iterative Improvement
|
||||
|
||||
This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
|
||||
1. Generate more raw sayings targeting the weak area
|
||||
2. Polish and filter
|
||||
3. Append to training corpus
|
||||
4. Re-train
|
||||
|
||||
The JSONL format and checkpoint system support this append workflow natively.
|
||||
Loading…
Add table
Add a link
Reference in a new issue