corpus generation (work from mid february)

This commit is contained in:
John McCardle 2026-03-09 19:52:09 -04:00
commit 356b62c6ea
16 changed files with 25872 additions and 38 deletions

1
.gitignore vendored Normal file
View file

@ -0,0 +1 @@
*__pycache__

431
CORPUS_GENERATION_SPEC.md Normal file
View file

@ -0,0 +1,431 @@
# Corpus Generation Spec — LLM-Polished Training Data
## Overview
The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
## Infrastructure
```python
import requests
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
"""Chat completion endpoint of local LLM"""
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
'model': model,
'messages': messages
}).json()
```
Same local endpoint as the graph enhancement phase. No cloud APIs.
## Phase 1: Bulk Raw Generation
### Goal
Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
### Generation Strategy
Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
```bash
# Even coverage across all 7 meta-template families
for template in deconstruction denial_of_consequences ironic_deficiency \
futile_preparation hypocritical_complaint tautological_wisdom \
false_equivalence; do
python folksy_generator.py --template $template --count 1500 --debug \
--output raw_${template}.jsonl
done
```
### Output Format
The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
```json
{
"raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
"meta_template": "deconstruction",
"surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
"slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
"chain": [
{"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
{"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
{"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
]
}
```
This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
### Deduplication at Generation Time
Before writing each generated saying, check:
- Exact duplicate raw_text → skip
- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
## Phase 2: LLM Polish
### Goal
Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
### System Prompt
```
You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
Your job:
1. Fix grammar, articles, and pluralization
2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
3. Preserve the core nouns and the relationship between them — do not swap out the key words
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
Examples of good polish:
Raw: "Don't build the coffee and act surprised when the water show up."
Chain: coffee MadeOf water
Polished: Don't brew the coffee and act surprised when the water's all gone.
Raw: "The chest's children always goes without hold books."
Chain: chest UsedFor hold_books
Polished: The bookshelf-maker's kids always end up reading off the floor.
Raw: "A pineapple is just a nectarine that's got an attitude."
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
Polished: A pineapple is just a peach that grew itself some armor.
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
Chain: steel MadeOf iron, steel HasProperty hard
Polished: You know what they say — steel without the iron is just a dream of being hard.
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
Chain: bamboo CapableOf grow_quickly
Polished: DISCARD
Raw: "That's just funning the canoe and praying for boiling food."
Chain: canoe UsedFor transport, fire UsedFor boiling_food
Polished: DISCARD
```
### User Prompt Template
```
Meta-template: {meta_template}
Relationship chain: {chain_formatted}
Slot fills: {slots_formatted}
Raw saying: {raw_text}
```
### Chain Formatting
Format the chain as a readable string:
```
bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
```
### Batch Processing
```python
import json
import time
def polish_batch(input_path, output_path):
system_prompt = load_system_prompt() # The prompt above
with open(input_path) as f:
raw_entries = [json.loads(line) for line in f]
results = []
discards = 0
for i, entry in enumerate(raw_entries):
user_prompt = format_polish_prompt(entry)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
response = llm_chat_completion(messages)
polished = response['choices'][0]['message']['content'].strip()
if polished == "DISCARD":
discards += 1
entry['status'] = 'discarded'
else:
entry['polished_text'] = polished
entry['status'] = 'polished'
results.append(entry)
if (i + 1) % 100 == 0:
print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
# Write checkpoint
save_checkpoint(results, output_path)
time.sleep(0.1) # gentle rate limiting
save_final(results, output_path)
print(f"Done: {len(results) - discards} polished, {discards} discarded")
```
### Expected Discard Rate
Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
## Phase 3: Deduplication and Quality Filtering
After LLM polish, run automated quality checks before including in the training corpus.
### Automated Filters
```python
def quality_filter(entry):
text = entry['polished_text']
# Length check: real proverbs are short
if len(text.split()) > 25:
return False, "too_long"
if len(text.split()) < 5:
return False, "too_short"
# Must contain at least 2 of the original slot-fill nouns
slot_words = set(entry['slots'].values())
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
if words_present < 2:
return False, "lost_key_nouns"
# No raw ConceptNet artifacts (multi-word underscore phrases)
if '_' in text:
return False, "conceptnet_artifact"
# No broken templates (unfilled slots)
if '{' in text or '}' in text:
return False, "unfilled_slot"
return True, "pass"
```
### Near-Duplicate Detection
Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
```python
from difflib import SequenceMatcher
def is_near_duplicate(text_a, text_b, threshold=0.75):
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
```
Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
## Phase 4: Training Corpus Formatting
### Goal
Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
### Training Pair Schema
Each polished saying generates multiple training pairs with different input framings:
```json
[
{
"input": "Tell me something about bread",
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
"meta_template": "deconstruction",
"source_words": ["bread", "yeast", "flour"]
},
{
"input": "Tell me a saying about baking",
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
"meta_template": "deconstruction",
"source_words": ["bread", "yeast", "flour"]
},
{
"input": "What would a farmer say about flour?",
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
"meta_template": "deconstruction",
"source_words": ["bread", "yeast", "flour"]
},
{
"input": "Give me a deconstruction proverb",
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
"meta_template": "deconstruction",
"source_words": ["bread", "yeast", "flour"]
}
]
```
### Input Framing Types
For each polished saying, generate training pairs with these input patterns:
1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
### Fictional Entity Training Pairs
Additionally, generate training pairs that demonstrate fictional entity handling:
```json
{
"input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
"output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
}
```
For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
### Format for Fiction Entity Input
Standardize how entity descriptions appear in training inputs:
```
A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
```
Example:
```
A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
```
This format matches what a game developer or worldbuilder would naturally provide at inference time.
## Phase 5: Corpus Statistics and Validation
### Required Metrics
Before declaring the corpus ready for fine-tuning, compute and report:
```
Total polished sayings: X
Discarded during polish: X (Y%)
Discarded during quality filter: X (Y%)
Final training pairs: X
Distribution by meta-template:
deconstruction: X (Y%)
denial_of_consequences: X (Y%)
ironic_deficiency: X (Y%)
futile_preparation: X (Y%)
hypocritical_complaint: X (Y%)
tautological_wisdom: X (Y%)
false_equivalence: X (Y%)
Distribution by input framing type:
word_seeded: X
category_seeded: X
persona_seeded: X
template_seeded: X
open_ended: X
fictional: X
Unique slot words used: X (out of 534 vocab)
Words never used in any saying: [list]
Average saying length: X words
```
### Balance Check
If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
### Human Spot-Check
Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
- **Good:** Sounds natural, funny, could fool someone into thinking it's real
- **Okay:** Grammatically correct but flat or too literal
- **Bad:** Awkward, nonsensical, or lost the relationship
Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
## Output Files
### `corpus_raw.jsonl`
All raw generated sayings with debug metadata. One JSON object per line.
### `corpus_polished.jsonl`
All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
### `corpus_filtered.jsonl`
Only sayings that passed quality filtering. One JSON object per line.
### `training_pairs.jsonl`
Final training corpus. One JSON object per line:
```json
{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
```
### `corpus_stats.json`
The metrics from Phase 5.
### `discard_analysis.csv`
Every discarded saying with its discard reason:
```
raw_text, meta_template, discard_stage, discard_reason
"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
"The fire's...", ironic_deficiency, quality_filter, "too_short"
```
This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
## File Organization
```
folksy-generator/
├── corpus/
│ ├── corpus_raw.jsonl
│ ├── corpus_polished.jsonl
│ ├── corpus_filtered.jsonl
│ ├── training_pairs.jsonl
│ ├── corpus_stats.json
│ └── discard_analysis.csv
├── scripts/
│ ├── generate_raw_batch.sh # Runs generator across all templates
│ ├── polish_corpus.py # LLM polish pipeline
│ ├── filter_corpus.py # Quality filtering
│ ├── format_training_pairs.py # Training pair generation
│ └── compute_corpus_stats.py # Metrics and validation
```
## Execution Timeline
Assuming ~1 second per LLM call on the local 4090:
| Step | Items | Est. Time |
|------|-------|-----------|
| Raw generation (template engine only) | 10,500 | ~2 minutes |
| LLM polish | 10,500 | ~3 hours |
| Quality filtering | ~7,500 | ~1 minute |
| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
## Integration Notes
### Feeding into Fine-Tuning
The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
### Iterative Improvement
This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
1. Generate more raw sayings targeting the weak area
2. Polish and filter
3. Append to training corpus
4. Re-train
The JSONL format and checkpoint system support this append workflow natively.

303
EVALUATION.md Normal file
View file

@ -0,0 +1,303 @@
# Folksy Generator — Evaluation Report
**Date:** 2026-02-17
**Evaluator:** Claude (automated)
**Scope:** Post-integration health check after three LLM augmentation phases
---
## 1. Project Structure Overview
```
folksy-generator/
├── folksy_generator.py # Main CLI generator (910 lines)
├── FOLKSY_GENERATOR_SPEC.md # Original project spec
├── GRAPH_ENHANCEMENT_SPEC.md # LLM graph augmentation spec (Phases 1-3)
├── CORPUS_GENERATION_SPEC.md # Corpus generation spec (next phase)
├── data/
│ ├── folksy_vocab.csv # Curated vocabulary (624 words, expanded from 534)
│ ├── folksy_vocab.csv.bak.* # Pre-expansion backup (534 words)
│ ├── folksy_relations.csv # Original ConceptNet edges (11,096 edges)
│ ├── folksy_relations_augmented.csv # LLM-generated edges (11,220 edges)
│ ├── classified_proverbs.csv # Labeled real proverbs for reference
│ ├── candidate_additions.csv # OOV words suggested by LLM (3,678 unique)
│ └── enhancement_log.csv # Processing log for all 3 phases
├── scripts/
│ ├── extract_from_conceptnet.py # One-time ConceptNet extraction (requires psql)
│ ├── extract_relations.py # Relation extraction helper
│ ├── classify_proverbs.py # Proverb classification
│ ├── expand_vocab.py # Phase: vocab expansion (+90 words)
│ ├── enhance_graph.py # Phase: LLM edge augmentation
│ ├── generate_raw_batch.sh # Bulk generation script
│ ├── polish_corpus.py # LLM polish pipeline
│ ├── filter_corpus.py # Quality filtering
│ ├── format_training_pairs.py # Training pair generation
│ └── compute_corpus_stats.py # Corpus statistics
├── examples/
│ ├── my_world.json # Fictional entity examples (5 entities)
│ └── sample_output.txt # Pre-integration sample output
├── schemas/
│ └── fictional_entities.schema.json
└── corpus/ # Empty — not yet populated
```
**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
---
## 2. What the Three LLM Integration Phases Produced
Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
### Phase 1: Per-Word Relationship Expansion
- **624 words** processed through GLM4-32B
- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
- 9,510 edges rejected as OOV (target words not in folksy vocab)
- 61 duplicates filtered
- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
- **148 low-connectivity words** targeted
- 6,272 bridge edges accepted
- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
### Phase 3: Property Enrichment
- **624 words** processed for distinctive HasProperty edges
- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
- 61 duplicates filtered
- Targeted at improving `false_equivalence` template output
### Vocab Expansion (via `expand_vocab.py`)
- Original vocabulary: **534 words**
- Current vocabulary: **624 words** (+90 words added)
- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
### Combined Data Summary
| Dataset | Count |
|---------|-------|
| Original ConceptNet edges | 11,096 |
| LLM-augmented edges | 11,220 |
| **Total edges (combined)** | **22,316** |
| Original vocabulary | 534 |
| Expanded vocabulary | 624 |
| Candidate OOV words (not added) | 3,678 |
---
## 3. Term Database Statistics
### Vocabulary by Category (36 categories)
| Category | Words | | Category | Words |
|----------|-------|-|----------|-------|
| bird | 97 | | fish | 16 |
| animal | 65 | | spice | 16 |
| tool | 56 | | fruit | 15 |
| plant | 43 | | mineral | 14 |
| food | 38 | | insect | 14 |
| material | 36 | | structure | 13 |
| container | 34 | | beverage | 9 |
| instrument | 28 | | fabric | 9 |
| landscape | 27 | | tree | 8 |
| vegetable | 24 | | wood | 7 |
| building | 21 | | herb | 7 |
| metal | 19 | | rock | 6 |
| flower | 19 | | water | 6 |
| vehicle | 18 | | furniture | 5 |
| stone | 17 | | clothing | 5 |
| weapon | 17 | | shelter | 5 |
| — | — | | crop, seed, organism, grain | 3-4 each |
### Edge Distribution — Original ConceptNet
| Relation | Edges |
|----------|-------|
| AtLocation | 5,294 |
| UsedFor | 2,481 |
| CapableOf | 1,138 |
| ReceivesAction | 485 |
| HasProperty | 422 |
| HasA | 307 |
| HasPrerequisite | 261 |
| MadeOf | 181 |
| PartOf | 170 |
| Others (6 types) | 257 |
### Edge Distribution — LLM Augmented
| Relation | Edges |
|----------|-------|
| HasProperty | 3,985 |
| HasA | 1,719 |
| PartOf | 1,247 |
| UsedFor | 1,230 |
| MadeOf | 1,217 |
| AtLocation | 1,008 |
| CapableOf | 288 |
| HasPrerequisite | 250 |
| Others (4 types) | 276 |
The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
---
## 4. Sample Generated Output (30 Sayings)
Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
1. An scarf ain't nothing but cotton that met some wool.
2. The only difference between a hummingbird and a dodo is metabolism.
3. An salt ain't nothing but ore that met some crystals.
4. Funny how the earthworm never has enough food for itself.
5. What's a coop but a kitchen with sound?
6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
7. Don't take the wheel and then gripe about the hull.
8. A bamboo don't come without its water, now does it?
9. Nobody's got less salsa than the man who makes the mango.
10. That's like eating the sea and complaining the savanna tastes off.
11. My daddy always said, can't have waking up in morning without coffee.
12. Take the bison out of meat and all you've got left is salty taste flesh.
13. Like baiting the flock and hoping for keep as pet.
14. The ice's family always goes without cool body.
15. There's a fella who takes the wax and says the sugar's no good.
16. That's just holding the drawer and praying for store blanket.
17. You know what they say, a mica with no schist is just a rough surface rock.
18. An silver ain't nothing but hairbrushes that met some alloy.
19. A kite is just a pelican that's got catch wind.
20. Like making the denim and hoping for material.
21. The nut feeds everyone's fit bolt but its own.
22. The pitcher's family always goes without throw fast ball.
23. A nail is just a weapon that's got smooth length.
24. You want lid? Well, first you're gonna need container.
25. Don't build the micrometer and say you ain't got workshop.
26. Ain't no sleeping at night ever came from nothing — you need bed.
27. What's a cicada but a lacebug with nocturnal behavior?
28. Don't drink the dish and then gripe about the gnocchi.
29. You can't put out a herring and then wonder where all the herringbone came from.
30. That's just lorikeeting the fruit and praying for breaking wind.
---
## 5. Quality Assessment
### Rating Summary
I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
| Rating | Count | % | Description |
|--------|-------|---|-------------|
| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
### Good Examples (natural-sounding, humorous)
- "Nobody's got less salsa than the man who makes the mango."
- "There's a fella who takes the wax and says the sugar's no good."
- "A bamboo don't come without its water, now does it?"
- "Don't take the wheel and then gripe about the hull."
- "Ain't no sleeping at night ever came from nothing — you need bed."
- "My daddy always said, can't have waking up in morning without coffee."
- "What's a cicada but a lacebug with nocturnal behavior?"
- "You can't put out a herring and then wonder where all the herringbone came from."
### Common Issues Identified
#### 1. Article / Grammar Errors (frequent)
- "An scarf ain't nothing but..." — should be "A scarf"
- "An silver ain't nothing but..." — should be "Silver"
- "An salt ain't nothing but..." — should be "Salt"
- "A have children don't come without..." — broken slot fill leaking action phrase as noun
#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
- "waking up in morning", "sleeping at night", "salty taste"
- "breaking wind", "store blanket", "rough surface"
- These are raw ConceptNet concept IDs that should have been filtered or reformatted
#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
- "lorikeeting the fruit" — `lorikeet` treated as a verb
- "fooding the earthworm" — `food` treated as a verb
- "jeansing the denim" — `jeans` treated as a verb
- "safariing the lion" — `safari` treated as a verb
- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
#### 4. LLM Enhancement Artifacts Leaking (moderate)
- "bridge word: plate" appearing in output text
- "bridge 2: **food**" appearing in output text
- "*bridge word: absorption*" appearing in output text
- These are raw LLM response fragments that weren't properly cleaned during Phase 2
#### 5. Semantic Mismatches (occasional)
- "A lynx is just a earthworm that's got feline." — wrong category siblings
- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
- "A emu is just a ferret that's got walk backwards." — cross-class comparison
### Per-Template Quality Assessment
| Template | Typical Quality | Key Issue |
|----------|----------------|-----------|
| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
---
## 6. Errors, Warnings, and Issues
### No Errors at Runtime
- Generator runs without crashes on all template types
- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
- JSON output mode produces valid JSONL with complete metadata
- Fictional entity generation works
### Issues Found
| Severity | Issue | Impact |
|----------|-------|--------|
| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
| **Low** | No test suite exists | No automated validation of output quality |
| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
---
## 7. Readiness Assessment for Corpus Generation
### Ready
- Template engine is functional and produces output across all 7 meta-template families
- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
- Vocab expansion added 90 words to cover previously sparse categories
- JSON output mode with full debug metadata is working — ready for bulk generation
- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
- Fictional entity support is implemented and functional
- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
### Should Fix Before Corpus Generation
1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
### Nice to Have
- Add a basic test suite (even just smoke tests that confirm each template generates output)
- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
### Overall Verdict
**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
Fixing the four "Should Fix" items before corpus generation would:
- Reduce the discard rate (saving LLM compute time)
- Improve the quality floor of raw output (giving the polish LLM better material to work with)
- Eliminate artifact contamination that could propagate into training data
The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.

318
GRAPH_ENHANCEMENT_SPEC.md Normal file
View file

@ -0,0 +1,318 @@
# Graph Enhancement Spec — LLM-Augmented Folksy Subgraph
## Overview
The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.
This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.
## Infrastructure
```python
import requests
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
"""Chat completion endpoint of local LLM"""
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
'model': model,
'messages': messages
}).json()
```
All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.
## Strategy
For each word in `folksy_vocab.csv`, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as `folksy_relations.csv` and merged into the generator's working dataset.
This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.
## Phase 1: Per-Word Relationship Expansion
### Input
Every word in `folksy_vocab.csv`, plus its existing edges from `folksy_relations.csv`.
### Process
For each word, send a prompt that:
1. Provides the word and its categories
2. Lists its EXISTING relationships (so the LLM doesn't duplicate them)
3. Asks for ADDITIONAL relationships across specific relation types
4. Constrains output to a parseable structured format
### System Prompt
```
You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
Rules:
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
- Every relationship must be something a typical adult would agree is true
- Do not repeat any relationship already listed as "known"
- Target words should be common English words (top 3000 frequency preferred)
- Output ONLY the structured format shown below, one relationship per line
- If you cannot think of good relationships for a given type, output NONE for that type
- Aim for 3-5 relationships per type where possible
Output format (one per line):
RELATION_TYPE: target_word | short natural phrasing
Example output:
AtLocation: barn | you find a horse in a barn
UsedFor: riding | a horse is used for riding
HasA: mane | a horse has a mane
CapableOf: gallop | a horse can gallop
MadeOf: NONE
PartOf: herd | a horse is part of a herd
```
### User Prompt Template
```
Word: {word}
Categories: {categories}
Known relationships:
{existing_edges_formatted}
Generate additional relationships for these types:
- AtLocation (where is it found?)
- UsedFor (what is it used for?)
- HasA (what does it have / contain?)
- PartOf (what is it part of?)
- CapableOf (what can it do?)
- MadeOf (what is it made of?)
- HasPrerequisite (what do you need before you can have/use it?)
- Causes (what does it cause or lead to?)
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)
```
### Formatting Existing Edges
For the "Known relationships" section, format existing edges as:
```
AtLocation: pond (weight 1.0), lake (weight 4.47)
CapableOf: swim (weight 2.0), fly (weight 1.0)
UsedFor: (none in database)
```
This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.
### Parsing LLM Output
```python
import re
def parse_llm_relations(response_text, source_word):
"""Parse structured LLM output into edge tuples."""
edges = []
for line in response_text.strip().split('\n'):
line = line.strip()
if not line or 'NONE' in line:
continue
match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
if match:
relation, target, surface = match.groups()
# Validate relation type
if relation in VALID_RELATIONS:
edges.append({
'start_word': source_word,
'end_word': target.strip().lower(),
'relation': relation,
'weight': 0.8, # LLM-generated edges get a default weight below ConceptNet minimum
'surface_text': surface.strip(),
'source': 'llm_augmented'
})
return edges
```
### Weight Assignment
LLM-generated edges get a default weight of **0.8** — deliberately below the ConceptNet minimum threshold of 1.0. This means:
- They fill gaps and add diversity
- They lose ties to ConceptNet edges (real data preferred when both exist)
- They can be filtered out easily if needed (`weight >= 1.0` restores pure ConceptNet)
- The generator can optionally boost or penalize LLM edges via a CLI flag
### Deduplication
Before merging, check each LLM-generated edge against existing edges:
- If (start_word, end_word, relation) already exists → skip
- If end_word is not in folksy_vocab → add to a `candidate_additions.csv` for review, but do NOT auto-add to vocab (avoids graph bloat)
- If end_word IS in folksy_vocab → add edge to `folksy_relations_augmented.csv`
## Phase 2: Cross-Word Relationship Discovery
After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.
### Process
1. Identify word pairs that are in the same category but have no path of length ≤ 2 between them
2. For a sample of these pairs, ask the LLM what connects them
### Prompt for Bridge Discovery
System prompt:
```
You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
Rules:
- The bridge word must be a common, concrete noun
- State the relationship type for each connection
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
Example:
Words: "cow" and "butter"
BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product
```
User prompt:
```
Words: "{word_a}" and "{word_b}"
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
Find 1-3 bridge words that connect them.
```
### Candidate Selection
Don't run this for all pairs — that's O(n²) on 534 words. Instead:
1. Build the current 2-hop reachability matrix
2. Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
3. For each low-connectivity word, pick 5-10 random same-category words it can't reach
4. Run bridge discovery on those pairs
5. Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words
## Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates
The `false_equivalence` meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.
### Prompt
System prompt:
```
You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
Rules:
- Only physical/sensory/behavioral properties, not abstract qualities
- Properties should DISTINGUISH this thing from similar things in its category
- Output one property per line as: PROPERTY | brief explanation
- Aim for 5-8 properties
```
User prompt:
```
Word: {word}
Category: {categories}
Other words in same category: {same_category_sample}
What properties distinguish {word} from the others listed?
```
Including same-category peers in the prompt encourages the LLM to generate *differentiating* properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.
### Output Format
```
fast | horses are known for running fast
tall | horses are tall compared to most farm animals
mane | horses have a distinctive mane
shod | horses wear horseshoes
```
These get stored as HasProperty edges in the augmented relations file.
## Output Files
### `folksy_relations_augmented.csv`
Same schema as `folksy_relations.csv` with additional columns:
```
start_word, end_word, relation, weight, surface_text, source
corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented
```
The `source` column allows filtering: `source=conceptnet` for pure ConceptNet, `source=llm_augmented` for LLM additions, or both for the full enhanced graph.
### `candidate_additions.csv`
Words that appeared in LLM output but aren't in the current folksy vocab:
```
word, suggested_by, relation_context, frequency
mane, horse, "HasA: a horse has a mane", 2
bridle, horse, "HasA: a horse has a bridle", 1
```
The `frequency` column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).
### `enhancement_log.csv`
Track what was processed and what the LLM produced:
```
source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
dog, 2025-02-15T10:30:00, 24, 18, 3, 3
horse, 2025-02-15T10:30:45, 31, 22, 5, 4
```
## Execution Plan
### Batch Processing
534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.
```python
import csv
import time
def process_all_words(vocab_path, relations_path, output_path):
vocab = load_vocab(vocab_path)
relations = load_relations(relations_path)
all_new_edges = []
for i, word_entry in enumerate(vocab):
word = word_entry['word']
categories = word_entry['categories']
existing = get_edges_for_word(relations, word)
messages = build_expansion_prompt(word, categories, existing)
response = llm_chat_completion(messages)
response_text = response['choices'][0]['message']['content']
new_edges = parse_llm_relations(response_text, word)
new_edges = deduplicate(new_edges, existing)
all_new_edges.extend(new_edges)
if (i + 1) % 50 == 0:
print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
time.sleep(0.1) # gentle rate limiting
save_augmented_relations(all_new_edges, output_path)
```
### Resumability
Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.
### Validation Pass
After all LLM edges are generated, run a quick validation:
1. No self-loops (start_word == end_word)
2. All relation types are in the valid set
3. No duplicate (start, end, relation) triples
4. Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
5. Spot-check 20 random LLM edges manually for sanity
## Integration with Generator
The generator's data loading should be updated to:
1. Load `folksy_relations.csv` (original ConceptNet edges)
2. If `folksy_relations_augmented.csv` exists, load and merge it
3. CLI flag: `--pure-conceptnet` to disable LLM-augmented edges
4. CLI flag: `--llm-weight-boost 0.2` to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)
This keeps the original ConceptNet data pristine and the augmentation fully reversible.

9511
data/candidate_additions.csv Normal file

File diff suppressed because it is too large Load diff

1397
data/enhancement_log.csv Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -533,3 +533,93 @@ oxpecker,bird,0.0,4,0
bowerbird,bird,0.0,3,0
condor,bird,0.0,3,0
gladiola,flower,0.0,3,0
metal,metal,0.80,0,0
soil,mineral,0.80,0,0
beak,animal,0.80,0,0
feather,"bird,material",0.80,0,0
plant,plant,0.80,0,0
forest,"landscape,tree",0.80,0,0
food,food,0.80,0,0
wing,bird,0.80,0,0
seed,"seed,plant",0.80,0,0
kitchen,"building,structure",0.80,0,0
handle,tool,0.80,0,0
tail,animal,0.80,0,0
leaf,plant,0.80,0,0
bone,"animal,material",0.80,0,0
flesh,"animal,food",0.80,0,0
flock,animal,0.80,0,0
field,"landscape,crop",0.80,0,0
fur,"animal,material",0.80,0,0
workshop,"building,structure",0.80,0,0
meat,"animal,food",0.80,0,0
fiber,"plant,material",0.80,0,0
farm,"structure,landscape",0.80,0,0
skin,"animal,material",0.80,0,0
leg,"animal,tool",0.80,0,0
flower,"flower,plant",0.80,0,0
ground,landscape,0.80,0,0
petal,"flower,plant",0.80,0,0
muscle,"organism,animal",0.80,0,0
shade,"landscape,plant",0.80,0,0
ocean,"water,landscape",0.80,0,0
medicine,"herb,plant",0.80,0,0
rubber,"material,fabric",0.80,0,0
mineral,"mineral,stone",0.80,0,0
toolbox,"tool,container",0.80,0,0
land,landscape,0.80,0,0
bird,"bird,animal",0.80,0,0
lid,"container,tool",0.80,0,0
bouquet,"flower,plant",0.80,0,0
ceramic,"material,container",0.80,0,0
lake,"water,landscape",0.80,0,0
fat,"animal,food",0.80,0,0
body,"organism,animal",0.80,0,0
house,"shelter,building",0.80,0,0
furniture,"furniture,structure",0.80,0,0
concrete,"material,stone",0.80,0,0
jewelry,material,0.80,0,0
fruit,fruit,0.80,0,0
fin,"animal,fish",0.80,0,0
container,container,0.80,0,0
branch,"plant,wood",0.80,0,0
earth,"landscape,mineral",0.80,0,0
fuel,material,0.80,0,0
ore,"mineral,metal",0.80,0,0
fireplace,"structure,tool",0.80,0,0
dust,material,0.80,0,0
door,"furniture,structure",0.80,0,0
window,structure,0.80,0,0
mouth,"animal,insect",0.80,0,0
string,material,0.80,0,0
fabric,fabric,0.80,0,0
sugar,"food,spice",0.80,0,0
trigger,"tool,weapon",0.80,0,0
key,tool,0.80,0,0
brick,"container,material,stone",0.80,0,0
stone,"rock,stone",0.80,0,0
mountain,"landscape,rock",0.80,0,0
juice,"beverage,food",0.80,0,0
cage,"structure,tool",0.80,0,0
head,"animal,insect",0.80,0,0
grain,grain,0.80,0,0
home,"building,shelter",0.80,0,0
crystal,"mineral,rock",0.80,0,0
engine,"tool,vehicle",0.80,0,0
hammer,"tool,weapon",0.80,0,0
aquarium,container,0.80,0,0
tooth,animal,0.80,0,0
river,"water,landscape",0.80,0,0
grassland,"landscape,plant",0.80,0,0
sea,"water,landscape",0.80,0,0
dessert,food,0.80,0,0
wheel,"tool,vehicle",0.80,0,0
needle,tool,0.80,0,0
jungle,"landscape,plant",0.80,0,0
blood,organism,0.80,0,0
oil,"beverage,mineral",0.80,0,0
mouthpiece,tool,0.80,0,0
claw,animal,0.80,0,0
spout,tool,0.80,0,0
savanna,"landscape,plant",0.80,0,0
desert,landscape,0.80,0,0

1 word categories tangibility_score conceptnet_edge_count frequency_rank
533 bowerbird bird 0.0 3 0
534 condor bird 0.0 3 0
535 gladiola flower 0.0 3 0
536 metal metal 0.80 0 0
537 soil mineral 0.80 0 0
538 beak animal 0.80 0 0
539 feather bird,material 0.80 0 0
540 plant plant 0.80 0 0
541 forest landscape,tree 0.80 0 0
542 food food 0.80 0 0
543 wing bird 0.80 0 0
544 seed seed,plant 0.80 0 0
545 kitchen building,structure 0.80 0 0
546 handle tool 0.80 0 0
547 tail animal 0.80 0 0
548 leaf plant 0.80 0 0
549 bone animal,material 0.80 0 0
550 flesh animal,food 0.80 0 0
551 flock animal 0.80 0 0
552 field landscape,crop 0.80 0 0
553 fur animal,material 0.80 0 0
554 workshop building,structure 0.80 0 0
555 meat animal,food 0.80 0 0
556 fiber plant,material 0.80 0 0
557 farm structure,landscape 0.80 0 0
558 skin animal,material 0.80 0 0
559 leg animal,tool 0.80 0 0
560 flower flower,plant 0.80 0 0
561 ground landscape 0.80 0 0
562 petal flower,plant 0.80 0 0
563 muscle organism,animal 0.80 0 0
564 shade landscape,plant 0.80 0 0
565 ocean water,landscape 0.80 0 0
566 medicine herb,plant 0.80 0 0
567 rubber material,fabric 0.80 0 0
568 mineral mineral,stone 0.80 0 0
569 toolbox tool,container 0.80 0 0
570 land landscape 0.80 0 0
571 bird bird,animal 0.80 0 0
572 lid container,tool 0.80 0 0
573 bouquet flower,plant 0.80 0 0
574 ceramic material,container 0.80 0 0
575 lake water,landscape 0.80 0 0
576 fat animal,food 0.80 0 0
577 body organism,animal 0.80 0 0
578 house shelter,building 0.80 0 0
579 furniture furniture,structure 0.80 0 0
580 concrete material,stone 0.80 0 0
581 jewelry material 0.80 0 0
582 fruit fruit 0.80 0 0
583 fin animal,fish 0.80 0 0
584 container container 0.80 0 0
585 branch plant,wood 0.80 0 0
586 earth landscape,mineral 0.80 0 0
587 fuel material 0.80 0 0
588 ore mineral,metal 0.80 0 0
589 fireplace structure,tool 0.80 0 0
590 dust material 0.80 0 0
591 door furniture,structure 0.80 0 0
592 window structure 0.80 0 0
593 mouth animal,insect 0.80 0 0
594 string material 0.80 0 0
595 fabric fabric 0.80 0 0
596 sugar food,spice 0.80 0 0
597 trigger tool,weapon 0.80 0 0
598 key tool 0.80 0 0
599 brick container,material,stone 0.80 0 0
600 stone rock,stone 0.80 0 0
601 mountain landscape,rock 0.80 0 0
602 juice beverage,food 0.80 0 0
603 cage structure,tool 0.80 0 0
604 head animal,insect 0.80 0 0
605 grain grain 0.80 0 0
606 home building,shelter 0.80 0 0
607 crystal mineral,rock 0.80 0 0
608 engine tool,vehicle 0.80 0 0
609 hammer tool,weapon 0.80 0 0
610 aquarium container 0.80 0 0
611 tooth animal 0.80 0 0
612 river water,landscape 0.80 0 0
613 grassland landscape,plant 0.80 0 0
614 sea water,landscape 0.80 0 0
615 dessert food 0.80 0 0
616 wheel tool,vehicle 0.80 0 0
617 needle tool 0.80 0 0
618 jungle landscape,plant 0.80 0 0
619 blood organism 0.80 0 0
620 oil beverage,mineral 0.80 0 0
621 mouthpiece tool 0.80 0 0
622 claw animal 0.80 0 0
623 spout tool 0.80 0 0
624 savanna landscape,plant 0.80 0 0
625 desert landscape 0.80 0 0

View file

@ -212,26 +212,45 @@ class Deconstruction(MetaTemplate):
# Find what A is made of / requires
ingredients = []
ingredient_rels = [] # track which relation found each ingredient
for rel in ("MadeOf", "HasPrerequisite", "HasA"):
ingredients.extend(_short_concepts(self.graph.neighbors(a, rel, min_weight=0.5)))
found = _short_concepts(self.graph.neighbors(a, rel, min_weight=0.5))
for item in found:
ingredients.append(item)
ingredient_rels.append(rel)
if len(ingredients) < 2:
for rel in ("MadeOf", "HasPrerequisite"):
for (start, w, s) in self.graph.reverse.get((a, rel), []):
if len(start.split("_")) <= 2:
ingredients.append((start, w, s))
ingredient_rels.append(rel)
if len(ingredients) < 2:
return None, None
random.shuffle(ingredients)
b_word = _readable(ingredients[0][0])
d_word = _readable(ingredients[1][0])
# Shuffle together
combined = list(zip(ingredients, ingredient_rels))
random.shuffle(combined)
ingredients, ingredient_rels = zip(*combined)
b_edge = ingredients[0]
b_word = _readable(b_edge[0])
b_rel = ingredient_rels[0]
d_edge = ingredients[1]
d_word = _readable(d_edge[0])
d_rel = ingredient_rels[1]
# Find a property for D
chain_edges = [
{"start": a, "relation": b_rel, "end": b_edge[0], "weight": b_edge[1], "surface_text": b_edge[2]},
{"start": a, "relation": d_rel, "end": d_edge[0], "weight": d_edge[1], "surface_text": d_edge[2]},
]
props = self.graph.neighbors(ingredients[1][0], "HasProperty")
if props:
c_word = _readable(random.choice(props)[0])
c_prop = random.choice(props)
c_word = _readable(c_prop[0])
chain_edges.append({"start": d_edge[0], "relation": "HasProperty", "end": c_prop[0], "weight": c_prop[1], "surface_text": c_prop[2]})
else:
c_word = random.choice(["plain", "sorry", "old", "humble", "dry", "wet", "cold"])
@ -242,6 +261,7 @@ class Deconstruction(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{a} MadeOf/Has [{b_word}, {d_word}]; {d_word} HasProperty {c_word}",
"chain_edges": chain_edges,
"slots": {"A": a, "B": b_word, "C": c_word, "D": d_word},
}
return saying, debug
@ -265,23 +285,31 @@ class DenialOfConsequences(MetaTemplate):
return None, None
# What is found at A? (reverse: B AtLocation A)
attracted = []
attracted = [] # (word, weight, surface_text, relation)
for (b, w, s) in self.graph.reverse.get((a, "AtLocation"), []):
attracted.append((b, w))
attracted.append((b, w, s, "AtLocation"))
# Also: what does A attract/cause?
for rel in ("Causes", "CausesDesire"):
for (b, w, s) in self.graph.edges.get((a, rel), []):
attracted.append((b, w))
attracted.append((b, w, s, rel))
if not attracted:
for (bridge, target, w1, w2) in self.graph.two_hop(a, "UsedFor", "AtLocation"):
attracted.append((target, w1 + w2))
attracted.append((target, w1 + w2, "", "AtLocation"))
if not attracted:
return None, None
b_word = _readable(random.choice(attracted)[0])
b_choice = random.choice(attracted)
b_word = _readable(b_choice[0])
chain_edges = [
{"start": b_choice[0] if b_choice[3] == "AtLocation" else a,
"relation": b_choice[3],
"end": a if b_choice[3] == "AtLocation" else b_choice[0],
"weight": b_choice[1], "surface_text": b_choice[2]},
]
create_verbs = {
"pond": "dig", "birdhouse": "hang", "fence": "build", "trap": "set",
@ -301,6 +329,7 @@ class DenialOfConsequences(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{b_word} AtLocation {a}; {a} created by {c_word}",
"chain_edges": chain_edges,
"slots": {"A": a, "B": b_word, "C": c_word},
}
return saying, debug
@ -324,14 +353,21 @@ class IronicDeficiency(MetaTemplate):
return None, None
products = []
product_rels = []
for rel in ("UsedFor", "CapableOf", "Causes"):
products.extend(self.graph.neighbors(a, rel, min_weight=0.5))
found = self.graph.neighbors(a, rel, min_weight=0.5)
for item in found:
products.append(item)
product_rels.append(rel)
products = _short_concepts(products)
if not products:
# Filter to short concepts while keeping rel tracking
filtered = [(p, r) for p, r in zip(products, product_rels) if len(p[0].split("_")) <= 3]
if not filtered:
return None, None
x_word = _readable(random.choice(products)[0])
choice_idx = random.randrange(len(filtered))
x_edge, x_rel = filtered[choice_idx]
x_word = _readable(x_edge[0])
family_members = ["wife", "children", "household", "family", "own kind"]
f_word = random.choice(family_members)
@ -339,10 +375,15 @@ class IronicDeficiency(MetaTemplate):
template = self._pick_template()
saying = template.format(A=a, X=x_word, F=f_word)
chain_edges = [
{"start": a, "relation": x_rel, "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
]
debug = {
"template_family": self.id,
"template": template,
"chain": f"{a} UsedFor/Produces {x_word}; irony: {a} lacks {x_word}",
"chain_edges": chain_edges,
"slots": {"A": a, "X": x_word, "F": f_word},
}
return saying, debug
@ -371,7 +412,12 @@ class FutilePreparation(MetaTemplate):
if not uses:
return None, None
action_word = random.choice(uses)[0]
action_edge = random.choice(uses)
action_word = action_edge[0]
chain_edges = [
{"start": seed, "relation": "UsedFor", "end": action_edge[0], "weight": action_edge[1], "surface_text": action_edge[2]},
]
# Find a different outcome in a related domain via 2-hop
outcomes = []
@ -392,7 +438,8 @@ class FutilePreparation(MetaTemplate):
if not outcomes:
return None, None
y_word = random.choice(outcomes)[0]
y_choice = random.choice(outcomes)
y_word = y_choice[0]
gerund = _gerund(action_word)
verb = _readable(action_word)
@ -405,6 +452,7 @@ class FutilePreparation(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{seed} UsedFor {action_word}; different domain: {y_word}",
"chain_edges": chain_edges,
"slots": {"seed": seed, "action": action_word, "Y": y_word},
}
return saying, debug
@ -430,21 +478,37 @@ class HypocriticalComplaint(MetaTemplate):
# Find parts of Z
parts = []
part_rels = []
for rel in ("HasA", "PartOf", "MadeOf"):
parts.extend(_short_concepts(self.graph.neighbors(z, rel, min_weight=0.5)))
found = _short_concepts(self.graph.neighbors(z, rel, min_weight=0.5))
for item in found:
parts.append(item)
part_rels.append(rel)
for (start, w, s) in self.graph.reverse.get((z, "PartOf"), []):
if len(start.split("_")) <= 2:
parts.append((start, w, s))
part_rels.append("PartOf")
for (start, w, s) in self.graph.reverse.get((z, "HasA"), []):
if len(start.split("_")) <= 2:
parts.append((start, w, s))
part_rels.append("HasA")
if len(parts) < 2:
return None, None
random.shuffle(parts)
x_word = _readable(parts[0][0])
y_word = _readable(parts[1][0])
combined = list(zip(parts, part_rels))
random.shuffle(combined)
parts, part_rels = zip(*combined)
x_edge = parts[0]
x_word = _readable(x_edge[0])
y_edge = parts[1]
y_word = _readable(y_edge[0])
chain_edges = [
{"start": z, "relation": part_rels[0], "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
{"start": z, "relation": part_rels[1], "end": y_edge[0], "weight": y_edge[1], "surface_text": y_edge[2]},
]
consume_verbs = ["eat", "drink", "take", "pick", "use up", "grab"]
verb = random.choice(consume_verbs)
@ -456,6 +520,7 @@ class HypocriticalComplaint(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{x_word} PartOf/HasA {z}; {y_word} PartOf/HasA {z}",
"chain_edges": chain_edges,
"slots": {"Z": z, "X": x_word, "Y": y_word, "verb": verb},
}
return saying, debug
@ -480,19 +545,25 @@ class TautologicalWisdom(MetaTemplate):
return None, None
# seed HasPrerequisite/Causes something
# Store (x_word, y_word, weight, edge_info) where edge_info captures the raw edge
chains = []
for (target, w, s) in self.graph.edges.get((seed, "HasPrerequisite"), []):
chains.append((_readable(target), seed, w)) # X=prereq, Y=seed
chains.append((_readable(target), seed, w,
{"start": seed, "relation": "HasPrerequisite", "end": target, "weight": w, "surface_text": s}))
for (target, w, s) in self.graph.edges.get((seed, "Causes"), []):
chains.append((seed, _readable(target), w)) # X=seed, Y=effect
chains.append((seed, _readable(target), w,
{"start": seed, "relation": "Causes", "end": target, "weight": w, "surface_text": s}))
# Also: what does seed require?
for (source, w, s) in self.graph.reverse.get((seed, "HasPrerequisite"), []):
chains.append((seed, _readable(source), w))
chains.append((seed, _readable(source), w,
{"start": source, "relation": "HasPrerequisite", "end": seed, "weight": w, "surface_text": s}))
if not chains:
return None, None
x_word, y_word, _ = random.choice(chains)
choice = random.choice(chains)
x_word, y_word = choice[0], choice[1]
chain_edge = choice[3]
template = self._pick_template()
saying = template.format(X=x_word, Y=y_word)
@ -501,6 +572,7 @@ class TautologicalWisdom(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{x_word} -> {y_word} (prerequisite/cause)",
"chain_edges": [chain_edge],
"slots": {"X": x_word, "Y": y_word},
}
return saying, debug
@ -543,15 +615,22 @@ class FalseEquivalence(MetaTemplate):
a_props = _short_concepts(self.graph.neighbors(a, "HasProperty"), max_words=2)
b_props = set(p[0] for p in self.graph.neighbors(b_word, "HasProperty"))
chain_edges = []
differentiators = [p for p in a_props if p[0] not in b_props]
if differentiators:
p_word = _readable(random.choice(differentiators)[0])
p_edge = random.choice(differentiators)
p_word = _readable(p_edge[0])
chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
elif a_props:
p_word = _readable(random.choice(a_props)[0])
p_edge = random.choice(a_props)
p_word = _readable(p_edge[0])
chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
else:
a_caps = self.graph.neighbors(a, "CapableOf")
if a_caps:
p_word = _readable(random.choice(a_caps)[0])
p_edge = random.choice(a_caps)
p_word = _readable(p_edge[0])
chain_edges.append({"start": a, "relation": "CapableOf", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
else:
p_word = random.choice(["ambition", "an attitude", "a plan", "patience"])
@ -562,6 +641,7 @@ class FalseEquivalence(MetaTemplate):
"template_family": self.id,
"template": template,
"chain": f"{a} IsA same category as {b_word}; {a} HasProperty {p_word}",
"chain_edges": chain_edges,
"slots": {"A": a, "B": b_word, "P": p_word},
}
return saying, debug
@ -621,7 +701,10 @@ TEMPLATE_REGISTRY = {
def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
debug=False, max_retries=20):
"""Generate a single folksy saying."""
"""Generate a single folksy saying.
When debug=True, always returns (saying, debug_dict) with chain_edges included.
"""
for _ in range(max_retries):
if template_id:
tid = template_id
@ -631,7 +714,7 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
cls = TEMPLATE_REGISTRY.get(tid)
if not cls:
print(f"Unknown template: {tid}", file=sys.stderr)
return None
return None, None
tmpl = cls(graph)
saying, dbg = tmpl.generate(seed_word=seed_word, seed_category=seed_category)
@ -643,6 +726,16 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
return None, None
def _get_seed_word(dbg):
"""Extract the primary seed word from debug slots for dedup tracking."""
slots = dbg.get("slots", {})
# Templates use different slot names for the seed
for key in ("A", "Z", "seed", "X"):
if key in slots:
return slots[key]
return None
def main():
parser = argparse.ArgumentParser(
description="Generate folksy fake-proverbs using ConceptNet relationships."
@ -655,8 +748,13 @@ def main():
parser.add_argument("--count", "-n", type=int, default=1, help="Number of sayings to generate")
parser.add_argument("--output", "-o", help="Output file (default: stdout)")
parser.add_argument("--debug", "-d", action="store_true", help="Show relationship chain debug info")
parser.add_argument("--json", action="store_true", help="Output JSONL format with full metadata")
parser.add_argument("--vocab", help="Path to folksy_vocab.csv")
parser.add_argument("--relations", help="Path to folksy_relations.csv")
parser.add_argument("--pure-conceptnet", action="store_true",
help="Skip loading augmented relations file")
parser.add_argument("--llm-weight-boost", type=float, default=0.0,
help="Boost weight of LLM-augmented edges with weight < 1.0 (default: 0.0)")
parser.add_argument("--list-templates", action="store_true", help="List available templates")
parser.add_argument("--list-categories", action="store_true", help="List available categories")
@ -679,6 +777,30 @@ def main():
print("Run scripts/extract_from_conceptnet.py first to generate data files.", file=sys.stderr)
sys.exit(1)
# Load augmented relations if available
if not args.pure_conceptnet:
augmented_path = DATA_DIR / "folksy_relations_augmented.csv"
if augmented_path.exists():
boost = args.llm_weight_boost
with open(augmented_path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
count = 0
for row in reader:
sw = row["start_word"]
ew = row["end_word"]
rel = row["relation"]
w = float(row["weight"])
if w < 1.0 and boost:
w = min(w + boost, 1.0)
surf = row.get("surface_text", "")
graph.edges[(sw, rel)].append((ew, w, surf))
graph.reverse[(ew, rel)].append((sw, w, surf))
graph.all_edges[sw].append((ew, rel, w))
graph.all_edges[ew].append((sw, rel, w))
count += 1
if count:
print(f"Loaded {count} augmented edges.", file=sys.stderr)
if args.list_categories:
for cat in sorted(graph.by_category.keys()):
print(f" {cat:20s} ({len(graph.by_category[cat])} words)")
@ -688,26 +810,96 @@ def main():
if args.entities:
graph.merge_fictional(args.entities)
# JSON mode implies debug internally
use_debug = args.debug or args.json
# Generate
out = open(args.output, "w", encoding="utf-8") if args.output else sys.stdout
try:
for i in range(args.count):
if args.count > 1:
# Deduplication tracking for batch mode
seen_text = set()
seen_slots = set()
seed_usage = defaultdict(int)
generated = 0
max_outer_attempts = args.count * 10 # generous outer limit
attempts = 0
while generated < args.count and attempts < max_outer_attempts:
attempts += 1
saying, dbg = generate_one(
graph,
template_id=args.template,
seed_word=args.seed,
seed_category=args.category,
debug=use_debug,
)
if not saying:
continue
# Dedup checks (failures don't count against retry limit)
if saying in seen_text:
continue
if dbg:
slots_key = (dbg["template_family"], frozenset(dbg["slots"].items()))
if slots_key in seen_slots:
continue
seed_w = _get_seed_word(dbg)
if seed_w and seed_usage[seed_w] >= 30:
continue
if seed_w:
seed_usage[seed_w] += 1
seen_slots.add(slots_key)
seen_text.add(saying)
generated += 1
if args.json and dbg:
record = {
"raw_text": saying,
"meta_template": dbg["template_family"],
"surface_template": dbg["template"],
"slots": dbg["slots"],
"chain": dbg.get("chain_edges", []),
}
out.write(json.dumps(record, ensure_ascii=False) + "\n")
else:
out.write(saying + "\n")
if args.debug and dbg:
out.write(f" [DEBUG] family={dbg['template_family']}\n")
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
out.write("\n")
else:
# Single generation (no dedup needed)
saying, dbg = generate_one(
graph,
template_id=args.template,
seed_word=args.seed,
seed_category=args.category,
debug=args.debug,
debug=use_debug,
)
if saying:
out.write(saying + "\n")
if args.debug and dbg:
out.write(f" [DEBUG] family={dbg['template_family']}\n")
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
out.write("\n")
if args.json and dbg:
record = {
"raw_text": saying,
"meta_template": dbg["template_family"],
"surface_template": dbg["template"],
"slots": dbg["slots"],
"chain": dbg.get("chain_edges", []),
}
out.write(json.dumps(record, ensure_ascii=False) + "\n")
else:
out.write(saying + "\n")
if args.debug and dbg:
out.write(f" [DEBUG] family={dbg['template_family']}\n")
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
out.write("\n")
else:
out.write(f"(failed to generate saying #{i+1} after retries)\n")
out.write("(failed to generate saying after retries)\n")
finally:
if args.output:
out.close()

View file

@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""Compute corpus statistics and validation metrics.
Reads corpus files and computes counts, distributions, coverage, and balance warnings.
Usage:
python scripts/compute_corpus_stats.py
python scripts/compute_corpus_stats.py --corpus-dir corpus/
"""
import argparse
import csv
import json
import sys
from collections import Counter
from pathlib import Path
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = PROJECT_DIR / "data"
def load_jsonl(path):
"""Load a JSONL file."""
entries = []
if not path.exists():
return entries
with open(path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
entries.append(json.loads(line))
return entries
def classify_input_type(inp):
"""Classify the input framing type of a training pair."""
if inp.startswith("Tell me something about"):
return "word_seeded"
elif inp.startswith("Tell me a saying about"):
return "category_seeded"
elif inp.startswith("What would a"):
return "persona_seeded"
elif inp.startswith("Give me a") and "proverb" in inp:
return "template_seeded"
elif any(inp.startswith(p) for p in [
"Tell me some folk", "What do they", "Give me a proverb",
"Share some", "What's a good"
]):
return "open_ended"
else:
return "fictional"
def main():
parser = argparse.ArgumentParser(description="Compute corpus statistics.")
parser.add_argument("--corpus-dir", default=str(PROJECT_DIR / "corpus"),
help="Corpus directory")
parser.add_argument("--output", default=None,
help="Output JSON file (default: corpus_dir/corpus_stats.json)")
args = parser.parse_args()
corpus_dir = Path(args.corpus_dir)
output_path = Path(args.output) if args.output else corpus_dir / "corpus_stats.json"
# Load all corpus files
raw = load_jsonl(corpus_dir / "corpus_raw.jsonl")
polished = load_jsonl(corpus_dir / "corpus_polished.jsonl")
filtered = load_jsonl(corpus_dir / "corpus_filtered.jsonl")
training = load_jsonl(corpus_dir / "training_pairs.jsonl")
# Load vocab for coverage analysis
vocab_words = set()
vocab_path = DATA_DIR / "folksy_vocab.csv"
if vocab_path.exists():
with open(vocab_path, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
vocab_words.add(row["word"])
stats = {}
# --- Raw corpus stats ---
stats["raw_count"] = len(raw)
raw_by_template = Counter(e.get("meta_template", "unknown") for e in raw)
stats["raw_by_template"] = dict(sorted(raw_by_template.items()))
# --- Polish stats ---
polished_entries = [e for e in polished if e.get("status") == "polished"]
discarded_entries = [e for e in polished if e.get("status") == "discarded"]
error_entries = [e for e in polished if e.get("status") == "error"]
stats["polished_count"] = len(polished_entries)
stats["discarded_during_polish"] = len(discarded_entries)
stats["errors_during_polish"] = len(error_entries)
if polished_entries or discarded_entries:
total_processed = len(polished_entries) + len(discarded_entries)
stats["polish_discard_rate"] = f"{len(discarded_entries)/total_processed*100:.1f}%"
polish_by_template = Counter(e.get("meta_template", "unknown") for e in polished_entries)
stats["polished_by_template"] = dict(sorted(polish_by_template.items()))
discard_by_template = Counter(e.get("meta_template", "unknown") for e in discarded_entries)
stats["discarded_by_template"] = dict(sorted(discard_by_template.items()))
# --- Filter stats ---
stats["filtered_count"] = len(filtered)
filter_by_template = Counter(e.get("meta_template", "unknown") for e in filtered)
stats["filtered_by_template"] = dict(sorted(filter_by_template.items()))
# Filter discard count
stats["discarded_during_filter"] = len(polished_entries) - len(filtered)
# --- Training pairs stats ---
stats["training_pair_count"] = len(training)
training_by_template = Counter(e.get("meta_template", "unknown") for e in training)
stats["training_by_template"] = dict(sorted(training_by_template.items()))
input_type_counts = Counter(classify_input_type(e.get("input", "")) for e in training)
stats["training_by_input_type"] = dict(sorted(input_type_counts.items()))
# --- Coverage analysis ---
used_words = set()
for entry in filtered:
slots = entry.get("slots", {})
for v in slots.values():
word = v.lower().replace(" ", "_")
if word in vocab_words:
used_words.add(word)
stats["unique_slot_words_used"] = len(used_words)
stats["total_vocab_words"] = len(vocab_words)
stats["vocab_coverage"] = f"{len(used_words)/len(vocab_words)*100:.1f}%" if vocab_words else "N/A"
never_used = sorted(vocab_words - used_words)
stats["words_never_used"] = never_used
stats["words_never_used_count"] = len(never_used)
# --- Saying length stats ---
lengths = []
for entry in filtered:
text = entry.get("polished_text", "")
if text:
lengths.append(len(text.split()))
if lengths:
stats["avg_saying_length_words"] = round(sum(lengths) / len(lengths), 1)
stats["min_saying_length_words"] = min(lengths)
stats["max_saying_length_words"] = max(lengths)
# --- Balance warnings ---
warnings = []
if filtered:
total_filtered = len(filtered)
for template, count in filter_by_template.items():
pct = count / total_filtered * 100
if pct < 10:
warnings.append(
f"WARNING: {template} has only {count} entries ({pct:.1f}%) — "
f"below 10% threshold. Generate more raw sayings for this family."
)
if training:
total_training = len(training)
for template, count in training_by_template.items():
pct = count / total_training * 100
if pct < 5:
warnings.append(
f"WARNING: {template} has only {count} training pairs ({pct:.1f}%) — very underrepresented."
)
stats["balance_warnings"] = warnings
# --- Write output ---
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(stats, f, indent=2, ensure_ascii=False)
# --- Print summary ---
print("=" * 60)
print("CORPUS STATISTICS")
print("=" * 60)
print(f"\nRaw sayings: {stats['raw_count']}")
print(f"Polished sayings: {stats['polished_count']}")
print(f"Discarded (polish): {stats.get('discarded_during_polish', 0)} ({stats.get('polish_discard_rate', 'N/A')})")
print(f"Discarded (filter): {stats.get('discarded_during_filter', 0)}")
print(f"Final filtered: {stats['filtered_count']}")
print(f"Training pairs: {stats['training_pair_count']}")
print(f"\nDistribution by meta-template (filtered):")
for t, c in sorted(filter_by_template.items()):
pct = c / len(filtered) * 100 if filtered else 0
print(f" {t:30s} {c:5d} ({pct:5.1f}%)")
print(f"\nDistribution by input framing type:")
for t, c in sorted(input_type_counts.items()):
print(f" {t:20s} {c:5d}")
print(f"\nVocab coverage: {stats['vocab_coverage']} ({stats['unique_slot_words_used']}/{stats['total_vocab_words']})")
print(f"Average saying length: {stats.get('avg_saying_length_words', 'N/A')} words")
if warnings:
print(f"\nBalance warnings:")
for w in warnings:
print(f" {w}")
print(f"\nFull stats: {output_path}")
if __name__ == "__main__":
main()

787
scripts/enhance_graph.py Normal file
View file

@ -0,0 +1,787 @@
#!/usr/bin/env python3
"""LLM-augmented graph enhancement for the folksy subgraph.
Three phases:
Phase 1: Per-word relationship expansion
Phase 2: Cross-word bridge discovery
Phase 3: Property enrichment for false_equivalence templates
Usage:
python scripts/enhance_graph.py --phase 1 # Run phase 1 only
python scripts/enhance_graph.py --phase 2 # Run phase 2 only
python scripts/enhance_graph.py --phase 3 # Run phase 3 only
python scripts/enhance_graph.py --all # Run all phases
python scripts/enhance_graph.py --phase 1 --dry-run # Print prompts without calling LLM
"""
import argparse
import csv
import os
import random
import re
import sys
import time
from collections import defaultdict
from datetime import datetime
from pathlib import Path
# Paths
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = PROJECT_DIR / "data"
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
LLM_MODEL = "THUDM-GLM4-32B"
VALID_RELATIONS = {
"AtLocation", "MadeOf", "PartOf", "UsedFor", "HasA", "HasProperty",
"Causes", "HasPrerequisite", "CapableOf", "ReceivesAction", "Desires",
"CausesDesire", "LocatedNear", "CreatedBy", "MotivatedByGoal", "HasSubevent",
}
AUGMENTED_CSV = DATA_DIR / "folksy_relations_augmented.csv"
CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
LOG_CSV = DATA_DIR / "enhancement_log.csv"
# ---------------------------------------------------------------------------
# Infrastructure
# ---------------------------------------------------------------------------
def llm_chat_completion(messages, max_retries=3):
"""Chat completion with retry logic."""
import requests
for attempt in range(max_retries):
try:
resp = requests.post(LLM_ENDPOINT, json={
"model": LLM_MODEL,
"messages": messages,
}, timeout=120)
resp.raise_for_status()
data = resp.json()
return data["choices"][0]["message"]["content"]
except Exception as e:
wait = (2 ** attempt)
print(f" LLM call failed (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
if attempt < max_retries - 1:
print(f" Retrying in {wait}s...", file=sys.stderr)
time.sleep(wait)
else:
print(f" Giving up on this word.", file=sys.stderr)
return None
def load_vocab():
"""Load folksy vocabulary."""
vocab = {}
with open(DATA_DIR / "folksy_vocab.csv", newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
word = row["word"]
cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
vocab[word] = {
"categories": cats,
"tangibility": float(row.get("tangibility_score", 0)),
"edge_count": int(row.get("conceptnet_edge_count", 0)),
}
return vocab
def load_relations():
"""Load existing relations (ConceptNet + any existing augmented)."""
edges = defaultdict(list) # (start, relation) -> [(end, weight, surface)]
existing_triples = set() # (start, end, relation) for dedup
for path in [DATA_DIR / "folksy_relations.csv", AUGMENTED_CSV]:
if not path.exists():
continue
with open(path, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
sw = row["start_word"]
ew = row["end_word"]
rel = row["relation"]
if not row['weight']: continue # corruption / skip?
w = float(row["weight"])
surf = row.get("surface_text", "")
edges[(sw, rel)].append((ew, w, surf))
existing_triples.add((sw, ew, rel))
return edges, existing_triples
def load_checkpoint():
"""Load enhancement log to determine what's already been processed."""
processed = set() # (word, phase)
if LOG_CSV.exists():
with open(LOG_CSV, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
processed.add((row["source_word"], row["phase"]))
return processed
def append_log(word, phase, edges_generated, edges_accepted, edges_duplicate, edges_oov):
"""Append a row to the enhancement log."""
write_header = not LOG_CSV.exists()
with open(LOG_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
if write_header:
writer.writerow(["source_word", "phase", "timestamp",
"edges_generated", "edges_accepted", "edges_duplicate", "edges_oov"])
writer.writerow([word, phase, datetime.now().isoformat(),
edges_generated, edges_accepted, edges_duplicate, edges_oov])
def append_augmented_edges(edges):
"""Append edges to the augmented relations CSV."""
write_header = not AUGMENTED_CSV.exists()
with open(AUGMENTED_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
if write_header:
writer.writerow(["start_word", "end_word", "relation", "weight", "surface_text", "source"])
for e in edges:
writer.writerow([e["start_word"], e["end_word"], e["relation"],
e["weight"], e["surface_text"], e["source"]])
def append_candidates(candidates):
"""Append candidate words to the candidate additions CSV."""
write_header = not CANDIDATE_CSV.exists()
with open(CANDIDATE_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
if write_header:
writer.writerow(["word", "suggested_by", "relation_context", "frequency"])
for c in candidates:
writer.writerow([c["word"], c["suggested_by"], c["relation_context"], c["frequency"]])
# ---------------------------------------------------------------------------
# Parsing
# ---------------------------------------------------------------------------
def parse_llm_relations(response_text, source_word):
"""Parse structured LLM output into edge dicts.
Handles bullets, numbering, extra whitespace, multi-word targets.
"""
edges = []
if not response_text:
return edges
for line in response_text.strip().split("\n"):
line = line.strip()
if not line:
continue
# Strip leading bullets/numbers: "- ", "1. ", "* ", etc.
line = re.sub(r"^[\d]+[.)]\s*", "", line)
line = re.sub(r"^[-*•]\s*", "", line)
line = line.strip()
if not line or "NONE" in line.upper():
continue
# Match: RELATION_TYPE: target_word(s) | surface text
match = re.match(r"^(\w+):\s*(.+?)\s*\|\s*(.+)$", line)
if not match:
continue
relation, target_raw, surface = match.groups()
relation = relation.strip()
if relation not in VALID_RELATIONS:
continue
# Normalize target: lowercase, replace spaces with underscores for multi-word
target = target_raw.strip().lower()
target = re.sub(r"\s+", "_", target)
# Skip self-loops
if target == source_word:
continue
edges.append({
"start_word": source_word,
"end_word": target,
"relation": relation,
"weight": 0.8,
"surface_text": surface.strip(),
"source": "llm_augmented",
})
return edges
def parse_bridge_response(response_text, word_a, word_b):
"""Parse bridge discovery LLM output."""
edges = []
if not response_text:
return edges
for line in response_text.strip().split("\n"):
line = line.strip()
if not line:
continue
# Strip common prefixes
line = re.sub(r"^[\d]+[.)]\s*", "", line)
line = re.sub(r"^[-*•]\s*", "", line)
line = re.sub(r"^BRIDGE:\s*", "", line, flags=re.IGNORECASE)
line = line.strip()
if not line:
continue
# BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
parts = [p.strip() for p in line.split("|")]
if len(parts) < 3:
continue
bridge_word = parts[0].strip().lower().replace(" ", "_")
# Parse relation_to_first
rel1_match = re.search(r"(?:relation_to_first|first):\s*(\w+)", parts[1], re.IGNORECASE)
rel2_match = re.search(r"(?:relation_to_second|second):\s*(\w+)", parts[2], re.IGNORECASE)
if not rel1_match or not rel2_match:
# Try simpler format: just the relation type
rel1_match = re.match(r"(\w+)", parts[1].split(":")[-1].strip())
rel2_match = re.match(r"(\w+)", parts[2].split(":")[-1].strip())
if not rel1_match or not rel2_match:
continue
rel1 = rel1_match.group(1)
rel2 = rel2_match.group(1)
if rel1 not in VALID_RELATIONS or rel2 not in VALID_RELATIONS:
continue
explanation = parts[3].strip() if len(parts) > 3 else ""
# Create edges: word_a -> bridge and bridge -> word_b
edges.append({
"start_word": word_a,
"end_word": bridge_word,
"relation": rel1,
"weight": 0.8,
"surface_text": explanation,
"source": "llm_bridge",
})
edges.append({
"start_word": bridge_word,
"end_word": word_b,
"relation": rel2,
"weight": 0.8,
"surface_text": explanation,
"source": "llm_bridge",
})
return edges
def parse_property_response(response_text, word):
"""Parse property enrichment LLM output."""
edges = []
if not response_text:
return edges
for line in response_text.strip().split("\n"):
line = line.strip()
if not line:
continue
line = re.sub(r"^[\d]+[.)]\s*", "", line)
line = re.sub(r"^[-*•]\s*", "", line)
line = line.strip()
if not line:
continue
# PROPERTY | explanation
parts = [p.strip() for p in line.split("|")]
if len(parts) < 1:
continue
prop = parts[0].strip().lower().replace(" ", "_")
explanation = parts[1].strip() if len(parts) > 1 else f"{word} is {prop}"
if not prop or prop == word:
continue
edges.append({
"start_word": word,
"end_word": prop,
"relation": "HasProperty",
"weight": 0.8,
"surface_text": explanation,
"source": "llm_property",
})
return edges
# ---------------------------------------------------------------------------
# Phase 1: Per-Word Expansion
# ---------------------------------------------------------------------------
PHASE1_SYSTEM = """You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
Rules:
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
- Every relationship must be something a typical adult would agree is true
- Do not repeat any relationship already listed as "known"
- Target words should be common English words (top 3000 frequency preferred)
- Output ONLY the structured format shown below, one relationship per line
- If you cannot think of good relationships for a given type, output NONE for that type
- Aim for 3-5 relationships per type where possible
Output format (one per line):
RELATION_TYPE: target_word | short natural phrasing
Example output:
AtLocation: barn | you find a horse in a barn
UsedFor: riding | a horse is used for riding
HasA: mane | a horse has a mane
CapableOf: gallop | a horse can gallop
MadeOf: NONE
PartOf: herd | a horse is part of a herd"""
PHASE1_USER = """Word: {word}
Categories: {categories}
Known relationships:
{existing_edges}
Generate additional relationships for these types:
- AtLocation (where is it found?)
- UsedFor (what is it used for?)
- HasA (what does it have / contain?)
- PartOf (what is it part of?)
- CapableOf (what can it do?)
- MadeOf (what is it made of?)
- HasPrerequisite (what do you need before you can have/use it?)
- Causes (what does it cause or lead to?)
- HasProperty (what adjectives describe it? limit to physical/sensory properties)"""
def format_existing_edges(edges_dict, word):
"""Format existing edges for a word grouped by relation type."""
relation_types = ["AtLocation", "UsedFor", "HasA", "PartOf", "CapableOf",
"MadeOf", "HasPrerequisite", "Causes", "HasProperty"]
lines = []
for rel in relation_types:
targets = edges_dict.get((word, rel), [])
if targets:
formatted = ", ".join(f"{t[0]} (weight {t[1]:.1f})" for t in targets[:10])
lines.append(f"{rel}: {formatted}")
else:
lines.append(f"{rel}: (none in database)")
return "\n".join(lines)
def run_phase1(vocab, edges, existing_triples, checkpoint, dry_run=False):
"""Phase 1: Per-word relationship expansion."""
words = sorted(vocab.keys())
total = len(words)
total_accepted = 0
total_skipped = 0
print(f"Phase 1: Processing {total} words...")
for i, word in enumerate(words):
if (word, "1") in checkpoint:
total_skipped += 1
continue
categories = ", ".join(vocab[word]["categories"])
existing = format_existing_edges(edges, word)
user_prompt = PHASE1_USER.format(
word=word, categories=categories, existing_edges=existing
)
messages = [
{"role": "system", "content": PHASE1_SYSTEM},
{"role": "user", "content": user_prompt},
]
if dry_run:
if i < 3: # Show first 3 prompts
print(f"\n--- Prompt for '{word}' ---")
print(f"System: {PHASE1_SYSTEM[:200]}...")
print(f"User:\n{user_prompt}")
elif i == 3:
print(f"\n... ({total - 3} more words) ...")
continue
response = llm_chat_completion(messages)
parsed = parse_llm_relations(response, word) if response else []
# Classify edges
accepted = []
candidates = []
duplicates = 0
for edge in parsed:
triple = (edge["start_word"], edge["end_word"], edge["relation"])
if triple in existing_triples:
duplicates += 1
continue
existing_triples.add(triple)
if edge["end_word"] in vocab:
accepted.append(edge)
else:
candidates.append({
"word": edge["end_word"],
"suggested_by": word,
"relation_context": f"{edge['relation']}: {edge['surface_text']}",
"frequency": 1,
})
if accepted:
append_augmented_edges(accepted)
# Also update in-memory edges for subsequent words
for e in accepted:
edges[(e["start_word"], e["relation"])].append(
(e["end_word"], e["weight"], e["surface_text"]))
if candidates:
append_candidates(candidates)
total_accepted += len(accepted)
append_log(word, "1", len(parsed), len(accepted), duplicates, len(candidates))
if (i + 1) % 50 == 0:
print(f" [{i+1}/{total}] {total_accepted} edges accepted so far")
time.sleep(0.1)
if dry_run:
print(f"\nDry run complete. Would process {total - total_skipped} words.")
else:
print(f"\nPhase 1 complete: {total_accepted} new edges accepted.")
# ---------------------------------------------------------------------------
# Phase 2: Cross-Word Bridge Discovery
# ---------------------------------------------------------------------------
PHASE2_SYSTEM = """You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
Rules:
- The bridge word must be a common, concrete noun
- State the relationship type for each connection
- Valid relationship types: AtLocation, UsedFor, HasA, PartOf, CapableOf, MadeOf, HasPrerequisite, Causes, HasProperty, ReceivesAction, Desires, CausesDesire, LocatedNear, CreatedBy
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
Example:
Words: "cow" and "butter"
milk | relation_to_first: CapableOf | relation_to_second: MadeOf | milk connects production to product"""
PHASE2_USER = """Words: "{word_a}" and "{word_b}"
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
Find 1-3 bridge words that connect them."""
def build_reachability(vocab, edges):
"""Build 2-hop reachability from vocab words to other vocab words."""
vocab_set = set(vocab.keys())
reachable = defaultdict(set) # word -> set of reachable vocab words
for word in vocab:
# Direct (1-hop) neighbors in vocab
for (sw, rel), targets in edges.items():
if sw == word:
for (ew, w, s) in targets:
if ew in vocab_set and ew != word:
reachable[word].add(ew)
# 2-hop from this neighbor
for (sw2, rel2), targets2 in edges.items():
if sw2 == ew:
for (ew2, w2, s2) in targets2:
if ew2 in vocab_set and ew2 != word:
reachable[word].add(ew2)
return reachable
def run_phase2(vocab, edges, existing_triples, checkpoint, dry_run=False):
"""Phase 2: Cross-word bridge discovery."""
print("Phase 2: Building reachability matrix...")
reachable = build_reachability(vocab, edges)
# Find low-connectivity words
vocab_set = set(vocab.keys())
low_connectivity = []
for word in vocab:
reach_count = len(reachable.get(word, set()))
if reach_count < 10:
low_connectivity.append((word, reach_count))
low_connectivity.sort(key=lambda x: x[1])
print(f" {len(low_connectivity)} words with <10 reachable vocab words")
# Build category index
by_category = defaultdict(list)
for word, info in vocab.items():
for cat in info["categories"]:
by_category[cat].append(word)
total_accepted = 0
pairs_processed = 0
total_skipped = 0
for word, reach_count in low_connectivity:
if (word, "2") in checkpoint:
total_skipped += 1
continue
word_cats = vocab[word]["categories"]
word_reachable = reachable.get(word, set())
# Find same-category words that are unreachable
unreachable = []
for cat in word_cats:
for peer in by_category.get(cat, []):
if peer != word and peer not in word_reachable:
unreachable.append(peer)
if not unreachable:
append_log(word, "2", 0, 0, 0, 0)
continue
# Sample 5-10 unreachable peers
sample = random.sample(unreachable, min(10, len(unreachable)))
accepted_for_word = 0
for peer in sample:
pair_key = f"{word}:{peer}"
if (pair_key, "2") in checkpoint:
continue
categories_a = ", ".join(vocab[word]["categories"])
categories_b = ", ".join(vocab[peer]["categories"])
user_prompt = PHASE2_USER.format(
word_a=word, word_b=peer,
categories_a=categories_a, categories_b=categories_b,
)
messages = [
{"role": "system", "content": PHASE2_SYSTEM},
{"role": "user", "content": user_prompt},
]
if dry_run:
if pairs_processed < 3:
print(f"\n--- Bridge prompt: '{word}' <-> '{peer}' ---")
print(f"User:\n{user_prompt}")
elif pairs_processed == 3:
print(f"\n... (more pairs) ...")
pairs_processed += 1
continue
response = llm_chat_completion(messages)
parsed = parse_bridge_response(response, word, peer) if response else []
accepted = []
duplicates = 0
oov = 0
for edge in parsed:
triple = (edge["start_word"], edge["end_word"], edge["relation"])
if triple in existing_triples:
duplicates += 1
continue
existing_triples.add(triple)
# For bridge edges, both endpoints should ideally be in vocab
if edge["start_word"] in vocab_set and edge["end_word"] in vocab_set:
accepted.append(edge)
elif edge["start_word"] in vocab_set or edge["end_word"] in vocab_set:
# At least one end in vocab — still useful
accepted.append(edge)
else:
oov += 1
if accepted:
append_augmented_edges(accepted)
for e in accepted:
edges[(e["start_word"], e["relation"])].append(
(e["end_word"], e["weight"], e["surface_text"]))
accepted_for_word += len(accepted)
pairs_processed += 1
time.sleep(0.1)
total_accepted += accepted_for_word
append_log(word, "2", 0, accepted_for_word, 0, 0)
if (pairs_processed) % 20 == 0:
print(f" {pairs_processed} pairs processed, {total_accepted} edges accepted")
if dry_run:
print(f"\nDry run complete. Would process {pairs_processed} word pairs.")
else:
print(f"\nPhase 2 complete: {total_accepted} bridge edges accepted from {pairs_processed} pairs.")
# ---------------------------------------------------------------------------
# Phase 3: Property Enrichment
# ---------------------------------------------------------------------------
PHASE3_SYSTEM = """You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
Rules:
- Only physical/sensory/behavioral properties, not abstract qualities
- Properties should DISTINGUISH this thing from similar things in its category
- Output one property per line as: PROPERTY | brief explanation
- Aim for 5-8 properties"""
PHASE3_USER = """Word: {word}
Category: {categories}
Other words in same category: {peers}
What properties distinguish {word} from the others listed?"""
def run_phase3(vocab, edges, existing_triples, checkpoint, dry_run=False):
"""Phase 3: Property enrichment for false_equivalence templates."""
by_category = defaultdict(list)
for word, info in vocab.items():
for cat in info["categories"]:
by_category[cat].append(word)
words = sorted(vocab.keys())
total = len(words)
total_accepted = 0
total_skipped = 0
print(f"Phase 3: Property enrichment for {total} words...")
for i, word in enumerate(words):
if (word, "3") in checkpoint:
total_skipped += 1
continue
word_cats = vocab[word]["categories"]
categories = ", ".join(word_cats)
# Gather same-category peers (sample of 10)
peers = set()
for cat in word_cats:
for peer in by_category.get(cat, []):
if peer != word:
peers.add(peer)
peer_sample = random.sample(list(peers), min(10, len(peers))) if peers else []
if not peer_sample:
append_log(word, "3", 0, 0, 0, 0)
continue
user_prompt = PHASE3_USER.format(
word=word, categories=categories,
peers=", ".join(peer_sample),
)
messages = [
{"role": "system", "content": PHASE3_SYSTEM},
{"role": "user", "content": user_prompt},
]
if dry_run:
if i < 3:
print(f"\n--- Property prompt for '{word}' ---")
print(f"User:\n{user_prompt}")
elif i == 3:
print(f"\n... ({total - 3} more words) ...")
continue
response = llm_chat_completion(messages)
parsed = parse_property_response(response, word) if response else []
accepted = []
duplicates = 0
for edge in parsed:
triple = (edge["start_word"], edge["end_word"], edge["relation"])
if triple in existing_triples:
duplicates += 1
continue
existing_triples.add(triple)
accepted.append(edge)
if accepted:
append_augmented_edges(accepted)
for e in accepted:
edges[(e["start_word"], e["relation"])].append(
(e["end_word"], e["weight"], e["surface_text"]))
total_accepted += len(accepted)
append_log(word, "3", len(parsed), len(accepted), duplicates, 0)
if (i + 1) % 50 == 0:
print(f" [{i+1}/{total}] {total_accepted} properties accepted so far")
time.sleep(0.1)
if dry_run:
print(f"\nDry run complete. Would process {total - total_skipped} words.")
else:
print(f"\nPhase 3 complete: {total_accepted} new HasProperty edges accepted.")
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="LLM-augmented graph enhancement for folksy subgraph."
)
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument("--phase", type=int, choices=[1, 2, 3],
help="Run a specific phase (1, 2, or 3)")
group.add_argument("--all", action="store_true",
help="Run all three phases in sequence")
parser.add_argument("--dry-run", action="store_true",
help="Print prompts without calling LLM")
args = parser.parse_args()
vocab = load_vocab()
edges, existing_triples = load_relations()
checkpoint = load_checkpoint()
print(f"Loaded {len(vocab)} vocab words, {len(existing_triples)} existing edge triples.")
print(f"Checkpoint: {len(checkpoint)} (word, phase) pairs already processed.")
phases = [args.phase] if args.phase else [1, 2, 3]
for phase in phases:
print(f"\n{'='*60}")
print(f"Running Phase {phase}")
print(f"{'='*60}")
if phase == 1:
run_phase1(vocab, edges, existing_triples, checkpoint, args.dry_run)
elif phase == 2:
run_phase2(vocab, edges, existing_triples, checkpoint, args.dry_run)
elif phase == 3:
run_phase3(vocab, edges, existing_triples, checkpoint, args.dry_run)
# Reload checkpoint after each phase for resumability
checkpoint = load_checkpoint()
print("\nDone.")
if __name__ == "__main__":
main()

512
scripts/expand_vocab.py Normal file
View file

@ -0,0 +1,512 @@
#!/usr/bin/env python3
"""Expand folksy vocabulary with high-quality candidates from LLM suggestions.
Reads candidate_additions.csv (words suggested by the LLM during phase 1 that
weren't in the vocab), filters for quality, uses the LLM to assign categories,
and appends the survivors to folksy_vocab.csv.
After running this, re-run `enhance_graph.py --phase 1` to generate edges
for the new words (the checkpoint will skip already-processed words).
Usage:
python scripts/expand_vocab.py # Full run
python scripts/expand_vocab.py --dry-run # Show what would be added
python scripts/expand_vocab.py --min-citations 8 # Stricter threshold
"""
import argparse
import csv
import json
import re
import shutil
import sys
import time
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = PROJECT_DIR / "data"
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
LLM_MODEL = "THUDM-GLM4-32B"
VOCAB_CSV = DATA_DIR / "folksy_vocab.csv"
CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
# Valid categories from the existing vocabulary
VALID_CATEGORIES = {
"animal", "beverage", "bird", "building", "clothing", "container", "crop",
"fabric", "fish", "flower", "food", "fruit", "furniture", "grain", "herb",
"insect", "instrument", "landscape", "material", "metal", "mineral",
"organism", "plant", "rock", "seed", "shelter", "spice", "stone",
"structure", "tool", "tree", "vegetable", "vehicle", "water", "weapon", "wood",
}
# ---------------------------------------------------------------------------
# Exclusion lists
# ---------------------------------------------------------------------------
# Abstract concepts, emotions, processes — not concrete enough for folksy vocab
EXCLUDE_ABSTRACT = {
"ecosystem", "satisfaction", "fullness", "warmth", "fear", "relaxation",
"growth", "interest", "nature", "protection", "digestion", "injury",
"decoration", "construction", "landscape", "noise", "sound", "energy",
"nourishment", "nutrition", "pollination", "sustainability", "tradition",
"biodiversity", "symbolism", "elegance", "resilience", "patience",
"beauty", "abundance", "fertility", "creativity", "harmony", "comfort",
"curiosity", "companionship", "loyalty", "aggression", "alertness",
"camouflage", "predation", "migration", "hibernation", "decomposition",
"erosion", "combustion", "fermentation", "oxidation", "corrosion",
"photosynthesis", "respiration", "evaporation", "precipitation",
"transpiration", "germination", "excitement", "enjoyment", "satiety",
"stability", "organization", "fragrance", "moisture", "wildlife",
"preservation", "conversation", "inspiration", "storage", "observation",
"hydration", "destruction", "entertainment", "education", "knowledge",
"safety", "practice", "research", "skill", "space", "license",
"collection", "habitat", "pollution", "health", "vibration", "wonder",
"awe", "refreshment", "irritation", "happiness", "joy", "damage",
"death", "pain", "thirst", "fear", "alarm", "contents", "ingredients",
"electricity", "oxygen", "navigation", "recreation", "meditation",
"nutrition", "celebration", "communication", "imagination", "devotion",
"ambition", "endurance", "independence", "discipline", "cooperation",
"sweetness", "fullness", "aroma", "flavor", "fragrance", "texture",
"smell", "color", "contents", "surface", "bottom", "edge",
"nutrients", "study", "outfit", "upholstery",
}
# Scientific/technical — not folksy enough for folk wisdom
EXCLUDE_TECHNICAL = {
"cellulose", "exoskeleton", "protein", "tissue", "cells", "alloy",
"cellulose", "enzyme", "chlorophyll", "genome", "photon",
"organism", "molecule", "compound", "polymer", "isotope",
"ecosystem", "metabolism", "catalyst", "membrane", "chromosome",
"cell", "nutrient", "ingredient", "material", "content",
}
# Collective/institutional nouns — not concrete individual things
EXCLUDE_INSTITUTIONAL = {
"orchestra", "fleet", "arsenal", "toolkit", "collection",
"restaurant", "museum", "university", "corporation", "organization",
"musician", "breakfast", "dinner", "meal", "dish", "sandwich",
"seafood", "refrigerator", "garage", "basement", "park",
}
# Adjectives and properties — useful as HasProperty targets but not as vocab words
EXCLUDE_ADJECTIVES = {
"small", "large", "heavy", "colorful", "green", "brown", "hard",
"white", "round", "sharp", "sturdy", "long", "soft", "flat",
"sweet", "bitter", "smooth", "rough", "bright", "dark", "dry",
"wet", "thick", "thin", "warm", "cold", "hot", "tall", "short",
"red", "blue", "yellow", "black", "grey", "gray", "pink",
"fragrant", "loud", "spicy", "sour", "tough", "delicate", "strong",
"weak", "light", "dense", "portable", "lightweight", "transparent",
"opaque", "flexible", "rigid", "brittle", "elastic", "porous",
"compact", "edible", "toxic", "aromatic", "nocturnal", "aquatic",
"durable", "cylindrical", "wooden", "shiny", "solid", "narrow",
"metallic", "pungent", "juicy", "fast", "powerful", "woody",
"fibrous", "savory", "liquid", "enclosed", "rectangular", "wild",
"feathered", "leafy", "crunchy", "dangerous", "fuzzy", "slimy",
"natural", "waterproof", "electronic",
}
# Words that are clearly verbs or gerunds
EXCLUDE_VERBS = {
"eating", "cooking", "growing", "fishing", "hunting", "flying",
"mining", "flavoring", "singing", "blooming", "holding", "baking",
"ripening", "opening", "cutting", "protecting", "seasoning",
"storing", "building", "swimming", "brewing", "weaving", "carving",
"climbing", "digging", "plowing", "sewing", "spinning", "tanning",
"swim", "run", "grow", "eat", "hunt", "peck", "bite", "dive",
"crawl", "cut", "shine", "sparkle",
}
def singularize(word):
"""Best-effort singularization. Returns (singular, was_plural)."""
# Irregular plurals
irregulars = {
"teeth": "tooth", "feet": "foot", "geese": "goose", "mice": "mouse",
"lice": "louse", "dice": "die", "oxen": "ox", "children": "child",
"leaves": "leaf", "loaves": "loaf", "halves": "half", "knives": "knife",
"lives": "life", "wives": "wife", "wolves": "wolf", "shelves": "shelf",
"calves": "calf",
}
if word in irregulars:
return irregulars[word], True
# -ves -> -f (already covered some above, catch remaining)
if word.endswith("ves"):
candidate = word[:-3] + "f"
return candidate, True
# -ies -> -y
if word.endswith("ies") and len(word) > 4:
return word[:-3] + "y", True
# -ses, -xes, -zes, -ches, -shes -> drop -es
if word.endswith(("ses", "xes", "zes", "ches", "shes")):
return word[:-2], True
# -s (but not -ss, -us, -is)
if word.endswith("s") and not word.endswith(("ss", "us", "is")):
return word[:-1], True
return word, False
def is_plural_of_existing(word, existing_vocab):
"""Check if word is likely a plural form of an existing vocab word."""
# word + s
if word.endswith("s") and word[:-1] in existing_vocab:
return True
# word + es
if word.endswith("es") and word[:-2] in existing_vocab:
return True
# word ending ies -> y
if word.endswith("ies") and word[:-3] + "y" in existing_vocab:
return True
# word ending ves -> f/fe
if word.endswith("ves"):
if word[:-3] + "f" in existing_vocab:
return True
if word[:-3] + "fe" in existing_vocab:
return True
return False
def is_plural_of_candidate(word, accepted_words):
"""Check if word is a plural of another candidate, or vice versa."""
# Is this word a plural of something accepted?
if word.endswith("s") and word[:-1] in accepted_words:
return True
if word.endswith("es") and word[:-2] in accepted_words:
return True
if word.endswith("ies") and word[:-3] + "y" in accepted_words:
return True
# Is something accepted a plural of this word?
if word + "s" in accepted_words:
return True
if word + "es" in accepted_words:
return True
if word.endswith("f") and word[:-1] + "ves" in accepted_words:
return True
if word.endswith("fe") and word[:-2] + "ves" in accepted_words:
return True
return False
# ---------------------------------------------------------------------------
# LLM categorization
# ---------------------------------------------------------------------------
CATEGORIZE_SYSTEM = """You are a vocabulary categorizer. Given a list of concrete nouns, assign each one to one or more categories from this fixed list:
animal, beverage, bird, building, clothing, container, crop, fabric, fish, flower, food, fruit, furniture, grain, herb, insect, instrument, landscape, material, metal, mineral, organism, plant, rock, seed, shelter, spice, stone, structure, tool, tree, vegetable, vehicle, water, weapon, wood
Rules:
- Use ONLY categories from the list above
- A word can have multiple categories (e.g., "brick" -> material, stone)
- If a word fits none of the categories well, output SKIP
- Output format: word: category1, category2
- One word per line"""
CATEGORIZE_USER = """Categorize these words:
{word_list}"""
def llm_chat_completion(messages, max_retries=3):
"""Chat completion with retry logic."""
import requests
for attempt in range(max_retries):
try:
resp = requests.post(LLM_ENDPOINT, json={
"model": LLM_MODEL,
"messages": messages,
}, timeout=120)
resp.raise_for_status()
data = resp.json()
return data["choices"][0]["message"]["content"]
except Exception as e:
wait = (2 ** attempt)
print(f" LLM call failed (attempt {attempt+1}/{max_retries}): {e}",
file=sys.stderr)
if attempt < max_retries - 1:
print(f" Retrying in {wait}s...", file=sys.stderr)
time.sleep(wait)
else:
print(f" Giving up on this batch.", file=sys.stderr)
return None
def parse_categories(response_text, valid_words):
"""Parse LLM categorization response."""
result = {}
if not response_text:
return result
for line in response_text.strip().split("\n"):
line = line.strip()
if not line:
continue
# Strip bullets/numbers
line = re.sub(r"^[\d]+[.)]\s*", "", line)
line = re.sub(r"^[-*•]\s*", "", line)
line = line.strip()
# Match: word: cat1, cat2
match = re.match(r"^(\w+)\s*:\s*(.+)$", line)
if not match:
continue
word = match.group(1).strip().lower()
cats_raw = match.group(2).strip()
if "SKIP" in cats_raw.upper():
continue
cats = []
for c in cats_raw.split(","):
c = c.strip().lower()
if c in VALID_CATEGORIES:
cats.append(c)
if word in valid_words and cats:
result[word] = cats
return result
def categorize_words(words, batch_size=25):
"""Categorize words using the LLM in batches."""
all_categories = {}
word_set = set(words)
for i in range(0, len(words), batch_size):
batch = words[i:i + batch_size]
word_list = "\n".join(f"- {w}" for w in batch)
messages = [
{"role": "system", "content": CATEGORIZE_SYSTEM},
{"role": "user", "content": CATEGORIZE_USER.format(word_list=word_list)},
]
response = llm_chat_completion(messages)
parsed = parse_categories(response, word_set)
all_categories.update(parsed)
categorized = len(parsed)
print(f" Batch {i // batch_size + 1}: {categorized}/{len(batch)} categorized")
time.sleep(0.1)
return all_categories
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Expand folksy vocabulary with LLM-suggested candidates."
)
parser.add_argument("--min-citations", type=int, default=5,
help="Minimum number of vocab words that suggested this candidate (default: 5)")
parser.add_argument("--dry-run", action="store_true",
help="Show what would be added without modifying files")
parser.add_argument("--no-llm", action="store_true",
help="Skip LLM categorization (use placeholder categories)")
args = parser.parse_args()
# Load existing vocab
existing_vocab = {}
with open(VOCAB_CSV, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
existing_vocab[row["word"]] = row
existing_words = set(existing_vocab.keys())
print(f"Existing vocabulary: {len(existing_words)} words")
# Load candidates
candidates = []
with open(CANDIDATE_CSV, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
candidates.append(row)
# Aggregate: count unique sources per candidate word
word_sources = defaultdict(set)
for c in candidates:
word_sources[c["word"]].add(c["suggested_by"])
print(f"Total candidate rows: {len(candidates)}")
print(f"Unique candidate words: {len(word_sources)}")
# Normalize plurals: merge citation counts into singular forms
normalized_sources = defaultdict(set)
for word, sources in word_sources.items():
singular, was_plural = singularize(word)
# Merge into the singular form
normalized_sources[singular].update(sources)
# Replace word_sources with normalized version
word_sources = {w: srcs for w, srcs in normalized_sources.items()}
print(f"After singularization: {len(word_sources)} unique candidates")
# Filter
accepted = []
reject_reasons = Counter()
# Sort by citation count descending for consistent ordering
sorted_candidates = sorted(word_sources.items(), key=lambda x: len(x[1]), reverse=True)
accepted_set = set()
for word, sources in sorted_candidates:
citation_count = len(sources)
# Minimum citation threshold
if citation_count < args.min_citations:
reject_reasons["below_threshold"] += 1
continue
# No multi-word (underscore) candidates
if "_" in word:
reject_reasons["multi_word"] += 1
continue
# Already in vocab
if word in existing_words:
reject_reasons["already_in_vocab"] += 1
continue
# Exclude abstracts
if word in EXCLUDE_ABSTRACT:
reject_reasons["abstract"] += 1
continue
# Exclude adjectives
if word in EXCLUDE_ADJECTIVES:
reject_reasons["adjective"] += 1
continue
# Exclude verbs/gerunds
if word in EXCLUDE_VERBS:
reject_reasons["verb_gerund"] += 1
continue
# Exclude technical/scientific
if word in EXCLUDE_TECHNICAL:
reject_reasons["technical"] += 1
continue
# Exclude institutional/collective
if word in EXCLUDE_INSTITUTIONAL:
reject_reasons["institutional"] += 1
continue
# Gerund pattern catch-all (but allow exceptions)
if word.endswith("ing") and word not in {"ring", "spring", "string", "wing", "ceiling"}:
reject_reasons["gerund_pattern"] += 1
continue
# Exclude plurals of existing vocab
if is_plural_of_existing(word, existing_words):
reject_reasons["plural_of_existing"] += 1
continue
# Exclude plurals of already-accepted candidates
if is_plural_of_candidate(word, accepted_set):
reject_reasons["plural_of_candidate"] += 1
continue
# Single character
if len(word) < 2:
reject_reasons["too_short"] += 1
continue
accepted.append((word, citation_count))
accepted_set.add(word)
print(f"\nFiltering results:")
print(f" Accepted: {len(accepted)}")
for reason, count in reject_reasons.most_common():
print(f" Rejected ({reason}): {count}")
if not accepted:
print("\nNo candidates passed filtering.")
return
# Show accepted words
print(f"\nAccepted candidates ({len(accepted)}):")
for word, count in accepted:
print(f" {word:25s} cited by {count:3d} vocab words")
if args.dry_run:
print(f"\nDry run complete. Would add {len(accepted)} words to vocabulary.")
return
# Categorize with LLM
words_to_categorize = [w for w, _ in accepted]
if args.no_llm:
print("\nSkipping LLM categorization (--no-llm). Using 'material' as placeholder.")
categories = {w: ["material"] for w in words_to_categorize}
else:
print(f"\nCategorizing {len(words_to_categorize)} words with LLM...")
categories = categorize_words(words_to_categorize)
# Words the LLM couldn't categorize get skipped
uncategorized = [w for w in words_to_categorize if w not in categories]
if uncategorized:
print(f"\n {len(uncategorized)} words could not be categorized (skipped):")
for w in uncategorized:
print(f" {w}")
# Build new vocab entries
new_entries = []
for word, citation_count in accepted:
if word not in categories:
continue
cats = categories[word]
new_entries.append({
"word": word,
"categories": ",".join(cats),
"tangibility_score": "0.80",
"conceptnet_edge_count": "0",
"frequency_rank": "0",
})
if not new_entries:
print("\nNo entries to add after categorization.")
return
# Backup existing vocab
backup_path = VOCAB_CSV.with_suffix(f".csv.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}")
shutil.copy2(VOCAB_CSV, backup_path)
print(f"\nBacked up vocabulary to {backup_path.name}")
# Append to vocab CSV
with open(VOCAB_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["word", "categories", "tangibility_score",
"conceptnet_edge_count", "frequency_rank"])
for entry in new_entries:
writer.writerow(entry)
print(f"\nAdded {len(new_entries)} words to {VOCAB_CSV.name}")
print(f"New vocabulary size: {len(existing_words) + len(new_entries)}")
# Summary by category
cat_counts = Counter()
for entry in new_entries:
for c in entry["categories"].split(","):
cat_counts[c.strip()] += 1
print(f"\nNew words by category:")
for cat, count in cat_counts.most_common():
print(f" {cat:20s} {count:3d}")
print(f"\nNext step: run 'python scripts/enhance_graph.py --phase 1' to generate edges for new words.")
if __name__ == "__main__":
main()

177
scripts/filter_corpus.py Normal file
View file

@ -0,0 +1,177 @@
#!/usr/bin/env python3
"""Quality filtering for polished folksy sayings.
Reads corpus_polished.jsonl, applies quality filters, outputs filtered corpus
and discard analysis.
Usage:
python scripts/filter_corpus.py
python scripts/filter_corpus.py --input corpus/corpus_polished.jsonl --output corpus/corpus_filtered.jsonl
"""
import argparse
import csv
import json
import sys
from difflib import SequenceMatcher
from pathlib import Path
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
CORPUS_DIR = PROJECT_DIR / "corpus"
def quality_filter(entry):
"""Apply quality filters to a polished entry.
Returns (passed, reason) tuple.
"""
text = entry.get("polished_text", "")
if not text:
return False, "no_polished_text"
words = text.split()
# Length check
if len(words) > 25:
return False, "too_long"
if len(words) < 5:
return False, "too_short"
# Must contain at least 2 of the original slot-fill nouns
slot_words = set(entry.get("slots", {}).values())
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
if words_present < 2:
return False, "lost_key_nouns"
# No raw ConceptNet artifacts (multi-word underscore phrases)
if "_" in text:
return False, "conceptnet_artifact"
# No broken templates (unfilled slots)
if "{" in text or "}" in text:
return False, "unfilled_slot"
return True, "pass"
def is_near_duplicate(text_a, text_b, threshold=0.75):
"""Check if two texts are near-duplicates."""
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
def deduplicate_within_family(entries):
"""Remove near-duplicates within each meta-template family.
Returns (kept, removed) lists.
"""
by_family = {}
for entry in entries:
family = entry.get("meta_template", "unknown")
by_family.setdefault(family, []).append(entry)
kept = []
removed = []
for family, family_entries in by_family.items():
family_kept = []
for entry in family_entries:
text = entry.get("polished_text", "")
is_dup = False
for existing in family_kept:
if is_near_duplicate(text, existing.get("polished_text", "")):
is_dup = True
break
if is_dup:
removed.append((entry, "near_duplicate"))
else:
family_kept.append(entry)
kept.extend(family_kept)
return kept, removed
def main():
parser = argparse.ArgumentParser(description="Quality filtering for polished folksy sayings.")
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
help="Input polished JSONL file")
parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
help="Output filtered JSONL file")
parser.add_argument("--discard-analysis", default=str(CORPUS_DIR / "discard_analysis.csv"),
help="Discard analysis CSV file")
args = parser.parse_args()
input_path = Path(args.input)
output_path = Path(args.output)
discard_path = Path(args.discard_analysis)
if not input_path.exists():
print(f"Error: {input_path} not found.", file=sys.stderr)
sys.exit(1)
# Load polished entries (only those with status=polished)
all_entries = []
already_discarded = 0
with open(input_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
entry = json.loads(line)
if entry.get("status") == "polished":
all_entries.append(entry)
elif entry.get("status") == "discarded":
already_discarded += 1
print(f"Loaded {len(all_entries)} polished entries ({already_discarded} already discarded by LLM)")
# Apply quality filters
passed = []
discards = [] # (entry, reason)
for entry in all_entries:
ok, reason = quality_filter(entry)
if ok:
passed.append(entry)
else:
discards.append((entry, reason))
print(f"Quality filter: {len(passed)} passed, {len(discards)} discarded")
# Show discard breakdown
from collections import Counter
reason_counts = Counter(r for _, r in discards)
for reason, count in reason_counts.most_common():
print(f" {reason}: {count}")
# Near-duplicate detection within template families
kept, dup_removed = deduplicate_within_family(passed)
discards.extend(dup_removed)
print(f"Near-duplicate removal: {len(dup_removed)} removed, {len(kept)} remaining")
# Write filtered output
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
for entry in kept:
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
print(f"\nFiltered corpus: {len(kept)} entries -> {output_path}")
# Write discard analysis
with open(discard_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["raw_text", "meta_template", "discard_stage", "discard_reason"])
for entry, reason in discards:
writer.writerow([
entry.get("raw_text", ""),
entry.get("meta_template", ""),
"llm_polish" if reason == "no_polished_text" else "quality_filter",
reason,
])
print(f"Discard analysis: {len(discards)} entries -> {discard_path}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,385 @@
#!/usr/bin/env python3
"""Format filtered sayings into training pairs for fine-tuning.
Each polished saying generates 3-5 training pairs with different input framings.
Also generates fictional entity training pairs.
Usage:
python scripts/format_training_pairs.py
python scripts/format_training_pairs.py --input corpus/corpus_filtered.jsonl --output corpus/training_pairs.jsonl
"""
import argparse
import csv
import json
import random
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
CORPUS_DIR = PROJECT_DIR / "corpus"
DATA_DIR = PROJECT_DIR / "data"
EXAMPLES_DIR = PROJECT_DIR / "examples"
# Template name mappings for human-readable prompts
TEMPLATE_NAMES = {
"deconstruction": "deconstruction",
"denial_of_consequences": "denial of consequences",
"ironic_deficiency": "ironic deficiency",
"futile_preparation": "futile preparation",
"hypocritical_complaint": "hypocritical complaint",
"tautological_wisdom": "tautological wisdom",
"false_equivalence": "false equivalence",
}
PERSONAS = ["farmer", "grandmother", "old sailor", "blacksmith", "innkeeper", "shepherd"]
OPEN_ENDED_PROMPTS = [
"Tell me some folk wisdom.",
"What do they say?",
"Give me a proverb.",
"Share some old-time wisdom.",
"What's a good saying?",
]
# Auto-generated fictional entities for additional training pairs
AUTO_ENTITIES = [
{
"name": "Stoneclaw",
"categories": ["animal", "predator"],
"properties": ["fierce", "rocky", "nocturnal"],
"relations": {"AtLocation": ["cave", "mountain"], "HasA": ["claws", "scales"], "CapableOf": ["hunting", "climbing"]},
},
{
"name": "Duskmelon",
"categories": ["fruit", "food"],
"properties": ["purple", "sweet", "fragrant"],
"relations": {"AtLocation": ["garden", "market"], "UsedFor": ["eating", "jam"], "MadeOf": ["seed", "juice"]},
},
{
"name": "Windloom",
"categories": ["tool", "craft"],
"properties": ["wooden", "portable", "intricate"],
"relations": {"UsedFor": ["weaving", "thread"], "MadeOf": ["wood", "string"], "AtLocation": ["workshop", "cottage"]},
},
{
"name": "Briarvine",
"categories": ["plant", "herb"],
"properties": ["thorny", "green", "medicinal"],
"relations": {"AtLocation": ["forest", "hedge"], "UsedFor": ["healing", "tea"], "HasA": ["thorn", "leaf"]},
},
{
"name": "Mudhog",
"categories": ["animal", "livestock"],
"properties": ["muddy", "stubborn", "heavy"],
"relations": {"AtLocation": ["farm", "swamp"], "Desires": ["food", "mud"], "CapableOf": ["digging", "rooting"]},
},
{
"name": "Frostberry",
"categories": ["fruit", "food"],
"properties": ["cold", "blue", "tiny"],
"relations": {"AtLocation": ["mountain", "tundra"], "UsedFor": ["eating", "preserves"], "HasProperty": ["cold", "tart"]},
},
{
"name": "Lanternmoss",
"categories": ["plant", "fungus"],
"properties": ["glowing", "damp", "soft"],
"relations": {"AtLocation": ["cave", "swamp"], "UsedFor": ["light", "decoration"], "HasProperty": ["luminous", "fragile"]},
},
{
"name": "Cinderhawk",
"categories": ["bird", "animal"],
"properties": ["fiery", "fast", "red"],
"relations": {"AtLocation": ["mountain", "volcano"], "CapableOf": ["flying", "hunting"], "HasA": ["talons", "feathers"]},
},
{
"name": "Rootstone",
"categories": ["stone", "material"],
"properties": ["veined", "hard", "ancient"],
"relations": {"AtLocation": ["quarry", "riverbed"], "UsedFor": ["building", "carving"], "MadeOf": ["mineral", "root"]},
},
{
"name": "Silkwort",
"categories": ["plant", "fiber"],
"properties": ["silky", "white", "tall"],
"relations": {"AtLocation": ["field", "meadow"], "UsedFor": ["weaving", "cloth"], "HasA": ["stem", "fiber"]},
},
{
"name": "Kettlefrog",
"categories": ["animal", "amphibian"],
"properties": ["loud", "round", "green"],
"relations": {"AtLocation": ["pond", "marsh"], "CapableOf": ["jumping", "croaking"], "Desires": ["flies", "water"]},
},
{
"name": "Dustwheat",
"categories": ["crop", "grain"],
"properties": ["dry", "golden", "hardy"],
"relations": {"AtLocation": ["field", "barn"], "UsedFor": ["bread", "flour"], "HasPrerequisite": ["rain", "soil"]},
},
]
def format_entity_description(entity):
"""Format entity into a natural description string."""
name = entity["name"]
cats = entity.get("categories", [])
props = entity.get("properties", [])
rels = entity.get("relations", {})
parts = []
# Category description
if props and cats:
prop_str = ", ".join(props[:3])
cat_str = " and ".join(cats[:2])
parts.append(f"A {name} is a {prop_str} {cat_str}.")
elif cats:
parts.append(f"A {name} is a {' and '.join(cats[:2])}.")
# Location
if "AtLocation" in rels:
locs = rels["AtLocation"]
parts.append(f"It is found near {' and '.join(locs[:2])}.")
# Parts/properties
if "HasA" in rels:
has = rels["HasA"]
parts.append(f"It has {', '.join(has[:3])}.")
# Capabilities
if "CapableOf" in rels:
caps = rels["CapableOf"]
parts.append(f"It can {' and '.join(caps[:2])}.")
# Uses
if "UsedFor" in rels:
uses = rels["UsedFor"]
parts.append(f"It is used for {' and '.join(uses[:2])}.")
return " ".join(parts)
def load_vocab_categories():
"""Load vocab to get word -> categories mapping."""
word_cats = {}
vocab_path = DATA_DIR / "folksy_vocab.csv"
if vocab_path.exists():
with open(vocab_path, newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
word = row["word"]
cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
word_cats[word] = cats
return word_cats
def generate_training_pairs(entry, word_cats):
"""Generate 3-5 training pairs for a single polished saying."""
polished = entry.get("polished_text", "")
slots = entry.get("slots", {})
meta_template = entry.get("meta_template", "")
# Collect source words (concrete nouns from slots)
source_words = [v for v in slots.values()
if v and not v.startswith("a ") and not v.startswith("an ") and len(v) > 1]
# Determine categories of slot words
slot_categories = set()
for word in source_words:
word_lower = word.lower().replace(" ", "_")
if word_lower in word_cats:
slot_categories.update(word_cats[word_lower])
pairs = []
base = {
"output": polished,
"meta_template": meta_template,
"source_words": source_words,
}
# 1. Word-seeded (always include)
if source_words:
word = random.choice(source_words)
pairs.append({**base, "input": f"Tell me something about {word}."})
# 2. Category-seeded (always include if we have categories)
if slot_categories:
cat = random.choice(list(slot_categories))
pairs.append({**base, "input": f"Tell me a saying about {cat}."})
# 3. Persona-seeded (always include)
persona = random.choice(PERSONAS)
if source_words:
word = random.choice(source_words)
pairs.append({**base, "input": f"What would a {persona} say about {word}?"})
# 4. Template-seeded (include ~70% of the time)
if random.random() < 0.7:
template_name = TEMPLATE_NAMES.get(meta_template, meta_template)
pairs.append({**base, "input": f"Give me a {template_name} proverb."})
# 5. Open-ended (include ~30% of the time)
if random.random() < 0.3:
prompt = random.choice(OPEN_ENDED_PROMPTS)
pairs.append({**base, "input": prompt})
return pairs
def generate_fictional_pairs(entities):
"""Generate training pairs for fictional entities.
These pairs include the entity description in the input.
"""
pairs = []
# Generate 15-25 pairs per entity
for entity in entities:
name = entity["name"]
desc = format_entity_description(entity)
props = entity.get("properties", [])
rels = entity.get("relations", {})
# Collect words related to this entity
related_words = []
for targets in rels.values():
related_words.extend(targets)
n_pairs = random.randint(15, 25)
for _ in range(n_pairs):
framing = random.choice(["persona", "word", "category", "open"])
if framing == "persona":
persona = random.choice(PERSONAS)
input_text = f"{desc} What would a {persona} say about a {name}?"
elif framing == "word" and related_words:
word = random.choice(related_words)
input_text = f"{desc} Tell me a saying about {name} and {word}."
elif framing == "category":
cats = entity.get("categories", ["thing"])
cat = random.choice(cats)
input_text = f"{desc} Give me folk wisdom about this {cat}."
else:
input_text = f"{desc} Tell me some folk wisdom about {name}."
# Placeholder output — these would ideally be generated through the
# template engine with fictional entities loaded, then polished.
# For now, generate a structural placeholder that indicates the
# entity relationships.
pairs.append({
"input": input_text,
"output": "", # Will be filled by actual generation
"meta_template": "fictional",
"source_words": [name] + related_words[:3],
"_needs_generation": True,
"_entity": entity,
})
return pairs
def main():
parser = argparse.ArgumentParser(description="Format training pairs for fine-tuning.")
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
help="Input filtered JSONL file")
parser.add_argument("--output", default=str(CORPUS_DIR / "training_pairs.jsonl"),
help="Output training pairs JSONL file")
parser.add_argument("--entities", default=str(EXAMPLES_DIR / "my_world.json"),
help="Fictional entities JSON file")
args = parser.parse_args()
input_path = Path(args.input)
output_path = Path(args.output)
entities_path = Path(args.entities)
if not input_path.exists():
print(f"Error: {input_path} not found.", file=sys.stderr)
sys.exit(1)
# Load vocab categories
word_cats = load_vocab_categories()
# Load filtered entries
entries = []
with open(input_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
entries.append(json.loads(line))
print(f"Loaded {len(entries)} filtered entries")
# Generate training pairs for each entry
all_pairs = []
for entry in entries:
pairs = generate_training_pairs(entry, word_cats)
all_pairs.extend(pairs)
print(f"Generated {len(all_pairs)} training pairs from polished sayings")
# Generate fictional entity pairs
fictional_entities = []
if entities_path.exists():
with open(entities_path, encoding="utf-8") as f:
data = json.load(f)
fictional_entities = data.get("entities", [])
print(f"Loaded {len(fictional_entities)} fictional entities from {entities_path}")
# Add auto-generated entities
fictional_entities.extend(AUTO_ENTITIES)
print(f"Total fictional entities (file + auto-generated): {len(fictional_entities)}")
fictional_pairs = generate_fictional_pairs(fictional_entities)
# Filter out placeholder pairs (those that still need generation)
# In a full pipeline, these would be generated through the template engine.
# For now, skip any with empty output.
real_fictional = [p for p in fictional_pairs if p.get("output")]
placeholder_fictional = [p for p in fictional_pairs if not p.get("output")]
if placeholder_fictional:
print(f" {len(placeholder_fictional)} fictional pairs need generation via template engine")
print(f" (Run folksy_generator.py with --entities to generate these, then re-run this script)")
all_pairs.extend(real_fictional)
# Clean up internal fields before writing
for pair in all_pairs:
pair.pop("_needs_generation", None)
pair.pop("_entity", None)
# Write output
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
for pair in all_pairs:
f.write(json.dumps(pair, ensure_ascii=False) + "\n")
# Stats
from collections import Counter
input_types = Counter()
for pair in all_pairs:
inp = pair["input"]
if inp.startswith("Tell me something about"):
input_types["word_seeded"] += 1
elif inp.startswith("Tell me a saying about"):
input_types["category_seeded"] += 1
elif inp.startswith("What would a"):
input_types["persona_seeded"] += 1
elif inp.startswith("Give me a") and "proverb" in inp:
input_types["template_seeded"] += 1
elif any(inp.startswith(p) for p in ["Tell me some folk", "What do they", "Give me a proverb", "Share some", "What's a good"]):
input_types["open_ended"] += 1
else:
input_types["fictional"] += 1
print(f"\nTotal training pairs: {len(all_pairs)}")
print("Distribution by input type:")
for itype, count in sorted(input_types.items()):
print(f" {itype:20s} {count:5d}")
print(f"\nOutput: {output_path}")
if __name__ == "__main__":
main()

61
scripts/generate_raw_batch.sh Executable file
View file

@ -0,0 +1,61 @@
#!/usr/bin/env bash
# Generate raw folksy sayings across all 7 templates.
# Output: corpus/corpus_raw.jsonl (~10,500 entries)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
CORPUS_DIR="$PROJECT_DIR/corpus"
GENERATOR="$PROJECT_DIR/folksy_generator.py"
COUNT_PER_TEMPLATE=${1:-1500}
mkdir -p "$CORPUS_DIR"
OUTPUT="$CORPUS_DIR/corpus_raw.jsonl"
# Clear existing file
> "$OUTPUT"
TEMPLATES=(
deconstruction
denial_of_consequences
ironic_deficiency
futile_preparation
hypocritical_complaint
tautological_wisdom
false_equivalence
)
echo "Generating $COUNT_PER_TEMPLATE sayings per template (${#TEMPLATES[@]} templates)..."
echo "Output: $OUTPUT"
total=0
for template in "${TEMPLATES[@]}"; do
echo -n " $template ($COUNT_PER_TEMPLATE)... "
before=$(wc -l < "$OUTPUT")
python "$GENERATOR" --template "$template" --count "$COUNT_PER_TEMPLATE" --json >> "$OUTPUT" 2>/dev/null
after=$(wc -l < "$OUTPUT")
generated=$((after - before))
total=$((total + generated))
echo "$generated generated"
done
echo ""
echo "Total: $total raw sayings in $OUTPUT"
echo ""
# Check template distribution
echo "Template distribution:"
python -c "
import json, sys
from collections import Counter
counts = Counter()
with open('$OUTPUT') as f:
for line in f:
entry = json.loads(line)
counts[entry['meta_template']] += 1
for template, count in sorted(counts.items()):
print(f' {template:30s} {count:5d}')
print(f\" {'TOTAL':30s} {sum(counts.values()):5d}\")
"

215
scripts/polish_corpus.py Normal file
View file

@ -0,0 +1,215 @@
#!/usr/bin/env python3
"""LLM polish pipeline for raw folksy sayings.
Reads corpus_raw.jsonl, sends each to GLM4-32B for polish.
Output file is the checkpoint append mode with resume detection.
Usage:
python scripts/polish_corpus.py
python scripts/polish_corpus.py --input corpus/corpus_raw.jsonl --output corpus/corpus_polished.jsonl
"""
import argparse
import json
import sys
import time
from pathlib import Path
SCRIPT_DIR = Path(__file__).parent
PROJECT_DIR = SCRIPT_DIR.parent
CORPUS_DIR = PROJECT_DIR / "corpus"
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
LLM_MODEL = "THUDM-GLM4-32B"
SYSTEM_PROMPT = """You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
Your job:
1. Fix grammar, articles, and pluralization
2. Make it sound natural like something a weathered farmer would say while leaning on a fence post
3. Preserve the core nouns and the relationship between them do not swap out the key words
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise real proverbs are short
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
Examples of good polish:
Raw: "Don't build the coffee and act surprised when the water show up."
Chain: coffee MadeOf water
Polished: Don't brew the coffee and act surprised when the water's all gone.
Raw: "The chest's children always goes without hold books."
Chain: chest UsedFor hold_books
Polished: The bookshelf-maker's kids always end up reading off the floor.
Raw: "A pineapple is just a nectarine that's got an attitude."
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
Polished: A pineapple is just a peach that grew itself some armor.
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
Chain: steel MadeOf iron, steel HasProperty hard
Polished: You know what they say steel without the iron is just a dream of being hard.
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
Chain: bamboo CapableOf grow_quickly
Polished: DISCARD
Raw: "That's just funning the canoe and praying for boiling food."
Chain: canoe UsedFor transport, fire UsedFor boiling_food
Polished: DISCARD"""
def llm_chat_completion(messages, max_retries=3):
"""Chat completion with retry logic."""
import requests
for attempt in range(max_retries):
try:
resp = requests.post(LLM_ENDPOINT, json={
"model": LLM_MODEL,
"messages": messages,
}, timeout=120)
resp.raise_for_status()
data = resp.json()
return data["choices"][0]["message"]["content"].strip()
except Exception as e:
wait = (2 ** attempt)
print(f" LLM error (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
if attempt < max_retries - 1:
time.sleep(wait)
else:
return None
def format_chain(chain_edges):
"""Format chain_edges list into readable string for LLM context."""
if not chain_edges:
return "(no chain data)"
parts = []
for edge in chain_edges:
start = edge.get("start", "?")
rel = edge.get("relation", "?")
end = edge.get("end", "?")
weight = edge.get("weight", 0)
parts.append(f"{start} --{rel}--> {end} (w:{weight:.1f})")
return ", ".join(parts)
def format_slots(slots):
"""Format slots dict for LLM context."""
return ", ".join(f"{k}={v}" for k, v in slots.items())
def load_already_processed(output_path):
"""Load set of raw_text strings already processed (for resume)."""
processed = set()
if output_path.exists():
with open(output_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
processed.add(entry.get("raw_text", ""))
except json.JSONDecodeError:
continue
return processed
def main():
parser = argparse.ArgumentParser(description="LLM polish pipeline for folksy sayings.")
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_raw.jsonl"),
help="Input JSONL file")
parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
help="Output JSONL file (also serves as checkpoint)")
args = parser.parse_args()
input_path = Path(args.input)
output_path = Path(args.output)
if not input_path.exists():
print(f"Error: {input_path} not found.", file=sys.stderr)
sys.exit(1)
# Load raw entries
raw_entries = []
with open(input_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
raw_entries.append(json.loads(line))
print(f"Loaded {len(raw_entries)} raw entries from {input_path}")
# Check what's already been processed
already_processed = load_already_processed(output_path)
remaining = [e for e in raw_entries if e.get("raw_text", "") not in already_processed]
print(f"Already processed: {len(already_processed)}")
print(f"Remaining: {len(remaining)}")
if not remaining:
print("Nothing to process.")
return
discards = 0
polished = 0
errors = 0
with open(output_path, "a", encoding="utf-8") as out:
for i, entry in enumerate(remaining):
raw_text = entry.get("raw_text", "")
meta_template = entry.get("meta_template", "")
chain = format_chain(entry.get("chain", []))
slots = format_slots(entry.get("slots", {}))
user_prompt = (
f"Meta-template: {meta_template}\n"
f"Relationship chain: {chain}\n"
f"Slot fills: {slots}\n"
f"Raw saying: {raw_text}"
)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
]
response = llm_chat_completion(messages)
if response is None:
entry["status"] = "error"
errors += 1
elif response.strip().upper() == "DISCARD":
entry["status"] = "discarded"
discards += 1
else:
entry["polished_text"] = response.strip()
entry["status"] = "polished"
polished += 1
out.write(json.dumps(entry, ensure_ascii=False) + "\n")
if (i + 1) % 100 == 0:
out.flush()
total_done = len(already_processed) + i + 1
print(f" [{total_done}/{len(raw_entries)}] "
f"polished={polished}, discarded={discards}, errors={errors}")
time.sleep(0.1)
total_done = len(already_processed) + len(remaining)
print(f"\nDone: {total_done} total entries processed.")
print(f" Polished: {polished}")
print(f" Discarded: {discards}")
print(f" Errors: {errors}")
print(f" Discard rate: {discards/(polished+discards)*100:.1f}%" if (polished+discards) else " N/A")
print(f"Output: {output_path}")
if __name__ == "__main__":
main()