corpus generation (work from mid february)
This commit is contained in:
parent
8c8a058301
commit
356b62c6ea
16 changed files with 25872 additions and 38 deletions
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
|
|
@ -0,0 +1 @@
|
|||
*__pycache__
|
||||
431
CORPUS_GENERATION_SPEC.md
Normal file
431
CORPUS_GENERATION_SPEC.md
Normal file
|
|
@ -0,0 +1,431 @@
|
|||
# Corpus Generation Spec — LLM-Polished Training Data
|
||||
|
||||
## Overview
|
||||
|
||||
The folksy generator produces structurally correct but grammatically rough idioms from templates. This phase uses GLM4-32B to transform raw template output into natural-sounding folk sayings, then packages the results as a training corpus for a small (0.5B parameter) task-specific model.
|
||||
|
||||
The pipeline is: **bulk generate → LLM polish → filter → format as training pairs → fine-tune small model**.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
|
||||
"""Chat completion endpoint of local LLM"""
|
||||
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
|
||||
'model': model,
|
||||
'messages': messages
|
||||
}).json()
|
||||
```
|
||||
|
||||
Same local endpoint as the graph enhancement phase. No cloud APIs.
|
||||
|
||||
## Phase 1: Bulk Raw Generation
|
||||
|
||||
### Goal
|
||||
Generate 10,000+ raw idioms from the template engine, covering all meta-template families with diverse seed words.
|
||||
|
||||
### Generation Strategy
|
||||
|
||||
Don't just run `--count 10000`. That will skew toward templates and categories with the most edges. Instead, generate systematically:
|
||||
|
||||
```bash
|
||||
# Even coverage across all 7 meta-template families
|
||||
for template in deconstruction denial_of_consequences ironic_deficiency \
|
||||
futile_preparation hypocritical_complaint tautological_wisdom \
|
||||
false_equivalence; do
|
||||
python folksy_generator.py --template $template --count 1500 --debug \
|
||||
--output raw_${template}.jsonl
|
||||
done
|
||||
```
|
||||
|
||||
### Output Format
|
||||
|
||||
The `--debug` flag is critical. Raw output should be JSONL with the relationship chain preserved:
|
||||
|
||||
```json
|
||||
{
|
||||
"raw_text": "Take the yeast out of bread and you've got yourself a wet flour.",
|
||||
"meta_template": "deconstruction",
|
||||
"surface_template": "Take the {B} out of {A} and you've got yourself a {C} {D}.",
|
||||
"slots": {"A": "bread", "B": "yeast", "C": "wet", "D": "flour"},
|
||||
"chain": [
|
||||
{"start": "bread", "relation": "MadeOf", "end": "yeast", "weight": 2.0},
|
||||
{"start": "bread", "relation": "MadeOf", "end": "flour", "weight": 1.5},
|
||||
{"start": "flour", "relation": "HasProperty", "end": "dry", "weight": 1.0}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This metadata travels with the saying through the entire pipeline. The LLM needs the chain to make intelligent polish decisions. The final training data needs the meta-template label.
|
||||
|
||||
### Deduplication at Generation Time
|
||||
|
||||
Before writing each generated saying, check:
|
||||
- Exact duplicate raw_text → skip
|
||||
- Same (meta_template, slots) tuple → skip (same slot fills, different surface template is fine)
|
||||
- Same seed word appeared more than 30 times across the batch → skip (prevents dog/bark saturation)
|
||||
|
||||
## Phase 2: LLM Polish
|
||||
|
||||
### Goal
|
||||
Transform each raw saying into natural-sounding folk wisdom. The LLM fixes grammar, adjusts articles and pluralization, smooths phrasing, and adds the kind of colorful variation that makes each saying feel hand-crafted rather than slot-filled.
|
||||
|
||||
### System Prompt
|
||||
|
||||
```
|
||||
You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
|
||||
|
||||
Your job:
|
||||
1. Fix grammar, articles, and pluralization
|
||||
2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
|
||||
3. Preserve the core nouns and the relationship between them — do not swap out the key words
|
||||
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
|
||||
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
|
||||
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
|
||||
|
||||
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
|
||||
|
||||
Examples of good polish:
|
||||
|
||||
Raw: "Don't build the coffee and act surprised when the water show up."
|
||||
Chain: coffee MadeOf water
|
||||
Polished: Don't brew the coffee and act surprised when the water's all gone.
|
||||
|
||||
Raw: "The chest's children always goes without hold books."
|
||||
Chain: chest UsedFor hold_books
|
||||
Polished: The bookshelf-maker's kids always end up reading off the floor.
|
||||
|
||||
Raw: "A pineapple is just a nectarine that's got an attitude."
|
||||
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
|
||||
Polished: A pineapple is just a peach that grew itself some armor.
|
||||
|
||||
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
|
||||
Chain: steel MadeOf iron, steel HasProperty hard
|
||||
Polished: You know what they say — steel without the iron is just a dream of being hard.
|
||||
|
||||
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
|
||||
Chain: bamboo CapableOf grow_quickly
|
||||
Polished: DISCARD
|
||||
|
||||
Raw: "That's just funning the canoe and praying for boiling food."
|
||||
Chain: canoe UsedFor transport, fire UsedFor boiling_food
|
||||
Polished: DISCARD
|
||||
```
|
||||
|
||||
### User Prompt Template
|
||||
|
||||
```
|
||||
Meta-template: {meta_template}
|
||||
Relationship chain: {chain_formatted}
|
||||
Slot fills: {slots_formatted}
|
||||
Raw saying: {raw_text}
|
||||
```
|
||||
|
||||
### Chain Formatting
|
||||
|
||||
Format the chain as a readable string:
|
||||
|
||||
```
|
||||
bread --MadeOf--> yeast (w:2.0), bread --MadeOf--> flour (w:1.5), flour --HasProperty--> dry (w:1.0)
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
import json
|
||||
import time
|
||||
|
||||
def polish_batch(input_path, output_path):
|
||||
system_prompt = load_system_prompt() # The prompt above
|
||||
|
||||
with open(input_path) as f:
|
||||
raw_entries = [json.loads(line) for line in f]
|
||||
|
||||
results = []
|
||||
discards = 0
|
||||
|
||||
for i, entry in enumerate(raw_entries):
|
||||
user_prompt = format_polish_prompt(entry)
|
||||
messages = [
|
||||
{"role": "system", "content": system_prompt},
|
||||
{"role": "user", "content": user_prompt}
|
||||
]
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
polished = response['choices'][0]['message']['content'].strip()
|
||||
|
||||
if polished == "DISCARD":
|
||||
discards += 1
|
||||
entry['status'] = 'discarded'
|
||||
else:
|
||||
entry['polished_text'] = polished
|
||||
entry['status'] = 'polished'
|
||||
|
||||
results.append(entry)
|
||||
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f"Processed {i+1}/{len(raw_entries)}, {discards} discarded so far")
|
||||
# Write checkpoint
|
||||
save_checkpoint(results, output_path)
|
||||
|
||||
time.sleep(0.1) # gentle rate limiting
|
||||
|
||||
save_final(results, output_path)
|
||||
print(f"Done: {len(results) - discards} polished, {discards} discarded")
|
||||
```
|
||||
|
||||
### Expected Discard Rate
|
||||
|
||||
Based on the 50-sample output, roughly 20-30% of raw sayings are unsalvageable. Budget for this: generate 10,000 raw to end up with 7,000-8,000 polished. If the discard rate after graph enhancement is lower (it should be — better edges = fewer nonsense combos), that's a bonus.
|
||||
|
||||
## Phase 3: Deduplication and Quality Filtering
|
||||
|
||||
After LLM polish, run automated quality checks before including in the training corpus.
|
||||
|
||||
### Automated Filters
|
||||
|
||||
```python
|
||||
def quality_filter(entry):
|
||||
text = entry['polished_text']
|
||||
|
||||
# Length check: real proverbs are short
|
||||
if len(text.split()) > 25:
|
||||
return False, "too_long"
|
||||
if len(text.split()) < 5:
|
||||
return False, "too_short"
|
||||
|
||||
# Must contain at least 2 of the original slot-fill nouns
|
||||
slot_words = set(entry['slots'].values())
|
||||
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
|
||||
if words_present < 2:
|
||||
return False, "lost_key_nouns"
|
||||
|
||||
# No raw ConceptNet artifacts (multi-word underscore phrases)
|
||||
if '_' in text:
|
||||
return False, "conceptnet_artifact"
|
||||
|
||||
# No broken templates (unfilled slots)
|
||||
if '{' in text or '}' in text:
|
||||
return False, "unfilled_slot"
|
||||
|
||||
return True, "pass"
|
||||
```
|
||||
|
||||
### Near-Duplicate Detection
|
||||
|
||||
Two sayings that use the same slot fills but different surface templates may polish into nearly identical text. Detect and keep only one:
|
||||
|
||||
```python
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
def is_near_duplicate(text_a, text_b, threshold=0.75):
|
||||
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
|
||||
```
|
||||
|
||||
Run pairwise within each meta-template family (not across families — similar nouns in different structures is fine).
|
||||
|
||||
## Phase 4: Training Corpus Formatting
|
||||
|
||||
### Goal
|
||||
Package the polished sayings as input/output training pairs for a 0.5B model fine-tune.
|
||||
|
||||
### Training Pair Schema
|
||||
|
||||
Each polished saying generates multiple training pairs with different input framings:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"input": "Tell me something about bread",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "Tell me a saying about baking",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "What would a farmer say about flour?",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
},
|
||||
{
|
||||
"input": "Give me a deconstruction proverb",
|
||||
"output": "Take the yeast out of bread and all you've got is wet flour with ambition.",
|
||||
"meta_template": "deconstruction",
|
||||
"source_words": ["bread", "yeast", "flour"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Input Framing Types
|
||||
|
||||
For each polished saying, generate training pairs with these input patterns:
|
||||
|
||||
1. **Word-seeded:** `"Tell me something about {random_slot_word}"`
|
||||
2. **Category-seeded:** `"Tell me a saying about {category_of_slot_word}"` (e.g., "animals", "tools", "food")
|
||||
3. **Persona-seeded:** `"What would a {persona} say about {word}?"` where persona ∈ [farmer, grandmother, old sailor, blacksmith, innkeeper, shepherd]
|
||||
4. **Template-seeded:** `"Give me a {meta_template_name} proverb"`
|
||||
5. **Open-ended:** `"Tell me some folk wisdom"` / `"What do they say?"` / `"Give me a proverb"`
|
||||
|
||||
Each polished saying should appear with 3-5 different input framings. This teaches the small model to respond to varied prompts while producing the same style of output.
|
||||
|
||||
### Fictional Entity Training Pairs
|
||||
|
||||
Additionally, generate training pairs that demonstrate fictional entity handling:
|
||||
|
||||
```json
|
||||
{
|
||||
"input": "A Xorhir is a large, stubborn mount found in stables and plains. It eats Grushum leaves. What would a farmer say about a Xorhir?",
|
||||
"output": "Don't plant the Grushum and act surprised when the Xorhir comes nosing at your fence."
|
||||
}
|
||||
```
|
||||
|
||||
For these, use the existing fictional entity examples from `my_world.json` plus 10-15 additional invented entities. Generate the sayings using the template engine with fictional entities loaded, then polish with GLM4-32B. Target: ~200-300 fictional entity training pairs to teach the pattern without overwhelming the real-word training signal.
|
||||
|
||||
### Format for Fiction Entity Input
|
||||
|
||||
Standardize how entity descriptions appear in training inputs:
|
||||
|
||||
```
|
||||
A {name} is a {categories_joined}. {property_sentences}. {relationship_sentences}.
|
||||
```
|
||||
|
||||
Example:
|
||||
```
|
||||
A turtleduck is a shy, armored bird. It is found near ponds and riverbanks. It has a shell and webbed feet. It can swim and lay eggs.
|
||||
```
|
||||
|
||||
This format matches what a game developer or worldbuilder would naturally provide at inference time.
|
||||
|
||||
## Phase 5: Corpus Statistics and Validation
|
||||
|
||||
### Required Metrics
|
||||
|
||||
Before declaring the corpus ready for fine-tuning, compute and report:
|
||||
|
||||
```
|
||||
Total polished sayings: X
|
||||
Discarded during polish: X (Y%)
|
||||
Discarded during quality filter: X (Y%)
|
||||
Final training pairs: X
|
||||
|
||||
Distribution by meta-template:
|
||||
deconstruction: X (Y%)
|
||||
denial_of_consequences: X (Y%)
|
||||
ironic_deficiency: X (Y%)
|
||||
futile_preparation: X (Y%)
|
||||
hypocritical_complaint: X (Y%)
|
||||
tautological_wisdom: X (Y%)
|
||||
false_equivalence: X (Y%)
|
||||
|
||||
Distribution by input framing type:
|
||||
word_seeded: X
|
||||
category_seeded: X
|
||||
persona_seeded: X
|
||||
template_seeded: X
|
||||
open_ended: X
|
||||
fictional: X
|
||||
|
||||
Unique slot words used: X (out of 534 vocab)
|
||||
Words never used in any saying: [list]
|
||||
Average saying length: X words
|
||||
```
|
||||
|
||||
### Balance Check
|
||||
|
||||
If any meta-template family has less than 10% of total pairs, go back and generate more raw sayings for that family specifically. The small model needs balanced exposure to all pattern types.
|
||||
|
||||
### Human Spot-Check
|
||||
|
||||
Randomly sample 50 polished sayings (spread across all families) and manually rate each as:
|
||||
- **Good:** Sounds natural, funny, could fool someone into thinking it's real
|
||||
- **Okay:** Grammatically correct but flat or too literal
|
||||
- **Bad:** Awkward, nonsensical, or lost the relationship
|
||||
|
||||
Target: >60% Good, <10% Bad. If Bad exceeds 10%, revisit the polish prompt or tighten quality filters.
|
||||
|
||||
## Output Files
|
||||
|
||||
### `corpus_raw.jsonl`
|
||||
All raw generated sayings with debug metadata. One JSON object per line.
|
||||
|
||||
### `corpus_polished.jsonl`
|
||||
All sayings after LLM polish, including discards (marked with `status: discarded`). One JSON object per line.
|
||||
|
||||
### `corpus_filtered.jsonl`
|
||||
Only sayings that passed quality filtering. One JSON object per line.
|
||||
|
||||
### `training_pairs.jsonl`
|
||||
Final training corpus. One JSON object per line:
|
||||
```json
|
||||
{"input": "...", "output": "...", "meta_template": "...", "source_words": [...]}
|
||||
```
|
||||
|
||||
### `corpus_stats.json`
|
||||
The metrics from Phase 5.
|
||||
|
||||
### `discard_analysis.csv`
|
||||
Every discarded saying with its discard reason:
|
||||
```
|
||||
raw_text, meta_template, discard_stage, discard_reason
|
||||
"Funny how the bamboo...", ironic_deficiency, llm_polish, "DISCARD by LLM"
|
||||
"The fire's...", ironic_deficiency, quality_filter, "too_short"
|
||||
```
|
||||
|
||||
This is valuable for debugging the template engine — if a specific template surface variant has a >50% discard rate, the template itself needs fixing.
|
||||
|
||||
## File Organization
|
||||
|
||||
```
|
||||
folksy-generator/
|
||||
├── corpus/
|
||||
│ ├── corpus_raw.jsonl
|
||||
│ ├── corpus_polished.jsonl
|
||||
│ ├── corpus_filtered.jsonl
|
||||
│ ├── training_pairs.jsonl
|
||||
│ ├── corpus_stats.json
|
||||
│ └── discard_analysis.csv
|
||||
├── scripts/
|
||||
│ ├── generate_raw_batch.sh # Runs generator across all templates
|
||||
│ ├── polish_corpus.py # LLM polish pipeline
|
||||
│ ├── filter_corpus.py # Quality filtering
|
||||
│ ├── format_training_pairs.py # Training pair generation
|
||||
│ └── compute_corpus_stats.py # Metrics and validation
|
||||
```
|
||||
|
||||
## Execution Timeline
|
||||
|
||||
Assuming ~1 second per LLM call on the local 4090:
|
||||
|
||||
| Step | Items | Est. Time |
|
||||
|------|-------|-----------|
|
||||
| Raw generation (template engine only) | 10,500 | ~2 minutes |
|
||||
| LLM polish | 10,500 | ~3 hours |
|
||||
| Quality filtering | ~7,500 | ~1 minute |
|
||||
| Training pair formatting | ~6,000 sayings × 4 framings | ~1 minute |
|
||||
| Fictional entity pairs | ~300 | ~5 minutes (includes generation + polish) |
|
||||
|
||||
Total: ~3.5 hours of mostly-unattended LLM grinding. The polish step is the bottleneck and fully resumable via checkpointing.
|
||||
|
||||
## Integration Notes
|
||||
|
||||
### Feeding into Fine-Tuning
|
||||
|
||||
The `training_pairs.jsonl` file is ready to feed directly into standard fine-tuning pipelines (HuggingFace Trainer, axolotl, etc.). The 0.5B model training is out of scope for this spec but the corpus format is designed for it.
|
||||
|
||||
### Iterative Improvement
|
||||
|
||||
This pipeline is designed to be re-run. After fine-tuning and evaluating the small model, weaknesses will appear (certain templates it struggles with, certain word categories it handles poorly). The fix is:
|
||||
1. Generate more raw sayings targeting the weak area
|
||||
2. Polish and filter
|
||||
3. Append to training corpus
|
||||
4. Re-train
|
||||
|
||||
The JSONL format and checkpoint system support this append workflow natively.
|
||||
303
EVALUATION.md
Normal file
303
EVALUATION.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# Folksy Generator — Evaluation Report
|
||||
|
||||
**Date:** 2026-02-17
|
||||
**Evaluator:** Claude (automated)
|
||||
**Scope:** Post-integration health check after three LLM augmentation phases
|
||||
|
||||
---
|
||||
|
||||
## 1. Project Structure Overview
|
||||
|
||||
```
|
||||
folksy-generator/
|
||||
├── folksy_generator.py # Main CLI generator (910 lines)
|
||||
├── FOLKSY_GENERATOR_SPEC.md # Original project spec
|
||||
├── GRAPH_ENHANCEMENT_SPEC.md # LLM graph augmentation spec (Phases 1-3)
|
||||
├── CORPUS_GENERATION_SPEC.md # Corpus generation spec (next phase)
|
||||
├── data/
|
||||
│ ├── folksy_vocab.csv # Curated vocabulary (624 words, expanded from 534)
|
||||
│ ├── folksy_vocab.csv.bak.* # Pre-expansion backup (534 words)
|
||||
│ ├── folksy_relations.csv # Original ConceptNet edges (11,096 edges)
|
||||
│ ├── folksy_relations_augmented.csv # LLM-generated edges (11,220 edges)
|
||||
│ ├── classified_proverbs.csv # Labeled real proverbs for reference
|
||||
│ ├── candidate_additions.csv # OOV words suggested by LLM (3,678 unique)
|
||||
│ └── enhancement_log.csv # Processing log for all 3 phases
|
||||
├── scripts/
|
||||
│ ├── extract_from_conceptnet.py # One-time ConceptNet extraction (requires psql)
|
||||
│ ├── extract_relations.py # Relation extraction helper
|
||||
│ ├── classify_proverbs.py # Proverb classification
|
||||
│ ├── expand_vocab.py # Phase: vocab expansion (+90 words)
|
||||
│ ├── enhance_graph.py # Phase: LLM edge augmentation
|
||||
│ ├── generate_raw_batch.sh # Bulk generation script
|
||||
│ ├── polish_corpus.py # LLM polish pipeline
|
||||
│ ├── filter_corpus.py # Quality filtering
|
||||
│ ├── format_training_pairs.py # Training pair generation
|
||||
│ └── compute_corpus_stats.py # Corpus statistics
|
||||
├── examples/
|
||||
│ ├── my_world.json # Fictional entity examples (5 entities)
|
||||
│ └── sample_output.txt # Pre-integration sample output
|
||||
├── schemas/
|
||||
│ └── fictional_entities.schema.json
|
||||
└── corpus/ # Empty — not yet populated
|
||||
```
|
||||
|
||||
**Entry point:** `python3 folksy_generator.py` — no virtual environment, no dependencies beyond Python 3.11 stdlib.
|
||||
|
||||
---
|
||||
|
||||
## 2. What the Three LLM Integration Phases Produced
|
||||
|
||||
Git history shows a single initial commit (`8c8a058 Initial 'folksy idiom' generator`). All three LLM augmentation phases were executed as data-pipeline operations rather than code commits — the results live in data files.
|
||||
|
||||
### Phase 1: Per-Word Relationship Expansion
|
||||
- **624 words** processed through GLM4-32B
|
||||
- 10,726 edges generated, **1,155 accepted** (10.8% acceptance rate)
|
||||
- 9,510 edges rejected as OOV (target words not in folksy vocab)
|
||||
- 61 duplicates filtered
|
||||
- Filled gaps in `AtLocation`, `UsedFor`, `HasA`, `MadeOf`, `PartOf`, `CapableOf`, `HasPrerequisite`, `Causes`, `HasProperty`
|
||||
|
||||
### Phase 2: Cross-Word Relationship Discovery (Bridge Words)
|
||||
- **148 low-connectivity words** targeted
|
||||
- 6,272 bridge edges accepted
|
||||
- This phase focused on connecting isolated vocabulary clusters via shared intermediate concepts
|
||||
|
||||
### Phase 3: Property Enrichment
|
||||
- **624 words** processed for distinctive HasProperty edges
|
||||
- 3,849 edges generated, **3,788 accepted** (98.4% acceptance rate)
|
||||
- 61 duplicates filtered
|
||||
- Targeted at improving `false_equivalence` template output
|
||||
|
||||
### Vocab Expansion (via `expand_vocab.py`)
|
||||
- Original vocabulary: **534 words**
|
||||
- Current vocabulary: **624 words** (+90 words added)
|
||||
- Added words span all major categories: animal (18), landscape (16), tool (14), material (13), plant (13), structure (8), food (7), and 25 other categories
|
||||
|
||||
### Combined Data Summary
|
||||
|
||||
| Dataset | Count |
|
||||
|---------|-------|
|
||||
| Original ConceptNet edges | 11,096 |
|
||||
| LLM-augmented edges | 11,220 |
|
||||
| **Total edges (combined)** | **22,316** |
|
||||
| Original vocabulary | 534 |
|
||||
| Expanded vocabulary | 624 |
|
||||
| Candidate OOV words (not added) | 3,678 |
|
||||
|
||||
---
|
||||
|
||||
## 3. Term Database Statistics
|
||||
|
||||
### Vocabulary by Category (36 categories)
|
||||
|
||||
| Category | Words | | Category | Words |
|
||||
|----------|-------|-|----------|-------|
|
||||
| bird | 97 | | fish | 16 |
|
||||
| animal | 65 | | spice | 16 |
|
||||
| tool | 56 | | fruit | 15 |
|
||||
| plant | 43 | | mineral | 14 |
|
||||
| food | 38 | | insect | 14 |
|
||||
| material | 36 | | structure | 13 |
|
||||
| container | 34 | | beverage | 9 |
|
||||
| instrument | 28 | | fabric | 9 |
|
||||
| landscape | 27 | | tree | 8 |
|
||||
| vegetable | 24 | | wood | 7 |
|
||||
| building | 21 | | herb | 7 |
|
||||
| metal | 19 | | rock | 6 |
|
||||
| flower | 19 | | water | 6 |
|
||||
| vehicle | 18 | | furniture | 5 |
|
||||
| stone | 17 | | clothing | 5 |
|
||||
| weapon | 17 | | shelter | 5 |
|
||||
| — | — | | crop, seed, organism, grain | 3-4 each |
|
||||
|
||||
### Edge Distribution — Original ConceptNet
|
||||
|
||||
| Relation | Edges |
|
||||
|----------|-------|
|
||||
| AtLocation | 5,294 |
|
||||
| UsedFor | 2,481 |
|
||||
| CapableOf | 1,138 |
|
||||
| ReceivesAction | 485 |
|
||||
| HasProperty | 422 |
|
||||
| HasA | 307 |
|
||||
| HasPrerequisite | 261 |
|
||||
| MadeOf | 181 |
|
||||
| PartOf | 170 |
|
||||
| Others (6 types) | 257 |
|
||||
|
||||
### Edge Distribution — LLM Augmented
|
||||
|
||||
| Relation | Edges |
|
||||
|----------|-------|
|
||||
| HasProperty | 3,985 |
|
||||
| HasA | 1,719 |
|
||||
| PartOf | 1,247 |
|
||||
| UsedFor | 1,230 |
|
||||
| MadeOf | 1,217 |
|
||||
| AtLocation | 1,008 |
|
||||
| CapableOf | 288 |
|
||||
| HasPrerequisite | 250 |
|
||||
| Others (4 types) | 276 |
|
||||
|
||||
The augmented edges deliberately fill the gaps in the original ConceptNet data. `HasProperty` went from 422 to 4,407 total — critical for the `false_equivalence` template.
|
||||
|
||||
---
|
||||
|
||||
## 4. Sample Generated Output (30 Sayings)
|
||||
|
||||
Generated with `python3 folksy_generator.py --count 30` using the full augmented graph:
|
||||
|
||||
1. An scarf ain't nothing but cotton that met some wool.
|
||||
2. The only difference between a hummingbird and a dodo is metabolism.
|
||||
3. An salt ain't nothing but ore that met some crystals.
|
||||
4. Funny how the earthworm never has enough food for itself.
|
||||
5. What's a coop but a kitchen with sound?
|
||||
6. My grandmother used to say, 'spooning the dessert won't bring you eating.'
|
||||
7. Don't take the wheel and then gripe about the hull.
|
||||
8. A bamboo don't come without its water, now does it?
|
||||
9. Nobody's got less salsa than the man who makes the mango.
|
||||
10. That's like eating the sea and complaining the savanna tastes off.
|
||||
11. My daddy always said, can't have waking up in morning without coffee.
|
||||
12. Take the bison out of meat and all you've got left is salty taste flesh.
|
||||
13. Like baiting the flock and hoping for keep as pet.
|
||||
14. The ice's family always goes without cool body.
|
||||
15. There's a fella who takes the wax and says the sugar's no good.
|
||||
16. That's just holding the drawer and praying for store blanket.
|
||||
17. You know what they say, a mica with no schist is just a rough surface rock.
|
||||
18. An silver ain't nothing but hairbrushes that met some alloy.
|
||||
19. A kite is just a pelican that's got catch wind.
|
||||
20. Like making the denim and hoping for material.
|
||||
21. The nut feeds everyone's fit bolt but its own.
|
||||
22. The pitcher's family always goes without throw fast ball.
|
||||
23. A nail is just a weapon that's got smooth length.
|
||||
24. You want lid? Well, first you're gonna need container.
|
||||
25. Don't build the micrometer and say you ain't got workshop.
|
||||
26. Ain't no sleeping at night ever came from nothing — you need bed.
|
||||
27. What's a cicada but a lacebug with nocturnal behavior?
|
||||
28. Don't drink the dish and then gripe about the gnocchi.
|
||||
29. You can't put out a herring and then wonder where all the herringbone came from.
|
||||
30. That's just lorikeeting the fruit and praying for breaking wind.
|
||||
|
||||
---
|
||||
|
||||
## 5. Quality Assessment
|
||||
|
||||
### Rating Summary
|
||||
|
||||
I rated each of the 30 sayings on a 3-tier scale (Good / Okay / Bad):
|
||||
|
||||
| Rating | Count | % | Description |
|
||||
|--------|-------|---|-------------|
|
||||
| **Good** | 8 | 27% | Sounds natural, humorous, structurally solid |
|
||||
| **Okay** | 9 | 30% | Semantically coherent but grammatically rough |
|
||||
| **Bad** | 13 | 43% | Broken grammar, nonsensical, or artifact leakage |
|
||||
|
||||
### Good Examples (natural-sounding, humorous)
|
||||
- "Nobody's got less salsa than the man who makes the mango."
|
||||
- "There's a fella who takes the wax and says the sugar's no good."
|
||||
- "A bamboo don't come without its water, now does it?"
|
||||
- "Don't take the wheel and then gripe about the hull."
|
||||
- "Ain't no sleeping at night ever came from nothing — you need bed."
|
||||
- "My daddy always said, can't have waking up in morning without coffee."
|
||||
- "What's a cicada but a lacebug with nocturnal behavior?"
|
||||
- "You can't put out a herring and then wonder where all the herringbone came from."
|
||||
|
||||
### Common Issues Identified
|
||||
|
||||
#### 1. Article / Grammar Errors (frequent)
|
||||
- "An scarf ain't nothing but..." — should be "A scarf"
|
||||
- "An silver ain't nothing but..." — should be "Silver"
|
||||
- "An salt ain't nothing but..." — should be "Salt"
|
||||
- "A have children don't come without..." — broken slot fill leaking action phrase as noun
|
||||
|
||||
#### 2. Multi-Word ConceptNet Phrases Leaking Into Templates (frequent)
|
||||
- "throw fast ball", "fit bolt", "cool body", "keep as pet", "store blanket"
|
||||
- "waking up in morning", "sleeping at night", "salty taste"
|
||||
- "breaking wind", "store blanket", "rough surface"
|
||||
- These are raw ConceptNet concept IDs that should have been filtered or reformatted
|
||||
|
||||
#### 3. Nonsensical Verb Conjugation in Futile Preparation (severe)
|
||||
- "lorikeeting the fruit" — `lorikeet` treated as a verb
|
||||
- "fooding the earthworm" — `food` treated as a verb
|
||||
- "jeansing the denim" — `jeans` treated as a verb
|
||||
- "safariing the lion" — `safari` treated as a verb
|
||||
- The `_gerund()` function applies gerunding to ANY UsedFor target, including nouns
|
||||
|
||||
#### 4. LLM Enhancement Artifacts Leaking (moderate)
|
||||
- "bridge word: plate" appearing in output text
|
||||
- "bridge 2: **food**" appearing in output text
|
||||
- "*bridge word: absorption*" appearing in output text
|
||||
- These are raw LLM response fragments that weren't properly cleaned during Phase 2
|
||||
|
||||
#### 5. Semantic Mismatches (occasional)
|
||||
- "A lynx is just a earthworm that's got feline." — wrong category siblings
|
||||
- "That's like eating the sea and complaining the savanna tastes off." — sea and savanna are not parts of a river
|
||||
- "A emu is just a ferret that's got walk backwards." — cross-class comparison
|
||||
|
||||
### Per-Template Quality Assessment
|
||||
|
||||
| Template | Typical Quality | Key Issue |
|
||||
|----------|----------------|-----------|
|
||||
| **deconstruction** | Okay | Multi-word properties leak; article errors with "An" |
|
||||
| **denial_of_consequences** | Good | Best template; LLM artifacts occasionally leak through |
|
||||
| **ironic_deficiency** | Okay-Bad | Multi-word action phrases used as nouns ("throw fast ball") |
|
||||
| **futile_preparation** | Bad | Nouns gerunded as verbs; worst template overall |
|
||||
| **hypocritical_complaint** | Okay | Some odd part-of relationships; generally coherent structure |
|
||||
| **tautological_wisdom** | Good | Simple structure avoids most issues; multi-word phrases still leak |
|
||||
| **false_equivalence** | Good | Benefited most from Phase 3 property enrichment |
|
||||
|
||||
---
|
||||
|
||||
## 6. Errors, Warnings, and Issues
|
||||
|
||||
### No Errors at Runtime
|
||||
- Generator runs without crashes on all template types
|
||||
- All CLI flags work (`--template`, `--count`, `--seed`, `--category`, `--debug`, `--json`, `--entities`, `--pure-conceptnet`, `--llm-weight-boost`)
|
||||
- JSON output mode produces valid JSONL with complete metadata
|
||||
- Fictional entity generation works
|
||||
|
||||
### Issues Found
|
||||
|
||||
| Severity | Issue | Impact |
|
||||
|----------|-------|--------|
|
||||
| **High** | LLM Phase 2 artifacts in augmented data ("bridge word:", "bridge 2:") | Raw LLM response fragments leak into generated sayings |
|
||||
| **High** | Nouns gerunded as verbs in `futile_preparation` | "lorikeeting", "fooding", "jeansing" — template fundamentally broken for non-verb UsedFor targets |
|
||||
| **Medium** | Multi-word ConceptNet phrases not filtered | "throw fast ball", "keep as pet" break sentence flow |
|
||||
| **Medium** | Article logic doesn't handle "a" vs "an" properly for all cases | "An scarf", "An silver", "An salt" |
|
||||
| **Low** | No test suite exists | No automated validation of output quality |
|
||||
| **Low** | No virtual environment or requirements.txt | Only stdlib needed currently, but will need deps for corpus generation phase |
|
||||
| **Info** | Corpus directory is empty | Expected — corpus generation is the next phase |
|
||||
|
||||
---
|
||||
|
||||
## 7. Readiness Assessment for Corpus Generation
|
||||
|
||||
### Ready
|
||||
- Template engine is functional and produces output across all 7 meta-template families
|
||||
- Augmented graph significantly improves vocabulary coverage (22,316 total edges)
|
||||
- Vocab expansion added 90 words to cover previously sparse categories
|
||||
- JSON output mode with full debug metadata is working — ready for bulk generation
|
||||
- Deduplication logic works (seen_text, seen_slots, seed_usage caps at 30)
|
||||
- Fictional entity support is implemented and functional
|
||||
- All corpus pipeline scripts exist (`generate_raw_batch.sh`, `polish_corpus.py`, `filter_corpus.py`, `format_training_pairs.py`, `compute_corpus_stats.py`)
|
||||
|
||||
### Should Fix Before Corpus Generation
|
||||
1. **Clean Phase 2 artifacts from `folksy_relations_augmented.csv`** — grep for "bridge word" and "bridge 2" in surface_text/end_word fields and remove or repair those edges
|
||||
2. **Fix `futile_preparation` gerunding** — the `_gerund()` function needs a check that the UsedFor target is actually a verb before conjugating it; alternatively, filter UsedFor targets to verb-like words only
|
||||
3. **Filter multi-word ConceptNet phrases** — the `_short_concepts()` helper caps at 3 words but many 2-3 word phrases are still awkward as slot fills ("salty taste", "cool body"); consider capping at 2 or adding a verb/noun POS check
|
||||
4. **Fix article logic** — the `_a()` function at line 680-684 only checks the first character; "An salt" is wrong because "salt" starts with "s"
|
||||
|
||||
### Nice to Have
|
||||
- Add a basic test suite (even just smoke tests that confirm each template generates output)
|
||||
- Create `requirements.txt` (currently stdlib-only, but corpus phase will need `requests` at minimum)
|
||||
- Review the 3,678 candidate OOV words — none exceeded frequency threshold of 3+ for auto-addition, but manual review could find useful additions
|
||||
|
||||
### Overall Verdict
|
||||
|
||||
**The template generator works but produces rough output.** This is expected and acceptable because the CORPUS_GENERATION_SPEC explicitly accounts for it — the raw output goes through LLM polishing (Phase 2 of corpus generation) where GLM4-32B fixes grammar and discards unsalvageable sayings. The spec estimates a 20-30% discard rate; based on this evaluation, the actual discard rate will likely be **40-50%** due to the issues above.
|
||||
|
||||
Fixing the four "Should Fix" items before corpus generation would:
|
||||
- Reduce the discard rate (saving LLM compute time)
|
||||
- Improve the quality floor of raw output (giving the polish LLM better material to work with)
|
||||
- Eliminate artifact contamination that could propagate into training data
|
||||
|
||||
The generator is **functional but not polished** — appropriate for its role as a raw material source in a pipeline that includes LLM correction downstream.
|
||||
318
GRAPH_ENHANCEMENT_SPEC.md
Normal file
318
GRAPH_ENHANCEMENT_SPEC.md
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
# Graph Enhancement Spec — LLM-Augmented Folksy Subgraph
|
||||
|
||||
## Overview
|
||||
|
||||
The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.
|
||||
|
||||
This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
|
||||
"""Chat completion endpoint of local LLM"""
|
||||
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
|
||||
'model': model,
|
||||
'messages': messages
|
||||
}).json()
|
||||
```
|
||||
|
||||
All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.
|
||||
|
||||
## Strategy
|
||||
|
||||
For each word in `folksy_vocab.csv`, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as `folksy_relations.csv` and merged into the generator's working dataset.
|
||||
|
||||
This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.
|
||||
|
||||
## Phase 1: Per-Word Relationship Expansion
|
||||
|
||||
### Input
|
||||
Every word in `folksy_vocab.csv`, plus its existing edges from `folksy_relations.csv`.
|
||||
|
||||
### Process
|
||||
|
||||
For each word, send a prompt that:
|
||||
1. Provides the word and its categories
|
||||
2. Lists its EXISTING relationships (so the LLM doesn't duplicate them)
|
||||
3. Asks for ADDITIONAL relationships across specific relation types
|
||||
4. Constrains output to a parseable structured format
|
||||
|
||||
### System Prompt
|
||||
|
||||
```
|
||||
You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
|
||||
|
||||
Rules:
|
||||
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
|
||||
- Every relationship must be something a typical adult would agree is true
|
||||
- Do not repeat any relationship already listed as "known"
|
||||
- Target words should be common English words (top 3000 frequency preferred)
|
||||
- Output ONLY the structured format shown below, one relationship per line
|
||||
- If you cannot think of good relationships for a given type, output NONE for that type
|
||||
- Aim for 3-5 relationships per type where possible
|
||||
|
||||
Output format (one per line):
|
||||
RELATION_TYPE: target_word | short natural phrasing
|
||||
|
||||
Example output:
|
||||
AtLocation: barn | you find a horse in a barn
|
||||
UsedFor: riding | a horse is used for riding
|
||||
HasA: mane | a horse has a mane
|
||||
CapableOf: gallop | a horse can gallop
|
||||
MadeOf: NONE
|
||||
PartOf: herd | a horse is part of a herd
|
||||
```
|
||||
|
||||
### User Prompt Template
|
||||
|
||||
```
|
||||
Word: {word}
|
||||
Categories: {categories}
|
||||
|
||||
Known relationships:
|
||||
{existing_edges_formatted}
|
||||
|
||||
Generate additional relationships for these types:
|
||||
- AtLocation (where is it found?)
|
||||
- UsedFor (what is it used for?)
|
||||
- HasA (what does it have / contain?)
|
||||
- PartOf (what is it part of?)
|
||||
- CapableOf (what can it do?)
|
||||
- MadeOf (what is it made of?)
|
||||
- HasPrerequisite (what do you need before you can have/use it?)
|
||||
- Causes (what does it cause or lead to?)
|
||||
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)
|
||||
```
|
||||
|
||||
### Formatting Existing Edges
|
||||
|
||||
For the "Known relationships" section, format existing edges as:
|
||||
|
||||
```
|
||||
AtLocation: pond (weight 1.0), lake (weight 4.47)
|
||||
CapableOf: swim (weight 2.0), fly (weight 1.0)
|
||||
UsedFor: (none in database)
|
||||
```
|
||||
|
||||
This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.
|
||||
|
||||
### Parsing LLM Output
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def parse_llm_relations(response_text, source_word):
|
||||
"""Parse structured LLM output into edge tuples."""
|
||||
edges = []
|
||||
for line in response_text.strip().split('\n'):
|
||||
line = line.strip()
|
||||
if not line or 'NONE' in line:
|
||||
continue
|
||||
match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
|
||||
if match:
|
||||
relation, target, surface = match.groups()
|
||||
# Validate relation type
|
||||
if relation in VALID_RELATIONS:
|
||||
edges.append({
|
||||
'start_word': source_word,
|
||||
'end_word': target.strip().lower(),
|
||||
'relation': relation,
|
||||
'weight': 0.8, # LLM-generated edges get a default weight below ConceptNet minimum
|
||||
'surface_text': surface.strip(),
|
||||
'source': 'llm_augmented'
|
||||
})
|
||||
return edges
|
||||
```
|
||||
|
||||
### Weight Assignment
|
||||
|
||||
LLM-generated edges get a default weight of **0.8** — deliberately below the ConceptNet minimum threshold of 1.0. This means:
|
||||
- They fill gaps and add diversity
|
||||
- They lose ties to ConceptNet edges (real data preferred when both exist)
|
||||
- They can be filtered out easily if needed (`weight >= 1.0` restores pure ConceptNet)
|
||||
- The generator can optionally boost or penalize LLM edges via a CLI flag
|
||||
|
||||
### Deduplication
|
||||
|
||||
Before merging, check each LLM-generated edge against existing edges:
|
||||
- If (start_word, end_word, relation) already exists → skip
|
||||
- If end_word is not in folksy_vocab → add to a `candidate_additions.csv` for review, but do NOT auto-add to vocab (avoids graph bloat)
|
||||
- If end_word IS in folksy_vocab → add edge to `folksy_relations_augmented.csv`
|
||||
|
||||
## Phase 2: Cross-Word Relationship Discovery
|
||||
|
||||
After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.
|
||||
|
||||
### Process
|
||||
|
||||
1. Identify word pairs that are in the same category but have no path of length ≤ 2 between them
|
||||
2. For a sample of these pairs, ask the LLM what connects them
|
||||
|
||||
### Prompt for Bridge Discovery
|
||||
|
||||
System prompt:
|
||||
```
|
||||
You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
|
||||
|
||||
Rules:
|
||||
- The bridge word must be a common, concrete noun
|
||||
- State the relationship type for each connection
|
||||
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
|
||||
|
||||
Example:
|
||||
Words: "cow" and "butter"
|
||||
BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product
|
||||
```
|
||||
|
||||
User prompt:
|
||||
```
|
||||
Words: "{word_a}" and "{word_b}"
|
||||
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
|
||||
Find 1-3 bridge words that connect them.
|
||||
```
|
||||
|
||||
### Candidate Selection
|
||||
|
||||
Don't run this for all pairs — that's O(n²) on 534 words. Instead:
|
||||
|
||||
1. Build the current 2-hop reachability matrix
|
||||
2. Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
|
||||
3. For each low-connectivity word, pick 5-10 random same-category words it can't reach
|
||||
4. Run bridge discovery on those pairs
|
||||
5. Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words
|
||||
|
||||
## Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates
|
||||
|
||||
The `false_equivalence` meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.
|
||||
|
||||
### Prompt
|
||||
|
||||
System prompt:
|
||||
```
|
||||
You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
|
||||
|
||||
Rules:
|
||||
- Only physical/sensory/behavioral properties, not abstract qualities
|
||||
- Properties should DISTINGUISH this thing from similar things in its category
|
||||
- Output one property per line as: PROPERTY | brief explanation
|
||||
- Aim for 5-8 properties
|
||||
```
|
||||
|
||||
User prompt:
|
||||
```
|
||||
Word: {word}
|
||||
Category: {categories}
|
||||
Other words in same category: {same_category_sample}
|
||||
|
||||
What properties distinguish {word} from the others listed?
|
||||
```
|
||||
|
||||
Including same-category peers in the prompt encourages the LLM to generate *differentiating* properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.
|
||||
|
||||
### Output Format
|
||||
|
||||
```
|
||||
fast | horses are known for running fast
|
||||
tall | horses are tall compared to most farm animals
|
||||
mane | horses have a distinctive mane
|
||||
shod | horses wear horseshoes
|
||||
```
|
||||
|
||||
These get stored as HasProperty edges in the augmented relations file.
|
||||
|
||||
## Output Files
|
||||
|
||||
### `folksy_relations_augmented.csv`
|
||||
Same schema as `folksy_relations.csv` with additional columns:
|
||||
|
||||
```
|
||||
start_word, end_word, relation, weight, surface_text, source
|
||||
corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
|
||||
dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
|
||||
horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented
|
||||
```
|
||||
|
||||
The `source` column allows filtering: `source=conceptnet` for pure ConceptNet, `source=llm_augmented` for LLM additions, or both for the full enhanced graph.
|
||||
|
||||
### `candidate_additions.csv`
|
||||
Words that appeared in LLM output but aren't in the current folksy vocab:
|
||||
|
||||
```
|
||||
word, suggested_by, relation_context, frequency
|
||||
mane, horse, "HasA: a horse has a mane", 2
|
||||
bridle, horse, "HasA: a horse has a bridle", 1
|
||||
```
|
||||
|
||||
The `frequency` column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).
|
||||
|
||||
### `enhancement_log.csv`
|
||||
Track what was processed and what the LLM produced:
|
||||
|
||||
```
|
||||
source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
|
||||
dog, 2025-02-15T10:30:00, 24, 18, 3, 3
|
||||
horse, 2025-02-15T10:30:45, 31, 22, 5, 4
|
||||
```
|
||||
|
||||
## Execution Plan
|
||||
|
||||
### Batch Processing
|
||||
|
||||
534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.
|
||||
|
||||
```python
|
||||
import csv
|
||||
import time
|
||||
|
||||
def process_all_words(vocab_path, relations_path, output_path):
|
||||
vocab = load_vocab(vocab_path)
|
||||
relations = load_relations(relations_path)
|
||||
all_new_edges = []
|
||||
|
||||
for i, word_entry in enumerate(vocab):
|
||||
word = word_entry['word']
|
||||
categories = word_entry['categories']
|
||||
existing = get_edges_for_word(relations, word)
|
||||
|
||||
messages = build_expansion_prompt(word, categories, existing)
|
||||
response = llm_chat_completion(messages)
|
||||
response_text = response['choices'][0]['message']['content']
|
||||
|
||||
new_edges = parse_llm_relations(response_text, word)
|
||||
new_edges = deduplicate(new_edges, existing)
|
||||
all_new_edges.extend(new_edges)
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
|
||||
|
||||
time.sleep(0.1) # gentle rate limiting
|
||||
|
||||
save_augmented_relations(all_new_edges, output_path)
|
||||
```
|
||||
|
||||
### Resumability
|
||||
|
||||
Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.
|
||||
|
||||
### Validation Pass
|
||||
|
||||
After all LLM edges are generated, run a quick validation:
|
||||
1. No self-loops (start_word == end_word)
|
||||
2. All relation types are in the valid set
|
||||
3. No duplicate (start, end, relation) triples
|
||||
4. Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
|
||||
5. Spot-check 20 random LLM edges manually for sanity
|
||||
|
||||
## Integration with Generator
|
||||
|
||||
The generator's data loading should be updated to:
|
||||
|
||||
1. Load `folksy_relations.csv` (original ConceptNet edges)
|
||||
2. If `folksy_relations_augmented.csv` exists, load and merge it
|
||||
3. CLI flag: `--pure-conceptnet` to disable LLM-augmented edges
|
||||
4. CLI flag: `--llm-weight-boost 0.2` to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)
|
||||
|
||||
This keeps the original ConceptNet data pristine and the augmentation fully reversible.
|
||||
9511
data/candidate_additions.csv
Normal file
9511
data/candidate_additions.csv
Normal file
File diff suppressed because it is too large
Load diff
1397
data/enhancement_log.csv
Normal file
1397
data/enhancement_log.csv
Normal file
File diff suppressed because it is too large
Load diff
11241
data/folksy_relations_augmented.csv
Normal file
11241
data/folksy_relations_augmented.csv
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -533,3 +533,93 @@ oxpecker,bird,0.0,4,0
|
|||
bowerbird,bird,0.0,3,0
|
||||
condor,bird,0.0,3,0
|
||||
gladiola,flower,0.0,3,0
|
||||
metal,metal,0.80,0,0
|
||||
soil,mineral,0.80,0,0
|
||||
beak,animal,0.80,0,0
|
||||
feather,"bird,material",0.80,0,0
|
||||
plant,plant,0.80,0,0
|
||||
forest,"landscape,tree",0.80,0,0
|
||||
food,food,0.80,0,0
|
||||
wing,bird,0.80,0,0
|
||||
seed,"seed,plant",0.80,0,0
|
||||
kitchen,"building,structure",0.80,0,0
|
||||
handle,tool,0.80,0,0
|
||||
tail,animal,0.80,0,0
|
||||
leaf,plant,0.80,0,0
|
||||
bone,"animal,material",0.80,0,0
|
||||
flesh,"animal,food",0.80,0,0
|
||||
flock,animal,0.80,0,0
|
||||
field,"landscape,crop",0.80,0,0
|
||||
fur,"animal,material",0.80,0,0
|
||||
workshop,"building,structure",0.80,0,0
|
||||
meat,"animal,food",0.80,0,0
|
||||
fiber,"plant,material",0.80,0,0
|
||||
farm,"structure,landscape",0.80,0,0
|
||||
skin,"animal,material",0.80,0,0
|
||||
leg,"animal,tool",0.80,0,0
|
||||
flower,"flower,plant",0.80,0,0
|
||||
ground,landscape,0.80,0,0
|
||||
petal,"flower,plant",0.80,0,0
|
||||
muscle,"organism,animal",0.80,0,0
|
||||
shade,"landscape,plant",0.80,0,0
|
||||
ocean,"water,landscape",0.80,0,0
|
||||
medicine,"herb,plant",0.80,0,0
|
||||
rubber,"material,fabric",0.80,0,0
|
||||
mineral,"mineral,stone",0.80,0,0
|
||||
toolbox,"tool,container",0.80,0,0
|
||||
land,landscape,0.80,0,0
|
||||
bird,"bird,animal",0.80,0,0
|
||||
lid,"container,tool",0.80,0,0
|
||||
bouquet,"flower,plant",0.80,0,0
|
||||
ceramic,"material,container",0.80,0,0
|
||||
lake,"water,landscape",0.80,0,0
|
||||
fat,"animal,food",0.80,0,0
|
||||
body,"organism,animal",0.80,0,0
|
||||
house,"shelter,building",0.80,0,0
|
||||
furniture,"furniture,structure",0.80,0,0
|
||||
concrete,"material,stone",0.80,0,0
|
||||
jewelry,material,0.80,0,0
|
||||
fruit,fruit,0.80,0,0
|
||||
fin,"animal,fish",0.80,0,0
|
||||
container,container,0.80,0,0
|
||||
branch,"plant,wood",0.80,0,0
|
||||
earth,"landscape,mineral",0.80,0,0
|
||||
fuel,material,0.80,0,0
|
||||
ore,"mineral,metal",0.80,0,0
|
||||
fireplace,"structure,tool",0.80,0,0
|
||||
dust,material,0.80,0,0
|
||||
door,"furniture,structure",0.80,0,0
|
||||
window,structure,0.80,0,0
|
||||
mouth,"animal,insect",0.80,0,0
|
||||
string,material,0.80,0,0
|
||||
fabric,fabric,0.80,0,0
|
||||
sugar,"food,spice",0.80,0,0
|
||||
trigger,"tool,weapon",0.80,0,0
|
||||
key,tool,0.80,0,0
|
||||
brick,"container,material,stone",0.80,0,0
|
||||
stone,"rock,stone",0.80,0,0
|
||||
mountain,"landscape,rock",0.80,0,0
|
||||
juice,"beverage,food",0.80,0,0
|
||||
cage,"structure,tool",0.80,0,0
|
||||
head,"animal,insect",0.80,0,0
|
||||
grain,grain,0.80,0,0
|
||||
home,"building,shelter",0.80,0,0
|
||||
crystal,"mineral,rock",0.80,0,0
|
||||
engine,"tool,vehicle",0.80,0,0
|
||||
hammer,"tool,weapon",0.80,0,0
|
||||
aquarium,container,0.80,0,0
|
||||
tooth,animal,0.80,0,0
|
||||
river,"water,landscape",0.80,0,0
|
||||
grassland,"landscape,plant",0.80,0,0
|
||||
sea,"water,landscape",0.80,0,0
|
||||
dessert,food,0.80,0,0
|
||||
wheel,"tool,vehicle",0.80,0,0
|
||||
needle,tool,0.80,0,0
|
||||
jungle,"landscape,plant",0.80,0,0
|
||||
blood,organism,0.80,0,0
|
||||
oil,"beverage,mineral",0.80,0,0
|
||||
mouthpiece,tool,0.80,0,0
|
||||
claw,animal,0.80,0,0
|
||||
spout,tool,0.80,0,0
|
||||
savanna,"landscape,plant",0.80,0,0
|
||||
desert,landscape,0.80,0,0
|
||||
|
|
|
|||
|
|
|
@ -212,26 +212,45 @@ class Deconstruction(MetaTemplate):
|
|||
|
||||
# Find what A is made of / requires
|
||||
ingredients = []
|
||||
ingredient_rels = [] # track which relation found each ingredient
|
||||
for rel in ("MadeOf", "HasPrerequisite", "HasA"):
|
||||
ingredients.extend(_short_concepts(self.graph.neighbors(a, rel, min_weight=0.5)))
|
||||
found = _short_concepts(self.graph.neighbors(a, rel, min_weight=0.5))
|
||||
for item in found:
|
||||
ingredients.append(item)
|
||||
ingredient_rels.append(rel)
|
||||
|
||||
if len(ingredients) < 2:
|
||||
for rel in ("MadeOf", "HasPrerequisite"):
|
||||
for (start, w, s) in self.graph.reverse.get((a, rel), []):
|
||||
if len(start.split("_")) <= 2:
|
||||
ingredients.append((start, w, s))
|
||||
ingredient_rels.append(rel)
|
||||
|
||||
if len(ingredients) < 2:
|
||||
return None, None
|
||||
|
||||
random.shuffle(ingredients)
|
||||
b_word = _readable(ingredients[0][0])
|
||||
d_word = _readable(ingredients[1][0])
|
||||
# Shuffle together
|
||||
combined = list(zip(ingredients, ingredient_rels))
|
||||
random.shuffle(combined)
|
||||
ingredients, ingredient_rels = zip(*combined)
|
||||
|
||||
b_edge = ingredients[0]
|
||||
b_word = _readable(b_edge[0])
|
||||
b_rel = ingredient_rels[0]
|
||||
d_edge = ingredients[1]
|
||||
d_word = _readable(d_edge[0])
|
||||
d_rel = ingredient_rels[1]
|
||||
|
||||
# Find a property for D
|
||||
chain_edges = [
|
||||
{"start": a, "relation": b_rel, "end": b_edge[0], "weight": b_edge[1], "surface_text": b_edge[2]},
|
||||
{"start": a, "relation": d_rel, "end": d_edge[0], "weight": d_edge[1], "surface_text": d_edge[2]},
|
||||
]
|
||||
props = self.graph.neighbors(ingredients[1][0], "HasProperty")
|
||||
if props:
|
||||
c_word = _readable(random.choice(props)[0])
|
||||
c_prop = random.choice(props)
|
||||
c_word = _readable(c_prop[0])
|
||||
chain_edges.append({"start": d_edge[0], "relation": "HasProperty", "end": c_prop[0], "weight": c_prop[1], "surface_text": c_prop[2]})
|
||||
else:
|
||||
c_word = random.choice(["plain", "sorry", "old", "humble", "dry", "wet", "cold"])
|
||||
|
||||
|
|
@ -242,6 +261,7 @@ class Deconstruction(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{a} MadeOf/Has [{b_word}, {d_word}]; {d_word} HasProperty {c_word}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"A": a, "B": b_word, "C": c_word, "D": d_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -265,23 +285,31 @@ class DenialOfConsequences(MetaTemplate):
|
|||
return None, None
|
||||
|
||||
# What is found at A? (reverse: B AtLocation A)
|
||||
attracted = []
|
||||
attracted = [] # (word, weight, surface_text, relation)
|
||||
for (b, w, s) in self.graph.reverse.get((a, "AtLocation"), []):
|
||||
attracted.append((b, w))
|
||||
attracted.append((b, w, s, "AtLocation"))
|
||||
|
||||
# Also: what does A attract/cause?
|
||||
for rel in ("Causes", "CausesDesire"):
|
||||
for (b, w, s) in self.graph.edges.get((a, rel), []):
|
||||
attracted.append((b, w))
|
||||
attracted.append((b, w, s, rel))
|
||||
|
||||
if not attracted:
|
||||
for (bridge, target, w1, w2) in self.graph.two_hop(a, "UsedFor", "AtLocation"):
|
||||
attracted.append((target, w1 + w2))
|
||||
attracted.append((target, w1 + w2, "", "AtLocation"))
|
||||
|
||||
if not attracted:
|
||||
return None, None
|
||||
|
||||
b_word = _readable(random.choice(attracted)[0])
|
||||
b_choice = random.choice(attracted)
|
||||
b_word = _readable(b_choice[0])
|
||||
|
||||
chain_edges = [
|
||||
{"start": b_choice[0] if b_choice[3] == "AtLocation" else a,
|
||||
"relation": b_choice[3],
|
||||
"end": a if b_choice[3] == "AtLocation" else b_choice[0],
|
||||
"weight": b_choice[1], "surface_text": b_choice[2]},
|
||||
]
|
||||
|
||||
create_verbs = {
|
||||
"pond": "dig", "birdhouse": "hang", "fence": "build", "trap": "set",
|
||||
|
|
@ -301,6 +329,7 @@ class DenialOfConsequences(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{b_word} AtLocation {a}; {a} created by {c_word}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"A": a, "B": b_word, "C": c_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -324,14 +353,21 @@ class IronicDeficiency(MetaTemplate):
|
|||
return None, None
|
||||
|
||||
products = []
|
||||
product_rels = []
|
||||
for rel in ("UsedFor", "CapableOf", "Causes"):
|
||||
products.extend(self.graph.neighbors(a, rel, min_weight=0.5))
|
||||
found = self.graph.neighbors(a, rel, min_weight=0.5)
|
||||
for item in found:
|
||||
products.append(item)
|
||||
product_rels.append(rel)
|
||||
|
||||
products = _short_concepts(products)
|
||||
if not products:
|
||||
# Filter to short concepts while keeping rel tracking
|
||||
filtered = [(p, r) for p, r in zip(products, product_rels) if len(p[0].split("_")) <= 3]
|
||||
if not filtered:
|
||||
return None, None
|
||||
|
||||
x_word = _readable(random.choice(products)[0])
|
||||
choice_idx = random.randrange(len(filtered))
|
||||
x_edge, x_rel = filtered[choice_idx]
|
||||
x_word = _readable(x_edge[0])
|
||||
|
||||
family_members = ["wife", "children", "household", "family", "own kind"]
|
||||
f_word = random.choice(family_members)
|
||||
|
|
@ -339,10 +375,15 @@ class IronicDeficiency(MetaTemplate):
|
|||
template = self._pick_template()
|
||||
saying = template.format(A=a, X=x_word, F=f_word)
|
||||
|
||||
chain_edges = [
|
||||
{"start": a, "relation": x_rel, "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
|
||||
]
|
||||
|
||||
debug = {
|
||||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{a} UsedFor/Produces {x_word}; irony: {a} lacks {x_word}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"A": a, "X": x_word, "F": f_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -371,7 +412,12 @@ class FutilePreparation(MetaTemplate):
|
|||
if not uses:
|
||||
return None, None
|
||||
|
||||
action_word = random.choice(uses)[0]
|
||||
action_edge = random.choice(uses)
|
||||
action_word = action_edge[0]
|
||||
|
||||
chain_edges = [
|
||||
{"start": seed, "relation": "UsedFor", "end": action_edge[0], "weight": action_edge[1], "surface_text": action_edge[2]},
|
||||
]
|
||||
|
||||
# Find a different outcome in a related domain via 2-hop
|
||||
outcomes = []
|
||||
|
|
@ -392,7 +438,8 @@ class FutilePreparation(MetaTemplate):
|
|||
if not outcomes:
|
||||
return None, None
|
||||
|
||||
y_word = random.choice(outcomes)[0]
|
||||
y_choice = random.choice(outcomes)
|
||||
y_word = y_choice[0]
|
||||
|
||||
gerund = _gerund(action_word)
|
||||
verb = _readable(action_word)
|
||||
|
|
@ -405,6 +452,7 @@ class FutilePreparation(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{seed} UsedFor {action_word}; different domain: {y_word}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"seed": seed, "action": action_word, "Y": y_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -430,21 +478,37 @@ class HypocriticalComplaint(MetaTemplate):
|
|||
|
||||
# Find parts of Z
|
||||
parts = []
|
||||
part_rels = []
|
||||
for rel in ("HasA", "PartOf", "MadeOf"):
|
||||
parts.extend(_short_concepts(self.graph.neighbors(z, rel, min_weight=0.5)))
|
||||
found = _short_concepts(self.graph.neighbors(z, rel, min_weight=0.5))
|
||||
for item in found:
|
||||
parts.append(item)
|
||||
part_rels.append(rel)
|
||||
for (start, w, s) in self.graph.reverse.get((z, "PartOf"), []):
|
||||
if len(start.split("_")) <= 2:
|
||||
parts.append((start, w, s))
|
||||
part_rels.append("PartOf")
|
||||
for (start, w, s) in self.graph.reverse.get((z, "HasA"), []):
|
||||
if len(start.split("_")) <= 2:
|
||||
parts.append((start, w, s))
|
||||
part_rels.append("HasA")
|
||||
|
||||
if len(parts) < 2:
|
||||
return None, None
|
||||
|
||||
random.shuffle(parts)
|
||||
x_word = _readable(parts[0][0])
|
||||
y_word = _readable(parts[1][0])
|
||||
combined = list(zip(parts, part_rels))
|
||||
random.shuffle(combined)
|
||||
parts, part_rels = zip(*combined)
|
||||
|
||||
x_edge = parts[0]
|
||||
x_word = _readable(x_edge[0])
|
||||
y_edge = parts[1]
|
||||
y_word = _readable(y_edge[0])
|
||||
|
||||
chain_edges = [
|
||||
{"start": z, "relation": part_rels[0], "end": x_edge[0], "weight": x_edge[1], "surface_text": x_edge[2]},
|
||||
{"start": z, "relation": part_rels[1], "end": y_edge[0], "weight": y_edge[1], "surface_text": y_edge[2]},
|
||||
]
|
||||
|
||||
consume_verbs = ["eat", "drink", "take", "pick", "use up", "grab"]
|
||||
verb = random.choice(consume_verbs)
|
||||
|
|
@ -456,6 +520,7 @@ class HypocriticalComplaint(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{x_word} PartOf/HasA {z}; {y_word} PartOf/HasA {z}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"Z": z, "X": x_word, "Y": y_word, "verb": verb},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -480,19 +545,25 @@ class TautologicalWisdom(MetaTemplate):
|
|||
return None, None
|
||||
|
||||
# seed HasPrerequisite/Causes something
|
||||
# Store (x_word, y_word, weight, edge_info) where edge_info captures the raw edge
|
||||
chains = []
|
||||
for (target, w, s) in self.graph.edges.get((seed, "HasPrerequisite"), []):
|
||||
chains.append((_readable(target), seed, w)) # X=prereq, Y=seed
|
||||
chains.append((_readable(target), seed, w,
|
||||
{"start": seed, "relation": "HasPrerequisite", "end": target, "weight": w, "surface_text": s}))
|
||||
for (target, w, s) in self.graph.edges.get((seed, "Causes"), []):
|
||||
chains.append((seed, _readable(target), w)) # X=seed, Y=effect
|
||||
chains.append((seed, _readable(target), w,
|
||||
{"start": seed, "relation": "Causes", "end": target, "weight": w, "surface_text": s}))
|
||||
# Also: what does seed require?
|
||||
for (source, w, s) in self.graph.reverse.get((seed, "HasPrerequisite"), []):
|
||||
chains.append((seed, _readable(source), w))
|
||||
chains.append((seed, _readable(source), w,
|
||||
{"start": source, "relation": "HasPrerequisite", "end": seed, "weight": w, "surface_text": s}))
|
||||
|
||||
if not chains:
|
||||
return None, None
|
||||
|
||||
x_word, y_word, _ = random.choice(chains)
|
||||
choice = random.choice(chains)
|
||||
x_word, y_word = choice[0], choice[1]
|
||||
chain_edge = choice[3]
|
||||
|
||||
template = self._pick_template()
|
||||
saying = template.format(X=x_word, Y=y_word)
|
||||
|
|
@ -501,6 +572,7 @@ class TautologicalWisdom(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{x_word} -> {y_word} (prerequisite/cause)",
|
||||
"chain_edges": [chain_edge],
|
||||
"slots": {"X": x_word, "Y": y_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -543,15 +615,22 @@ class FalseEquivalence(MetaTemplate):
|
|||
a_props = _short_concepts(self.graph.neighbors(a, "HasProperty"), max_words=2)
|
||||
b_props = set(p[0] for p in self.graph.neighbors(b_word, "HasProperty"))
|
||||
|
||||
chain_edges = []
|
||||
differentiators = [p for p in a_props if p[0] not in b_props]
|
||||
if differentiators:
|
||||
p_word = _readable(random.choice(differentiators)[0])
|
||||
p_edge = random.choice(differentiators)
|
||||
p_word = _readable(p_edge[0])
|
||||
chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
|
||||
elif a_props:
|
||||
p_word = _readable(random.choice(a_props)[0])
|
||||
p_edge = random.choice(a_props)
|
||||
p_word = _readable(p_edge[0])
|
||||
chain_edges.append({"start": a, "relation": "HasProperty", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
|
||||
else:
|
||||
a_caps = self.graph.neighbors(a, "CapableOf")
|
||||
if a_caps:
|
||||
p_word = _readable(random.choice(a_caps)[0])
|
||||
p_edge = random.choice(a_caps)
|
||||
p_word = _readable(p_edge[0])
|
||||
chain_edges.append({"start": a, "relation": "CapableOf", "end": p_edge[0], "weight": p_edge[1], "surface_text": p_edge[2]})
|
||||
else:
|
||||
p_word = random.choice(["ambition", "an attitude", "a plan", "patience"])
|
||||
|
||||
|
|
@ -562,6 +641,7 @@ class FalseEquivalence(MetaTemplate):
|
|||
"template_family": self.id,
|
||||
"template": template,
|
||||
"chain": f"{a} IsA same category as {b_word}; {a} HasProperty {p_word}",
|
||||
"chain_edges": chain_edges,
|
||||
"slots": {"A": a, "B": b_word, "P": p_word},
|
||||
}
|
||||
return saying, debug
|
||||
|
|
@ -621,7 +701,10 @@ TEMPLATE_REGISTRY = {
|
|||
|
||||
def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
|
||||
debug=False, max_retries=20):
|
||||
"""Generate a single folksy saying."""
|
||||
"""Generate a single folksy saying.
|
||||
|
||||
When debug=True, always returns (saying, debug_dict) with chain_edges included.
|
||||
"""
|
||||
for _ in range(max_retries):
|
||||
if template_id:
|
||||
tid = template_id
|
||||
|
|
@ -631,7 +714,7 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
|
|||
cls = TEMPLATE_REGISTRY.get(tid)
|
||||
if not cls:
|
||||
print(f"Unknown template: {tid}", file=sys.stderr)
|
||||
return None
|
||||
return None, None
|
||||
|
||||
tmpl = cls(graph)
|
||||
saying, dbg = tmpl.generate(seed_word=seed_word, seed_category=seed_category)
|
||||
|
|
@ -643,6 +726,16 @@ def generate_one(graph, template_id=None, seed_word=None, seed_category=None,
|
|||
return None, None
|
||||
|
||||
|
||||
def _get_seed_word(dbg):
|
||||
"""Extract the primary seed word from debug slots for dedup tracking."""
|
||||
slots = dbg.get("slots", {})
|
||||
# Templates use different slot names for the seed
|
||||
for key in ("A", "Z", "seed", "X"):
|
||||
if key in slots:
|
||||
return slots[key]
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate folksy fake-proverbs using ConceptNet relationships."
|
||||
|
|
@ -655,8 +748,13 @@ def main():
|
|||
parser.add_argument("--count", "-n", type=int, default=1, help="Number of sayings to generate")
|
||||
parser.add_argument("--output", "-o", help="Output file (default: stdout)")
|
||||
parser.add_argument("--debug", "-d", action="store_true", help="Show relationship chain debug info")
|
||||
parser.add_argument("--json", action="store_true", help="Output JSONL format with full metadata")
|
||||
parser.add_argument("--vocab", help="Path to folksy_vocab.csv")
|
||||
parser.add_argument("--relations", help="Path to folksy_relations.csv")
|
||||
parser.add_argument("--pure-conceptnet", action="store_true",
|
||||
help="Skip loading augmented relations file")
|
||||
parser.add_argument("--llm-weight-boost", type=float, default=0.0,
|
||||
help="Boost weight of LLM-augmented edges with weight < 1.0 (default: 0.0)")
|
||||
parser.add_argument("--list-templates", action="store_true", help="List available templates")
|
||||
parser.add_argument("--list-categories", action="store_true", help="List available categories")
|
||||
|
||||
|
|
@ -679,6 +777,30 @@ def main():
|
|||
print("Run scripts/extract_from_conceptnet.py first to generate data files.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Load augmented relations if available
|
||||
if not args.pure_conceptnet:
|
||||
augmented_path = DATA_DIR / "folksy_relations_augmented.csv"
|
||||
if augmented_path.exists():
|
||||
boost = args.llm_weight_boost
|
||||
with open(augmented_path, newline="", encoding="utf-8") as f:
|
||||
reader = csv.DictReader(f)
|
||||
count = 0
|
||||
for row in reader:
|
||||
sw = row["start_word"]
|
||||
ew = row["end_word"]
|
||||
rel = row["relation"]
|
||||
w = float(row["weight"])
|
||||
if w < 1.0 and boost:
|
||||
w = min(w + boost, 1.0)
|
||||
surf = row.get("surface_text", "")
|
||||
graph.edges[(sw, rel)].append((ew, w, surf))
|
||||
graph.reverse[(ew, rel)].append((sw, w, surf))
|
||||
graph.all_edges[sw].append((ew, rel, w))
|
||||
graph.all_edges[ew].append((sw, rel, w))
|
||||
count += 1
|
||||
if count:
|
||||
print(f"Loaded {count} augmented edges.", file=sys.stderr)
|
||||
|
||||
if args.list_categories:
|
||||
for cat in sorted(graph.by_category.keys()):
|
||||
print(f" {cat:20s} ({len(graph.by_category[cat])} words)")
|
||||
|
|
@ -688,26 +810,96 @@ def main():
|
|||
if args.entities:
|
||||
graph.merge_fictional(args.entities)
|
||||
|
||||
# JSON mode implies debug internally
|
||||
use_debug = args.debug or args.json
|
||||
|
||||
# Generate
|
||||
out = open(args.output, "w", encoding="utf-8") if args.output else sys.stdout
|
||||
try:
|
||||
for i in range(args.count):
|
||||
if args.count > 1:
|
||||
# Deduplication tracking for batch mode
|
||||
seen_text = set()
|
||||
seen_slots = set()
|
||||
seed_usage = defaultdict(int)
|
||||
generated = 0
|
||||
max_outer_attempts = args.count * 10 # generous outer limit
|
||||
attempts = 0
|
||||
|
||||
while generated < args.count and attempts < max_outer_attempts:
|
||||
attempts += 1
|
||||
saying, dbg = generate_one(
|
||||
graph,
|
||||
template_id=args.template,
|
||||
seed_word=args.seed,
|
||||
seed_category=args.category,
|
||||
debug=use_debug,
|
||||
)
|
||||
if not saying:
|
||||
continue
|
||||
|
||||
# Dedup checks (failures don't count against retry limit)
|
||||
if saying in seen_text:
|
||||
continue
|
||||
|
||||
if dbg:
|
||||
slots_key = (dbg["template_family"], frozenset(dbg["slots"].items()))
|
||||
if slots_key in seen_slots:
|
||||
continue
|
||||
|
||||
seed_w = _get_seed_word(dbg)
|
||||
if seed_w and seed_usage[seed_w] >= 30:
|
||||
continue
|
||||
if seed_w:
|
||||
seed_usage[seed_w] += 1
|
||||
seen_slots.add(slots_key)
|
||||
|
||||
seen_text.add(saying)
|
||||
generated += 1
|
||||
|
||||
if args.json and dbg:
|
||||
record = {
|
||||
"raw_text": saying,
|
||||
"meta_template": dbg["template_family"],
|
||||
"surface_template": dbg["template"],
|
||||
"slots": dbg["slots"],
|
||||
"chain": dbg.get("chain_edges", []),
|
||||
}
|
||||
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
||||
else:
|
||||
out.write(saying + "\n")
|
||||
if args.debug and dbg:
|
||||
out.write(f" [DEBUG] family={dbg['template_family']}\n")
|
||||
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
|
||||
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
|
||||
out.write("\n")
|
||||
else:
|
||||
# Single generation (no dedup needed)
|
||||
saying, dbg = generate_one(
|
||||
graph,
|
||||
template_id=args.template,
|
||||
seed_word=args.seed,
|
||||
seed_category=args.category,
|
||||
debug=args.debug,
|
||||
debug=use_debug,
|
||||
)
|
||||
if saying:
|
||||
out.write(saying + "\n")
|
||||
if args.debug and dbg:
|
||||
out.write(f" [DEBUG] family={dbg['template_family']}\n")
|
||||
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
|
||||
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
|
||||
out.write("\n")
|
||||
if args.json and dbg:
|
||||
record = {
|
||||
"raw_text": saying,
|
||||
"meta_template": dbg["template_family"],
|
||||
"surface_template": dbg["template"],
|
||||
"slots": dbg["slots"],
|
||||
"chain": dbg.get("chain_edges", []),
|
||||
}
|
||||
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
||||
else:
|
||||
out.write(saying + "\n")
|
||||
if args.debug and dbg:
|
||||
out.write(f" [DEBUG] family={dbg['template_family']}\n")
|
||||
out.write(f" [DEBUG] chain: {dbg['chain']}\n")
|
||||
out.write(f" [DEBUG] slots: {dbg['slots']}\n")
|
||||
out.write("\n")
|
||||
else:
|
||||
out.write(f"(failed to generate saying #{i+1} after retries)\n")
|
||||
out.write("(failed to generate saying after retries)\n")
|
||||
finally:
|
||||
if args.output:
|
||||
out.close()
|
||||
|
|
|
|||
213
scripts/compute_corpus_stats.py
Normal file
213
scripts/compute_corpus_stats.py
Normal file
|
|
@ -0,0 +1,213 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Compute corpus statistics and validation metrics.
|
||||
|
||||
Reads corpus files and computes counts, distributions, coverage, and balance warnings.
|
||||
|
||||
Usage:
|
||||
python scripts/compute_corpus_stats.py
|
||||
python scripts/compute_corpus_stats.py --corpus-dir corpus/
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
DATA_DIR = PROJECT_DIR / "data"
|
||||
|
||||
|
||||
def load_jsonl(path):
|
||||
"""Load a JSONL file."""
|
||||
entries = []
|
||||
if not path.exists():
|
||||
return entries
|
||||
with open(path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
entries.append(json.loads(line))
|
||||
return entries
|
||||
|
||||
|
||||
def classify_input_type(inp):
|
||||
"""Classify the input framing type of a training pair."""
|
||||
if inp.startswith("Tell me something about"):
|
||||
return "word_seeded"
|
||||
elif inp.startswith("Tell me a saying about"):
|
||||
return "category_seeded"
|
||||
elif inp.startswith("What would a"):
|
||||
return "persona_seeded"
|
||||
elif inp.startswith("Give me a") and "proverb" in inp:
|
||||
return "template_seeded"
|
||||
elif any(inp.startswith(p) for p in [
|
||||
"Tell me some folk", "What do they", "Give me a proverb",
|
||||
"Share some", "What's a good"
|
||||
]):
|
||||
return "open_ended"
|
||||
else:
|
||||
return "fictional"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Compute corpus statistics.")
|
||||
parser.add_argument("--corpus-dir", default=str(PROJECT_DIR / "corpus"),
|
||||
help="Corpus directory")
|
||||
parser.add_argument("--output", default=None,
|
||||
help="Output JSON file (default: corpus_dir/corpus_stats.json)")
|
||||
args = parser.parse_args()
|
||||
|
||||
corpus_dir = Path(args.corpus_dir)
|
||||
output_path = Path(args.output) if args.output else corpus_dir / "corpus_stats.json"
|
||||
|
||||
# Load all corpus files
|
||||
raw = load_jsonl(corpus_dir / "corpus_raw.jsonl")
|
||||
polished = load_jsonl(corpus_dir / "corpus_polished.jsonl")
|
||||
filtered = load_jsonl(corpus_dir / "corpus_filtered.jsonl")
|
||||
training = load_jsonl(corpus_dir / "training_pairs.jsonl")
|
||||
|
||||
# Load vocab for coverage analysis
|
||||
vocab_words = set()
|
||||
vocab_path = DATA_DIR / "folksy_vocab.csv"
|
||||
if vocab_path.exists():
|
||||
with open(vocab_path, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
vocab_words.add(row["word"])
|
||||
|
||||
stats = {}
|
||||
|
||||
# --- Raw corpus stats ---
|
||||
stats["raw_count"] = len(raw)
|
||||
raw_by_template = Counter(e.get("meta_template", "unknown") for e in raw)
|
||||
stats["raw_by_template"] = dict(sorted(raw_by_template.items()))
|
||||
|
||||
# --- Polish stats ---
|
||||
polished_entries = [e for e in polished if e.get("status") == "polished"]
|
||||
discarded_entries = [e for e in polished if e.get("status") == "discarded"]
|
||||
error_entries = [e for e in polished if e.get("status") == "error"]
|
||||
|
||||
stats["polished_count"] = len(polished_entries)
|
||||
stats["discarded_during_polish"] = len(discarded_entries)
|
||||
stats["errors_during_polish"] = len(error_entries)
|
||||
if polished_entries or discarded_entries:
|
||||
total_processed = len(polished_entries) + len(discarded_entries)
|
||||
stats["polish_discard_rate"] = f"{len(discarded_entries)/total_processed*100:.1f}%"
|
||||
|
||||
polish_by_template = Counter(e.get("meta_template", "unknown") for e in polished_entries)
|
||||
stats["polished_by_template"] = dict(sorted(polish_by_template.items()))
|
||||
|
||||
discard_by_template = Counter(e.get("meta_template", "unknown") for e in discarded_entries)
|
||||
stats["discarded_by_template"] = dict(sorted(discard_by_template.items()))
|
||||
|
||||
# --- Filter stats ---
|
||||
stats["filtered_count"] = len(filtered)
|
||||
|
||||
filter_by_template = Counter(e.get("meta_template", "unknown") for e in filtered)
|
||||
stats["filtered_by_template"] = dict(sorted(filter_by_template.items()))
|
||||
|
||||
# Filter discard count
|
||||
stats["discarded_during_filter"] = len(polished_entries) - len(filtered)
|
||||
|
||||
# --- Training pairs stats ---
|
||||
stats["training_pair_count"] = len(training)
|
||||
|
||||
training_by_template = Counter(e.get("meta_template", "unknown") for e in training)
|
||||
stats["training_by_template"] = dict(sorted(training_by_template.items()))
|
||||
|
||||
input_type_counts = Counter(classify_input_type(e.get("input", "")) for e in training)
|
||||
stats["training_by_input_type"] = dict(sorted(input_type_counts.items()))
|
||||
|
||||
# --- Coverage analysis ---
|
||||
used_words = set()
|
||||
for entry in filtered:
|
||||
slots = entry.get("slots", {})
|
||||
for v in slots.values():
|
||||
word = v.lower().replace(" ", "_")
|
||||
if word in vocab_words:
|
||||
used_words.add(word)
|
||||
|
||||
stats["unique_slot_words_used"] = len(used_words)
|
||||
stats["total_vocab_words"] = len(vocab_words)
|
||||
stats["vocab_coverage"] = f"{len(used_words)/len(vocab_words)*100:.1f}%" if vocab_words else "N/A"
|
||||
|
||||
never_used = sorted(vocab_words - used_words)
|
||||
stats["words_never_used"] = never_used
|
||||
stats["words_never_used_count"] = len(never_used)
|
||||
|
||||
# --- Saying length stats ---
|
||||
lengths = []
|
||||
for entry in filtered:
|
||||
text = entry.get("polished_text", "")
|
||||
if text:
|
||||
lengths.append(len(text.split()))
|
||||
|
||||
if lengths:
|
||||
stats["avg_saying_length_words"] = round(sum(lengths) / len(lengths), 1)
|
||||
stats["min_saying_length_words"] = min(lengths)
|
||||
stats["max_saying_length_words"] = max(lengths)
|
||||
|
||||
# --- Balance warnings ---
|
||||
warnings = []
|
||||
if filtered:
|
||||
total_filtered = len(filtered)
|
||||
for template, count in filter_by_template.items():
|
||||
pct = count / total_filtered * 100
|
||||
if pct < 10:
|
||||
warnings.append(
|
||||
f"WARNING: {template} has only {count} entries ({pct:.1f}%) — "
|
||||
f"below 10% threshold. Generate more raw sayings for this family."
|
||||
)
|
||||
|
||||
if training:
|
||||
total_training = len(training)
|
||||
for template, count in training_by_template.items():
|
||||
pct = count / total_training * 100
|
||||
if pct < 5:
|
||||
warnings.append(
|
||||
f"WARNING: {template} has only {count} training pairs ({pct:.1f}%) — very underrepresented."
|
||||
)
|
||||
|
||||
stats["balance_warnings"] = warnings
|
||||
|
||||
# --- Write output ---
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump(stats, f, indent=2, ensure_ascii=False)
|
||||
|
||||
# --- Print summary ---
|
||||
print("=" * 60)
|
||||
print("CORPUS STATISTICS")
|
||||
print("=" * 60)
|
||||
|
||||
print(f"\nRaw sayings: {stats['raw_count']}")
|
||||
print(f"Polished sayings: {stats['polished_count']}")
|
||||
print(f"Discarded (polish): {stats.get('discarded_during_polish', 0)} ({stats.get('polish_discard_rate', 'N/A')})")
|
||||
print(f"Discarded (filter): {stats.get('discarded_during_filter', 0)}")
|
||||
print(f"Final filtered: {stats['filtered_count']}")
|
||||
print(f"Training pairs: {stats['training_pair_count']}")
|
||||
|
||||
print(f"\nDistribution by meta-template (filtered):")
|
||||
for t, c in sorted(filter_by_template.items()):
|
||||
pct = c / len(filtered) * 100 if filtered else 0
|
||||
print(f" {t:30s} {c:5d} ({pct:5.1f}%)")
|
||||
|
||||
print(f"\nDistribution by input framing type:")
|
||||
for t, c in sorted(input_type_counts.items()):
|
||||
print(f" {t:20s} {c:5d}")
|
||||
|
||||
print(f"\nVocab coverage: {stats['vocab_coverage']} ({stats['unique_slot_words_used']}/{stats['total_vocab_words']})")
|
||||
print(f"Average saying length: {stats.get('avg_saying_length_words', 'N/A')} words")
|
||||
|
||||
if warnings:
|
||||
print(f"\nBalance warnings:")
|
||||
for w in warnings:
|
||||
print(f" {w}")
|
||||
|
||||
print(f"\nFull stats: {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
787
scripts/enhance_graph.py
Normal file
787
scripts/enhance_graph.py
Normal file
|
|
@ -0,0 +1,787 @@
|
|||
#!/usr/bin/env python3
|
||||
"""LLM-augmented graph enhancement for the folksy subgraph.
|
||||
|
||||
Three phases:
|
||||
Phase 1: Per-word relationship expansion
|
||||
Phase 2: Cross-word bridge discovery
|
||||
Phase 3: Property enrichment for false_equivalence templates
|
||||
|
||||
Usage:
|
||||
python scripts/enhance_graph.py --phase 1 # Run phase 1 only
|
||||
python scripts/enhance_graph.py --phase 2 # Run phase 2 only
|
||||
python scripts/enhance_graph.py --phase 3 # Run phase 3 only
|
||||
python scripts/enhance_graph.py --all # Run all phases
|
||||
python scripts/enhance_graph.py --phase 1 --dry-run # Print prompts without calling LLM
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# Paths
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
DATA_DIR = PROJECT_DIR / "data"
|
||||
|
||||
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
|
||||
LLM_MODEL = "THUDM-GLM4-32B"
|
||||
|
||||
VALID_RELATIONS = {
|
||||
"AtLocation", "MadeOf", "PartOf", "UsedFor", "HasA", "HasProperty",
|
||||
"Causes", "HasPrerequisite", "CapableOf", "ReceivesAction", "Desires",
|
||||
"CausesDesire", "LocatedNear", "CreatedBy", "MotivatedByGoal", "HasSubevent",
|
||||
}
|
||||
|
||||
AUGMENTED_CSV = DATA_DIR / "folksy_relations_augmented.csv"
|
||||
CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
|
||||
LOG_CSV = DATA_DIR / "enhancement_log.csv"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Infrastructure
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def llm_chat_completion(messages, max_retries=3):
|
||||
"""Chat completion with retry logic."""
|
||||
import requests
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
resp = requests.post(LLM_ENDPOINT, json={
|
||||
"model": LLM_MODEL,
|
||||
"messages": messages,
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return data["choices"][0]["message"]["content"]
|
||||
except Exception as e:
|
||||
wait = (2 ** attempt)
|
||||
print(f" LLM call failed (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
|
||||
if attempt < max_retries - 1:
|
||||
print(f" Retrying in {wait}s...", file=sys.stderr)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
print(f" Giving up on this word.", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
def load_vocab():
|
||||
"""Load folksy vocabulary."""
|
||||
vocab = {}
|
||||
with open(DATA_DIR / "folksy_vocab.csv", newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
word = row["word"]
|
||||
cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
|
||||
vocab[word] = {
|
||||
"categories": cats,
|
||||
"tangibility": float(row.get("tangibility_score", 0)),
|
||||
"edge_count": int(row.get("conceptnet_edge_count", 0)),
|
||||
}
|
||||
return vocab
|
||||
|
||||
|
||||
def load_relations():
|
||||
"""Load existing relations (ConceptNet + any existing augmented)."""
|
||||
edges = defaultdict(list) # (start, relation) -> [(end, weight, surface)]
|
||||
existing_triples = set() # (start, end, relation) for dedup
|
||||
|
||||
for path in [DATA_DIR / "folksy_relations.csv", AUGMENTED_CSV]:
|
||||
if not path.exists():
|
||||
continue
|
||||
with open(path, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
sw = row["start_word"]
|
||||
ew = row["end_word"]
|
||||
rel = row["relation"]
|
||||
if not row['weight']: continue # corruption / skip?
|
||||
w = float(row["weight"])
|
||||
surf = row.get("surface_text", "")
|
||||
edges[(sw, rel)].append((ew, w, surf))
|
||||
existing_triples.add((sw, ew, rel))
|
||||
|
||||
return edges, existing_triples
|
||||
|
||||
|
||||
def load_checkpoint():
|
||||
"""Load enhancement log to determine what's already been processed."""
|
||||
processed = set() # (word, phase)
|
||||
if LOG_CSV.exists():
|
||||
with open(LOG_CSV, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
processed.add((row["source_word"], row["phase"]))
|
||||
return processed
|
||||
|
||||
|
||||
def append_log(word, phase, edges_generated, edges_accepted, edges_duplicate, edges_oov):
|
||||
"""Append a row to the enhancement log."""
|
||||
write_header = not LOG_CSV.exists()
|
||||
with open(LOG_CSV, "a", newline="", encoding="utf-8") as f:
|
||||
writer = csv.writer(f)
|
||||
if write_header:
|
||||
writer.writerow(["source_word", "phase", "timestamp",
|
||||
"edges_generated", "edges_accepted", "edges_duplicate", "edges_oov"])
|
||||
writer.writerow([word, phase, datetime.now().isoformat(),
|
||||
edges_generated, edges_accepted, edges_duplicate, edges_oov])
|
||||
|
||||
|
||||
def append_augmented_edges(edges):
|
||||
"""Append edges to the augmented relations CSV."""
|
||||
write_header = not AUGMENTED_CSV.exists()
|
||||
with open(AUGMENTED_CSV, "a", newline="", encoding="utf-8") as f:
|
||||
writer = csv.writer(f)
|
||||
if write_header:
|
||||
writer.writerow(["start_word", "end_word", "relation", "weight", "surface_text", "source"])
|
||||
for e in edges:
|
||||
writer.writerow([e["start_word"], e["end_word"], e["relation"],
|
||||
e["weight"], e["surface_text"], e["source"]])
|
||||
|
||||
|
||||
def append_candidates(candidates):
|
||||
"""Append candidate words to the candidate additions CSV."""
|
||||
write_header = not CANDIDATE_CSV.exists()
|
||||
with open(CANDIDATE_CSV, "a", newline="", encoding="utf-8") as f:
|
||||
writer = csv.writer(f)
|
||||
if write_header:
|
||||
writer.writerow(["word", "suggested_by", "relation_context", "frequency"])
|
||||
for c in candidates:
|
||||
writer.writerow([c["word"], c["suggested_by"], c["relation_context"], c["frequency"]])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Parsing
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_llm_relations(response_text, source_word):
|
||||
"""Parse structured LLM output into edge dicts.
|
||||
|
||||
Handles bullets, numbering, extra whitespace, multi-word targets.
|
||||
"""
|
||||
edges = []
|
||||
if not response_text:
|
||||
return edges
|
||||
|
||||
for line in response_text.strip().split("\n"):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Strip leading bullets/numbers: "- ", "1. ", "* ", etc.
|
||||
line = re.sub(r"^[\d]+[.)]\s*", "", line)
|
||||
line = re.sub(r"^[-*•]\s*", "", line)
|
||||
line = line.strip()
|
||||
|
||||
if not line or "NONE" in line.upper():
|
||||
continue
|
||||
|
||||
# Match: RELATION_TYPE: target_word(s) | surface text
|
||||
match = re.match(r"^(\w+):\s*(.+?)\s*\|\s*(.+)$", line)
|
||||
if not match:
|
||||
continue
|
||||
|
||||
relation, target_raw, surface = match.groups()
|
||||
relation = relation.strip()
|
||||
|
||||
if relation not in VALID_RELATIONS:
|
||||
continue
|
||||
|
||||
# Normalize target: lowercase, replace spaces with underscores for multi-word
|
||||
target = target_raw.strip().lower()
|
||||
target = re.sub(r"\s+", "_", target)
|
||||
|
||||
# Skip self-loops
|
||||
if target == source_word:
|
||||
continue
|
||||
|
||||
edges.append({
|
||||
"start_word": source_word,
|
||||
"end_word": target,
|
||||
"relation": relation,
|
||||
"weight": 0.8,
|
||||
"surface_text": surface.strip(),
|
||||
"source": "llm_augmented",
|
||||
})
|
||||
|
||||
return edges
|
||||
|
||||
|
||||
def parse_bridge_response(response_text, word_a, word_b):
|
||||
"""Parse bridge discovery LLM output."""
|
||||
edges = []
|
||||
if not response_text:
|
||||
return edges
|
||||
|
||||
for line in response_text.strip().split("\n"):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Strip common prefixes
|
||||
line = re.sub(r"^[\d]+[.)]\s*", "", line)
|
||||
line = re.sub(r"^[-*•]\s*", "", line)
|
||||
line = re.sub(r"^BRIDGE:\s*", "", line, flags=re.IGNORECASE)
|
||||
line = line.strip()
|
||||
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
|
||||
parts = [p.strip() for p in line.split("|")]
|
||||
if len(parts) < 3:
|
||||
continue
|
||||
|
||||
bridge_word = parts[0].strip().lower().replace(" ", "_")
|
||||
|
||||
# Parse relation_to_first
|
||||
rel1_match = re.search(r"(?:relation_to_first|first):\s*(\w+)", parts[1], re.IGNORECASE)
|
||||
rel2_match = re.search(r"(?:relation_to_second|second):\s*(\w+)", parts[2], re.IGNORECASE)
|
||||
|
||||
if not rel1_match or not rel2_match:
|
||||
# Try simpler format: just the relation type
|
||||
rel1_match = re.match(r"(\w+)", parts[1].split(":")[-1].strip())
|
||||
rel2_match = re.match(r"(\w+)", parts[2].split(":")[-1].strip())
|
||||
|
||||
if not rel1_match or not rel2_match:
|
||||
continue
|
||||
|
||||
rel1 = rel1_match.group(1)
|
||||
rel2 = rel2_match.group(1)
|
||||
|
||||
if rel1 not in VALID_RELATIONS or rel2 not in VALID_RELATIONS:
|
||||
continue
|
||||
|
||||
explanation = parts[3].strip() if len(parts) > 3 else ""
|
||||
|
||||
# Create edges: word_a -> bridge and bridge -> word_b
|
||||
edges.append({
|
||||
"start_word": word_a,
|
||||
"end_word": bridge_word,
|
||||
"relation": rel1,
|
||||
"weight": 0.8,
|
||||
"surface_text": explanation,
|
||||
"source": "llm_bridge",
|
||||
})
|
||||
edges.append({
|
||||
"start_word": bridge_word,
|
||||
"end_word": word_b,
|
||||
"relation": rel2,
|
||||
"weight": 0.8,
|
||||
"surface_text": explanation,
|
||||
"source": "llm_bridge",
|
||||
})
|
||||
|
||||
return edges
|
||||
|
||||
|
||||
def parse_property_response(response_text, word):
|
||||
"""Parse property enrichment LLM output."""
|
||||
edges = []
|
||||
if not response_text:
|
||||
return edges
|
||||
|
||||
for line in response_text.strip().split("\n"):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
line = re.sub(r"^[\d]+[.)]\s*", "", line)
|
||||
line = re.sub(r"^[-*•]\s*", "", line)
|
||||
line = line.strip()
|
||||
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# PROPERTY | explanation
|
||||
parts = [p.strip() for p in line.split("|")]
|
||||
if len(parts) < 1:
|
||||
continue
|
||||
|
||||
prop = parts[0].strip().lower().replace(" ", "_")
|
||||
explanation = parts[1].strip() if len(parts) > 1 else f"{word} is {prop}"
|
||||
|
||||
if not prop or prop == word:
|
||||
continue
|
||||
|
||||
edges.append({
|
||||
"start_word": word,
|
||||
"end_word": prop,
|
||||
"relation": "HasProperty",
|
||||
"weight": 0.8,
|
||||
"surface_text": explanation,
|
||||
"source": "llm_property",
|
||||
})
|
||||
|
||||
return edges
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 1: Per-Word Expansion
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PHASE1_SYSTEM = """You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
|
||||
|
||||
Rules:
|
||||
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
|
||||
- Every relationship must be something a typical adult would agree is true
|
||||
- Do not repeat any relationship already listed as "known"
|
||||
- Target words should be common English words (top 3000 frequency preferred)
|
||||
- Output ONLY the structured format shown below, one relationship per line
|
||||
- If you cannot think of good relationships for a given type, output NONE for that type
|
||||
- Aim for 3-5 relationships per type where possible
|
||||
|
||||
Output format (one per line):
|
||||
RELATION_TYPE: target_word | short natural phrasing
|
||||
|
||||
Example output:
|
||||
AtLocation: barn | you find a horse in a barn
|
||||
UsedFor: riding | a horse is used for riding
|
||||
HasA: mane | a horse has a mane
|
||||
CapableOf: gallop | a horse can gallop
|
||||
MadeOf: NONE
|
||||
PartOf: herd | a horse is part of a herd"""
|
||||
|
||||
|
||||
PHASE1_USER = """Word: {word}
|
||||
Categories: {categories}
|
||||
|
||||
Known relationships:
|
||||
{existing_edges}
|
||||
|
||||
Generate additional relationships for these types:
|
||||
- AtLocation (where is it found?)
|
||||
- UsedFor (what is it used for?)
|
||||
- HasA (what does it have / contain?)
|
||||
- PartOf (what is it part of?)
|
||||
- CapableOf (what can it do?)
|
||||
- MadeOf (what is it made of?)
|
||||
- HasPrerequisite (what do you need before you can have/use it?)
|
||||
- Causes (what does it cause or lead to?)
|
||||
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)"""
|
||||
|
||||
|
||||
def format_existing_edges(edges_dict, word):
|
||||
"""Format existing edges for a word grouped by relation type."""
|
||||
relation_types = ["AtLocation", "UsedFor", "HasA", "PartOf", "CapableOf",
|
||||
"MadeOf", "HasPrerequisite", "Causes", "HasProperty"]
|
||||
|
||||
lines = []
|
||||
for rel in relation_types:
|
||||
targets = edges_dict.get((word, rel), [])
|
||||
if targets:
|
||||
formatted = ", ".join(f"{t[0]} (weight {t[1]:.1f})" for t in targets[:10])
|
||||
lines.append(f"{rel}: {formatted}")
|
||||
else:
|
||||
lines.append(f"{rel}: (none in database)")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def run_phase1(vocab, edges, existing_triples, checkpoint, dry_run=False):
|
||||
"""Phase 1: Per-word relationship expansion."""
|
||||
words = sorted(vocab.keys())
|
||||
total = len(words)
|
||||
total_accepted = 0
|
||||
total_skipped = 0
|
||||
|
||||
print(f"Phase 1: Processing {total} words...")
|
||||
|
||||
for i, word in enumerate(words):
|
||||
if (word, "1") in checkpoint:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
categories = ", ".join(vocab[word]["categories"])
|
||||
existing = format_existing_edges(edges, word)
|
||||
|
||||
user_prompt = PHASE1_USER.format(
|
||||
word=word, categories=categories, existing_edges=existing
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": PHASE1_SYSTEM},
|
||||
{"role": "user", "content": user_prompt},
|
||||
]
|
||||
|
||||
if dry_run:
|
||||
if i < 3: # Show first 3 prompts
|
||||
print(f"\n--- Prompt for '{word}' ---")
|
||||
print(f"System: {PHASE1_SYSTEM[:200]}...")
|
||||
print(f"User:\n{user_prompt}")
|
||||
elif i == 3:
|
||||
print(f"\n... ({total - 3} more words) ...")
|
||||
continue
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
parsed = parse_llm_relations(response, word) if response else []
|
||||
|
||||
# Classify edges
|
||||
accepted = []
|
||||
candidates = []
|
||||
duplicates = 0
|
||||
|
||||
for edge in parsed:
|
||||
triple = (edge["start_word"], edge["end_word"], edge["relation"])
|
||||
if triple in existing_triples:
|
||||
duplicates += 1
|
||||
continue
|
||||
|
||||
existing_triples.add(triple)
|
||||
|
||||
if edge["end_word"] in vocab:
|
||||
accepted.append(edge)
|
||||
else:
|
||||
candidates.append({
|
||||
"word": edge["end_word"],
|
||||
"suggested_by": word,
|
||||
"relation_context": f"{edge['relation']}: {edge['surface_text']}",
|
||||
"frequency": 1,
|
||||
})
|
||||
|
||||
if accepted:
|
||||
append_augmented_edges(accepted)
|
||||
# Also update in-memory edges for subsequent words
|
||||
for e in accepted:
|
||||
edges[(e["start_word"], e["relation"])].append(
|
||||
(e["end_word"], e["weight"], e["surface_text"]))
|
||||
|
||||
if candidates:
|
||||
append_candidates(candidates)
|
||||
|
||||
total_accepted += len(accepted)
|
||||
|
||||
append_log(word, "1", len(parsed), len(accepted), duplicates, len(candidates))
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" [{i+1}/{total}] {total_accepted} edges accepted so far")
|
||||
|
||||
time.sleep(0.1)
|
||||
|
||||
if dry_run:
|
||||
print(f"\nDry run complete. Would process {total - total_skipped} words.")
|
||||
else:
|
||||
print(f"\nPhase 1 complete: {total_accepted} new edges accepted.")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 2: Cross-Word Bridge Discovery
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PHASE2_SYSTEM = """You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
|
||||
|
||||
Rules:
|
||||
- The bridge word must be a common, concrete noun
|
||||
- State the relationship type for each connection
|
||||
- Valid relationship types: AtLocation, UsedFor, HasA, PartOf, CapableOf, MadeOf, HasPrerequisite, Causes, HasProperty, ReceivesAction, Desires, CausesDesire, LocatedNear, CreatedBy
|
||||
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
|
||||
|
||||
Example:
|
||||
Words: "cow" and "butter"
|
||||
milk | relation_to_first: CapableOf | relation_to_second: MadeOf | milk connects production to product"""
|
||||
|
||||
|
||||
PHASE2_USER = """Words: "{word_a}" and "{word_b}"
|
||||
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
|
||||
Find 1-3 bridge words that connect them."""
|
||||
|
||||
|
||||
def build_reachability(vocab, edges):
|
||||
"""Build 2-hop reachability from vocab words to other vocab words."""
|
||||
vocab_set = set(vocab.keys())
|
||||
reachable = defaultdict(set) # word -> set of reachable vocab words
|
||||
|
||||
for word in vocab:
|
||||
# Direct (1-hop) neighbors in vocab
|
||||
for (sw, rel), targets in edges.items():
|
||||
if sw == word:
|
||||
for (ew, w, s) in targets:
|
||||
if ew in vocab_set and ew != word:
|
||||
reachable[word].add(ew)
|
||||
# 2-hop from this neighbor
|
||||
for (sw2, rel2), targets2 in edges.items():
|
||||
if sw2 == ew:
|
||||
for (ew2, w2, s2) in targets2:
|
||||
if ew2 in vocab_set and ew2 != word:
|
||||
reachable[word].add(ew2)
|
||||
|
||||
return reachable
|
||||
|
||||
|
||||
def run_phase2(vocab, edges, existing_triples, checkpoint, dry_run=False):
|
||||
"""Phase 2: Cross-word bridge discovery."""
|
||||
print("Phase 2: Building reachability matrix...")
|
||||
reachable = build_reachability(vocab, edges)
|
||||
|
||||
# Find low-connectivity words
|
||||
vocab_set = set(vocab.keys())
|
||||
low_connectivity = []
|
||||
for word in vocab:
|
||||
reach_count = len(reachable.get(word, set()))
|
||||
if reach_count < 10:
|
||||
low_connectivity.append((word, reach_count))
|
||||
|
||||
low_connectivity.sort(key=lambda x: x[1])
|
||||
print(f" {len(low_connectivity)} words with <10 reachable vocab words")
|
||||
|
||||
# Build category index
|
||||
by_category = defaultdict(list)
|
||||
for word, info in vocab.items():
|
||||
for cat in info["categories"]:
|
||||
by_category[cat].append(word)
|
||||
|
||||
total_accepted = 0
|
||||
pairs_processed = 0
|
||||
total_skipped = 0
|
||||
|
||||
for word, reach_count in low_connectivity:
|
||||
if (word, "2") in checkpoint:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
word_cats = vocab[word]["categories"]
|
||||
word_reachable = reachable.get(word, set())
|
||||
|
||||
# Find same-category words that are unreachable
|
||||
unreachable = []
|
||||
for cat in word_cats:
|
||||
for peer in by_category.get(cat, []):
|
||||
if peer != word and peer not in word_reachable:
|
||||
unreachable.append(peer)
|
||||
|
||||
if not unreachable:
|
||||
append_log(word, "2", 0, 0, 0, 0)
|
||||
continue
|
||||
|
||||
# Sample 5-10 unreachable peers
|
||||
sample = random.sample(unreachable, min(10, len(unreachable)))
|
||||
|
||||
accepted_for_word = 0
|
||||
|
||||
for peer in sample:
|
||||
pair_key = f"{word}:{peer}"
|
||||
if (pair_key, "2") in checkpoint:
|
||||
continue
|
||||
|
||||
categories_a = ", ".join(vocab[word]["categories"])
|
||||
categories_b = ", ".join(vocab[peer]["categories"])
|
||||
|
||||
user_prompt = PHASE2_USER.format(
|
||||
word_a=word, word_b=peer,
|
||||
categories_a=categories_a, categories_b=categories_b,
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": PHASE2_SYSTEM},
|
||||
{"role": "user", "content": user_prompt},
|
||||
]
|
||||
|
||||
if dry_run:
|
||||
if pairs_processed < 3:
|
||||
print(f"\n--- Bridge prompt: '{word}' <-> '{peer}' ---")
|
||||
print(f"User:\n{user_prompt}")
|
||||
elif pairs_processed == 3:
|
||||
print(f"\n... (more pairs) ...")
|
||||
pairs_processed += 1
|
||||
continue
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
parsed = parse_bridge_response(response, word, peer) if response else []
|
||||
|
||||
accepted = []
|
||||
duplicates = 0
|
||||
oov = 0
|
||||
|
||||
for edge in parsed:
|
||||
triple = (edge["start_word"], edge["end_word"], edge["relation"])
|
||||
if triple in existing_triples:
|
||||
duplicates += 1
|
||||
continue
|
||||
existing_triples.add(triple)
|
||||
|
||||
# For bridge edges, both endpoints should ideally be in vocab
|
||||
if edge["start_word"] in vocab_set and edge["end_word"] in vocab_set:
|
||||
accepted.append(edge)
|
||||
elif edge["start_word"] in vocab_set or edge["end_word"] in vocab_set:
|
||||
# At least one end in vocab — still useful
|
||||
accepted.append(edge)
|
||||
else:
|
||||
oov += 1
|
||||
|
||||
if accepted:
|
||||
append_augmented_edges(accepted)
|
||||
for e in accepted:
|
||||
edges[(e["start_word"], e["relation"])].append(
|
||||
(e["end_word"], e["weight"], e["surface_text"]))
|
||||
accepted_for_word += len(accepted)
|
||||
|
||||
pairs_processed += 1
|
||||
time.sleep(0.1)
|
||||
|
||||
total_accepted += accepted_for_word
|
||||
append_log(word, "2", 0, accepted_for_word, 0, 0)
|
||||
|
||||
if (pairs_processed) % 20 == 0:
|
||||
print(f" {pairs_processed} pairs processed, {total_accepted} edges accepted")
|
||||
|
||||
if dry_run:
|
||||
print(f"\nDry run complete. Would process {pairs_processed} word pairs.")
|
||||
else:
|
||||
print(f"\nPhase 2 complete: {total_accepted} bridge edges accepted from {pairs_processed} pairs.")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 3: Property Enrichment
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PHASE3_SYSTEM = """You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
|
||||
|
||||
Rules:
|
||||
- Only physical/sensory/behavioral properties, not abstract qualities
|
||||
- Properties should DISTINGUISH this thing from similar things in its category
|
||||
- Output one property per line as: PROPERTY | brief explanation
|
||||
- Aim for 5-8 properties"""
|
||||
|
||||
|
||||
PHASE3_USER = """Word: {word}
|
||||
Category: {categories}
|
||||
Other words in same category: {peers}
|
||||
|
||||
What properties distinguish {word} from the others listed?"""
|
||||
|
||||
|
||||
def run_phase3(vocab, edges, existing_triples, checkpoint, dry_run=False):
|
||||
"""Phase 3: Property enrichment for false_equivalence templates."""
|
||||
by_category = defaultdict(list)
|
||||
for word, info in vocab.items():
|
||||
for cat in info["categories"]:
|
||||
by_category[cat].append(word)
|
||||
|
||||
words = sorted(vocab.keys())
|
||||
total = len(words)
|
||||
total_accepted = 0
|
||||
total_skipped = 0
|
||||
|
||||
print(f"Phase 3: Property enrichment for {total} words...")
|
||||
|
||||
for i, word in enumerate(words):
|
||||
if (word, "3") in checkpoint:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
word_cats = vocab[word]["categories"]
|
||||
categories = ", ".join(word_cats)
|
||||
|
||||
# Gather same-category peers (sample of 10)
|
||||
peers = set()
|
||||
for cat in word_cats:
|
||||
for peer in by_category.get(cat, []):
|
||||
if peer != word:
|
||||
peers.add(peer)
|
||||
peer_sample = random.sample(list(peers), min(10, len(peers))) if peers else []
|
||||
|
||||
if not peer_sample:
|
||||
append_log(word, "3", 0, 0, 0, 0)
|
||||
continue
|
||||
|
||||
user_prompt = PHASE3_USER.format(
|
||||
word=word, categories=categories,
|
||||
peers=", ".join(peer_sample),
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": PHASE3_SYSTEM},
|
||||
{"role": "user", "content": user_prompt},
|
||||
]
|
||||
|
||||
if dry_run:
|
||||
if i < 3:
|
||||
print(f"\n--- Property prompt for '{word}' ---")
|
||||
print(f"User:\n{user_prompt}")
|
||||
elif i == 3:
|
||||
print(f"\n... ({total - 3} more words) ...")
|
||||
continue
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
parsed = parse_property_response(response, word) if response else []
|
||||
|
||||
accepted = []
|
||||
duplicates = 0
|
||||
|
||||
for edge in parsed:
|
||||
triple = (edge["start_word"], edge["end_word"], edge["relation"])
|
||||
if triple in existing_triples:
|
||||
duplicates += 1
|
||||
continue
|
||||
existing_triples.add(triple)
|
||||
accepted.append(edge)
|
||||
|
||||
if accepted:
|
||||
append_augmented_edges(accepted)
|
||||
for e in accepted:
|
||||
edges[(e["start_word"], e["relation"])].append(
|
||||
(e["end_word"], e["weight"], e["surface_text"]))
|
||||
|
||||
total_accepted += len(accepted)
|
||||
append_log(word, "3", len(parsed), len(accepted), duplicates, 0)
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" [{i+1}/{total}] {total_accepted} properties accepted so far")
|
||||
|
||||
time.sleep(0.1)
|
||||
|
||||
if dry_run:
|
||||
print(f"\nDry run complete. Would process {total - total_skipped} words.")
|
||||
else:
|
||||
print(f"\nPhase 3 complete: {total_accepted} new HasProperty edges accepted.")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="LLM-augmented graph enhancement for folksy subgraph."
|
||||
)
|
||||
group = parser.add_mutually_exclusive_group(required=True)
|
||||
group.add_argument("--phase", type=int, choices=[1, 2, 3],
|
||||
help="Run a specific phase (1, 2, or 3)")
|
||||
group.add_argument("--all", action="store_true",
|
||||
help="Run all three phases in sequence")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Print prompts without calling LLM")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
vocab = load_vocab()
|
||||
edges, existing_triples = load_relations()
|
||||
checkpoint = load_checkpoint()
|
||||
|
||||
print(f"Loaded {len(vocab)} vocab words, {len(existing_triples)} existing edge triples.")
|
||||
print(f"Checkpoint: {len(checkpoint)} (word, phase) pairs already processed.")
|
||||
|
||||
phases = [args.phase] if args.phase else [1, 2, 3]
|
||||
|
||||
for phase in phases:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Running Phase {phase}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
if phase == 1:
|
||||
run_phase1(vocab, edges, existing_triples, checkpoint, args.dry_run)
|
||||
elif phase == 2:
|
||||
run_phase2(vocab, edges, existing_triples, checkpoint, args.dry_run)
|
||||
elif phase == 3:
|
||||
run_phase3(vocab, edges, existing_triples, checkpoint, args.dry_run)
|
||||
|
||||
# Reload checkpoint after each phase for resumability
|
||||
checkpoint = load_checkpoint()
|
||||
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
512
scripts/expand_vocab.py
Normal file
512
scripts/expand_vocab.py
Normal file
|
|
@ -0,0 +1,512 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Expand folksy vocabulary with high-quality candidates from LLM suggestions.
|
||||
|
||||
Reads candidate_additions.csv (words suggested by the LLM during phase 1 that
|
||||
weren't in the vocab), filters for quality, uses the LLM to assign categories,
|
||||
and appends the survivors to folksy_vocab.csv.
|
||||
|
||||
After running this, re-run `enhance_graph.py --phase 1` to generate edges
|
||||
for the new words (the checkpoint will skip already-processed words).
|
||||
|
||||
Usage:
|
||||
python scripts/expand_vocab.py # Full run
|
||||
python scripts/expand_vocab.py --dry-run # Show what would be added
|
||||
python scripts/expand_vocab.py --min-citations 8 # Stricter threshold
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
DATA_DIR = PROJECT_DIR / "data"
|
||||
|
||||
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
|
||||
LLM_MODEL = "THUDM-GLM4-32B"
|
||||
|
||||
VOCAB_CSV = DATA_DIR / "folksy_vocab.csv"
|
||||
CANDIDATE_CSV = DATA_DIR / "candidate_additions.csv"
|
||||
|
||||
# Valid categories from the existing vocabulary
|
||||
VALID_CATEGORIES = {
|
||||
"animal", "beverage", "bird", "building", "clothing", "container", "crop",
|
||||
"fabric", "fish", "flower", "food", "fruit", "furniture", "grain", "herb",
|
||||
"insect", "instrument", "landscape", "material", "metal", "mineral",
|
||||
"organism", "plant", "rock", "seed", "shelter", "spice", "stone",
|
||||
"structure", "tool", "tree", "vegetable", "vehicle", "water", "weapon", "wood",
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Exclusion lists
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Abstract concepts, emotions, processes — not concrete enough for folksy vocab
|
||||
EXCLUDE_ABSTRACT = {
|
||||
"ecosystem", "satisfaction", "fullness", "warmth", "fear", "relaxation",
|
||||
"growth", "interest", "nature", "protection", "digestion", "injury",
|
||||
"decoration", "construction", "landscape", "noise", "sound", "energy",
|
||||
"nourishment", "nutrition", "pollination", "sustainability", "tradition",
|
||||
"biodiversity", "symbolism", "elegance", "resilience", "patience",
|
||||
"beauty", "abundance", "fertility", "creativity", "harmony", "comfort",
|
||||
"curiosity", "companionship", "loyalty", "aggression", "alertness",
|
||||
"camouflage", "predation", "migration", "hibernation", "decomposition",
|
||||
"erosion", "combustion", "fermentation", "oxidation", "corrosion",
|
||||
"photosynthesis", "respiration", "evaporation", "precipitation",
|
||||
"transpiration", "germination", "excitement", "enjoyment", "satiety",
|
||||
"stability", "organization", "fragrance", "moisture", "wildlife",
|
||||
"preservation", "conversation", "inspiration", "storage", "observation",
|
||||
"hydration", "destruction", "entertainment", "education", "knowledge",
|
||||
"safety", "practice", "research", "skill", "space", "license",
|
||||
"collection", "habitat", "pollution", "health", "vibration", "wonder",
|
||||
"awe", "refreshment", "irritation", "happiness", "joy", "damage",
|
||||
"death", "pain", "thirst", "fear", "alarm", "contents", "ingredients",
|
||||
"electricity", "oxygen", "navigation", "recreation", "meditation",
|
||||
"nutrition", "celebration", "communication", "imagination", "devotion",
|
||||
"ambition", "endurance", "independence", "discipline", "cooperation",
|
||||
"sweetness", "fullness", "aroma", "flavor", "fragrance", "texture",
|
||||
"smell", "color", "contents", "surface", "bottom", "edge",
|
||||
"nutrients", "study", "outfit", "upholstery",
|
||||
}
|
||||
|
||||
# Scientific/technical — not folksy enough for folk wisdom
|
||||
EXCLUDE_TECHNICAL = {
|
||||
"cellulose", "exoskeleton", "protein", "tissue", "cells", "alloy",
|
||||
"cellulose", "enzyme", "chlorophyll", "genome", "photon",
|
||||
"organism", "molecule", "compound", "polymer", "isotope",
|
||||
"ecosystem", "metabolism", "catalyst", "membrane", "chromosome",
|
||||
"cell", "nutrient", "ingredient", "material", "content",
|
||||
}
|
||||
|
||||
# Collective/institutional nouns — not concrete individual things
|
||||
EXCLUDE_INSTITUTIONAL = {
|
||||
"orchestra", "fleet", "arsenal", "toolkit", "collection",
|
||||
"restaurant", "museum", "university", "corporation", "organization",
|
||||
"musician", "breakfast", "dinner", "meal", "dish", "sandwich",
|
||||
"seafood", "refrigerator", "garage", "basement", "park",
|
||||
}
|
||||
|
||||
# Adjectives and properties — useful as HasProperty targets but not as vocab words
|
||||
EXCLUDE_ADJECTIVES = {
|
||||
"small", "large", "heavy", "colorful", "green", "brown", "hard",
|
||||
"white", "round", "sharp", "sturdy", "long", "soft", "flat",
|
||||
"sweet", "bitter", "smooth", "rough", "bright", "dark", "dry",
|
||||
"wet", "thick", "thin", "warm", "cold", "hot", "tall", "short",
|
||||
"red", "blue", "yellow", "black", "grey", "gray", "pink",
|
||||
"fragrant", "loud", "spicy", "sour", "tough", "delicate", "strong",
|
||||
"weak", "light", "dense", "portable", "lightweight", "transparent",
|
||||
"opaque", "flexible", "rigid", "brittle", "elastic", "porous",
|
||||
"compact", "edible", "toxic", "aromatic", "nocturnal", "aquatic",
|
||||
"durable", "cylindrical", "wooden", "shiny", "solid", "narrow",
|
||||
"metallic", "pungent", "juicy", "fast", "powerful", "woody",
|
||||
"fibrous", "savory", "liquid", "enclosed", "rectangular", "wild",
|
||||
"feathered", "leafy", "crunchy", "dangerous", "fuzzy", "slimy",
|
||||
"natural", "waterproof", "electronic",
|
||||
}
|
||||
|
||||
# Words that are clearly verbs or gerunds
|
||||
EXCLUDE_VERBS = {
|
||||
"eating", "cooking", "growing", "fishing", "hunting", "flying",
|
||||
"mining", "flavoring", "singing", "blooming", "holding", "baking",
|
||||
"ripening", "opening", "cutting", "protecting", "seasoning",
|
||||
"storing", "building", "swimming", "brewing", "weaving", "carving",
|
||||
"climbing", "digging", "plowing", "sewing", "spinning", "tanning",
|
||||
"swim", "run", "grow", "eat", "hunt", "peck", "bite", "dive",
|
||||
"crawl", "cut", "shine", "sparkle",
|
||||
}
|
||||
|
||||
|
||||
def singularize(word):
|
||||
"""Best-effort singularization. Returns (singular, was_plural)."""
|
||||
# Irregular plurals
|
||||
irregulars = {
|
||||
"teeth": "tooth", "feet": "foot", "geese": "goose", "mice": "mouse",
|
||||
"lice": "louse", "dice": "die", "oxen": "ox", "children": "child",
|
||||
"leaves": "leaf", "loaves": "loaf", "halves": "half", "knives": "knife",
|
||||
"lives": "life", "wives": "wife", "wolves": "wolf", "shelves": "shelf",
|
||||
"calves": "calf",
|
||||
}
|
||||
if word in irregulars:
|
||||
return irregulars[word], True
|
||||
|
||||
# -ves -> -f (already covered some above, catch remaining)
|
||||
if word.endswith("ves"):
|
||||
candidate = word[:-3] + "f"
|
||||
return candidate, True
|
||||
|
||||
# -ies -> -y
|
||||
if word.endswith("ies") and len(word) > 4:
|
||||
return word[:-3] + "y", True
|
||||
|
||||
# -ses, -xes, -zes, -ches, -shes -> drop -es
|
||||
if word.endswith(("ses", "xes", "zes", "ches", "shes")):
|
||||
return word[:-2], True
|
||||
|
||||
# -s (but not -ss, -us, -is)
|
||||
if word.endswith("s") and not word.endswith(("ss", "us", "is")):
|
||||
return word[:-1], True
|
||||
|
||||
return word, False
|
||||
|
||||
|
||||
def is_plural_of_existing(word, existing_vocab):
|
||||
"""Check if word is likely a plural form of an existing vocab word."""
|
||||
# word + s
|
||||
if word.endswith("s") and word[:-1] in existing_vocab:
|
||||
return True
|
||||
# word + es
|
||||
if word.endswith("es") and word[:-2] in existing_vocab:
|
||||
return True
|
||||
# word ending ies -> y
|
||||
if word.endswith("ies") and word[:-3] + "y" in existing_vocab:
|
||||
return True
|
||||
# word ending ves -> f/fe
|
||||
if word.endswith("ves"):
|
||||
if word[:-3] + "f" in existing_vocab:
|
||||
return True
|
||||
if word[:-3] + "fe" in existing_vocab:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def is_plural_of_candidate(word, accepted_words):
|
||||
"""Check if word is a plural of another candidate, or vice versa."""
|
||||
# Is this word a plural of something accepted?
|
||||
if word.endswith("s") and word[:-1] in accepted_words:
|
||||
return True
|
||||
if word.endswith("es") and word[:-2] in accepted_words:
|
||||
return True
|
||||
if word.endswith("ies") and word[:-3] + "y" in accepted_words:
|
||||
return True
|
||||
# Is something accepted a plural of this word?
|
||||
if word + "s" in accepted_words:
|
||||
return True
|
||||
if word + "es" in accepted_words:
|
||||
return True
|
||||
if word.endswith("f") and word[:-1] + "ves" in accepted_words:
|
||||
return True
|
||||
if word.endswith("fe") and word[:-2] + "ves" in accepted_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LLM categorization
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
CATEGORIZE_SYSTEM = """You are a vocabulary categorizer. Given a list of concrete nouns, assign each one to one or more categories from this fixed list:
|
||||
|
||||
animal, beverage, bird, building, clothing, container, crop, fabric, fish, flower, food, fruit, furniture, grain, herb, insect, instrument, landscape, material, metal, mineral, organism, plant, rock, seed, shelter, spice, stone, structure, tool, tree, vegetable, vehicle, water, weapon, wood
|
||||
|
||||
Rules:
|
||||
- Use ONLY categories from the list above
|
||||
- A word can have multiple categories (e.g., "brick" -> material, stone)
|
||||
- If a word fits none of the categories well, output SKIP
|
||||
- Output format: word: category1, category2
|
||||
- One word per line"""
|
||||
|
||||
CATEGORIZE_USER = """Categorize these words:
|
||||
{word_list}"""
|
||||
|
||||
|
||||
def llm_chat_completion(messages, max_retries=3):
|
||||
"""Chat completion with retry logic."""
|
||||
import requests
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
resp = requests.post(LLM_ENDPOINT, json={
|
||||
"model": LLM_MODEL,
|
||||
"messages": messages,
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return data["choices"][0]["message"]["content"]
|
||||
except Exception as e:
|
||||
wait = (2 ** attempt)
|
||||
print(f" LLM call failed (attempt {attempt+1}/{max_retries}): {e}",
|
||||
file=sys.stderr)
|
||||
if attempt < max_retries - 1:
|
||||
print(f" Retrying in {wait}s...", file=sys.stderr)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
print(f" Giving up on this batch.", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
def parse_categories(response_text, valid_words):
|
||||
"""Parse LLM categorization response."""
|
||||
result = {}
|
||||
if not response_text:
|
||||
return result
|
||||
|
||||
for line in response_text.strip().split("\n"):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Strip bullets/numbers
|
||||
line = re.sub(r"^[\d]+[.)]\s*", "", line)
|
||||
line = re.sub(r"^[-*•]\s*", "", line)
|
||||
line = line.strip()
|
||||
|
||||
# Match: word: cat1, cat2
|
||||
match = re.match(r"^(\w+)\s*:\s*(.+)$", line)
|
||||
if not match:
|
||||
continue
|
||||
|
||||
word = match.group(1).strip().lower()
|
||||
cats_raw = match.group(2).strip()
|
||||
|
||||
if "SKIP" in cats_raw.upper():
|
||||
continue
|
||||
|
||||
cats = []
|
||||
for c in cats_raw.split(","):
|
||||
c = c.strip().lower()
|
||||
if c in VALID_CATEGORIES:
|
||||
cats.append(c)
|
||||
|
||||
if word in valid_words and cats:
|
||||
result[word] = cats
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def categorize_words(words, batch_size=25):
|
||||
"""Categorize words using the LLM in batches."""
|
||||
all_categories = {}
|
||||
word_set = set(words)
|
||||
|
||||
for i in range(0, len(words), batch_size):
|
||||
batch = words[i:i + batch_size]
|
||||
word_list = "\n".join(f"- {w}" for w in batch)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": CATEGORIZE_SYSTEM},
|
||||
{"role": "user", "content": CATEGORIZE_USER.format(word_list=word_list)},
|
||||
]
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
parsed = parse_categories(response, word_set)
|
||||
all_categories.update(parsed)
|
||||
|
||||
categorized = len(parsed)
|
||||
print(f" Batch {i // batch_size + 1}: {categorized}/{len(batch)} categorized")
|
||||
time.sleep(0.1)
|
||||
|
||||
return all_categories
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Expand folksy vocabulary with LLM-suggested candidates."
|
||||
)
|
||||
parser.add_argument("--min-citations", type=int, default=5,
|
||||
help="Minimum number of vocab words that suggested this candidate (default: 5)")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Show what would be added without modifying files")
|
||||
parser.add_argument("--no-llm", action="store_true",
|
||||
help="Skip LLM categorization (use placeholder categories)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load existing vocab
|
||||
existing_vocab = {}
|
||||
with open(VOCAB_CSV, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
existing_vocab[row["word"]] = row
|
||||
existing_words = set(existing_vocab.keys())
|
||||
print(f"Existing vocabulary: {len(existing_words)} words")
|
||||
|
||||
# Load candidates
|
||||
candidates = []
|
||||
with open(CANDIDATE_CSV, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
candidates.append(row)
|
||||
|
||||
# Aggregate: count unique sources per candidate word
|
||||
word_sources = defaultdict(set)
|
||||
for c in candidates:
|
||||
word_sources[c["word"]].add(c["suggested_by"])
|
||||
|
||||
print(f"Total candidate rows: {len(candidates)}")
|
||||
print(f"Unique candidate words: {len(word_sources)}")
|
||||
|
||||
# Normalize plurals: merge citation counts into singular forms
|
||||
normalized_sources = defaultdict(set)
|
||||
for word, sources in word_sources.items():
|
||||
singular, was_plural = singularize(word)
|
||||
# Merge into the singular form
|
||||
normalized_sources[singular].update(sources)
|
||||
# Replace word_sources with normalized version
|
||||
word_sources = {w: srcs for w, srcs in normalized_sources.items()}
|
||||
print(f"After singularization: {len(word_sources)} unique candidates")
|
||||
|
||||
# Filter
|
||||
accepted = []
|
||||
reject_reasons = Counter()
|
||||
|
||||
# Sort by citation count descending for consistent ordering
|
||||
sorted_candidates = sorted(word_sources.items(), key=lambda x: len(x[1]), reverse=True)
|
||||
accepted_set = set()
|
||||
|
||||
for word, sources in sorted_candidates:
|
||||
citation_count = len(sources)
|
||||
|
||||
# Minimum citation threshold
|
||||
if citation_count < args.min_citations:
|
||||
reject_reasons["below_threshold"] += 1
|
||||
continue
|
||||
|
||||
# No multi-word (underscore) candidates
|
||||
if "_" in word:
|
||||
reject_reasons["multi_word"] += 1
|
||||
continue
|
||||
|
||||
# Already in vocab
|
||||
if word in existing_words:
|
||||
reject_reasons["already_in_vocab"] += 1
|
||||
continue
|
||||
|
||||
# Exclude abstracts
|
||||
if word in EXCLUDE_ABSTRACT:
|
||||
reject_reasons["abstract"] += 1
|
||||
continue
|
||||
|
||||
# Exclude adjectives
|
||||
if word in EXCLUDE_ADJECTIVES:
|
||||
reject_reasons["adjective"] += 1
|
||||
continue
|
||||
|
||||
# Exclude verbs/gerunds
|
||||
if word in EXCLUDE_VERBS:
|
||||
reject_reasons["verb_gerund"] += 1
|
||||
continue
|
||||
|
||||
# Exclude technical/scientific
|
||||
if word in EXCLUDE_TECHNICAL:
|
||||
reject_reasons["technical"] += 1
|
||||
continue
|
||||
|
||||
# Exclude institutional/collective
|
||||
if word in EXCLUDE_INSTITUTIONAL:
|
||||
reject_reasons["institutional"] += 1
|
||||
continue
|
||||
|
||||
# Gerund pattern catch-all (but allow exceptions)
|
||||
if word.endswith("ing") and word not in {"ring", "spring", "string", "wing", "ceiling"}:
|
||||
reject_reasons["gerund_pattern"] += 1
|
||||
continue
|
||||
|
||||
# Exclude plurals of existing vocab
|
||||
if is_plural_of_existing(word, existing_words):
|
||||
reject_reasons["plural_of_existing"] += 1
|
||||
continue
|
||||
|
||||
# Exclude plurals of already-accepted candidates
|
||||
if is_plural_of_candidate(word, accepted_set):
|
||||
reject_reasons["plural_of_candidate"] += 1
|
||||
continue
|
||||
|
||||
# Single character
|
||||
if len(word) < 2:
|
||||
reject_reasons["too_short"] += 1
|
||||
continue
|
||||
|
||||
accepted.append((word, citation_count))
|
||||
accepted_set.add(word)
|
||||
|
||||
print(f"\nFiltering results:")
|
||||
print(f" Accepted: {len(accepted)}")
|
||||
for reason, count in reject_reasons.most_common():
|
||||
print(f" Rejected ({reason}): {count}")
|
||||
|
||||
if not accepted:
|
||||
print("\nNo candidates passed filtering.")
|
||||
return
|
||||
|
||||
# Show accepted words
|
||||
print(f"\nAccepted candidates ({len(accepted)}):")
|
||||
for word, count in accepted:
|
||||
print(f" {word:25s} cited by {count:3d} vocab words")
|
||||
|
||||
if args.dry_run:
|
||||
print(f"\nDry run complete. Would add {len(accepted)} words to vocabulary.")
|
||||
return
|
||||
|
||||
# Categorize with LLM
|
||||
words_to_categorize = [w for w, _ in accepted]
|
||||
|
||||
if args.no_llm:
|
||||
print("\nSkipping LLM categorization (--no-llm). Using 'material' as placeholder.")
|
||||
categories = {w: ["material"] for w in words_to_categorize}
|
||||
else:
|
||||
print(f"\nCategorizing {len(words_to_categorize)} words with LLM...")
|
||||
categories = categorize_words(words_to_categorize)
|
||||
|
||||
# Words the LLM couldn't categorize get skipped
|
||||
uncategorized = [w for w in words_to_categorize if w not in categories]
|
||||
if uncategorized:
|
||||
print(f"\n {len(uncategorized)} words could not be categorized (skipped):")
|
||||
for w in uncategorized:
|
||||
print(f" {w}")
|
||||
|
||||
# Build new vocab entries
|
||||
new_entries = []
|
||||
for word, citation_count in accepted:
|
||||
if word not in categories:
|
||||
continue
|
||||
cats = categories[word]
|
||||
new_entries.append({
|
||||
"word": word,
|
||||
"categories": ",".join(cats),
|
||||
"tangibility_score": "0.80",
|
||||
"conceptnet_edge_count": "0",
|
||||
"frequency_rank": "0",
|
||||
})
|
||||
|
||||
if not new_entries:
|
||||
print("\nNo entries to add after categorization.")
|
||||
return
|
||||
|
||||
# Backup existing vocab
|
||||
backup_path = VOCAB_CSV.with_suffix(f".csv.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}")
|
||||
shutil.copy2(VOCAB_CSV, backup_path)
|
||||
print(f"\nBacked up vocabulary to {backup_path.name}")
|
||||
|
||||
# Append to vocab CSV
|
||||
with open(VOCAB_CSV, "a", newline="", encoding="utf-8") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=["word", "categories", "tangibility_score",
|
||||
"conceptnet_edge_count", "frequency_rank"])
|
||||
for entry in new_entries:
|
||||
writer.writerow(entry)
|
||||
|
||||
print(f"\nAdded {len(new_entries)} words to {VOCAB_CSV.name}")
|
||||
print(f"New vocabulary size: {len(existing_words) + len(new_entries)}")
|
||||
|
||||
# Summary by category
|
||||
cat_counts = Counter()
|
||||
for entry in new_entries:
|
||||
for c in entry["categories"].split(","):
|
||||
cat_counts[c.strip()] += 1
|
||||
print(f"\nNew words by category:")
|
||||
for cat, count in cat_counts.most_common():
|
||||
print(f" {cat:20s} {count:3d}")
|
||||
|
||||
print(f"\nNext step: run 'python scripts/enhance_graph.py --phase 1' to generate edges for new words.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
177
scripts/filter_corpus.py
Normal file
177
scripts/filter_corpus.py
Normal file
|
|
@ -0,0 +1,177 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Quality filtering for polished folksy sayings.
|
||||
|
||||
Reads corpus_polished.jsonl, applies quality filters, outputs filtered corpus
|
||||
and discard analysis.
|
||||
|
||||
Usage:
|
||||
python scripts/filter_corpus.py
|
||||
python scripts/filter_corpus.py --input corpus/corpus_polished.jsonl --output corpus/corpus_filtered.jsonl
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import sys
|
||||
from difflib import SequenceMatcher
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
CORPUS_DIR = PROJECT_DIR / "corpus"
|
||||
|
||||
|
||||
def quality_filter(entry):
|
||||
"""Apply quality filters to a polished entry.
|
||||
|
||||
Returns (passed, reason) tuple.
|
||||
"""
|
||||
text = entry.get("polished_text", "")
|
||||
if not text:
|
||||
return False, "no_polished_text"
|
||||
|
||||
words = text.split()
|
||||
|
||||
# Length check
|
||||
if len(words) > 25:
|
||||
return False, "too_long"
|
||||
if len(words) < 5:
|
||||
return False, "too_short"
|
||||
|
||||
# Must contain at least 2 of the original slot-fill nouns
|
||||
slot_words = set(entry.get("slots", {}).values())
|
||||
words_present = sum(1 for w in slot_words if w.lower() in text.lower())
|
||||
if words_present < 2:
|
||||
return False, "lost_key_nouns"
|
||||
|
||||
# No raw ConceptNet artifacts (multi-word underscore phrases)
|
||||
if "_" in text:
|
||||
return False, "conceptnet_artifact"
|
||||
|
||||
# No broken templates (unfilled slots)
|
||||
if "{" in text or "}" in text:
|
||||
return False, "unfilled_slot"
|
||||
|
||||
return True, "pass"
|
||||
|
||||
|
||||
def is_near_duplicate(text_a, text_b, threshold=0.75):
|
||||
"""Check if two texts are near-duplicates."""
|
||||
return SequenceMatcher(None, text_a.lower(), text_b.lower()).ratio() > threshold
|
||||
|
||||
|
||||
def deduplicate_within_family(entries):
|
||||
"""Remove near-duplicates within each meta-template family.
|
||||
|
||||
Returns (kept, removed) lists.
|
||||
"""
|
||||
by_family = {}
|
||||
for entry in entries:
|
||||
family = entry.get("meta_template", "unknown")
|
||||
by_family.setdefault(family, []).append(entry)
|
||||
|
||||
kept = []
|
||||
removed = []
|
||||
|
||||
for family, family_entries in by_family.items():
|
||||
family_kept = []
|
||||
for entry in family_entries:
|
||||
text = entry.get("polished_text", "")
|
||||
is_dup = False
|
||||
for existing in family_kept:
|
||||
if is_near_duplicate(text, existing.get("polished_text", "")):
|
||||
is_dup = True
|
||||
break
|
||||
if is_dup:
|
||||
removed.append((entry, "near_duplicate"))
|
||||
else:
|
||||
family_kept.append(entry)
|
||||
kept.extend(family_kept)
|
||||
|
||||
return kept, removed
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Quality filtering for polished folksy sayings.")
|
||||
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
|
||||
help="Input polished JSONL file")
|
||||
parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
|
||||
help="Output filtered JSONL file")
|
||||
parser.add_argument("--discard-analysis", default=str(CORPUS_DIR / "discard_analysis.csv"),
|
||||
help="Discard analysis CSV file")
|
||||
args = parser.parse_args()
|
||||
|
||||
input_path = Path(args.input)
|
||||
output_path = Path(args.output)
|
||||
discard_path = Path(args.discard_analysis)
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"Error: {input_path} not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Load polished entries (only those with status=polished)
|
||||
all_entries = []
|
||||
already_discarded = 0
|
||||
with open(input_path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
entry = json.loads(line)
|
||||
if entry.get("status") == "polished":
|
||||
all_entries.append(entry)
|
||||
elif entry.get("status") == "discarded":
|
||||
already_discarded += 1
|
||||
|
||||
print(f"Loaded {len(all_entries)} polished entries ({already_discarded} already discarded by LLM)")
|
||||
|
||||
# Apply quality filters
|
||||
passed = []
|
||||
discards = [] # (entry, reason)
|
||||
|
||||
for entry in all_entries:
|
||||
ok, reason = quality_filter(entry)
|
||||
if ok:
|
||||
passed.append(entry)
|
||||
else:
|
||||
discards.append((entry, reason))
|
||||
|
||||
print(f"Quality filter: {len(passed)} passed, {len(discards)} discarded")
|
||||
|
||||
# Show discard breakdown
|
||||
from collections import Counter
|
||||
reason_counts = Counter(r for _, r in discards)
|
||||
for reason, count in reason_counts.most_common():
|
||||
print(f" {reason}: {count}")
|
||||
|
||||
# Near-duplicate detection within template families
|
||||
kept, dup_removed = deduplicate_within_family(passed)
|
||||
discards.extend(dup_removed)
|
||||
|
||||
print(f"Near-duplicate removal: {len(dup_removed)} removed, {len(kept)} remaining")
|
||||
|
||||
# Write filtered output
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
for entry in kept:
|
||||
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
|
||||
|
||||
print(f"\nFiltered corpus: {len(kept)} entries -> {output_path}")
|
||||
|
||||
# Write discard analysis
|
||||
with open(discard_path, "w", newline="", encoding="utf-8") as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(["raw_text", "meta_template", "discard_stage", "discard_reason"])
|
||||
for entry, reason in discards:
|
||||
writer.writerow([
|
||||
entry.get("raw_text", ""),
|
||||
entry.get("meta_template", ""),
|
||||
"llm_polish" if reason == "no_polished_text" else "quality_filter",
|
||||
reason,
|
||||
])
|
||||
|
||||
print(f"Discard analysis: {len(discards)} entries -> {discard_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
385
scripts/format_training_pairs.py
Normal file
385
scripts/format_training_pairs.py
Normal file
|
|
@ -0,0 +1,385 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Format filtered sayings into training pairs for fine-tuning.
|
||||
|
||||
Each polished saying generates 3-5 training pairs with different input framings.
|
||||
Also generates fictional entity training pairs.
|
||||
|
||||
Usage:
|
||||
python scripts/format_training_pairs.py
|
||||
python scripts/format_training_pairs.py --input corpus/corpus_filtered.jsonl --output corpus/training_pairs.jsonl
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import random
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
CORPUS_DIR = PROJECT_DIR / "corpus"
|
||||
DATA_DIR = PROJECT_DIR / "data"
|
||||
EXAMPLES_DIR = PROJECT_DIR / "examples"
|
||||
|
||||
# Template name mappings for human-readable prompts
|
||||
TEMPLATE_NAMES = {
|
||||
"deconstruction": "deconstruction",
|
||||
"denial_of_consequences": "denial of consequences",
|
||||
"ironic_deficiency": "ironic deficiency",
|
||||
"futile_preparation": "futile preparation",
|
||||
"hypocritical_complaint": "hypocritical complaint",
|
||||
"tautological_wisdom": "tautological wisdom",
|
||||
"false_equivalence": "false equivalence",
|
||||
}
|
||||
|
||||
PERSONAS = ["farmer", "grandmother", "old sailor", "blacksmith", "innkeeper", "shepherd"]
|
||||
|
||||
OPEN_ENDED_PROMPTS = [
|
||||
"Tell me some folk wisdom.",
|
||||
"What do they say?",
|
||||
"Give me a proverb.",
|
||||
"Share some old-time wisdom.",
|
||||
"What's a good saying?",
|
||||
]
|
||||
|
||||
# Auto-generated fictional entities for additional training pairs
|
||||
AUTO_ENTITIES = [
|
||||
{
|
||||
"name": "Stoneclaw",
|
||||
"categories": ["animal", "predator"],
|
||||
"properties": ["fierce", "rocky", "nocturnal"],
|
||||
"relations": {"AtLocation": ["cave", "mountain"], "HasA": ["claws", "scales"], "CapableOf": ["hunting", "climbing"]},
|
||||
},
|
||||
{
|
||||
"name": "Duskmelon",
|
||||
"categories": ["fruit", "food"],
|
||||
"properties": ["purple", "sweet", "fragrant"],
|
||||
"relations": {"AtLocation": ["garden", "market"], "UsedFor": ["eating", "jam"], "MadeOf": ["seed", "juice"]},
|
||||
},
|
||||
{
|
||||
"name": "Windloom",
|
||||
"categories": ["tool", "craft"],
|
||||
"properties": ["wooden", "portable", "intricate"],
|
||||
"relations": {"UsedFor": ["weaving", "thread"], "MadeOf": ["wood", "string"], "AtLocation": ["workshop", "cottage"]},
|
||||
},
|
||||
{
|
||||
"name": "Briarvine",
|
||||
"categories": ["plant", "herb"],
|
||||
"properties": ["thorny", "green", "medicinal"],
|
||||
"relations": {"AtLocation": ["forest", "hedge"], "UsedFor": ["healing", "tea"], "HasA": ["thorn", "leaf"]},
|
||||
},
|
||||
{
|
||||
"name": "Mudhog",
|
||||
"categories": ["animal", "livestock"],
|
||||
"properties": ["muddy", "stubborn", "heavy"],
|
||||
"relations": {"AtLocation": ["farm", "swamp"], "Desires": ["food", "mud"], "CapableOf": ["digging", "rooting"]},
|
||||
},
|
||||
{
|
||||
"name": "Frostberry",
|
||||
"categories": ["fruit", "food"],
|
||||
"properties": ["cold", "blue", "tiny"],
|
||||
"relations": {"AtLocation": ["mountain", "tundra"], "UsedFor": ["eating", "preserves"], "HasProperty": ["cold", "tart"]},
|
||||
},
|
||||
{
|
||||
"name": "Lanternmoss",
|
||||
"categories": ["plant", "fungus"],
|
||||
"properties": ["glowing", "damp", "soft"],
|
||||
"relations": {"AtLocation": ["cave", "swamp"], "UsedFor": ["light", "decoration"], "HasProperty": ["luminous", "fragile"]},
|
||||
},
|
||||
{
|
||||
"name": "Cinderhawk",
|
||||
"categories": ["bird", "animal"],
|
||||
"properties": ["fiery", "fast", "red"],
|
||||
"relations": {"AtLocation": ["mountain", "volcano"], "CapableOf": ["flying", "hunting"], "HasA": ["talons", "feathers"]},
|
||||
},
|
||||
{
|
||||
"name": "Rootstone",
|
||||
"categories": ["stone", "material"],
|
||||
"properties": ["veined", "hard", "ancient"],
|
||||
"relations": {"AtLocation": ["quarry", "riverbed"], "UsedFor": ["building", "carving"], "MadeOf": ["mineral", "root"]},
|
||||
},
|
||||
{
|
||||
"name": "Silkwort",
|
||||
"categories": ["plant", "fiber"],
|
||||
"properties": ["silky", "white", "tall"],
|
||||
"relations": {"AtLocation": ["field", "meadow"], "UsedFor": ["weaving", "cloth"], "HasA": ["stem", "fiber"]},
|
||||
},
|
||||
{
|
||||
"name": "Kettlefrog",
|
||||
"categories": ["animal", "amphibian"],
|
||||
"properties": ["loud", "round", "green"],
|
||||
"relations": {"AtLocation": ["pond", "marsh"], "CapableOf": ["jumping", "croaking"], "Desires": ["flies", "water"]},
|
||||
},
|
||||
{
|
||||
"name": "Dustwheat",
|
||||
"categories": ["crop", "grain"],
|
||||
"properties": ["dry", "golden", "hardy"],
|
||||
"relations": {"AtLocation": ["field", "barn"], "UsedFor": ["bread", "flour"], "HasPrerequisite": ["rain", "soil"]},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def format_entity_description(entity):
|
||||
"""Format entity into a natural description string."""
|
||||
name = entity["name"]
|
||||
cats = entity.get("categories", [])
|
||||
props = entity.get("properties", [])
|
||||
rels = entity.get("relations", {})
|
||||
|
||||
parts = []
|
||||
|
||||
# Category description
|
||||
if props and cats:
|
||||
prop_str = ", ".join(props[:3])
|
||||
cat_str = " and ".join(cats[:2])
|
||||
parts.append(f"A {name} is a {prop_str} {cat_str}.")
|
||||
elif cats:
|
||||
parts.append(f"A {name} is a {' and '.join(cats[:2])}.")
|
||||
|
||||
# Location
|
||||
if "AtLocation" in rels:
|
||||
locs = rels["AtLocation"]
|
||||
parts.append(f"It is found near {' and '.join(locs[:2])}.")
|
||||
|
||||
# Parts/properties
|
||||
if "HasA" in rels:
|
||||
has = rels["HasA"]
|
||||
parts.append(f"It has {', '.join(has[:3])}.")
|
||||
|
||||
# Capabilities
|
||||
if "CapableOf" in rels:
|
||||
caps = rels["CapableOf"]
|
||||
parts.append(f"It can {' and '.join(caps[:2])}.")
|
||||
|
||||
# Uses
|
||||
if "UsedFor" in rels:
|
||||
uses = rels["UsedFor"]
|
||||
parts.append(f"It is used for {' and '.join(uses[:2])}.")
|
||||
|
||||
return " ".join(parts)
|
||||
|
||||
|
||||
def load_vocab_categories():
|
||||
"""Load vocab to get word -> categories mapping."""
|
||||
word_cats = {}
|
||||
vocab_path = DATA_DIR / "folksy_vocab.csv"
|
||||
if vocab_path.exists():
|
||||
with open(vocab_path, newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
word = row["word"]
|
||||
cats = [c.strip() for c in row["categories"].split(",") if c.strip()]
|
||||
word_cats[word] = cats
|
||||
return word_cats
|
||||
|
||||
|
||||
def generate_training_pairs(entry, word_cats):
|
||||
"""Generate 3-5 training pairs for a single polished saying."""
|
||||
polished = entry.get("polished_text", "")
|
||||
slots = entry.get("slots", {})
|
||||
meta_template = entry.get("meta_template", "")
|
||||
|
||||
# Collect source words (concrete nouns from slots)
|
||||
source_words = [v for v in slots.values()
|
||||
if v and not v.startswith("a ") and not v.startswith("an ") and len(v) > 1]
|
||||
|
||||
# Determine categories of slot words
|
||||
slot_categories = set()
|
||||
for word in source_words:
|
||||
word_lower = word.lower().replace(" ", "_")
|
||||
if word_lower in word_cats:
|
||||
slot_categories.update(word_cats[word_lower])
|
||||
|
||||
pairs = []
|
||||
base = {
|
||||
"output": polished,
|
||||
"meta_template": meta_template,
|
||||
"source_words": source_words,
|
||||
}
|
||||
|
||||
# 1. Word-seeded (always include)
|
||||
if source_words:
|
||||
word = random.choice(source_words)
|
||||
pairs.append({**base, "input": f"Tell me something about {word}."})
|
||||
|
||||
# 2. Category-seeded (always include if we have categories)
|
||||
if slot_categories:
|
||||
cat = random.choice(list(slot_categories))
|
||||
pairs.append({**base, "input": f"Tell me a saying about {cat}."})
|
||||
|
||||
# 3. Persona-seeded (always include)
|
||||
persona = random.choice(PERSONAS)
|
||||
if source_words:
|
||||
word = random.choice(source_words)
|
||||
pairs.append({**base, "input": f"What would a {persona} say about {word}?"})
|
||||
|
||||
# 4. Template-seeded (include ~70% of the time)
|
||||
if random.random() < 0.7:
|
||||
template_name = TEMPLATE_NAMES.get(meta_template, meta_template)
|
||||
pairs.append({**base, "input": f"Give me a {template_name} proverb."})
|
||||
|
||||
# 5. Open-ended (include ~30% of the time)
|
||||
if random.random() < 0.3:
|
||||
prompt = random.choice(OPEN_ENDED_PROMPTS)
|
||||
pairs.append({**base, "input": prompt})
|
||||
|
||||
return pairs
|
||||
|
||||
|
||||
def generate_fictional_pairs(entities):
|
||||
"""Generate training pairs for fictional entities.
|
||||
|
||||
These pairs include the entity description in the input.
|
||||
"""
|
||||
pairs = []
|
||||
|
||||
# Generate 15-25 pairs per entity
|
||||
for entity in entities:
|
||||
name = entity["name"]
|
||||
desc = format_entity_description(entity)
|
||||
props = entity.get("properties", [])
|
||||
rels = entity.get("relations", {})
|
||||
|
||||
# Collect words related to this entity
|
||||
related_words = []
|
||||
for targets in rels.values():
|
||||
related_words.extend(targets)
|
||||
|
||||
n_pairs = random.randint(15, 25)
|
||||
|
||||
for _ in range(n_pairs):
|
||||
framing = random.choice(["persona", "word", "category", "open"])
|
||||
|
||||
if framing == "persona":
|
||||
persona = random.choice(PERSONAS)
|
||||
input_text = f"{desc} What would a {persona} say about a {name}?"
|
||||
elif framing == "word" and related_words:
|
||||
word = random.choice(related_words)
|
||||
input_text = f"{desc} Tell me a saying about {name} and {word}."
|
||||
elif framing == "category":
|
||||
cats = entity.get("categories", ["thing"])
|
||||
cat = random.choice(cats)
|
||||
input_text = f"{desc} Give me folk wisdom about this {cat}."
|
||||
else:
|
||||
input_text = f"{desc} Tell me some folk wisdom about {name}."
|
||||
|
||||
# Placeholder output — these would ideally be generated through the
|
||||
# template engine with fictional entities loaded, then polished.
|
||||
# For now, generate a structural placeholder that indicates the
|
||||
# entity relationships.
|
||||
pairs.append({
|
||||
"input": input_text,
|
||||
"output": "", # Will be filled by actual generation
|
||||
"meta_template": "fictional",
|
||||
"source_words": [name] + related_words[:3],
|
||||
"_needs_generation": True,
|
||||
"_entity": entity,
|
||||
})
|
||||
|
||||
return pairs
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Format training pairs for fine-tuning.")
|
||||
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_filtered.jsonl"),
|
||||
help="Input filtered JSONL file")
|
||||
parser.add_argument("--output", default=str(CORPUS_DIR / "training_pairs.jsonl"),
|
||||
help="Output training pairs JSONL file")
|
||||
parser.add_argument("--entities", default=str(EXAMPLES_DIR / "my_world.json"),
|
||||
help="Fictional entities JSON file")
|
||||
args = parser.parse_args()
|
||||
|
||||
input_path = Path(args.input)
|
||||
output_path = Path(args.output)
|
||||
entities_path = Path(args.entities)
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"Error: {input_path} not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Load vocab categories
|
||||
word_cats = load_vocab_categories()
|
||||
|
||||
# Load filtered entries
|
||||
entries = []
|
||||
with open(input_path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
entries.append(json.loads(line))
|
||||
|
||||
print(f"Loaded {len(entries)} filtered entries")
|
||||
|
||||
# Generate training pairs for each entry
|
||||
all_pairs = []
|
||||
for entry in entries:
|
||||
pairs = generate_training_pairs(entry, word_cats)
|
||||
all_pairs.extend(pairs)
|
||||
|
||||
print(f"Generated {len(all_pairs)} training pairs from polished sayings")
|
||||
|
||||
# Generate fictional entity pairs
|
||||
fictional_entities = []
|
||||
if entities_path.exists():
|
||||
with open(entities_path, encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
fictional_entities = data.get("entities", [])
|
||||
print(f"Loaded {len(fictional_entities)} fictional entities from {entities_path}")
|
||||
|
||||
# Add auto-generated entities
|
||||
fictional_entities.extend(AUTO_ENTITIES)
|
||||
print(f"Total fictional entities (file + auto-generated): {len(fictional_entities)}")
|
||||
|
||||
fictional_pairs = generate_fictional_pairs(fictional_entities)
|
||||
|
||||
# Filter out placeholder pairs (those that still need generation)
|
||||
# In a full pipeline, these would be generated through the template engine.
|
||||
# For now, skip any with empty output.
|
||||
real_fictional = [p for p in fictional_pairs if p.get("output")]
|
||||
placeholder_fictional = [p for p in fictional_pairs if not p.get("output")]
|
||||
|
||||
if placeholder_fictional:
|
||||
print(f" {len(placeholder_fictional)} fictional pairs need generation via template engine")
|
||||
print(f" (Run folksy_generator.py with --entities to generate these, then re-run this script)")
|
||||
|
||||
all_pairs.extend(real_fictional)
|
||||
|
||||
# Clean up internal fields before writing
|
||||
for pair in all_pairs:
|
||||
pair.pop("_needs_generation", None)
|
||||
pair.pop("_entity", None)
|
||||
|
||||
# Write output
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
for pair in all_pairs:
|
||||
f.write(json.dumps(pair, ensure_ascii=False) + "\n")
|
||||
|
||||
# Stats
|
||||
from collections import Counter
|
||||
input_types = Counter()
|
||||
for pair in all_pairs:
|
||||
inp = pair["input"]
|
||||
if inp.startswith("Tell me something about"):
|
||||
input_types["word_seeded"] += 1
|
||||
elif inp.startswith("Tell me a saying about"):
|
||||
input_types["category_seeded"] += 1
|
||||
elif inp.startswith("What would a"):
|
||||
input_types["persona_seeded"] += 1
|
||||
elif inp.startswith("Give me a") and "proverb" in inp:
|
||||
input_types["template_seeded"] += 1
|
||||
elif any(inp.startswith(p) for p in ["Tell me some folk", "What do they", "Give me a proverb", "Share some", "What's a good"]):
|
||||
input_types["open_ended"] += 1
|
||||
else:
|
||||
input_types["fictional"] += 1
|
||||
|
||||
print(f"\nTotal training pairs: {len(all_pairs)}")
|
||||
print("Distribution by input type:")
|
||||
for itype, count in sorted(input_types.items()):
|
||||
print(f" {itype:20s} {count:5d}")
|
||||
|
||||
print(f"\nOutput: {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
61
scripts/generate_raw_batch.sh
Executable file
61
scripts/generate_raw_batch.sh
Executable file
|
|
@ -0,0 +1,61 @@
|
|||
#!/usr/bin/env bash
|
||||
# Generate raw folksy sayings across all 7 templates.
|
||||
# Output: corpus/corpus_raw.jsonl (~10,500 entries)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
|
||||
CORPUS_DIR="$PROJECT_DIR/corpus"
|
||||
GENERATOR="$PROJECT_DIR/folksy_generator.py"
|
||||
|
||||
COUNT_PER_TEMPLATE=${1:-1500}
|
||||
|
||||
mkdir -p "$CORPUS_DIR"
|
||||
|
||||
OUTPUT="$CORPUS_DIR/corpus_raw.jsonl"
|
||||
# Clear existing file
|
||||
> "$OUTPUT"
|
||||
|
||||
TEMPLATES=(
|
||||
deconstruction
|
||||
denial_of_consequences
|
||||
ironic_deficiency
|
||||
futile_preparation
|
||||
hypocritical_complaint
|
||||
tautological_wisdom
|
||||
false_equivalence
|
||||
)
|
||||
|
||||
echo "Generating $COUNT_PER_TEMPLATE sayings per template (${#TEMPLATES[@]} templates)..."
|
||||
echo "Output: $OUTPUT"
|
||||
|
||||
total=0
|
||||
for template in "${TEMPLATES[@]}"; do
|
||||
echo -n " $template ($COUNT_PER_TEMPLATE)... "
|
||||
before=$(wc -l < "$OUTPUT")
|
||||
python "$GENERATOR" --template "$template" --count "$COUNT_PER_TEMPLATE" --json >> "$OUTPUT" 2>/dev/null
|
||||
after=$(wc -l < "$OUTPUT")
|
||||
generated=$((after - before))
|
||||
total=$((total + generated))
|
||||
echo "$generated generated"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "Total: $total raw sayings in $OUTPUT"
|
||||
echo ""
|
||||
|
||||
# Check template distribution
|
||||
echo "Template distribution:"
|
||||
python -c "
|
||||
import json, sys
|
||||
from collections import Counter
|
||||
counts = Counter()
|
||||
with open('$OUTPUT') as f:
|
||||
for line in f:
|
||||
entry = json.loads(line)
|
||||
counts[entry['meta_template']] += 1
|
||||
for template, count in sorted(counts.items()):
|
||||
print(f' {template:30s} {count:5d}')
|
||||
print(f\" {'TOTAL':30s} {sum(counts.values()):5d}\")
|
||||
"
|
||||
215
scripts/polish_corpus.py
Normal file
215
scripts/polish_corpus.py
Normal file
|
|
@ -0,0 +1,215 @@
|
|||
#!/usr/bin/env python3
|
||||
"""LLM polish pipeline for raw folksy sayings.
|
||||
|
||||
Reads corpus_raw.jsonl, sends each to GLM4-32B for polish.
|
||||
Output file is the checkpoint — append mode with resume detection.
|
||||
|
||||
Usage:
|
||||
python scripts/polish_corpus.py
|
||||
python scripts/polish_corpus.py --input corpus/corpus_raw.jsonl --output corpus/corpus_polished.jsonl
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).parent
|
||||
PROJECT_DIR = SCRIPT_DIR.parent
|
||||
CORPUS_DIR = PROJECT_DIR / "corpus"
|
||||
|
||||
LLM_ENDPOINT = "http://192.168.1.100:8853/v1d/chat/completions"
|
||||
LLM_MODEL = "THUDM-GLM4-32B"
|
||||
|
||||
|
||||
SYSTEM_PROMPT = """You are an editor specializing in folk sayings and rural proverbs. You will receive a rough draft of a fake folksy saying along with the relationship chain it encodes.
|
||||
|
||||
Your job:
|
||||
1. Fix grammar, articles, and pluralization
|
||||
2. Make it sound natural — like something a weathered farmer would say while leaning on a fence post
|
||||
3. Preserve the core nouns and the relationship between them — do not swap out the key words
|
||||
4. You MAY add small colorful details (adjectives, folksy verb choices, regional flavor) but keep it concise — real proverbs are short
|
||||
5. You MAY lightly restructure the sentence for better rhythm, but keep the same meaning pattern
|
||||
6. If the saying is unsalvageable nonsense (the nouns don't relate in any meaningful way, or the combination is unintentionally offensive), respond with exactly: DISCARD
|
||||
|
||||
Output ONLY the polished saying on a single line. No quotes, no explanation, no preamble.
|
||||
|
||||
Examples of good polish:
|
||||
|
||||
Raw: "Don't build the coffee and act surprised when the water show up."
|
||||
Chain: coffee MadeOf water
|
||||
Polished: Don't brew the coffee and act surprised when the water's all gone.
|
||||
|
||||
Raw: "The chest's children always goes without hold books."
|
||||
Chain: chest UsedFor hold_books
|
||||
Polished: The bookshelf-maker's kids always end up reading off the floor.
|
||||
|
||||
Raw: "A pineapple is just a nectarine that's got an attitude."
|
||||
Chain: pineapple IsA fruit, nectarine IsA fruit, pineapple HasProperty prickly
|
||||
Polished: A pineapple is just a peach that grew itself some armor.
|
||||
|
||||
Raw: "You know what they say, a steel with no iron is just a harder than gold iron."
|
||||
Chain: steel MadeOf iron, steel HasProperty hard
|
||||
Polished: You know what they say — steel without the iron is just a dream of being hard.
|
||||
|
||||
Raw: "Funny how the bamboo never has enough grow very quickly for itself."
|
||||
Chain: bamboo CapableOf grow_quickly
|
||||
Polished: DISCARD
|
||||
|
||||
Raw: "That's just funning the canoe and praying for boiling food."
|
||||
Chain: canoe UsedFor transport, fire UsedFor boiling_food
|
||||
Polished: DISCARD"""
|
||||
|
||||
|
||||
def llm_chat_completion(messages, max_retries=3):
|
||||
"""Chat completion with retry logic."""
|
||||
import requests
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
resp = requests.post(LLM_ENDPOINT, json={
|
||||
"model": LLM_MODEL,
|
||||
"messages": messages,
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return data["choices"][0]["message"]["content"].strip()
|
||||
except Exception as e:
|
||||
wait = (2 ** attempt)
|
||||
print(f" LLM error (attempt {attempt+1}/{max_retries}): {e}", file=sys.stderr)
|
||||
if attempt < max_retries - 1:
|
||||
time.sleep(wait)
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
def format_chain(chain_edges):
|
||||
"""Format chain_edges list into readable string for LLM context."""
|
||||
if not chain_edges:
|
||||
return "(no chain data)"
|
||||
parts = []
|
||||
for edge in chain_edges:
|
||||
start = edge.get("start", "?")
|
||||
rel = edge.get("relation", "?")
|
||||
end = edge.get("end", "?")
|
||||
weight = edge.get("weight", 0)
|
||||
parts.append(f"{start} --{rel}--> {end} (w:{weight:.1f})")
|
||||
return ", ".join(parts)
|
||||
|
||||
|
||||
def format_slots(slots):
|
||||
"""Format slots dict for LLM context."""
|
||||
return ", ".join(f"{k}={v}" for k, v in slots.items())
|
||||
|
||||
|
||||
def load_already_processed(output_path):
|
||||
"""Load set of raw_text strings already processed (for resume)."""
|
||||
processed = set()
|
||||
if output_path.exists():
|
||||
with open(output_path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
processed.add(entry.get("raw_text", ""))
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
return processed
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="LLM polish pipeline for folksy sayings.")
|
||||
parser.add_argument("--input", default=str(CORPUS_DIR / "corpus_raw.jsonl"),
|
||||
help="Input JSONL file")
|
||||
parser.add_argument("--output", default=str(CORPUS_DIR / "corpus_polished.jsonl"),
|
||||
help="Output JSONL file (also serves as checkpoint)")
|
||||
args = parser.parse_args()
|
||||
|
||||
input_path = Path(args.input)
|
||||
output_path = Path(args.output)
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"Error: {input_path} not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Load raw entries
|
||||
raw_entries = []
|
||||
with open(input_path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
raw_entries.append(json.loads(line))
|
||||
|
||||
print(f"Loaded {len(raw_entries)} raw entries from {input_path}")
|
||||
|
||||
# Check what's already been processed
|
||||
already_processed = load_already_processed(output_path)
|
||||
remaining = [e for e in raw_entries if e.get("raw_text", "") not in already_processed]
|
||||
|
||||
print(f"Already processed: {len(already_processed)}")
|
||||
print(f"Remaining: {len(remaining)}")
|
||||
|
||||
if not remaining:
|
||||
print("Nothing to process.")
|
||||
return
|
||||
|
||||
discards = 0
|
||||
polished = 0
|
||||
errors = 0
|
||||
|
||||
with open(output_path, "a", encoding="utf-8") as out:
|
||||
for i, entry in enumerate(remaining):
|
||||
raw_text = entry.get("raw_text", "")
|
||||
meta_template = entry.get("meta_template", "")
|
||||
chain = format_chain(entry.get("chain", []))
|
||||
slots = format_slots(entry.get("slots", {}))
|
||||
|
||||
user_prompt = (
|
||||
f"Meta-template: {meta_template}\n"
|
||||
f"Relationship chain: {chain}\n"
|
||||
f"Slot fills: {slots}\n"
|
||||
f"Raw saying: {raw_text}"
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": user_prompt},
|
||||
]
|
||||
|
||||
response = llm_chat_completion(messages)
|
||||
|
||||
if response is None:
|
||||
entry["status"] = "error"
|
||||
errors += 1
|
||||
elif response.strip().upper() == "DISCARD":
|
||||
entry["status"] = "discarded"
|
||||
discards += 1
|
||||
else:
|
||||
entry["polished_text"] = response.strip()
|
||||
entry["status"] = "polished"
|
||||
polished += 1
|
||||
|
||||
out.write(json.dumps(entry, ensure_ascii=False) + "\n")
|
||||
|
||||
if (i + 1) % 100 == 0:
|
||||
out.flush()
|
||||
total_done = len(already_processed) + i + 1
|
||||
print(f" [{total_done}/{len(raw_entries)}] "
|
||||
f"polished={polished}, discarded={discards}, errors={errors}")
|
||||
|
||||
time.sleep(0.1)
|
||||
|
||||
total_done = len(already_processed) + len(remaining)
|
||||
print(f"\nDone: {total_done} total entries processed.")
|
||||
print(f" Polished: {polished}")
|
||||
print(f" Discarded: {discards}")
|
||||
print(f" Errors: {errors}")
|
||||
print(f" Discard rate: {discards/(polished+discards)*100:.1f}%" if (polished+discards) else " N/A")
|
||||
print(f"Output: {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Loading…
Add table
Add a link
Reference in a new issue