folksy_idioms/GRAPH_ENHANCEMENT_SPEC.md

318 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Graph Enhancement Spec — LLM-Augmented Folksy Subgraph
## Overview
The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.
This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.
## Infrastructure
```python
import requests
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
"""Chat completion endpoint of local LLM"""
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
'model': model,
'messages': messages
}).json()
```
All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.
## Strategy
For each word in `folksy_vocab.csv`, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as `folksy_relations.csv` and merged into the generator's working dataset.
This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.
## Phase 1: Per-Word Relationship Expansion
### Input
Every word in `folksy_vocab.csv`, plus its existing edges from `folksy_relations.csv`.
### Process
For each word, send a prompt that:
1. Provides the word and its categories
2. Lists its EXISTING relationships (so the LLM doesn't duplicate them)
3. Asks for ADDITIONAL relationships across specific relation types
4. Constrains output to a parseable structured format
### System Prompt
```
You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
Rules:
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
- Every relationship must be something a typical adult would agree is true
- Do not repeat any relationship already listed as "known"
- Target words should be common English words (top 3000 frequency preferred)
- Output ONLY the structured format shown below, one relationship per line
- If you cannot think of good relationships for a given type, output NONE for that type
- Aim for 3-5 relationships per type where possible
Output format (one per line):
RELATION_TYPE: target_word | short natural phrasing
Example output:
AtLocation: barn | you find a horse in a barn
UsedFor: riding | a horse is used for riding
HasA: mane | a horse has a mane
CapableOf: gallop | a horse can gallop
MadeOf: NONE
PartOf: herd | a horse is part of a herd
```
### User Prompt Template
```
Word: {word}
Categories: {categories}
Known relationships:
{existing_edges_formatted}
Generate additional relationships for these types:
- AtLocation (where is it found?)
- UsedFor (what is it used for?)
- HasA (what does it have / contain?)
- PartOf (what is it part of?)
- CapableOf (what can it do?)
- MadeOf (what is it made of?)
- HasPrerequisite (what do you need before you can have/use it?)
- Causes (what does it cause or lead to?)
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)
```
### Formatting Existing Edges
For the "Known relationships" section, format existing edges as:
```
AtLocation: pond (weight 1.0), lake (weight 4.47)
CapableOf: swim (weight 2.0), fly (weight 1.0)
UsedFor: (none in database)
```
This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.
### Parsing LLM Output
```python
import re
def parse_llm_relations(response_text, source_word):
"""Parse structured LLM output into edge tuples."""
edges = []
for line in response_text.strip().split('\n'):
line = line.strip()
if not line or 'NONE' in line:
continue
match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
if match:
relation, target, surface = match.groups()
# Validate relation type
if relation in VALID_RELATIONS:
edges.append({
'start_word': source_word,
'end_word': target.strip().lower(),
'relation': relation,
'weight': 0.8, # LLM-generated edges get a default weight below ConceptNet minimum
'surface_text': surface.strip(),
'source': 'llm_augmented'
})
return edges
```
### Weight Assignment
LLM-generated edges get a default weight of **0.8** — deliberately below the ConceptNet minimum threshold of 1.0. This means:
- They fill gaps and add diversity
- They lose ties to ConceptNet edges (real data preferred when both exist)
- They can be filtered out easily if needed (`weight >= 1.0` restores pure ConceptNet)
- The generator can optionally boost or penalize LLM edges via a CLI flag
### Deduplication
Before merging, check each LLM-generated edge against existing edges:
- If (start_word, end_word, relation) already exists → skip
- If end_word is not in folksy_vocab → add to a `candidate_additions.csv` for review, but do NOT auto-add to vocab (avoids graph bloat)
- If end_word IS in folksy_vocab → add edge to `folksy_relations_augmented.csv`
## Phase 2: Cross-Word Relationship Discovery
After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.
### Process
1. Identify word pairs that are in the same category but have no path of length ≤ 2 between them
2. For a sample of these pairs, ask the LLM what connects them
### Prompt for Bridge Discovery
System prompt:
```
You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
Rules:
- The bridge word must be a common, concrete noun
- State the relationship type for each connection
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
Example:
Words: "cow" and "butter"
BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product
```
User prompt:
```
Words: "{word_a}" and "{word_b}"
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
Find 1-3 bridge words that connect them.
```
### Candidate Selection
Don't run this for all pairs — that's O(n²) on 534 words. Instead:
1. Build the current 2-hop reachability matrix
2. Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
3. For each low-connectivity word, pick 5-10 random same-category words it can't reach
4. Run bridge discovery on those pairs
5. Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words
## Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates
The `false_equivalence` meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.
### Prompt
System prompt:
```
You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
Rules:
- Only physical/sensory/behavioral properties, not abstract qualities
- Properties should DISTINGUISH this thing from similar things in its category
- Output one property per line as: PROPERTY | brief explanation
- Aim for 5-8 properties
```
User prompt:
```
Word: {word}
Category: {categories}
Other words in same category: {same_category_sample}
What properties distinguish {word} from the others listed?
```
Including same-category peers in the prompt encourages the LLM to generate *differentiating* properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.
### Output Format
```
fast | horses are known for running fast
tall | horses are tall compared to most farm animals
mane | horses have a distinctive mane
shod | horses wear horseshoes
```
These get stored as HasProperty edges in the augmented relations file.
## Output Files
### `folksy_relations_augmented.csv`
Same schema as `folksy_relations.csv` with additional columns:
```
start_word, end_word, relation, weight, surface_text, source
corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented
```
The `source` column allows filtering: `source=conceptnet` for pure ConceptNet, `source=llm_augmented` for LLM additions, or both for the full enhanced graph.
### `candidate_additions.csv`
Words that appeared in LLM output but aren't in the current folksy vocab:
```
word, suggested_by, relation_context, frequency
mane, horse, "HasA: a horse has a mane", 2
bridle, horse, "HasA: a horse has a bridle", 1
```
The `frequency` column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).
### `enhancement_log.csv`
Track what was processed and what the LLM produced:
```
source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
dog, 2025-02-15T10:30:00, 24, 18, 3, 3
horse, 2025-02-15T10:30:45, 31, 22, 5, 4
```
## Execution Plan
### Batch Processing
534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.
```python
import csv
import time
def process_all_words(vocab_path, relations_path, output_path):
vocab = load_vocab(vocab_path)
relations = load_relations(relations_path)
all_new_edges = []
for i, word_entry in enumerate(vocab):
word = word_entry['word']
categories = word_entry['categories']
existing = get_edges_for_word(relations, word)
messages = build_expansion_prompt(word, categories, existing)
response = llm_chat_completion(messages)
response_text = response['choices'][0]['message']['content']
new_edges = parse_llm_relations(response_text, word)
new_edges = deduplicate(new_edges, existing)
all_new_edges.extend(new_edges)
if (i + 1) % 50 == 0:
print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
time.sleep(0.1) # gentle rate limiting
save_augmented_relations(all_new_edges, output_path)
```
### Resumability
Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.
### Validation Pass
After all LLM edges are generated, run a quick validation:
1. No self-loops (start_word == end_word)
2. All relation types are in the valid set
3. No duplicate (start, end, relation) triples
4. Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
5. Spot-check 20 random LLM edges manually for sanity
## Integration with Generator
The generator's data loading should be updated to:
1. Load `folksy_relations.csv` (original ConceptNet edges)
2. If `folksy_relations_augmented.csv` exists, load and merge it
3. CLI flag: `--pure-conceptnet` to disable LLM-augmented edges
4. CLI flag: `--llm-weight-boost 0.2` to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)
This keeps the original ConceptNet data pristine and the augmentation fully reversible.