corpus generation (work from mid february)
This commit is contained in:
parent
8c8a058301
commit
356b62c6ea
16 changed files with 25872 additions and 38 deletions
318
GRAPH_ENHANCEMENT_SPEC.md
Normal file
318
GRAPH_ENHANCEMENT_SPEC.md
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
# Graph Enhancement Spec — LLM-Augmented Folksy Subgraph
|
||||
|
||||
## Overview
|
||||
|
||||
The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.
|
||||
|
||||
This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
|
||||
"""Chat completion endpoint of local LLM"""
|
||||
return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
|
||||
'model': model,
|
||||
'messages': messages
|
||||
}).json()
|
||||
```
|
||||
|
||||
All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.
|
||||
|
||||
## Strategy
|
||||
|
||||
For each word in `folksy_vocab.csv`, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as `folksy_relations.csv` and merged into the generator's working dataset.
|
||||
|
||||
This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.
|
||||
|
||||
## Phase 1: Per-Word Relationship Expansion
|
||||
|
||||
### Input
|
||||
Every word in `folksy_vocab.csv`, plus its existing edges from `folksy_relations.csv`.
|
||||
|
||||
### Process
|
||||
|
||||
For each word, send a prompt that:
|
||||
1. Provides the word and its categories
|
||||
2. Lists its EXISTING relationships (so the LLM doesn't duplicate them)
|
||||
3. Asks for ADDITIONAL relationships across specific relation types
|
||||
4. Constrains output to a parseable structured format
|
||||
|
||||
### System Prompt
|
||||
|
||||
```
|
||||
You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.
|
||||
|
||||
Rules:
|
||||
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
|
||||
- Every relationship must be something a typical adult would agree is true
|
||||
- Do not repeat any relationship already listed as "known"
|
||||
- Target words should be common English words (top 3000 frequency preferred)
|
||||
- Output ONLY the structured format shown below, one relationship per line
|
||||
- If you cannot think of good relationships for a given type, output NONE for that type
|
||||
- Aim for 3-5 relationships per type where possible
|
||||
|
||||
Output format (one per line):
|
||||
RELATION_TYPE: target_word | short natural phrasing
|
||||
|
||||
Example output:
|
||||
AtLocation: barn | you find a horse in a barn
|
||||
UsedFor: riding | a horse is used for riding
|
||||
HasA: mane | a horse has a mane
|
||||
CapableOf: gallop | a horse can gallop
|
||||
MadeOf: NONE
|
||||
PartOf: herd | a horse is part of a herd
|
||||
```
|
||||
|
||||
### User Prompt Template
|
||||
|
||||
```
|
||||
Word: {word}
|
||||
Categories: {categories}
|
||||
|
||||
Known relationships:
|
||||
{existing_edges_formatted}
|
||||
|
||||
Generate additional relationships for these types:
|
||||
- AtLocation (where is it found?)
|
||||
- UsedFor (what is it used for?)
|
||||
- HasA (what does it have / contain?)
|
||||
- PartOf (what is it part of?)
|
||||
- CapableOf (what can it do?)
|
||||
- MadeOf (what is it made of?)
|
||||
- HasPrerequisite (what do you need before you can have/use it?)
|
||||
- Causes (what does it cause or lead to?)
|
||||
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)
|
||||
```
|
||||
|
||||
### Formatting Existing Edges
|
||||
|
||||
For the "Known relationships" section, format existing edges as:
|
||||
|
||||
```
|
||||
AtLocation: pond (weight 1.0), lake (weight 4.47)
|
||||
CapableOf: swim (weight 2.0), fly (weight 1.0)
|
||||
UsedFor: (none in database)
|
||||
```
|
||||
|
||||
This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.
|
||||
|
||||
### Parsing LLM Output
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def parse_llm_relations(response_text, source_word):
|
||||
"""Parse structured LLM output into edge tuples."""
|
||||
edges = []
|
||||
for line in response_text.strip().split('\n'):
|
||||
line = line.strip()
|
||||
if not line or 'NONE' in line:
|
||||
continue
|
||||
match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
|
||||
if match:
|
||||
relation, target, surface = match.groups()
|
||||
# Validate relation type
|
||||
if relation in VALID_RELATIONS:
|
||||
edges.append({
|
||||
'start_word': source_word,
|
||||
'end_word': target.strip().lower(),
|
||||
'relation': relation,
|
||||
'weight': 0.8, # LLM-generated edges get a default weight below ConceptNet minimum
|
||||
'surface_text': surface.strip(),
|
||||
'source': 'llm_augmented'
|
||||
})
|
||||
return edges
|
||||
```
|
||||
|
||||
### Weight Assignment
|
||||
|
||||
LLM-generated edges get a default weight of **0.8** — deliberately below the ConceptNet minimum threshold of 1.0. This means:
|
||||
- They fill gaps and add diversity
|
||||
- They lose ties to ConceptNet edges (real data preferred when both exist)
|
||||
- They can be filtered out easily if needed (`weight >= 1.0` restores pure ConceptNet)
|
||||
- The generator can optionally boost or penalize LLM edges via a CLI flag
|
||||
|
||||
### Deduplication
|
||||
|
||||
Before merging, check each LLM-generated edge against existing edges:
|
||||
- If (start_word, end_word, relation) already exists → skip
|
||||
- If end_word is not in folksy_vocab → add to a `candidate_additions.csv` for review, but do NOT auto-add to vocab (avoids graph bloat)
|
||||
- If end_word IS in folksy_vocab → add edge to `folksy_relations_augmented.csv`
|
||||
|
||||
## Phase 2: Cross-Word Relationship Discovery
|
||||
|
||||
After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.
|
||||
|
||||
### Process
|
||||
|
||||
1. Identify word pairs that are in the same category but have no path of length ≤ 2 between them
|
||||
2. For a sample of these pairs, ask the LLM what connects them
|
||||
|
||||
### Prompt for Bridge Discovery
|
||||
|
||||
System prompt:
|
||||
```
|
||||
You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.
|
||||
|
||||
Rules:
|
||||
- The bridge word must be a common, concrete noun
|
||||
- State the relationship type for each connection
|
||||
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation
|
||||
|
||||
Example:
|
||||
Words: "cow" and "butter"
|
||||
BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product
|
||||
```
|
||||
|
||||
User prompt:
|
||||
```
|
||||
Words: "{word_a}" and "{word_b}"
|
||||
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
|
||||
Find 1-3 bridge words that connect them.
|
||||
```
|
||||
|
||||
### Candidate Selection
|
||||
|
||||
Don't run this for all pairs — that's O(n²) on 534 words. Instead:
|
||||
|
||||
1. Build the current 2-hop reachability matrix
|
||||
2. Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
|
||||
3. For each low-connectivity word, pick 5-10 random same-category words it can't reach
|
||||
4. Run bridge discovery on those pairs
|
||||
5. Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words
|
||||
|
||||
## Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates
|
||||
|
||||
The `false_equivalence` meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.
|
||||
|
||||
### Prompt
|
||||
|
||||
System prompt:
|
||||
```
|
||||
You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.
|
||||
|
||||
Rules:
|
||||
- Only physical/sensory/behavioral properties, not abstract qualities
|
||||
- Properties should DISTINGUISH this thing from similar things in its category
|
||||
- Output one property per line as: PROPERTY | brief explanation
|
||||
- Aim for 5-8 properties
|
||||
```
|
||||
|
||||
User prompt:
|
||||
```
|
||||
Word: {word}
|
||||
Category: {categories}
|
||||
Other words in same category: {same_category_sample}
|
||||
|
||||
What properties distinguish {word} from the others listed?
|
||||
```
|
||||
|
||||
Including same-category peers in the prompt encourages the LLM to generate *differentiating* properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.
|
||||
|
||||
### Output Format
|
||||
|
||||
```
|
||||
fast | horses are known for running fast
|
||||
tall | horses are tall compared to most farm animals
|
||||
mane | horses have a distinctive mane
|
||||
shod | horses wear horseshoes
|
||||
```
|
||||
|
||||
These get stored as HasProperty edges in the augmented relations file.
|
||||
|
||||
## Output Files
|
||||
|
||||
### `folksy_relations_augmented.csv`
|
||||
Same schema as `folksy_relations.csv` with additional columns:
|
||||
|
||||
```
|
||||
start_word, end_word, relation, weight, surface_text, source
|
||||
corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
|
||||
dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
|
||||
horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented
|
||||
```
|
||||
|
||||
The `source` column allows filtering: `source=conceptnet` for pure ConceptNet, `source=llm_augmented` for LLM additions, or both for the full enhanced graph.
|
||||
|
||||
### `candidate_additions.csv`
|
||||
Words that appeared in LLM output but aren't in the current folksy vocab:
|
||||
|
||||
```
|
||||
word, suggested_by, relation_context, frequency
|
||||
mane, horse, "HasA: a horse has a mane", 2
|
||||
bridle, horse, "HasA: a horse has a bridle", 1
|
||||
```
|
||||
|
||||
The `frequency` column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).
|
||||
|
||||
### `enhancement_log.csv`
|
||||
Track what was processed and what the LLM produced:
|
||||
|
||||
```
|
||||
source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
|
||||
dog, 2025-02-15T10:30:00, 24, 18, 3, 3
|
||||
horse, 2025-02-15T10:30:45, 31, 22, 5, 4
|
||||
```
|
||||
|
||||
## Execution Plan
|
||||
|
||||
### Batch Processing
|
||||
|
||||
534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.
|
||||
|
||||
```python
|
||||
import csv
|
||||
import time
|
||||
|
||||
def process_all_words(vocab_path, relations_path, output_path):
|
||||
vocab = load_vocab(vocab_path)
|
||||
relations = load_relations(relations_path)
|
||||
all_new_edges = []
|
||||
|
||||
for i, word_entry in enumerate(vocab):
|
||||
word = word_entry['word']
|
||||
categories = word_entry['categories']
|
||||
existing = get_edges_for_word(relations, word)
|
||||
|
||||
messages = build_expansion_prompt(word, categories, existing)
|
||||
response = llm_chat_completion(messages)
|
||||
response_text = response['choices'][0]['message']['content']
|
||||
|
||||
new_edges = parse_llm_relations(response_text, word)
|
||||
new_edges = deduplicate(new_edges, existing)
|
||||
all_new_edges.extend(new_edges)
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
|
||||
|
||||
time.sleep(0.1) # gentle rate limiting
|
||||
|
||||
save_augmented_relations(all_new_edges, output_path)
|
||||
```
|
||||
|
||||
### Resumability
|
||||
|
||||
Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.
|
||||
|
||||
### Validation Pass
|
||||
|
||||
After all LLM edges are generated, run a quick validation:
|
||||
1. No self-loops (start_word == end_word)
|
||||
2. All relation types are in the valid set
|
||||
3. No duplicate (start, end, relation) triples
|
||||
4. Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
|
||||
5. Spot-check 20 random LLM edges manually for sanity
|
||||
|
||||
## Integration with Generator
|
||||
|
||||
The generator's data loading should be updated to:
|
||||
|
||||
1. Load `folksy_relations.csv` (original ConceptNet edges)
|
||||
2. If `folksy_relations_augmented.csv` exists, load and merge it
|
||||
3. CLI flag: `--pure-conceptnet` to disable LLM-augmented edges
|
||||
4. CLI flag: `--llm-weight-boost 0.2` to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)
|
||||
|
||||
This keeps the original ConceptNet data pristine and the augmentation fully reversible.
|
||||
Loading…
Add table
Add a link
Reference in a new issue