john 356b62c6ea corpus generation (work from mid february)

2026-03-09 19:52:09 -04:00

12 KiB

Raw Blame History

Graph Enhancement Spec — LLM-Augmented Folksy Subgraph

Overview

The folksy subgraph extracted from ConceptNet (534 words, 11,096 edges) has coverage gaps. Many common folksy words have sparse or heavily skewed edge distributions — "dog" maps almost exclusively to "bark," "horse" collapses to "ride," etc. This produces repetitive output when the generator seeds on these words.

This phase uses the local GLM4-32B model to generate supplementary relationship edges for every word in the folksy vocabulary, expanding the graph's density and diversity while maintaining the typed-edge structure the template engine requires.

Infrastructure

import requests

def llm_chat_completion(messages: list, model="THUDM-GLM4-32B"):
    """Chat completion endpoint of local LLM"""
    return requests.post("http://192.168.1.100:8853/v1d/chat/completions", json={
        'model': model,
        'messages': messages
    }).json()

All LLM calls go through this endpoint. No cloud APIs. The model runs locally on the RTX 4090.

Strategy

For each word in folksy_vocab.csv, ask the LLM to generate relationships that ConceptNet is missing or underrepresenting. The LLM output gets parsed into the same edge format as folksy_relations.csv and merged into the generator's working dataset.

This is NOT free-form generation. The LLM is constrained to output structured relationship tuples that conform to the existing relation type taxonomy. Think of it as using the LLM as a commonsense knowledge base that supplements ConceptNet, not replaces it.

Phase 1: Per-Word Relationship Expansion

Input

Every word in folksy_vocab.csv, plus its existing edges from folksy_relations.csv.

Process

For each word, send a prompt that:

Provides the word and its categories
Lists its EXISTING relationships (so the LLM doesn't duplicate them)
Asks for ADDITIONAL relationships across specific relation types
Constrains output to a parseable structured format

System Prompt

You are a commonsense knowledge annotator. You will be given a concrete noun and its known relationships. Your job is to generate ADDITIONAL commonsense relationships that are missing.

Rules:
- Only generate relationships involving concrete, tangible things (animals, foods, tools, plants, buildings, weather, landscape, household objects)
- Every relationship must be something a typical adult would agree is true
- Do not repeat any relationship already listed as "known"
- Target words should be common English words (top 3000 frequency preferred)
- Output ONLY the structured format shown below, one relationship per line
- If you cannot think of good relationships for a given type, output NONE for that type
- Aim for 3-5 relationships per type where possible

Output format (one per line):
RELATION_TYPE: target_word | short natural phrasing

Example output:
AtLocation: barn | you find a horse in a barn
UsedFor: riding | a horse is used for riding
HasA: mane | a horse has a mane
CapableOf: gallop | a horse can gallop
MadeOf: NONE
PartOf: herd | a horse is part of a herd

User Prompt Template

Word: {word}
Categories: {categories}

Known relationships:
{existing_edges_formatted}

Generate additional relationships for these types:
- AtLocation (where is it found?)
- UsedFor (what is it used for?)
- HasA (what does it have / contain?)
- PartOf (what is it part of?)
- CapableOf (what can it do?)
- MadeOf (what is it made of?)
- HasPrerequisite (what do you need before you can have/use it?)
- Causes (what does it cause or lead to?)
- HasProperty (what adjectives describe it? — limit to physical/sensory properties)

Formatting Existing Edges

For the "Known relationships" section, format existing edges as:

AtLocation: pond (weight 1.0), lake (weight 4.47)
CapableOf: swim (weight 2.0), fly (weight 1.0)
UsedFor: (none in database)

This shows the LLM what's already covered AND highlights which relation types are empty and most need filling.

Parsing LLM Output

import re

def parse_llm_relations(response_text, source_word):
    """Parse structured LLM output into edge tuples."""
    edges = []
    for line in response_text.strip().split('\n'):
        line = line.strip()
        if not line or 'NONE' in line:
            continue
        match = re.match(r'^(\w+):\s*(\w+)\s*\|\s*(.+)$', line)
        if match:
            relation, target, surface = match.groups()
            # Validate relation type
            if relation in VALID_RELATIONS:
                edges.append({
                    'start_word': source_word,
                    'end_word': target.strip().lower(),
                    'relation': relation,
                    'weight': 0.8,  # LLM-generated edges get a default weight below ConceptNet minimum
                    'surface_text': surface.strip(),
                    'source': 'llm_augmented'
                })
    return edges

Weight Assignment

LLM-generated edges get a default weight of 0.8 — deliberately below the ConceptNet minimum threshold of 1.0. This means:

They fill gaps and add diversity
They lose ties to ConceptNet edges (real data preferred when both exist)
They can be filtered out easily if needed (weight >= 1.0 restores pure ConceptNet)
The generator can optionally boost or penalize LLM edges via a CLI flag

Deduplication

Before merging, check each LLM-generated edge against existing edges:

If (start_word, end_word, relation) already exists → skip
If end_word is not in folksy_vocab → add to a candidate_additions.csv for review, but do NOT auto-add to vocab (avoids graph bloat)
If end_word IS in folksy_vocab → add edge to folksy_relations_augmented.csv

Phase 2: Cross-Word Relationship Discovery

After per-word expansion, run a second pass that specifically targets 2-hop paths. The goal is to find bridge words that connect otherwise-isolated clusters.

Process

Identify word pairs that are in the same category but have no path of length ≤ 2 between them
For a sample of these pairs, ask the LLM what connects them

Prompt for Bridge Discovery

System prompt:

You are a commonsense knowledge annotator. You will be given two concrete nouns. Your job is to identify a BRIDGE word that connects them — something that relates to both.

Rules:
- The bridge word must be a common, concrete noun
- State the relationship type for each connection
- Output format: BRIDGE_WORD | relation_to_first: TYPE | relation_to_second: TYPE | explanation

Example:
Words: "cow" and "butter"
BRIDGE: milk | CapableOf from cow: a cow produces milk | MadeOf for butter: butter is made of milk | milk connects production to product

User prompt:

Words: "{word_a}" and "{word_b}"
Categories: {word_a} is {categories_a}, {word_b} is {categories_b}
Find 1-3 bridge words that connect them.

Candidate Selection

Don't run this for all pairs — that's O(n²) on 534 words. Instead:

Build the current 2-hop reachability matrix
Identify words with LOW 2-hop reachability (few or no 2-hop paths to other folksy words)
For each low-connectivity word, pick 5-10 random same-category words it can't reach
Run bridge discovery on those pairs
Target: ensure every word in the vocab has at least 3 distinct 2-hop paths to other vocab words

Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates

The false_equivalence meta-template needs HasProperty edges, which are sparse in ConceptNet for concrete nouns. Run a targeted property-extraction pass.

Prompt

System prompt:

You are a commonsense knowledge annotator. Given a concrete noun, list its most distinctive physical or sensory properties — things you could see, touch, hear, smell, or taste. Also list behavioral properties for animals.

Rules:
- Only physical/sensory/behavioral properties, not abstract qualities
- Properties should DISTINGUISH this thing from similar things in its category
- Output one property per line as: PROPERTY | brief explanation
- Aim for 5-8 properties

User prompt:

Word: {word}
Category: {categories}
Other words in same category: {same_category_sample}

What properties distinguish {word} from the others listed?

Including same-category peers in the prompt encourages the LLM to generate differentiating properties rather than generic ones. "Has legs" is useless for a horse because every animal has legs. "Has a mane" differentiates it.

Output Format

fast | horses are known for running fast
tall | horses are tall compared to most farm animals
mane | horses have a distinctive mane
shod | horses wear horseshoes

These get stored as HasProperty edges in the augmented relations file.

Output Files

`folksy_relations_augmented.csv`

Same schema as folksy_relations.csv with additional columns:

start_word, end_word, relation, weight, surface_text, source
corn, chicken, UsedFor, 1.0, "Corn is used for feeding chickens", conceptnet
dog, porch, AtLocation, 0.8, "you find a dog on a porch", llm_augmented
horse, mane, HasA, 0.8, "a horse has a mane", llm_augmented

The source column allows filtering: source=conceptnet for pure ConceptNet, source=llm_augmented for LLM additions, or both for the full enhanced graph.

`candidate_additions.csv`

Words that appeared in LLM output but aren't in the current folksy vocab:

word, suggested_by, relation_context, frequency
mane, horse, "HasA: a horse has a mane", 2
bridle, horse, "HasA: a horse has a bridle", 1

The frequency column counts how many different source words suggested this target. High-frequency candidates are strong additions to the folksy vocab. Review manually or with a threshold (e.g., suggested by 3+ different words → auto-add).

`enhancement_log.csv`

Track what was processed and what the LLM produced:

source_word, timestamp, edges_generated, edges_accepted, edges_duplicate, edges_oov
dog, 2025-02-15T10:30:00, 24, 18, 3, 3
horse, 2025-02-15T10:30:45, 31, 22, 5, 4

Execution Plan

Batch Processing

534 words × ~1 second per LLM call = ~9 minutes for Phase 1. Very manageable.

import csv
import time

def process_all_words(vocab_path, relations_path, output_path):
    vocab = load_vocab(vocab_path)
    relations = load_relations(relations_path)
    all_new_edges = []
    
    for i, word_entry in enumerate(vocab):
        word = word_entry['word']
        categories = word_entry['categories']
        existing = get_edges_for_word(relations, word)
        
        messages = build_expansion_prompt(word, categories, existing)
        response = llm_chat_completion(messages)
        response_text = response['choices'][0]['message']['content']
        
        new_edges = parse_llm_relations(response_text, word)
        new_edges = deduplicate(new_edges, existing)
        all_new_edges.extend(new_edges)
        
        if (i + 1) % 50 == 0:
            print(f"Processed {i+1}/{len(vocab)} words, {len(all_new_edges)} new edges so far")
        
        time.sleep(0.1)  # gentle rate limiting
    
    save_augmented_relations(all_new_edges, output_path)

Resumability

Write a checkpoint file after each word so the process can resume if interrupted. The enhancement_log.csv serves this purpose — skip any word that already has an entry.

Validation Pass

After all LLM edges are generated, run a quick validation:

No self-loops (start_word == end_word)
All relation types are in the valid set
No duplicate (start, end, relation) triples
Distribution check: flag any word that got 0 new edges (LLM may have failed to parse)
Spot-check 20 random LLM edges manually for sanity

Integration with Generator

The generator's data loading should be updated to:

Load folksy_relations.csv (original ConceptNet edges)
If folksy_relations_augmented.csv exists, load and merge it
CLI flag: --pure-conceptnet to disable LLM-augmented edges
CLI flag: --llm-weight-boost 0.2 to adjust LLM edge weights at runtime (default 0, meaning they keep their 0.8 weight)

This keeps the original ConceptNet data pristine and the augmentation fully reversible.

12 KiB Raw Blame History Unescape Escape

Graph Enhancement Spec — LLM-Augmented Folksy Subgraph

Overview

Infrastructure

Strategy

Phase 1: Per-Word Relationship Expansion

Input

Process

System Prompt

User Prompt Template

Formatting Existing Edges

Parsing LLM Output

Weight Assignment

Deduplication

Phase 2: Cross-Word Relationship Discovery

Process

Prompt for Bridge Discovery

Candidate Selection

Phase 3: Property Enrichment for FALSE_EQUIVALENCE Templates

Prompt

Output Format

Output Files

folksy_relations_augmented.csv

candidate_additions.csv

enhancement_log.csv

Execution Plan

Batch Processing

Resumability

Validation Pass

Integration with Generator

12 KiB

Raw Blame History

`folksy_relations_augmented.csv`

`candidate_additions.csv`

`enhancement_log.csv`