folksy_idioms/GPU_TRAINING_REQUIREMENTS.md
john 02daa7bb97 Add SFT training script and run Qwen3-0.6B-Base fine-tune
Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs
using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090.

Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031
Model checkpoint at folksy-model/final/ (not committed — 1.2 GB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:07:23 -04:00

17 KiB

GPU Training Requirements — Folksy Proverb Model

Date: 2026-03-13 Status: Planning Prerequisite: Corpus generation complete (9,025 sayings, 36,079 training pairs)


1. Objective

Fine-tune a 0.5B parameter language model to generate folksy proverbs on demand. The model should respond to varied prompt styles (word-seeded, persona-seeded, template-seeded, open-ended) with natural-sounding fake folk wisdom in the style of the generated corpus.


2. Base Model Selection

Recommendation: Qwen2.5-0.5B-Instruct

Criterion Qwen2.5-0.5B-Instruct
Parameters 494M
Architecture Transformer decoder, GQA, RoPE
Context window 32,768 tokens
License Apache 2.0
Source Qwen/Qwen2.5-0.5B-Instruct on HuggingFace
Tokenizer Byte-level BPE (151,646 vocab)

Why Qwen2.5-0.5B-Instruct

  • Exact size target. The corpus generation spec calls for a 0.5B model; this is precisely that.
  • Already instruction-tuned. The training pairs use an input/output (instruction/response) format. Starting from an instruct-tuned base means the model already understands the turn structure — we're teaching it what to say, not how to follow instructions.
  • Modern architecture. Grouped-query attention and RoPE positional encoding. Trains efficiently and inferences fast.
  • Apache 2.0. No usage restrictions for any deployment scenario.
  • Strong small-model baseline. Qwen2.5-0.5B benchmarks well against peers (SmolLM2-360M, TinyLlama-1.1B, GPT-2 Medium). It punches above its weight on language tasks.

Alternatives Considered

Model Size Why Not
SmolLM2-360M 360M Slightly undersized; weaker language generation quality
TinyLlama-1.1B 1.1B 2x the target size; more VRAM, slower inference, marginal quality gain for this task
GPT-2 Medium 355M Outdated architecture (absolute positional encoding, no GQA); poor instruction-following baseline
Phi-3-mini 3.8B 7.6x over budget; overkill for a narrow-domain generative task

3. Training Data

Corpus Summary

Metric Value
Training pairs 36,079
Unique sayings 9,025
File corpus/training_pairs.jsonl
Size on disk 7.5 MB
Average output length 10.1 words (~15-20 tokens)
Average input length ~6-10 words (~8-15 tokens)
Vocab coverage 624/624 (100%)

Format

Each line is a JSON object:

{"input": "Tell me something about color.", "output": "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest.", "meta_template": "deconstruction", "source_words": ["hamburger", "beef", "color", "tomato"]}

Distribution by Template Family

Template Sayings %
false_equivalence 1,897 21.0%
futile_preparation 1,735 19.2%
ironic_deficiency 1,563 17.3%
deconstruction 1,544 17.1%
hypocritical_complaint 811 9.0%
denial_of_consequences 750 8.3%
tautological_wisdom 725 8.0%

Three families are below the 10% balance threshold (denial_of_consequences, hypocritical_complaint, tautological_wisdom). This is a known issue from corpus generation. The training should still work — the model will slightly underperform on these templates but the imbalance is not severe.

Distribution by Input Type

Input Type Pairs
word_seeded 9,025
category_seeded 9,025
persona_seeded 9,025
template_seeded 6,858
open_ended 2,146

Data Preparation for Training

The JSONL needs to be converted to chat-template format for the Qwen2.5 tokenizer:

# Each training pair becomes:
messages = [
    {"role": "user", "content": entry["input"]},
    {"role": "assistant", "content": entry["output"]}
]
# Tokenized using Qwen2.5's chat template via tokenizer.apply_chat_template()

Split: 90/5/5 (train/validation/test) → ~32,471 train / ~1,804 val / ~1,804 test. Stratify by meta_template to preserve template distribution in each split.


4. Hardware

Local GPU: RTX 4090

Resource Available
GPU NVIDIA GeForce RTX 4090
VRAM 24 GB GDDR6X
System RAM 128 GB
CPU cores 12
CUDA driver 555.42.02

VRAM Budget (Full Fine-Tune, bf16)

Component Estimate
Model weights (bf16) ~1.0 GB
Gradients (bf16) ~1.0 GB
Optimizer states (AdamW, fp32) ~2.0 GB
Activations (batch 32, seq 128) ~2-4 GB
CUDA overhead + buffers ~1-2 GB
Total ~7-10 GB

A 0.5B model full fine-tune fits comfortably in 24 GB VRAM. No need for LoRA, QLoRA, or gradient checkpointing. Full fine-tune is the simplest and most effective approach for this model size.

Note on Concurrent GPU Use

The local 4090 currently serves GLM4-32B for inference at 192.168.1.100:8853. Training and LLM serving cannot run simultaneously — the training job needs the full 24 GB. Shut down the vLLM/inference server before starting training. This means LLM-as-judge evaluation must happen either before or after training, not during.


5. Training Approach

Method: Full Fine-Tune (SFT)

No LoRA/QLoRA. At 0.5B parameters with 24 GB VRAM, full fine-tune is straightforward and produces the best results. LoRA's parameter-efficiency advantage is irrelevant when the full model fits in memory with room to spare.

Framework: HuggingFace Transformers + TRL

transformers >= 4.45.0
trl >= 0.12.0
datasets >= 3.0.0
torch >= 2.4.0
accelerate >= 1.0.0
peft  # not needed for full fine-tune, but useful if experimenting with LoRA later

Hyperparameters

Parameter Value Notes
Learning rate 2e-5 Standard for SFT on instruct models
LR scheduler Cosine with warmup
Warmup ratio 0.05 ~170 steps
Epochs 3 Small dataset; 3-5 epochs before overfitting
Per-device batch size 32 Fits easily; increase if VRAM allows
Gradient accumulation 1 Effective batch = 32
Max sequence length 128 Inputs ~15 tokens + outputs ~20 tokens; 128 is generous
Weight decay 0.01
Precision bf16 4090 supports bf16 natively
Optimizer AdamW (torch fused)
Eval strategy steps (every 100)
Save strategy steps (every 500)
Logging TensorBoard or W&B

Estimated Training Time

Metric Value
Training examples ~32,471
Steps per epoch (batch 32) ~1,015
Total steps (3 epochs) ~3,045
Throughput estimate (4090, 0.5B, bf16) ~80-120 steps/min
Estimated wall time 25-40 minutes

This is a very fast training job. Even 5 epochs would finish in under an hour.

Training Script Skeleton

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

model_id = "Qwen/Qwen2.5-0.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")

dataset = load_dataset("json", data_files="corpus/training_pairs.jsonl")

def format_chat(example):
    return {
        "messages": [
            {"role": "user", "content": example["input"]},
            {"role": "assistant", "content": example["output"]},
        ]
    }

dataset = dataset.map(format_chat)
split = dataset["train"].train_test_split(test_size=0.1, seed=42)

training_args = SFTConfig(
    output_dir="./folksy-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    weight_decay=0.01,
    bf16=True,
    max_seq_length=128,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=500,
    logging_steps=10,
    report_to="tensorboard",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model("./folksy-model/final")
tokenizer.save_pretrained("./folksy-model/final")

6. Evaluation

Automated Metrics

  1. Validation loss / perplexity — tracked during training via eval steps. Watch for overfitting (val loss increasing while train loss decreases).
  2. BLEU/ROUGE on test set — sanity check, but not the primary metric for creative generation.
  3. Template coverage — generate 1,000 sayings with varied prompts, verify all 7 template families appear in output.
  4. Lexical diversity — distinct-1, distinct-2 (unique unigrams/bigrams in generated output). Low diversity = mode collapse.

LLM-as-Judge Evaluation (Self-Hosted)

Use GLM4-32B (already available at 192.168.1.100:8853) as an automated evaluator. No external API needed.

Procedure:

  1. Stop the training job (free the GPU)
  2. Restart the GLM4-32B inference server
  3. Generate 200 sayings from the fine-tuned model across all prompt types
  4. Send each to GLM4-32B with a judge prompt

Judge prompt:

Rate this folk saying on a 1-5 scale:
- 5: Sounds like a real proverb — natural, witty, memorable
- 4: Good folksy saying — natural language, clear meaning
- 3: Acceptable — grammatically correct but flat or formulaic
- 2: Awkward — grammatical issues or forced phrasing
- 1: Broken — nonsensical, incomplete, or garbled

Saying: "{generated_saying}"

Respond with only the number and a one-sentence justification.

Target: Mean score >= 3.5, with <10% scoring 1 or 2.

Human Spot-Check

Sample 50 generated sayings, rate as Good/Okay/Bad (same criteria as EVALUATION.md). Target: >60% Good, <10% Bad.

A/B Comparison

Generate 50 sayings from the fine-tuned model and 50 from the raw template engine. Present pairs to GLM4-32B (or manually) and ask which sounds more natural. The fine-tuned model should win >80% of comparisons.


7. Output Artifacts

Artifact Path Description
Model weights folksy-model/final/ Full bf16 model (~1 GB)
Tokenizer folksy-model/final/ Qwen2.5 tokenizer config + vocab
Training logs folksy-model/runs/ TensorBoard event files
Checkpoints folksy-model/checkpoint-*/ Intermediate saves every 500 steps
Eval results folksy-model/eval_results.json Automated metrics on test set
Judge results folksy-model/judge_results.jsonl GLM4-32B evaluation scores

Model Distribution (Optional)

The final model can be:

  • Quantized to GGUF (via llama.cpp) for CPU inference — a 0.5B model runs on any machine
  • Pushed to HuggingFace Hub if sharing publicly
  • Served locally via vLLM, llama.cpp, or Ollama for integration testing

8. RunPod Feasibility

Is RunPod needed?

No. The local RTX 4090 is more than sufficient. A 0.5B full fine-tune on 36K examples will finish in under an hour. RunPod would be useful only if:

  • The local GPU is occupied with inference work that can't be interrupted
  • You want to run multiple training experiments in parallel (hyperparameter sweeps)
  • You scale up to a larger base model (e.g., 3B+)

If RunPod is Used Anyway

Instance GPU VRAM Hourly Cost (approx) Notes
RTX 4090 1x 4090 24 GB ~$0.40/hr Identical to local hardware
A40 1x A40 48 GB ~$0.50/hr More VRAM headroom; good if experimenting with larger batch sizes
RTX A6000 1x A6000 48 GB ~$0.60/hr Same tier as A40

At ~30-40 minutes of training time, the total cost would be under $0.50 for a single run. Use the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 template.


9. Self-Hosted LLM Jobs (Replacing External API Dependencies)

All tasks that might otherwise require a paid API key are reformulated as self-hosted jobs using the existing local infrastructure.

Task External API Alternative Self-Hosted Solution
LLM-as-Judge evaluation GPT-4 / Claude API GLM4-32B (32B, local 4090) — already running at 192.168.1.100:8853
Data augmentation (if more training pairs needed) GPT-4 for paraphrase generation GLM4-32B — same endpoint, same prompts used for corpus polishing
Synthetic evaluation prompts API-generated diverse test prompts GLM4-32B — generate varied evaluation prompts locally
Model comparison judging Claude API for A/B preference judging GLM4-32B — structured judge prompt with forced-choice output
Embedding-based dedup (if scaling corpus) OpenAI embeddings API Sentence-transformers (e.g., all-MiniLM-L6-v2, runs on CPU, 80MB)
Classification of failure modes API-based analysis GLM4-32B — classify generated sayings by quality/failure type

Key constraint: GLM4-32B and the training job cannot share the 4090 simultaneously. Sequence the workflow as:

1. [GPU: GLM4-32B] Pre-training evaluation — generate baseline judge scores
2. [GPU: idle]      Shut down inference server
3. [GPU: training]  Fine-tune Qwen2.5-0.5B (~30-40 min)
4. [GPU: idle]      Training complete, model saved
5. [GPU: GLM4-32B]  Restart inference server
6. [GPU: GLM4-32B]  Post-training evaluation — judge fine-tuned model output

If a larger judge model is desired (e.g., 70B for higher-quality evaluation), options for the 4090:

  • Qwen2.5-72B-Instruct-AWQ (4-bit, ~40GB) — does not fit in 24 GB single-GPU
  • Qwen2.5-32B-Instruct — similar quality tier to GLM4-32B, interchangeable
  • Llama-3.1-70B-Instruct (GGUF Q4_K_M) — ~40GB, does not fit single 4090
  • Conclusion: GLM4-32B is the practical ceiling for single-4090 evaluation. For 70B+ judge models, use RunPod with 2x A6000 or 1x A100 80GB (~$1-2/hr, would need <1 hour for 200 evaluations).

10. Blockers and Dependencies

No Blockers

The training pipeline has no external dependencies that aren't already met.

Dependency Status
Training corpus (training_pairs.jsonl) Complete — 36,079 pairs
GPU hardware (RTX 4090) Available
Base model (Qwen2.5-0.5B-Instruct) Public on HuggingFace, Apache 2.0
Python packages (transformers, trl, torch) Install needed — pip install transformers trl datasets accelerate torch
Evaluation LLM (GLM4-32B) Running locally
CUDA toolkit Installed (driver 555.42.02)

Pre-Training Checklist

  • Install training dependencies: pip install transformers trl datasets accelerate tensorboard
  • Download base model: huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
  • Write training script (skeleton provided in Section 5)
  • Write data preparation script (JSONL → chat-template format with train/val/test split)
  • Shut down GLM4-32B inference server to free GPU memory
  • Run training (~30-40 min)
  • Restart GLM4-32B for evaluation
  • Run LLM-as-judge evaluation
  • Human spot-check of 50 generated sayings

Risks

Risk Likelihood Mitigation
Overfitting on 36K examples Medium Monitor val loss; use early stopping; try 2-3 epochs instead of 5
Mode collapse (model produces same few sayings) Low-Medium Check distinct-n metrics; if occurring, reduce learning rate or add dropout
Template imbalance causes weak coverage Low 3 templates are at 8-9% (below 10% threshold); generate more pairs for these families if needed
GPU memory conflict with inference server None if sequenced Follow the GPU sequencing workflow in Section 9

11. Stretch Goals (Post-Initial Training)

These are not blockers but could improve results in subsequent iterations:

  1. Fictional entity fine-tuning pairs — The CORPUS_GENERATION_SPEC describes ~200-300 fictional entity training pairs. These were not included in the current 36K corpus. Adding them would teach the model to generalize to novel concepts provided at inference time.

  2. DPO/ORPO alignment — After SFT, use GLM4-32B to generate preference pairs (chosen vs rejected sayings) and run a DPO pass. This could sharpen quality without needing more training data.

  3. Hyperparameter sweep — Run 3-5 configs varying learning rate (1e-5 to 5e-5) and epochs (2-5). With 30-minute training runs, a full sweep takes ~2.5 hours.

  4. Larger base model experiment — Try Qwen2.5-1.5B-Instruct (1.5B params) as a comparison. Still fits on the 4090 for full fine-tune. Compare quality vs the 0.5B model to see if the extra parameters matter for this narrow task.

  5. GGUF export for deployment — Convert the final model to GGUF Q8_0 format for CPU-only deployment via llama.cpp or Ollama. A 0.5B Q8 model is ~500MB and runs at interactive speed on any modern CPU.