john 02daa7bb97 Add SFT training script and run Qwen3-0.6B-Base fine-tune

Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs
using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090.

Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031
Model checkpoint at folksy-model/final/ (not committed — 1.2 GB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-31 22:07:23 -04:00

17 KiB

Raw Blame History

GPU Training Requirements — Folksy Proverb Model

Date: 2026-03-13 Status: Planning Prerequisite: Corpus generation complete (9,025 sayings, 36,079 training pairs)

1. Objective

Fine-tune a 0.5B parameter language model to generate folksy proverbs on demand. The model should respond to varied prompt styles (word-seeded, persona-seeded, template-seeded, open-ended) with natural-sounding fake folk wisdom in the style of the generated corpus.

2. Base Model Selection

Recommendation: Qwen2.5-0.5B-Instruct

Criterion	Qwen2.5-0.5B-Instruct
Parameters	494M
Architecture	Transformer decoder, GQA, RoPE
Context window	32,768 tokens
License	Apache 2.0
Source	`Qwen/Qwen2.5-0.5B-Instruct` on HuggingFace
Tokenizer	Byte-level BPE (151,646 vocab)

Why Qwen2.5-0.5B-Instruct

Exact size target. The corpus generation spec calls for a 0.5B model; this is precisely that.
Already instruction-tuned. The training pairs use an input/output (instruction/response) format. Starting from an instruct-tuned base means the model already understands the turn structure — we're teaching it what to say, not how to follow instructions.
Modern architecture. Grouped-query attention and RoPE positional encoding. Trains efficiently and inferences fast.
Apache 2.0. No usage restrictions for any deployment scenario.
Strong small-model baseline. Qwen2.5-0.5B benchmarks well against peers (SmolLM2-360M, TinyLlama-1.1B, GPT-2 Medium). It punches above its weight on language tasks.

Alternatives Considered

Model	Size	Why Not
SmolLM2-360M	360M	Slightly undersized; weaker language generation quality
TinyLlama-1.1B	1.1B	2x the target size; more VRAM, slower inference, marginal quality gain for this task
GPT-2 Medium	355M	Outdated architecture (absolute positional encoding, no GQA); poor instruction-following baseline
Phi-3-mini	3.8B	7.6x over budget; overkill for a narrow-domain generative task

3. Training Data

Corpus Summary

Metric	Value
Training pairs	36,079
Unique sayings	9,025
File	`corpus/training_pairs.jsonl`
Size on disk	7.5 MB
Average output length	10.1 words (~15-20 tokens)
Average input length	~6-10 words (~8-15 tokens)
Vocab coverage	624/624 (100%)

Format

Each line is a JSON object:

{"input": "Tell me something about color.", "output": "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest.", "meta_template": "deconstruction", "source_words": ["hamburger", "beef", "color", "tomato"]}

Distribution by Template Family

Template	Sayings	%
false_equivalence	1,897	21.0%
futile_preparation	1,735	19.2%
ironic_deficiency	1,563	17.3%
deconstruction	1,544	17.1%
hypocritical_complaint	811	9.0%
denial_of_consequences	750	8.3%
tautological_wisdom	725	8.0%

Three families are below the 10% balance threshold (denial_of_consequences, hypocritical_complaint, tautological_wisdom). This is a known issue from corpus generation. The training should still work — the model will slightly underperform on these templates but the imbalance is not severe.

Distribution by Input Type

Input Type	Pairs
word_seeded	9,025
category_seeded	9,025
persona_seeded	9,025
template_seeded	6,858
open_ended	2,146

Data Preparation for Training

The JSONL needs to be converted to chat-template format for the Qwen2.5 tokenizer:

# Each training pair becomes:
messages = [
    {"role": "user", "content": entry["input"]},
    {"role": "assistant", "content": entry["output"]}
]
# Tokenized using Qwen2.5's chat template via tokenizer.apply_chat_template()

Split: 90/5/5 (train/validation/test) → ~32,471 train / ~1,804 val / ~1,804 test. Stratify by meta_template to preserve template distribution in each split.

4. Hardware

Local GPU: RTX 4090

Resource	Available
GPU	NVIDIA GeForce RTX 4090
VRAM	24 GB GDDR6X
System RAM	128 GB
CPU cores	12
CUDA driver	555.42.02

VRAM Budget (Full Fine-Tune, bf16)

Component	Estimate
Model weights (bf16)	~1.0 GB
Gradients (bf16)	~1.0 GB
Optimizer states (AdamW, fp32)	~2.0 GB
Activations (batch 32, seq 128)	~2-4 GB
CUDA overhead + buffers	~1-2 GB
Total	~7-10 GB

A 0.5B model full fine-tune fits comfortably in 24 GB VRAM. No need for LoRA, QLoRA, or gradient checkpointing. Full fine-tune is the simplest and most effective approach for this model size.

Note on Concurrent GPU Use

The local 4090 currently serves GLM4-32B for inference at 192.168.1.100:8853. Training and LLM serving cannot run simultaneously — the training job needs the full 24 GB. Shut down the vLLM/inference server before starting training. This means LLM-as-judge evaluation must happen either before or after training, not during.

5. Training Approach

Method: Full Fine-Tune (SFT)

No LoRA/QLoRA. At 0.5B parameters with 24 GB VRAM, full fine-tune is straightforward and produces the best results. LoRA's parameter-efficiency advantage is irrelevant when the full model fits in memory with room to spare.

Framework: HuggingFace Transformers + TRL

transformers >= 4.45.0
trl >= 0.12.0
datasets >= 3.0.0
torch >= 2.4.0
accelerate >= 1.0.0
peft  # not needed for full fine-tune, but useful if experimenting with LoRA later

Hyperparameters

Parameter	Value	Notes
Learning rate	2e-5	Standard for SFT on instruct models
LR scheduler	Cosine with warmup
Warmup ratio	0.05	~170 steps
Epochs	3	Small dataset; 3-5 epochs before overfitting
Per-device batch size	32	Fits easily; increase if VRAM allows
Gradient accumulation	1	Effective batch = 32
Max sequence length	128	Inputs ~15 tokens + outputs ~20 tokens; 128 is generous
Weight decay	0.01
Precision	bf16	4090 supports bf16 natively
Optimizer	AdamW (torch fused)
Eval strategy	steps (every 100)
Save strategy	steps (every 500)
Logging	TensorBoard or W&B

Estimated Training Time

Metric	Value
Training examples	~32,471
Steps per epoch (batch 32)	~1,015
Total steps (3 epochs)	~3,045
Throughput estimate (4090, 0.5B, bf16)	~80-120 steps/min
Estimated wall time	25-40 minutes

This is a very fast training job. Even 5 epochs would finish in under an hour.

Training Script Skeleton

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

model_id = "Qwen/Qwen2.5-0.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")

dataset = load_dataset("json", data_files="corpus/training_pairs.jsonl")

def format_chat(example):
    return {
        "messages": [
            {"role": "user", "content": example["input"]},
            {"role": "assistant", "content": example["output"]},
        ]
    }

dataset = dataset.map(format_chat)
split = dataset["train"].train_test_split(test_size=0.1, seed=42)

training_args = SFTConfig(
    output_dir="./folksy-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    weight_decay=0.01,
    bf16=True,
    max_seq_length=128,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=500,
    logging_steps=10,
    report_to="tensorboard",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model("./folksy-model/final")
tokenizer.save_pretrained("./folksy-model/final")

6. Evaluation

Automated Metrics

Validation loss / perplexity — tracked during training via eval steps. Watch for overfitting (val loss increasing while train loss decreases).
BLEU/ROUGE on test set — sanity check, but not the primary metric for creative generation.
Template coverage — generate 1,000 sayings with varied prompts, verify all 7 template families appear in output.
Lexical diversity — distinct-1, distinct-2 (unique unigrams/bigrams in generated output). Low diversity = mode collapse.

LLM-as-Judge Evaluation (Self-Hosted)

Use GLM4-32B (already available at 192.168.1.100:8853) as an automated evaluator. No external API needed.

Procedure:

Stop the training job (free the GPU)
Restart the GLM4-32B inference server
Generate 200 sayings from the fine-tuned model across all prompt types
Send each to GLM4-32B with a judge prompt

Judge prompt:

Rate this folk saying on a 1-5 scale:
- 5: Sounds like a real proverb — natural, witty, memorable
- 4: Good folksy saying — natural language, clear meaning
- 3: Acceptable — grammatically correct but flat or formulaic
- 2: Awkward — grammatical issues or forced phrasing
- 1: Broken — nonsensical, incomplete, or garbled

Saying: "{generated_saying}"

Respond with only the number and a one-sentence justification.

Target: Mean score >= 3.5, with <10% scoring 1 or 2.

Human Spot-Check

Sample 50 generated sayings, rate as Good/Okay/Bad (same criteria as EVALUATION.md). Target: >60% Good, <10% Bad.

A/B Comparison

Generate 50 sayings from the fine-tuned model and 50 from the raw template engine. Present pairs to GLM4-32B (or manually) and ask which sounds more natural. The fine-tuned model should win >80% of comparisons.

7. Output Artifacts

Artifact	Path	Description
Model weights	`folksy-model/final/`	Full bf16 model (~1 GB)
Tokenizer	`folksy-model/final/`	Qwen2.5 tokenizer config + vocab
Training logs	`folksy-model/runs/`	TensorBoard event files
Checkpoints	`folksy-model/checkpoint-*/`	Intermediate saves every 500 steps
Eval results	`folksy-model/eval_results.json`	Automated metrics on test set
Judge results	`folksy-model/judge_results.jsonl`	GLM4-32B evaluation scores

Model Distribution (Optional)

The final model can be:

Quantized to GGUF (via llama.cpp) for CPU inference — a 0.5B model runs on any machine
Pushed to HuggingFace Hub if sharing publicly
Served locally via vLLM, llama.cpp, or Ollama for integration testing

8. RunPod Feasibility

Is RunPod needed?

No. The local RTX 4090 is more than sufficient. A 0.5B full fine-tune on 36K examples will finish in under an hour. RunPod would be useful only if:

The local GPU is occupied with inference work that can't be interrupted
You want to run multiple training experiments in parallel (hyperparameter sweeps)
You scale up to a larger base model (e.g., 3B+)

If RunPod is Used Anyway

Instance	GPU	VRAM	Hourly Cost (approx)	Notes
RTX 4090	1x 4090	24 GB	~$0.40/hr	Identical to local hardware
A40	1x A40	48 GB	~$0.50/hr	More VRAM headroom; good if experimenting with larger batch sizes
RTX A6000	1x A6000	48 GB	~$0.60/hr	Same tier as A40

At ~30-40 minutes of training time, the total cost would be under $0.50 for a single run. Use the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 template.

9. Self-Hosted LLM Jobs (Replacing External API Dependencies)

All tasks that might otherwise require a paid API key are reformulated as self-hosted jobs using the existing local infrastructure.

Task	External API Alternative	Self-Hosted Solution
LLM-as-Judge evaluation	GPT-4 / Claude API	GLM4-32B (32B, local 4090) — already running at `192.168.1.100:8853`
Data augmentation (if more training pairs needed)	GPT-4 for paraphrase generation	GLM4-32B — same endpoint, same prompts used for corpus polishing
Synthetic evaluation prompts	API-generated diverse test prompts	GLM4-32B — generate varied evaluation prompts locally
Model comparison judging	Claude API for A/B preference judging	GLM4-32B — structured judge prompt with forced-choice output
Embedding-based dedup (if scaling corpus)	OpenAI embeddings API	Sentence-transformers (e.g., `all-MiniLM-L6-v2`, runs on CPU, 80MB)
Classification of failure modes	API-based analysis	GLM4-32B — classify generated sayings by quality/failure type

Key constraint: GLM4-32B and the training job cannot share the 4090 simultaneously. Sequence the workflow as:

1. [GPU: GLM4-32B] Pre-training evaluation — generate baseline judge scores
2. [GPU: idle]      Shut down inference server
3. [GPU: training]  Fine-tune Qwen2.5-0.5B (~30-40 min)
4. [GPU: idle]      Training complete, model saved
5. [GPU: GLM4-32B]  Restart inference server
6. [GPU: GLM4-32B]  Post-training evaluation — judge fine-tuned model output

If a larger judge model is desired (e.g., 70B for higher-quality evaluation), options for the 4090:

Qwen2.5-72B-Instruct-AWQ (4-bit, ~40GB) — does not fit in 24 GB single-GPU
Qwen2.5-32B-Instruct — similar quality tier to GLM4-32B, interchangeable
Llama-3.1-70B-Instruct (GGUF Q4_K_M) — ~40GB, does not fit single 4090
Conclusion: GLM4-32B is the practical ceiling for single-4090 evaluation. For 70B+ judge models, use RunPod with 2x A6000 or 1x A100 80GB (~$1-2/hr, would need <1 hour for 200 evaluations).

10. Blockers and Dependencies

No Blockers

The training pipeline has no external dependencies that aren't already met.

Dependency	Status
Training corpus (`training_pairs.jsonl`)	Complete — 36,079 pairs
GPU hardware (RTX 4090)	Available
Base model (Qwen2.5-0.5B-Instruct)	Public on HuggingFace, Apache 2.0
Python packages (transformers, trl, torch)	Install needed — `pip install transformers trl datasets accelerate torch`
Evaluation LLM (GLM4-32B)	Running locally
CUDA toolkit	Installed (driver 555.42.02)

Pre-Training Checklist

Install training dependencies: pip install transformers trl datasets accelerate tensorboard
Download base model: huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Write training script (skeleton provided in Section 5)
Write data preparation script (JSONL → chat-template format with train/val/test split)
Shut down GLM4-32B inference server to free GPU memory
Run training (~30-40 min)
Restart GLM4-32B for evaluation
Run LLM-as-judge evaluation
Human spot-check of 50 generated sayings

Risks

Risk	Likelihood	Mitigation
Overfitting on 36K examples	Medium	Monitor val loss; use early stopping; try 2-3 epochs instead of 5
Mode collapse (model produces same few sayings)	Low-Medium	Check distinct-n metrics; if occurring, reduce learning rate or add dropout
Template imbalance causes weak coverage	Low	3 templates are at 8-9% (below 10% threshold); generate more pairs for these families if needed
GPU memory conflict with inference server	None if sequenced	Follow the GPU sequencing workflow in Section 9

11. Stretch Goals (Post-Initial Training)

These are not blockers but could improve results in subsequent iterations:

Fictional entity fine-tuning pairs — The CORPUS_GENERATION_SPEC describes ~200-300 fictional entity training pairs. These were not included in the current 36K corpus. Adding them would teach the model to generalize to novel concepts provided at inference time.
DPO/ORPO alignment — After SFT, use GLM4-32B to generate preference pairs (chosen vs rejected sayings) and run a DPO pass. This could sharpen quality without needing more training data.
Hyperparameter sweep — Run 3-5 configs varying learning rate (1e-5 to 5e-5) and epochs (2-5). With 30-minute training runs, a full sweep takes ~2.5 hours.
Larger base model experiment — Try Qwen2.5-1.5B-Instruct (1.5B params) as a comparison. Still fits on the 4090 for full fine-tune. Compare quality vs the 0.5B model to see if the extra parameters matter for this narrow task.
GGUF export for deployment — Convert the final model to GGUF Q8_0 format for CPU-only deployment via llama.cpp or Ollama. A 0.5B Q8 model is ~500MB and runs at interactive speed on any modern CPU.

17 KiB Raw Blame History