Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090. Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031 Model checkpoint at folksy-model/final/ (not committed — 1.2 GB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17 KiB
GPU Training Requirements — Folksy Proverb Model
Date: 2026-03-13 Status: Planning Prerequisite: Corpus generation complete (9,025 sayings, 36,079 training pairs)
1. Objective
Fine-tune a 0.5B parameter language model to generate folksy proverbs on demand. The model should respond to varied prompt styles (word-seeded, persona-seeded, template-seeded, open-ended) with natural-sounding fake folk wisdom in the style of the generated corpus.
2. Base Model Selection
Recommendation: Qwen2.5-0.5B-Instruct
| Criterion | Qwen2.5-0.5B-Instruct |
|---|---|
| Parameters | 494M |
| Architecture | Transformer decoder, GQA, RoPE |
| Context window | 32,768 tokens |
| License | Apache 2.0 |
| Source | Qwen/Qwen2.5-0.5B-Instruct on HuggingFace |
| Tokenizer | Byte-level BPE (151,646 vocab) |
Why Qwen2.5-0.5B-Instruct
- Exact size target. The corpus generation spec calls for a 0.5B model; this is precisely that.
- Already instruction-tuned. The training pairs use an input/output (instruction/response) format. Starting from an instruct-tuned base means the model already understands the turn structure — we're teaching it what to say, not how to follow instructions.
- Modern architecture. Grouped-query attention and RoPE positional encoding. Trains efficiently and inferences fast.
- Apache 2.0. No usage restrictions for any deployment scenario.
- Strong small-model baseline. Qwen2.5-0.5B benchmarks well against peers (SmolLM2-360M, TinyLlama-1.1B, GPT-2 Medium). It punches above its weight on language tasks.
Alternatives Considered
| Model | Size | Why Not |
|---|---|---|
| SmolLM2-360M | 360M | Slightly undersized; weaker language generation quality |
| TinyLlama-1.1B | 1.1B | 2x the target size; more VRAM, slower inference, marginal quality gain for this task |
| GPT-2 Medium | 355M | Outdated architecture (absolute positional encoding, no GQA); poor instruction-following baseline |
| Phi-3-mini | 3.8B | 7.6x over budget; overkill for a narrow-domain generative task |
3. Training Data
Corpus Summary
| Metric | Value |
|---|---|
| Training pairs | 36,079 |
| Unique sayings | 9,025 |
| File | corpus/training_pairs.jsonl |
| Size on disk | 7.5 MB |
| Average output length | 10.1 words (~15-20 tokens) |
| Average input length | ~6-10 words (~8-15 tokens) |
| Vocab coverage | 624/624 (100%) |
Format
Each line is a JSON object:
{"input": "Tell me something about color.", "output": "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest.", "meta_template": "deconstruction", "source_words": ["hamburger", "beef", "color", "tomato"]}
Distribution by Template Family
| Template | Sayings | % |
|---|---|---|
| false_equivalence | 1,897 | 21.0% |
| futile_preparation | 1,735 | 19.2% |
| ironic_deficiency | 1,563 | 17.3% |
| deconstruction | 1,544 | 17.1% |
| hypocritical_complaint | 811 | 9.0% |
| denial_of_consequences | 750 | 8.3% |
| tautological_wisdom | 725 | 8.0% |
Three families are below the 10% balance threshold (denial_of_consequences, hypocritical_complaint, tautological_wisdom). This is a known issue from corpus generation. The training should still work — the model will slightly underperform on these templates but the imbalance is not severe.
Distribution by Input Type
| Input Type | Pairs |
|---|---|
| word_seeded | 9,025 |
| category_seeded | 9,025 |
| persona_seeded | 9,025 |
| template_seeded | 6,858 |
| open_ended | 2,146 |
Data Preparation for Training
The JSONL needs to be converted to chat-template format for the Qwen2.5 tokenizer:
# Each training pair becomes:
messages = [
{"role": "user", "content": entry["input"]},
{"role": "assistant", "content": entry["output"]}
]
# Tokenized using Qwen2.5's chat template via tokenizer.apply_chat_template()
Split: 90/5/5 (train/validation/test) → ~32,471 train / ~1,804 val / ~1,804 test. Stratify by meta_template to preserve template distribution in each split.
4. Hardware
Local GPU: RTX 4090
| Resource | Available |
|---|---|
| GPU | NVIDIA GeForce RTX 4090 |
| VRAM | 24 GB GDDR6X |
| System RAM | 128 GB |
| CPU cores | 12 |
| CUDA driver | 555.42.02 |
VRAM Budget (Full Fine-Tune, bf16)
| Component | Estimate |
|---|---|
| Model weights (bf16) | ~1.0 GB |
| Gradients (bf16) | ~1.0 GB |
| Optimizer states (AdamW, fp32) | ~2.0 GB |
| Activations (batch 32, seq 128) | ~2-4 GB |
| CUDA overhead + buffers | ~1-2 GB |
| Total | ~7-10 GB |
A 0.5B model full fine-tune fits comfortably in 24 GB VRAM. No need for LoRA, QLoRA, or gradient checkpointing. Full fine-tune is the simplest and most effective approach for this model size.
Note on Concurrent GPU Use
The local 4090 currently serves GLM4-32B for inference at 192.168.1.100:8853. Training and LLM serving cannot run simultaneously — the training job needs the full 24 GB. Shut down the vLLM/inference server before starting training. This means LLM-as-judge evaluation must happen either before or after training, not during.
5. Training Approach
Method: Full Fine-Tune (SFT)
No LoRA/QLoRA. At 0.5B parameters with 24 GB VRAM, full fine-tune is straightforward and produces the best results. LoRA's parameter-efficiency advantage is irrelevant when the full model fits in memory with room to spare.
Framework: HuggingFace Transformers + TRL
transformers >= 4.45.0
trl >= 0.12.0
datasets >= 3.0.0
torch >= 2.4.0
accelerate >= 1.0.0
peft # not needed for full fine-tune, but useful if experimenting with LoRA later
Hyperparameters
| Parameter | Value | Notes |
|---|---|---|
| Learning rate | 2e-5 | Standard for SFT on instruct models |
| LR scheduler | Cosine with warmup | |
| Warmup ratio | 0.05 | ~170 steps |
| Epochs | 3 | Small dataset; 3-5 epochs before overfitting |
| Per-device batch size | 32 | Fits easily; increase if VRAM allows |
| Gradient accumulation | 1 | Effective batch = 32 |
| Max sequence length | 128 | Inputs ~15 tokens + outputs ~20 tokens; 128 is generous |
| Weight decay | 0.01 | |
| Precision | bf16 | 4090 supports bf16 natively |
| Optimizer | AdamW (torch fused) | |
| Eval strategy | steps (every 100) | |
| Save strategy | steps (every 500) | |
| Logging | TensorBoard or W&B |
Estimated Training Time
| Metric | Value |
|---|---|
| Training examples | ~32,471 |
| Steps per epoch (batch 32) | ~1,015 |
| Total steps (3 epochs) | ~3,045 |
| Throughput estimate (4090, 0.5B, bf16) | ~80-120 steps/min |
| Estimated wall time | 25-40 minutes |
This is a very fast training job. Even 5 epochs would finish in under an hour.
Training Script Skeleton
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")
dataset = load_dataset("json", data_files="corpus/training_pairs.jsonl")
def format_chat(example):
return {
"messages": [
{"role": "user", "content": example["input"]},
{"role": "assistant", "content": example["output"]},
]
}
dataset = dataset.map(format_chat)
split = dataset["train"].train_test_split(test_size=0.1, seed=42)
training_args = SFTConfig(
output_dir="./folksy-model",
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
bf16=True,
max_seq_length=128,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=500,
logging_steps=10,
report_to="tensorboard",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=split["train"],
eval_dataset=split["test"],
processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./folksy-model/final")
tokenizer.save_pretrained("./folksy-model/final")
6. Evaluation
Automated Metrics
- Validation loss / perplexity — tracked during training via eval steps. Watch for overfitting (val loss increasing while train loss decreases).
- BLEU/ROUGE on test set — sanity check, but not the primary metric for creative generation.
- Template coverage — generate 1,000 sayings with varied prompts, verify all 7 template families appear in output.
- Lexical diversity — distinct-1, distinct-2 (unique unigrams/bigrams in generated output). Low diversity = mode collapse.
LLM-as-Judge Evaluation (Self-Hosted)
Use GLM4-32B (already available at 192.168.1.100:8853) as an automated evaluator. No external API needed.
Procedure:
- Stop the training job (free the GPU)
- Restart the GLM4-32B inference server
- Generate 200 sayings from the fine-tuned model across all prompt types
- Send each to GLM4-32B with a judge prompt
Judge prompt:
Rate this folk saying on a 1-5 scale:
- 5: Sounds like a real proverb — natural, witty, memorable
- 4: Good folksy saying — natural language, clear meaning
- 3: Acceptable — grammatically correct but flat or formulaic
- 2: Awkward — grammatical issues or forced phrasing
- 1: Broken — nonsensical, incomplete, or garbled
Saying: "{generated_saying}"
Respond with only the number and a one-sentence justification.
Target: Mean score >= 3.5, with <10% scoring 1 or 2.
Human Spot-Check
Sample 50 generated sayings, rate as Good/Okay/Bad (same criteria as EVALUATION.md). Target: >60% Good, <10% Bad.
A/B Comparison
Generate 50 sayings from the fine-tuned model and 50 from the raw template engine. Present pairs to GLM4-32B (or manually) and ask which sounds more natural. The fine-tuned model should win >80% of comparisons.
7. Output Artifacts
| Artifact | Path | Description |
|---|---|---|
| Model weights | folksy-model/final/ |
Full bf16 model (~1 GB) |
| Tokenizer | folksy-model/final/ |
Qwen2.5 tokenizer config + vocab |
| Training logs | folksy-model/runs/ |
TensorBoard event files |
| Checkpoints | folksy-model/checkpoint-*/ |
Intermediate saves every 500 steps |
| Eval results | folksy-model/eval_results.json |
Automated metrics on test set |
| Judge results | folksy-model/judge_results.jsonl |
GLM4-32B evaluation scores |
Model Distribution (Optional)
The final model can be:
- Quantized to GGUF (via
llama.cpp) for CPU inference — a 0.5B model runs on any machine - Pushed to HuggingFace Hub if sharing publicly
- Served locally via vLLM, llama.cpp, or Ollama for integration testing
8. RunPod Feasibility
Is RunPod needed?
No. The local RTX 4090 is more than sufficient. A 0.5B full fine-tune on 36K examples will finish in under an hour. RunPod would be useful only if:
- The local GPU is occupied with inference work that can't be interrupted
- You want to run multiple training experiments in parallel (hyperparameter sweeps)
- You scale up to a larger base model (e.g., 3B+)
If RunPod is Used Anyway
| Instance | GPU | VRAM | Hourly Cost (approx) | Notes |
|---|---|---|---|---|
| RTX 4090 | 1x 4090 | 24 GB | ~$0.40/hr | Identical to local hardware |
| A40 | 1x A40 | 48 GB | ~$0.50/hr | More VRAM headroom; good if experimenting with larger batch sizes |
| RTX A6000 | 1x A6000 | 48 GB | ~$0.60/hr | Same tier as A40 |
At ~30-40 minutes of training time, the total cost would be under $0.50 for a single run. Use the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 template.
9. Self-Hosted LLM Jobs (Replacing External API Dependencies)
All tasks that might otherwise require a paid API key are reformulated as self-hosted jobs using the existing local infrastructure.
| Task | External API Alternative | Self-Hosted Solution |
|---|---|---|
| LLM-as-Judge evaluation | GPT-4 / Claude API | GLM4-32B (32B, local 4090) — already running at 192.168.1.100:8853 |
| Data augmentation (if more training pairs needed) | GPT-4 for paraphrase generation | GLM4-32B — same endpoint, same prompts used for corpus polishing |
| Synthetic evaluation prompts | API-generated diverse test prompts | GLM4-32B — generate varied evaluation prompts locally |
| Model comparison judging | Claude API for A/B preference judging | GLM4-32B — structured judge prompt with forced-choice output |
| Embedding-based dedup (if scaling corpus) | OpenAI embeddings API | Sentence-transformers (e.g., all-MiniLM-L6-v2, runs on CPU, 80MB) |
| Classification of failure modes | API-based analysis | GLM4-32B — classify generated sayings by quality/failure type |
Key constraint: GLM4-32B and the training job cannot share the 4090 simultaneously. Sequence the workflow as:
1. [GPU: GLM4-32B] Pre-training evaluation — generate baseline judge scores
2. [GPU: idle] Shut down inference server
3. [GPU: training] Fine-tune Qwen2.5-0.5B (~30-40 min)
4. [GPU: idle] Training complete, model saved
5. [GPU: GLM4-32B] Restart inference server
6. [GPU: GLM4-32B] Post-training evaluation — judge fine-tuned model output
If a larger judge model is desired (e.g., 70B for higher-quality evaluation), options for the 4090:
- Qwen2.5-72B-Instruct-AWQ (4-bit, ~40GB) — does not fit in 24 GB single-GPU
- Qwen2.5-32B-Instruct — similar quality tier to GLM4-32B, interchangeable
- Llama-3.1-70B-Instruct (GGUF Q4_K_M) — ~40GB, does not fit single 4090
- Conclusion: GLM4-32B is the practical ceiling for single-4090 evaluation. For 70B+ judge models, use RunPod with 2x A6000 or 1x A100 80GB (~$1-2/hr, would need <1 hour for 200 evaluations).
10. Blockers and Dependencies
No Blockers
The training pipeline has no external dependencies that aren't already met.
| Dependency | Status |
|---|---|
Training corpus (training_pairs.jsonl) |
Complete — 36,079 pairs |
| GPU hardware (RTX 4090) | Available |
| Base model (Qwen2.5-0.5B-Instruct) | Public on HuggingFace, Apache 2.0 |
| Python packages (transformers, trl, torch) | Install needed — pip install transformers trl datasets accelerate torch |
| Evaluation LLM (GLM4-32B) | Running locally |
| CUDA toolkit | Installed (driver 555.42.02) |
Pre-Training Checklist
- Install training dependencies:
pip install transformers trl datasets accelerate tensorboard - Download base model:
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct - Write training script (skeleton provided in Section 5)
- Write data preparation script (JSONL → chat-template format with train/val/test split)
- Shut down GLM4-32B inference server to free GPU memory
- Run training (~30-40 min)
- Restart GLM4-32B for evaluation
- Run LLM-as-judge evaluation
- Human spot-check of 50 generated sayings
Risks
| Risk | Likelihood | Mitigation |
|---|---|---|
| Overfitting on 36K examples | Medium | Monitor val loss; use early stopping; try 2-3 epochs instead of 5 |
| Mode collapse (model produces same few sayings) | Low-Medium | Check distinct-n metrics; if occurring, reduce learning rate or add dropout |
| Template imbalance causes weak coverage | Low | 3 templates are at 8-9% (below 10% threshold); generate more pairs for these families if needed |
| GPU memory conflict with inference server | None if sequenced | Follow the GPU sequencing workflow in Section 9 |
11. Stretch Goals (Post-Initial Training)
These are not blockers but could improve results in subsequent iterations:
-
Fictional entity fine-tuning pairs — The CORPUS_GENERATION_SPEC describes ~200-300 fictional entity training pairs. These were not included in the current 36K corpus. Adding them would teach the model to generalize to novel concepts provided at inference time.
-
DPO/ORPO alignment — After SFT, use GLM4-32B to generate preference pairs (chosen vs rejected sayings) and run a DPO pass. This could sharpen quality without needing more training data.
-
Hyperparameter sweep — Run 3-5 configs varying learning rate (1e-5 to 5e-5) and epochs (2-5). With 30-minute training runs, a full sweep takes ~2.5 hours.
-
Larger base model experiment — Try Qwen2.5-1.5B-Instruct (1.5B params) as a comparison. Still fits on the 4090 for full fine-tune. Compare quality vs the 0.5B model to see if the extra parameters matter for this narrow task.
-
GGUF export for deployment — Convert the final model to GGUF Q8_0 format for CPU-only deployment via llama.cpp or Ollama. A 0.5B Q8 model is ~500MB and runs at interactive speed on any modern CPU.