# GPU Training Requirements — Folksy Proverb Model **Date:** 2026-03-13 **Status:** Planning **Prerequisite:** Corpus generation complete (9,025 sayings, 36,079 training pairs) --- ## 1. Objective Fine-tune a 0.5B parameter language model to generate folksy proverbs on demand. The model should respond to varied prompt styles (word-seeded, persona-seeded, template-seeded, open-ended) with natural-sounding fake folk wisdom in the style of the generated corpus. --- ## 2. Base Model Selection ### Recommendation: **Qwen2.5-0.5B-Instruct** | Criterion | Qwen2.5-0.5B-Instruct | |-----------|----------------------| | Parameters | 494M | | Architecture | Transformer decoder, GQA, RoPE | | Context window | 32,768 tokens | | License | Apache 2.0 | | Source | `Qwen/Qwen2.5-0.5B-Instruct` on HuggingFace | | Tokenizer | Byte-level BPE (151,646 vocab) | ### Why Qwen2.5-0.5B-Instruct - **Exact size target.** The corpus generation spec calls for a 0.5B model; this is precisely that. - **Already instruction-tuned.** The training pairs use an input/output (instruction/response) format. Starting from an instruct-tuned base means the model already understands the turn structure — we're teaching it *what* to say, not *how* to follow instructions. - **Modern architecture.** Grouped-query attention and RoPE positional encoding. Trains efficiently and inferences fast. - **Apache 2.0.** No usage restrictions for any deployment scenario. - **Strong small-model baseline.** Qwen2.5-0.5B benchmarks well against peers (SmolLM2-360M, TinyLlama-1.1B, GPT-2 Medium). It punches above its weight on language tasks. ### Alternatives Considered | Model | Size | Why Not | |-------|------|---------| | SmolLM2-360M | 360M | Slightly undersized; weaker language generation quality | | TinyLlama-1.1B | 1.1B | 2x the target size; more VRAM, slower inference, marginal quality gain for this task | | GPT-2 Medium | 355M | Outdated architecture (absolute positional encoding, no GQA); poor instruction-following baseline | | Phi-3-mini | 3.8B | 7.6x over budget; overkill for a narrow-domain generative task | --- ## 3. Training Data ### Corpus Summary | Metric | Value | |--------|-------| | Training pairs | 36,079 | | Unique sayings | 9,025 | | File | `corpus/training_pairs.jsonl` | | Size on disk | 7.5 MB | | Average output length | 10.1 words (~15-20 tokens) | | Average input length | ~6-10 words (~8-15 tokens) | | Vocab coverage | 624/624 (100%) | ### Format Each line is a JSON object: ```json {"input": "Tell me something about color.", "output": "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest.", "meta_template": "deconstruction", "source_words": ["hamburger", "beef", "color", "tomato"]} ``` ### Distribution by Template Family | Template | Sayings | % | |----------|---------|---| | false_equivalence | 1,897 | 21.0% | | futile_preparation | 1,735 | 19.2% | | ironic_deficiency | 1,563 | 17.3% | | deconstruction | 1,544 | 17.1% | | hypocritical_complaint | 811 | 9.0% | | denial_of_consequences | 750 | 8.3% | | tautological_wisdom | 725 | 8.0% | Three families are below the 10% balance threshold (denial_of_consequences, hypocritical_complaint, tautological_wisdom). This is a known issue from corpus generation. The training should still work — the model will slightly underperform on these templates but the imbalance is not severe. ### Distribution by Input Type | Input Type | Pairs | |------------|-------| | word_seeded | 9,025 | | category_seeded | 9,025 | | persona_seeded | 9,025 | | template_seeded | 6,858 | | open_ended | 2,146 | ### Data Preparation for Training The JSONL needs to be converted to chat-template format for the Qwen2.5 tokenizer: ```python # Each training pair becomes: messages = [ {"role": "user", "content": entry["input"]}, {"role": "assistant", "content": entry["output"]} ] # Tokenized using Qwen2.5's chat template via tokenizer.apply_chat_template() ``` Split: **90/5/5** (train/validation/test) → ~32,471 train / ~1,804 val / ~1,804 test. Stratify by `meta_template` to preserve template distribution in each split. --- ## 4. Hardware ### Local GPU: RTX 4090 | Resource | Available | |----------|-----------| | GPU | NVIDIA GeForce RTX 4090 | | VRAM | 24 GB GDDR6X | | System RAM | 128 GB | | CPU cores | 12 | | CUDA driver | 555.42.02 | ### VRAM Budget (Full Fine-Tune, bf16) | Component | Estimate | |-----------|----------| | Model weights (bf16) | ~1.0 GB | | Gradients (bf16) | ~1.0 GB | | Optimizer states (AdamW, fp32) | ~2.0 GB | | Activations (batch 32, seq 128) | ~2-4 GB | | CUDA overhead + buffers | ~1-2 GB | | **Total** | **~7-10 GB** | A 0.5B model full fine-tune fits comfortably in 24 GB VRAM. No need for LoRA, QLoRA, or gradient checkpointing. Full fine-tune is the simplest and most effective approach for this model size. ### Note on Concurrent GPU Use The local 4090 currently serves GLM4-32B for inference at `192.168.1.100:8853`. **Training and LLM serving cannot run simultaneously** — the training job needs the full 24 GB. Shut down the vLLM/inference server before starting training. This means LLM-as-judge evaluation must happen either before or after training, not during. --- ## 5. Training Approach ### Method: Full Fine-Tune (SFT) No LoRA/QLoRA. At 0.5B parameters with 24 GB VRAM, full fine-tune is straightforward and produces the best results. LoRA's parameter-efficiency advantage is irrelevant when the full model fits in memory with room to spare. ### Framework: HuggingFace Transformers + TRL ``` transformers >= 4.45.0 trl >= 0.12.0 datasets >= 3.0.0 torch >= 2.4.0 accelerate >= 1.0.0 peft # not needed for full fine-tune, but useful if experimenting with LoRA later ``` ### Hyperparameters | Parameter | Value | Notes | |-----------|-------|-------| | Learning rate | 2e-5 | Standard for SFT on instruct models | | LR scheduler | Cosine with warmup | | | Warmup ratio | 0.05 | ~170 steps | | Epochs | 3 | Small dataset; 3-5 epochs before overfitting | | Per-device batch size | 32 | Fits easily; increase if VRAM allows | | Gradient accumulation | 1 | Effective batch = 32 | | Max sequence length | 128 | Inputs ~15 tokens + outputs ~20 tokens; 128 is generous | | Weight decay | 0.01 | | | Precision | bf16 | 4090 supports bf16 natively | | Optimizer | AdamW (torch fused) | | | Eval strategy | steps (every 100) | | | Save strategy | steps (every 500) | | | Logging | TensorBoard or W&B | | ### Estimated Training Time | Metric | Value | |--------|-------| | Training examples | ~32,471 | | Steps per epoch (batch 32) | ~1,015 | | Total steps (3 epochs) | ~3,045 | | Throughput estimate (4090, 0.5B, bf16) | ~80-120 steps/min | | **Estimated wall time** | **25-40 minutes** | This is a very fast training job. Even 5 epochs would finish in under an hour. ### Training Script Skeleton ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from trl import SFTConfig, SFTTrainer model_id = "Qwen/Qwen2.5-0.5B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16") dataset = load_dataset("json", data_files="corpus/training_pairs.jsonl") def format_chat(example): return { "messages": [ {"role": "user", "content": example["input"]}, {"role": "assistant", "content": example["output"]}, ] } dataset = dataset.map(format_chat) split = dataset["train"].train_test_split(test_size=0.1, seed=42) training_args = SFTConfig( output_dir="./folksy-model", num_train_epochs=3, per_device_train_batch_size=32, learning_rate=2e-5, lr_scheduler_type="cosine", warmup_ratio=0.05, weight_decay=0.01, bf16=True, max_seq_length=128, eval_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=500, logging_steps=10, report_to="tensorboard", ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=split["train"], eval_dataset=split["test"], processing_class=tokenizer, ) trainer.train() trainer.save_model("./folksy-model/final") tokenizer.save_pretrained("./folksy-model/final") ``` --- ## 6. Evaluation ### Automated Metrics 1. **Validation loss / perplexity** — tracked during training via eval steps. Watch for overfitting (val loss increasing while train loss decreases). 2. **BLEU/ROUGE on test set** — sanity check, but not the primary metric for creative generation. 3. **Template coverage** — generate 1,000 sayings with varied prompts, verify all 7 template families appear in output. 4. **Lexical diversity** — distinct-1, distinct-2 (unique unigrams/bigrams in generated output). Low diversity = mode collapse. ### LLM-as-Judge Evaluation (Self-Hosted) Use **GLM4-32B** (already available at `192.168.1.100:8853`) as an automated evaluator. No external API needed. **Procedure:** 1. Stop the training job (free the GPU) 2. Restart the GLM4-32B inference server 3. Generate 200 sayings from the fine-tuned model across all prompt types 4. Send each to GLM4-32B with a judge prompt **Judge prompt:** ``` Rate this folk saying on a 1-5 scale: - 5: Sounds like a real proverb — natural, witty, memorable - 4: Good folksy saying — natural language, clear meaning - 3: Acceptable — grammatically correct but flat or formulaic - 2: Awkward — grammatical issues or forced phrasing - 1: Broken — nonsensical, incomplete, or garbled Saying: "{generated_saying}" Respond with only the number and a one-sentence justification. ``` **Target:** Mean score >= 3.5, with <10% scoring 1 or 2. ### Human Spot-Check Sample 50 generated sayings, rate as Good/Okay/Bad (same criteria as EVALUATION.md). Target: >60% Good, <10% Bad. ### A/B Comparison Generate 50 sayings from the fine-tuned model and 50 from the raw template engine. Present pairs to GLM4-32B (or manually) and ask which sounds more natural. The fine-tuned model should win >80% of comparisons. --- ## 7. Output Artifacts | Artifact | Path | Description | |----------|------|-------------| | Model weights | `folksy-model/final/` | Full bf16 model (~1 GB) | | Tokenizer | `folksy-model/final/` | Qwen2.5 tokenizer config + vocab | | Training logs | `folksy-model/runs/` | TensorBoard event files | | Checkpoints | `folksy-model/checkpoint-*/` | Intermediate saves every 500 steps | | Eval results | `folksy-model/eval_results.json` | Automated metrics on test set | | Judge results | `folksy-model/judge_results.jsonl` | GLM4-32B evaluation scores | ### Model Distribution (Optional) The final model can be: - **Quantized to GGUF** (via `llama.cpp`) for CPU inference — a 0.5B model runs on any machine - **Pushed to HuggingFace Hub** if sharing publicly - **Served locally** via vLLM, llama.cpp, or Ollama for integration testing --- ## 8. RunPod Feasibility ### Is RunPod needed? **No.** The local RTX 4090 is more than sufficient. A 0.5B full fine-tune on 36K examples will finish in under an hour. RunPod would be useful only if: - The local GPU is occupied with inference work that can't be interrupted - You want to run multiple training experiments in parallel (hyperparameter sweeps) - You scale up to a larger base model (e.g., 3B+) ### If RunPod is Used Anyway | Instance | GPU | VRAM | Hourly Cost (approx) | Notes | |----------|-----|------|----------------------|-------| | RTX 4090 | 1x 4090 | 24 GB | ~$0.40/hr | Identical to local hardware | | A40 | 1x A40 | 48 GB | ~$0.50/hr | More VRAM headroom; good if experimenting with larger batch sizes | | RTX A6000 | 1x A6000 | 48 GB | ~$0.60/hr | Same tier as A40 | At ~30-40 minutes of training time, the total cost would be under $0.50 for a single run. Use the `runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04` template. --- ## 9. Self-Hosted LLM Jobs (Replacing External API Dependencies) All tasks that might otherwise require a paid API key are reformulated as self-hosted jobs using the existing local infrastructure. | Task | External API Alternative | Self-Hosted Solution | |------|--------------------------|----------------------| | **LLM-as-Judge evaluation** | GPT-4 / Claude API | **GLM4-32B** (32B, local 4090) — already running at `192.168.1.100:8853` | | **Data augmentation** (if more training pairs needed) | GPT-4 for paraphrase generation | **GLM4-32B** — same endpoint, same prompts used for corpus polishing | | **Synthetic evaluation prompts** | API-generated diverse test prompts | **GLM4-32B** — generate varied evaluation prompts locally | | **Model comparison judging** | Claude API for A/B preference judging | **GLM4-32B** — structured judge prompt with forced-choice output | | **Embedding-based dedup** (if scaling corpus) | OpenAI embeddings API | **Sentence-transformers** (e.g., `all-MiniLM-L6-v2`, runs on CPU, 80MB) | | **Classification of failure modes** | API-based analysis | **GLM4-32B** — classify generated sayings by quality/failure type | **Key constraint:** GLM4-32B and the training job cannot share the 4090 simultaneously. Sequence the workflow as: ``` 1. [GPU: GLM4-32B] Pre-training evaluation — generate baseline judge scores 2. [GPU: idle] Shut down inference server 3. [GPU: training] Fine-tune Qwen2.5-0.5B (~30-40 min) 4. [GPU: idle] Training complete, model saved 5. [GPU: GLM4-32B] Restart inference server 6. [GPU: GLM4-32B] Post-training evaluation — judge fine-tuned model output ``` If a larger judge model is desired (e.g., 70B for higher-quality evaluation), options for the 4090: - **Qwen2.5-72B-Instruct-AWQ** (4-bit, ~40GB) — does not fit in 24 GB single-GPU - **Qwen2.5-32B-Instruct** — similar quality tier to GLM4-32B, interchangeable - **Llama-3.1-70B-Instruct** (GGUF Q4_K_M) — ~40GB, does not fit single 4090 - **Conclusion:** GLM4-32B is the practical ceiling for single-4090 evaluation. For 70B+ judge models, use RunPod with 2x A6000 or 1x A100 80GB (~$1-2/hr, would need <1 hour for 200 evaluations). --- ## 10. Blockers and Dependencies ### No Blockers The training pipeline has no external dependencies that aren't already met. | Dependency | Status | |------------|--------| | Training corpus (`training_pairs.jsonl`) | Complete — 36,079 pairs | | GPU hardware (RTX 4090) | Available | | Base model (Qwen2.5-0.5B-Instruct) | Public on HuggingFace, Apache 2.0 | | Python packages (transformers, trl, torch) | Install needed — `pip install transformers trl datasets accelerate torch` | | Evaluation LLM (GLM4-32B) | Running locally | | CUDA toolkit | Installed (driver 555.42.02) | ### Pre-Training Checklist - [ ] Install training dependencies: `pip install transformers trl datasets accelerate tensorboard` - [ ] Download base model: `huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct` - [ ] Write training script (skeleton provided in Section 5) - [ ] Write data preparation script (JSONL → chat-template format with train/val/test split) - [ ] Shut down GLM4-32B inference server to free GPU memory - [ ] Run training (~30-40 min) - [ ] Restart GLM4-32B for evaluation - [ ] Run LLM-as-judge evaluation - [ ] Human spot-check of 50 generated sayings ### Risks | Risk | Likelihood | Mitigation | |------|------------|------------| | Overfitting on 36K examples | Medium | Monitor val loss; use early stopping; try 2-3 epochs instead of 5 | | Mode collapse (model produces same few sayings) | Low-Medium | Check distinct-n metrics; if occurring, reduce learning rate or add dropout | | Template imbalance causes weak coverage | Low | 3 templates are at 8-9% (below 10% threshold); generate more pairs for these families if needed | | GPU memory conflict with inference server | None if sequenced | Follow the GPU sequencing workflow in Section 9 | --- ## 11. Stretch Goals (Post-Initial Training) These are not blockers but could improve results in subsequent iterations: 1. **Fictional entity fine-tuning pairs** — The CORPUS_GENERATION_SPEC describes ~200-300 fictional entity training pairs. These were not included in the current 36K corpus. Adding them would teach the model to generalize to novel concepts provided at inference time. 2. **DPO/ORPO alignment** — After SFT, use GLM4-32B to generate preference pairs (chosen vs rejected sayings) and run a DPO pass. This could sharpen quality without needing more training data. 3. **Hyperparameter sweep** — Run 3-5 configs varying learning rate (1e-5 to 5e-5) and epochs (2-5). With 30-minute training runs, a full sweep takes ~2.5 hours. 4. **Larger base model experiment** — Try Qwen2.5-1.5B-Instruct (1.5B params) as a comparison. Still fits on the 4090 for full fine-tune. Compare quality vs the 0.5B model to see if the extra parameters matter for this narrow task. 5. **GGUF export for deployment** — Convert the final model to GGUF Q8_0 format for CPU-only deployment via llama.cpp or Ollama. A 0.5B Q8 model is ~500MB and runs at interactive speed on any modern CPU.