Add SFT training script and run Qwen3-0.6B-Base fine-tune
Train Qwen3-0.6B-Base (596M params) on 36K folksy proverb pairs using full SFT with HuggingFace TRL. 3 epochs, 11 min on RTX 4090. Results: train_loss=0.954, eval_loss=1.032, test_loss=1.031 Model checkpoint at folksy-model/final/ (not committed — 1.2 GB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
9298c425bc
commit
02daa7bb97
4 changed files with 919 additions and 0 deletions
2
.gitignore
vendored
2
.gitignore
vendored
|
|
@ -1 +1,3 @@
|
||||||
*__pycache__
|
*__pycache__
|
||||||
|
.venv/
|
||||||
|
folksy-model/
|
||||||
|
|
|
||||||
208
CORPUS_QUALITY_REVIEW.md
Normal file
208
CORPUS_QUALITY_REVIEW.md
Normal file
|
|
@ -0,0 +1,208 @@
|
||||||
|
# Corpus Quality Review
|
||||||
|
|
||||||
|
Review date: 2026-03-27. All data sampled directly from corpus files on disk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Corpus Stats
|
||||||
|
|
||||||
|
| File | Entries | Size | Format |
|
||||||
|
|------|---------|------|--------|
|
||||||
|
| `corpus_raw.jsonl` | 9,835 | 4.5 MB | JSONL — raw template output with debug metadata |
|
||||||
|
| `corpus_polished.jsonl` | 9,835 | 5.2 MB | JSONL — all entries after GLM4-32B polish (includes discards) |
|
||||||
|
| `corpus_naturalized.jsonl` | 19,540 | 13 MB | JSONL — naturalization pass (polished + recovered discards, 2 variants each) |
|
||||||
|
| `corpus_filtered.jsonl` | 9,025 | 6.0 MB | JSONL — deduplicated final sayings |
|
||||||
|
| `training_pairs.jsonl` | 36,079 | 7.2 MB | JSONL — `{input, output, meta_template, source_words}` |
|
||||||
|
|
||||||
|
**Token estimates** (words × 1.3 subword factor):
|
||||||
|
|
||||||
|
- Sayings only: 91,428 words → ~119K tokens
|
||||||
|
- Training pairs (input + output): 582,598 words → ~757K tokens
|
||||||
|
|
||||||
|
**Average saying length:** 10.1 words
|
||||||
|
|
||||||
|
**Vocab coverage:** 624/624 (100%)
|
||||||
|
|
||||||
|
### Distribution by meta-template
|
||||||
|
|
||||||
|
| Template | Count | % | Status |
|
||||||
|
|----------|-------|---|--------|
|
||||||
|
| false_equivalence | 1,897 | 21.0% | OK |
|
||||||
|
| futile_preparation | 1,735 | 19.2% | OK |
|
||||||
|
| ironic_deficiency | 1,563 | 17.3% | OK |
|
||||||
|
| deconstruction | 1,544 | 17.1% | OK |
|
||||||
|
| hypocritical_complaint | 811 | 9.0% | ⚠ Below 10% |
|
||||||
|
| denial_of_consequences | 750 | 8.3% | ⚠ Below 10% |
|
||||||
|
| tautological_wisdom | 725 | 8.0% | ⚠ Below 10% |
|
||||||
|
|
||||||
|
Three families are below the 10% balance threshold from the spec. The model will see 2.6× more `false_equivalence` examples than `tautological_wisdom`.
|
||||||
|
|
||||||
|
### Training pair framing types
|
||||||
|
|
||||||
|
| Framing | Count |
|
||||||
|
|---------|-------|
|
||||||
|
| word_seeded | 9,025 |
|
||||||
|
| category_seeded | 9,025 |
|
||||||
|
| persona_seeded | 9,025 |
|
||||||
|
| template_seeded | 6,858 |
|
||||||
|
| open_ended | 2,146 |
|
||||||
|
| **Total** | **36,079** |
|
||||||
|
|
||||||
|
No fictional entity pairs are present in the current corpus.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Random Samples
|
||||||
|
|
||||||
|
Pulled via `shuf` from `corpus_filtered.jsonl` (seed-independent, different parts of the file):
|
||||||
|
|
||||||
|
1. **[ironic_deficiency]** The coffee-maker's always short on fabric.
|
||||||
|
2. **[ironic_deficiency]** The man who builds the nest hasn't got a single feather.
|
||||||
|
3. **[denial_of_consequences]** A feller who builds the shelf can't gripe about the tape.
|
||||||
|
4. **[false_equivalence]** An anchor's just an iron that got too big for its britches.
|
||||||
|
5. **[futile_preparation]** Fill a seed-bin with rubbish, won't get you a ship.
|
||||||
|
6. **[futile_preparation]** You can sweep all you want, but it won't get you measuring angles.
|
||||||
|
7. **[false_equivalence]** Water's just juice without the color, I reckon.
|
||||||
|
8. **[futile_preparation]** Putting the cart before the horse, hoping for the best.
|
||||||
|
9. **[futile_preparation]** Skipping breakfast's like praying for leftovers.
|
||||||
|
10. **[false_equivalence]** A van-sized thing'll make you a canoe.
|
||||||
|
11. **[false_equivalence]** A gazelle's just a lightweight with long legs.
|
||||||
|
12. **[futile_preparation]** Hoofing the claw and hoping to play baseball.
|
||||||
|
13. **[false_equivalence]** A bull's just a steer with folks who tolerate his ways.
|
||||||
|
14. **[futile_preparation]** Grandma always said, "Drinkin' the coffee won't keep you workin'."
|
||||||
|
15. **[hypocritical_complaint]** A fella picks pewter scrap and says copper ain't worth a thing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Quality Spectrum
|
||||||
|
|
||||||
|
### Best (most natural, most folksy)
|
||||||
|
|
||||||
|
These sound like something you'd actually hear on a porch:
|
||||||
|
|
||||||
|
1. **[deconstruction]** Plastic ain't nothing without its fuel. Just carbon thinkin' it's better than the rest.
|
||||||
|
2. **[futile_preparation]** Your grandma was right — ain't no flute gonna bring a whole band.
|
||||||
|
3. **[false_equivalence]** An eagle's just a crow that sees its lunch from far off.
|
||||||
|
|
||||||
|
Runner-up gems from the random sample:
|
||||||
|
- *A bull's just a steer with folks who tolerate his ways.*
|
||||||
|
- *An anchor's just an iron that got too big for its britches.*
|
||||||
|
|
||||||
|
### Worst (most stilted, most obviously generated)
|
||||||
|
|
||||||
|
These read like broken Mad Libs:
|
||||||
|
|
||||||
|
1. **[ironic_deficiency]** Dolphin kin always do without the wave fin.
|
||||||
|
2. **[ironic_deficiency]** A thrush's kin goes without the fly.
|
||||||
|
3. **[false_equivalence]** A van-sized thing'll make you a canoe.
|
||||||
|
|
||||||
|
These share a pattern: the naturalization pass couldn't salvage the underlying nonsense relationship. "Dolphin kin" and "wave fin" are ConceptNet artifacts that survived the filter.
|
||||||
|
|
||||||
|
### Borderline
|
||||||
|
|
||||||
|
Grammatically fine but flat — could go either way:
|
||||||
|
|
||||||
|
1. **[deconstruction]** Without the beef, it's just plain old bread.
|
||||||
|
2. **[deconstruction]** A bouquet ain't much without the star flower.
|
||||||
|
3. **[tautological_wisdom]** An ostrich ain't no good without its grub.
|
||||||
|
|
||||||
|
These are competent filler. They won't teach the model bad habits, but they won't teach it flair either.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Dropped Noun Check
|
||||||
|
|
||||||
|
### Background
|
||||||
|
|
||||||
|
The original pipeline's quality filter required ≥2 slot-fill nouns present in the polished text (the `lost_key_nouns` check in `filter_corpus.py`). The later `rebuild_training_pairs.py` relaxed this requirement because the naturalization pass often rephrases concepts rather than repeating slot words verbatim.
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
**Strict check (all slot words including property descriptions):** 2,197 of 9,025 entries (24.3%) have >50% of slot words missing. However, this is misleading — slot values like `"essential to life"`, `"less dense than water"`, and `"flicking sound"` are property descriptions, not nouns the saying should contain.
|
||||||
|
|
||||||
|
**Core noun check (A/B slots only):** 853 of 9,025 entries (9.5%) are missing at least one of the two primary concept nouns.
|
||||||
|
|
||||||
|
### Examples of dropped core nouns
|
||||||
|
|
||||||
|
| # | Saying | Missing noun | What happened |
|
||||||
|
|---|--------|-------------|---------------|
|
||||||
|
| 1 | "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest." | hamburger (slot A) | Shortened to "burger" — acceptable synonym |
|
||||||
|
| 2 | "A boat without wood's just wet water, you know." | sailboat (slot A) | Generalized to "boat" — acceptable |
|
||||||
|
| 3 | "A gilded rifle's just wood that got fancy." | weapon (slot A) | Replaced by the more specific "rifle" (slot B) — fine |
|
||||||
|
| 4 | "Soda without the fizz is just water." | glass (slot B) | "Glass" (container material) was irrelevant to meaning — LLM correctly dropped it |
|
||||||
|
| 5 | "A falcon without its little buddy's got big ideas and nowhere sensible to fly 'em." | oxpecker, wing, flapping (slots B/C/D) | Heavy rewrite; meaning drifted |
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
|
||||||
|
Most "dropped nouns" are acceptable: the LLM used synonyms (`hamburger` → `burger`), generalizations (`sailboat` → `boat`), or correctly dropped irrelevant slot fills. True meaning-drift cases (example 5) exist but are uncommon. The relaxed filter was the right call — the strict `lost_key_nouns` filter in `discard_analysis.csv` already caught and discarded 689 entries during the polish phase.
|
||||||
|
|
||||||
|
**The dropped noun issue from the prior session appears resolved.** The naturalization pass and relaxed rebuild filter handle it appropriately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Processing Pipeline Status
|
||||||
|
|
||||||
|
### Pipeline stages (in execution order)
|
||||||
|
|
||||||
|
| Stage | Script | Status | Entries In → Out | Errors |
|
||||||
|
|-------|--------|--------|-----------------|--------|
|
||||||
|
| 1. Raw generation | `generate_raw_batch.sh` | ✅ Complete | → 9,835 | 0 |
|
||||||
|
| 2. LLM polish | `polish_corpus.py` | ✅ Complete (81.5 min) | 9,835 → 5,499 polished + 4,336 discards | 0 |
|
||||||
|
| 3. Naturalization | `naturalize_corpus.py` | ✅ Complete (147.8 min) | 9,835 → 9,468 usable | 0 |
|
||||||
|
| 4. Rebuild (filter + dedup + format) | `rebuild_training_pairs.py` | ✅ Complete | 19,031 → 9,025 filtered → 36,079 pairs | 0 |
|
||||||
|
|
||||||
|
All stages completed with **zero errors** across all runs.
|
||||||
|
|
||||||
|
### Discard breakdown
|
||||||
|
|
||||||
|
| Reason | Count |
|
||||||
|
|--------|-------|
|
||||||
|
| LLM polish → DISCARD | 4,336 (44.1% of raw) |
|
||||||
|
| Near-duplicate removal | 2,495 |
|
||||||
|
| Lost key nouns (strict filter) | 689 |
|
||||||
|
| Too long (>25 words) | 3 |
|
||||||
|
| Naturalization filtered | 73 |
|
||||||
|
| Naturalization skipped | 436 |
|
||||||
|
|
||||||
|
### Logs
|
||||||
|
|
||||||
|
- `corpus/polish_log.txt` — clean run, 0 errors, steady 1.6–2.0 req/s throughput
|
||||||
|
- `corpus/naturalize_log.txt` — clean run, 0 errors, steady 1.0–1.1 req/s throughput
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Training Readiness Assessment
|
||||||
|
|
||||||
|
### What's ready
|
||||||
|
|
||||||
|
- **Volume:** 9,025 unique sayings and 36,079 training pairs is a solid corpus for a 0.5B fine-tune. ~757K tokens of training data.
|
||||||
|
- **Pipeline integrity:** All stages completed with zero errors. Clean logs, full checkpointing.
|
||||||
|
- **Vocab coverage:** 100% of the 624-word vocabulary appears in the corpus.
|
||||||
|
- **Format:** Training pairs are clean `{input, output, meta_template, source_words}` JSONL — plug directly into HF Trainer or axolotl.
|
||||||
|
- **No AI-isms:** Zero instances of common LLM crutch phrases ("it is important", "in conclusion", etc.).
|
||||||
|
- **Grammar:** Zero truncated or grammatically broken sentences detected (no trailing articles, no double articles, no unfilled slots).
|
||||||
|
|
||||||
|
### Quality risks (ranked)
|
||||||
|
|
||||||
|
**1. Opening phrase repetition (MEDIUM)**
|
||||||
|
The corpus has noticeable repetition in sentence openings:
|
||||||
|
- "The man who..." appears 151 times (1.7%)
|
||||||
|
- "A man who..." appears 110 times (1.2%)
|
||||||
|
- "Funny how a/the..." appears 184 times combined (2.0%)
|
||||||
|
|
||||||
|
This could cause the fine-tuned model to over-rely on these openings. Not a blocker, but worth monitoring in generation quality after training.
|
||||||
|
|
||||||
|
**2. Template imbalance (LOW-MEDIUM)**
|
||||||
|
Three template families are below 10%: `tautological_wisdom` (8.0%), `denial_of_consequences` (8.3%), `hypocritical_complaint` (9.0%). The spec says to go back and generate more if below 10%. The gap is small — the model will still see 725+ examples of each — but it's a known deviation from spec.
|
||||||
|
|
||||||
|
**3. Semantic misfires (LOW)**
|
||||||
|
A small percentage of entries have nonsensical relationships that survived filtering (e.g., "Dolphin kin always do without the wave fin", "A van-sized thing'll make you a canoe"). These are rare enough (<1% by estimate) that they'll be noise in training, not a pattern the model learns.
|
||||||
|
|
||||||
|
**4. Missing fictional entity pairs (LOW)**
|
||||||
|
The spec calls for ~200-300 fictional entity training pairs. None are present. This means the model won't learn the "describe an entity → generate a saying about it" pattern out of the box. This can be added post-training or in a follow-up fine-tune pass.
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**The corpus is ready for the RunPod training run.** The biggest risk is the opening phrase repetition, but at <2% per pattern across 36K pairs, it's unlikely to dominate the model's behavior. The template imbalance is a minor spec deviation (8.0-9.0% vs the 10% target) and can be corrected in a second training round if the model shows weakness on those families.
|
||||||
|
|
||||||
|
Start training. Evaluate the model's output diversity after 1 epoch — if it's over-producing "The man who..." or "Funny how..." openings, consider deduplicating by opening trigram before the next training round.
|
||||||
413
GPU_TRAINING_REQUIREMENTS.md
Normal file
413
GPU_TRAINING_REQUIREMENTS.md
Normal file
|
|
@ -0,0 +1,413 @@
|
||||||
|
# GPU Training Requirements — Folksy Proverb Model
|
||||||
|
|
||||||
|
**Date:** 2026-03-13
|
||||||
|
**Status:** Planning
|
||||||
|
**Prerequisite:** Corpus generation complete (9,025 sayings, 36,079 training pairs)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Objective
|
||||||
|
|
||||||
|
Fine-tune a 0.5B parameter language model to generate folksy proverbs on demand. The model should respond to varied prompt styles (word-seeded, persona-seeded, template-seeded, open-ended) with natural-sounding fake folk wisdom in the style of the generated corpus.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Base Model Selection
|
||||||
|
|
||||||
|
### Recommendation: **Qwen2.5-0.5B-Instruct**
|
||||||
|
|
||||||
|
| Criterion | Qwen2.5-0.5B-Instruct |
|
||||||
|
|-----------|----------------------|
|
||||||
|
| Parameters | 494M |
|
||||||
|
| Architecture | Transformer decoder, GQA, RoPE |
|
||||||
|
| Context window | 32,768 tokens |
|
||||||
|
| License | Apache 2.0 |
|
||||||
|
| Source | `Qwen/Qwen2.5-0.5B-Instruct` on HuggingFace |
|
||||||
|
| Tokenizer | Byte-level BPE (151,646 vocab) |
|
||||||
|
|
||||||
|
### Why Qwen2.5-0.5B-Instruct
|
||||||
|
|
||||||
|
- **Exact size target.** The corpus generation spec calls for a 0.5B model; this is precisely that.
|
||||||
|
- **Already instruction-tuned.** The training pairs use an input/output (instruction/response) format. Starting from an instruct-tuned base means the model already understands the turn structure — we're teaching it *what* to say, not *how* to follow instructions.
|
||||||
|
- **Modern architecture.** Grouped-query attention and RoPE positional encoding. Trains efficiently and inferences fast.
|
||||||
|
- **Apache 2.0.** No usage restrictions for any deployment scenario.
|
||||||
|
- **Strong small-model baseline.** Qwen2.5-0.5B benchmarks well against peers (SmolLM2-360M, TinyLlama-1.1B, GPT-2 Medium). It punches above its weight on language tasks.
|
||||||
|
|
||||||
|
### Alternatives Considered
|
||||||
|
|
||||||
|
| Model | Size | Why Not |
|
||||||
|
|-------|------|---------|
|
||||||
|
| SmolLM2-360M | 360M | Slightly undersized; weaker language generation quality |
|
||||||
|
| TinyLlama-1.1B | 1.1B | 2x the target size; more VRAM, slower inference, marginal quality gain for this task |
|
||||||
|
| GPT-2 Medium | 355M | Outdated architecture (absolute positional encoding, no GQA); poor instruction-following baseline |
|
||||||
|
| Phi-3-mini | 3.8B | 7.6x over budget; overkill for a narrow-domain generative task |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Training Data
|
||||||
|
|
||||||
|
### Corpus Summary
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Training pairs | 36,079 |
|
||||||
|
| Unique sayings | 9,025 |
|
||||||
|
| File | `corpus/training_pairs.jsonl` |
|
||||||
|
| Size on disk | 7.5 MB |
|
||||||
|
| Average output length | 10.1 words (~15-20 tokens) |
|
||||||
|
| Average input length | ~6-10 words (~8-15 tokens) |
|
||||||
|
| Vocab coverage | 624/624 (100%) |
|
||||||
|
|
||||||
|
### Format
|
||||||
|
|
||||||
|
Each line is a JSON object:
|
||||||
|
```json
|
||||||
|
{"input": "Tell me something about color.", "output": "A burger without beef? That's just a fancy tomato thinkin' it's better than the rest.", "meta_template": "deconstruction", "source_words": ["hamburger", "beef", "color", "tomato"]}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Distribution by Template Family
|
||||||
|
|
||||||
|
| Template | Sayings | % |
|
||||||
|
|----------|---------|---|
|
||||||
|
| false_equivalence | 1,897 | 21.0% |
|
||||||
|
| futile_preparation | 1,735 | 19.2% |
|
||||||
|
| ironic_deficiency | 1,563 | 17.3% |
|
||||||
|
| deconstruction | 1,544 | 17.1% |
|
||||||
|
| hypocritical_complaint | 811 | 9.0% |
|
||||||
|
| denial_of_consequences | 750 | 8.3% |
|
||||||
|
| tautological_wisdom | 725 | 8.0% |
|
||||||
|
|
||||||
|
Three families are below the 10% balance threshold (denial_of_consequences, hypocritical_complaint, tautological_wisdom). This is a known issue from corpus generation. The training should still work — the model will slightly underperform on these templates but the imbalance is not severe.
|
||||||
|
|
||||||
|
### Distribution by Input Type
|
||||||
|
|
||||||
|
| Input Type | Pairs |
|
||||||
|
|------------|-------|
|
||||||
|
| word_seeded | 9,025 |
|
||||||
|
| category_seeded | 9,025 |
|
||||||
|
| persona_seeded | 9,025 |
|
||||||
|
| template_seeded | 6,858 |
|
||||||
|
| open_ended | 2,146 |
|
||||||
|
|
||||||
|
### Data Preparation for Training
|
||||||
|
|
||||||
|
The JSONL needs to be converted to chat-template format for the Qwen2.5 tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Each training pair becomes:
|
||||||
|
messages = [
|
||||||
|
{"role": "user", "content": entry["input"]},
|
||||||
|
{"role": "assistant", "content": entry["output"]}
|
||||||
|
]
|
||||||
|
# Tokenized using Qwen2.5's chat template via tokenizer.apply_chat_template()
|
||||||
|
```
|
||||||
|
|
||||||
|
Split: **90/5/5** (train/validation/test) → ~32,471 train / ~1,804 val / ~1,804 test. Stratify by `meta_template` to preserve template distribution in each split.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Hardware
|
||||||
|
|
||||||
|
### Local GPU: RTX 4090
|
||||||
|
|
||||||
|
| Resource | Available |
|
||||||
|
|----------|-----------|
|
||||||
|
| GPU | NVIDIA GeForce RTX 4090 |
|
||||||
|
| VRAM | 24 GB GDDR6X |
|
||||||
|
| System RAM | 128 GB |
|
||||||
|
| CPU cores | 12 |
|
||||||
|
| CUDA driver | 555.42.02 |
|
||||||
|
|
||||||
|
### VRAM Budget (Full Fine-Tune, bf16)
|
||||||
|
|
||||||
|
| Component | Estimate |
|
||||||
|
|-----------|----------|
|
||||||
|
| Model weights (bf16) | ~1.0 GB |
|
||||||
|
| Gradients (bf16) | ~1.0 GB |
|
||||||
|
| Optimizer states (AdamW, fp32) | ~2.0 GB |
|
||||||
|
| Activations (batch 32, seq 128) | ~2-4 GB |
|
||||||
|
| CUDA overhead + buffers | ~1-2 GB |
|
||||||
|
| **Total** | **~7-10 GB** |
|
||||||
|
|
||||||
|
A 0.5B model full fine-tune fits comfortably in 24 GB VRAM. No need for LoRA, QLoRA, or gradient checkpointing. Full fine-tune is the simplest and most effective approach for this model size.
|
||||||
|
|
||||||
|
### Note on Concurrent GPU Use
|
||||||
|
|
||||||
|
The local 4090 currently serves GLM4-32B for inference at `192.168.1.100:8853`. **Training and LLM serving cannot run simultaneously** — the training job needs the full 24 GB. Shut down the vLLM/inference server before starting training. This means LLM-as-judge evaluation must happen either before or after training, not during.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Training Approach
|
||||||
|
|
||||||
|
### Method: Full Fine-Tune (SFT)
|
||||||
|
|
||||||
|
No LoRA/QLoRA. At 0.5B parameters with 24 GB VRAM, full fine-tune is straightforward and produces the best results. LoRA's parameter-efficiency advantage is irrelevant when the full model fits in memory with room to spare.
|
||||||
|
|
||||||
|
### Framework: HuggingFace Transformers + TRL
|
||||||
|
|
||||||
|
```
|
||||||
|
transformers >= 4.45.0
|
||||||
|
trl >= 0.12.0
|
||||||
|
datasets >= 3.0.0
|
||||||
|
torch >= 2.4.0
|
||||||
|
accelerate >= 1.0.0
|
||||||
|
peft # not needed for full fine-tune, but useful if experimenting with LoRA later
|
||||||
|
```
|
||||||
|
|
||||||
|
### Hyperparameters
|
||||||
|
|
||||||
|
| Parameter | Value | Notes |
|
||||||
|
|-----------|-------|-------|
|
||||||
|
| Learning rate | 2e-5 | Standard for SFT on instruct models |
|
||||||
|
| LR scheduler | Cosine with warmup | |
|
||||||
|
| Warmup ratio | 0.05 | ~170 steps |
|
||||||
|
| Epochs | 3 | Small dataset; 3-5 epochs before overfitting |
|
||||||
|
| Per-device batch size | 32 | Fits easily; increase if VRAM allows |
|
||||||
|
| Gradient accumulation | 1 | Effective batch = 32 |
|
||||||
|
| Max sequence length | 128 | Inputs ~15 tokens + outputs ~20 tokens; 128 is generous |
|
||||||
|
| Weight decay | 0.01 | |
|
||||||
|
| Precision | bf16 | 4090 supports bf16 natively |
|
||||||
|
| Optimizer | AdamW (torch fused) | |
|
||||||
|
| Eval strategy | steps (every 100) | |
|
||||||
|
| Save strategy | steps (every 500) | |
|
||||||
|
| Logging | TensorBoard or W&B | |
|
||||||
|
|
||||||
|
### Estimated Training Time
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Training examples | ~32,471 |
|
||||||
|
| Steps per epoch (batch 32) | ~1,015 |
|
||||||
|
| Total steps (3 epochs) | ~3,045 |
|
||||||
|
| Throughput estimate (4090, 0.5B, bf16) | ~80-120 steps/min |
|
||||||
|
| **Estimated wall time** | **25-40 minutes** |
|
||||||
|
|
||||||
|
This is a very fast training job. Even 5 epochs would finish in under an hour.
|
||||||
|
|
||||||
|
### Training Script Skeleton
|
||||||
|
|
||||||
|
```python
|
||||||
|
from datasets import load_dataset
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
from trl import SFTConfig, SFTTrainer
|
||||||
|
|
||||||
|
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")
|
||||||
|
|
||||||
|
dataset = load_dataset("json", data_files="corpus/training_pairs.jsonl")
|
||||||
|
|
||||||
|
def format_chat(example):
|
||||||
|
return {
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": example["input"]},
|
||||||
|
{"role": "assistant", "content": example["output"]},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
dataset = dataset.map(format_chat)
|
||||||
|
split = dataset["train"].train_test_split(test_size=0.1, seed=42)
|
||||||
|
|
||||||
|
training_args = SFTConfig(
|
||||||
|
output_dir="./folksy-model",
|
||||||
|
num_train_epochs=3,
|
||||||
|
per_device_train_batch_size=32,
|
||||||
|
learning_rate=2e-5,
|
||||||
|
lr_scheduler_type="cosine",
|
||||||
|
warmup_ratio=0.05,
|
||||||
|
weight_decay=0.01,
|
||||||
|
bf16=True,
|
||||||
|
max_seq_length=128,
|
||||||
|
eval_strategy="steps",
|
||||||
|
eval_steps=100,
|
||||||
|
save_strategy="steps",
|
||||||
|
save_steps=500,
|
||||||
|
logging_steps=10,
|
||||||
|
report_to="tensorboard",
|
||||||
|
)
|
||||||
|
|
||||||
|
trainer = SFTTrainer(
|
||||||
|
model=model,
|
||||||
|
args=training_args,
|
||||||
|
train_dataset=split["train"],
|
||||||
|
eval_dataset=split["test"],
|
||||||
|
processing_class=tokenizer,
|
||||||
|
)
|
||||||
|
|
||||||
|
trainer.train()
|
||||||
|
trainer.save_model("./folksy-model/final")
|
||||||
|
tokenizer.save_pretrained("./folksy-model/final")
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Evaluation
|
||||||
|
|
||||||
|
### Automated Metrics
|
||||||
|
|
||||||
|
1. **Validation loss / perplexity** — tracked during training via eval steps. Watch for overfitting (val loss increasing while train loss decreases).
|
||||||
|
2. **BLEU/ROUGE on test set** — sanity check, but not the primary metric for creative generation.
|
||||||
|
3. **Template coverage** — generate 1,000 sayings with varied prompts, verify all 7 template families appear in output.
|
||||||
|
4. **Lexical diversity** — distinct-1, distinct-2 (unique unigrams/bigrams in generated output). Low diversity = mode collapse.
|
||||||
|
|
||||||
|
### LLM-as-Judge Evaluation (Self-Hosted)
|
||||||
|
|
||||||
|
Use **GLM4-32B** (already available at `192.168.1.100:8853`) as an automated evaluator. No external API needed.
|
||||||
|
|
||||||
|
**Procedure:**
|
||||||
|
1. Stop the training job (free the GPU)
|
||||||
|
2. Restart the GLM4-32B inference server
|
||||||
|
3. Generate 200 sayings from the fine-tuned model across all prompt types
|
||||||
|
4. Send each to GLM4-32B with a judge prompt
|
||||||
|
|
||||||
|
**Judge prompt:**
|
||||||
|
```
|
||||||
|
Rate this folk saying on a 1-5 scale:
|
||||||
|
- 5: Sounds like a real proverb — natural, witty, memorable
|
||||||
|
- 4: Good folksy saying — natural language, clear meaning
|
||||||
|
- 3: Acceptable — grammatically correct but flat or formulaic
|
||||||
|
- 2: Awkward — grammatical issues or forced phrasing
|
||||||
|
- 1: Broken — nonsensical, incomplete, or garbled
|
||||||
|
|
||||||
|
Saying: "{generated_saying}"
|
||||||
|
|
||||||
|
Respond with only the number and a one-sentence justification.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Target:** Mean score >= 3.5, with <10% scoring 1 or 2.
|
||||||
|
|
||||||
|
### Human Spot-Check
|
||||||
|
|
||||||
|
Sample 50 generated sayings, rate as Good/Okay/Bad (same criteria as EVALUATION.md). Target: >60% Good, <10% Bad.
|
||||||
|
|
||||||
|
### A/B Comparison
|
||||||
|
|
||||||
|
Generate 50 sayings from the fine-tuned model and 50 from the raw template engine. Present pairs to GLM4-32B (or manually) and ask which sounds more natural. The fine-tuned model should win >80% of comparisons.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Output Artifacts
|
||||||
|
|
||||||
|
| Artifact | Path | Description |
|
||||||
|
|----------|------|-------------|
|
||||||
|
| Model weights | `folksy-model/final/` | Full bf16 model (~1 GB) |
|
||||||
|
| Tokenizer | `folksy-model/final/` | Qwen2.5 tokenizer config + vocab |
|
||||||
|
| Training logs | `folksy-model/runs/` | TensorBoard event files |
|
||||||
|
| Checkpoints | `folksy-model/checkpoint-*/` | Intermediate saves every 500 steps |
|
||||||
|
| Eval results | `folksy-model/eval_results.json` | Automated metrics on test set |
|
||||||
|
| Judge results | `folksy-model/judge_results.jsonl` | GLM4-32B evaluation scores |
|
||||||
|
|
||||||
|
### Model Distribution (Optional)
|
||||||
|
|
||||||
|
The final model can be:
|
||||||
|
- **Quantized to GGUF** (via `llama.cpp`) for CPU inference — a 0.5B model runs on any machine
|
||||||
|
- **Pushed to HuggingFace Hub** if sharing publicly
|
||||||
|
- **Served locally** via vLLM, llama.cpp, or Ollama for integration testing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. RunPod Feasibility
|
||||||
|
|
||||||
|
### Is RunPod needed?
|
||||||
|
|
||||||
|
**No.** The local RTX 4090 is more than sufficient. A 0.5B full fine-tune on 36K examples will finish in under an hour. RunPod would be useful only if:
|
||||||
|
- The local GPU is occupied with inference work that can't be interrupted
|
||||||
|
- You want to run multiple training experiments in parallel (hyperparameter sweeps)
|
||||||
|
- You scale up to a larger base model (e.g., 3B+)
|
||||||
|
|
||||||
|
### If RunPod is Used Anyway
|
||||||
|
|
||||||
|
| Instance | GPU | VRAM | Hourly Cost (approx) | Notes |
|
||||||
|
|----------|-----|------|----------------------|-------|
|
||||||
|
| RTX 4090 | 1x 4090 | 24 GB | ~$0.40/hr | Identical to local hardware |
|
||||||
|
| A40 | 1x A40 | 48 GB | ~$0.50/hr | More VRAM headroom; good if experimenting with larger batch sizes |
|
||||||
|
| RTX A6000 | 1x A6000 | 48 GB | ~$0.60/hr | Same tier as A40 |
|
||||||
|
|
||||||
|
At ~30-40 minutes of training time, the total cost would be under $0.50 for a single run. Use the `runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04` template.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Self-Hosted LLM Jobs (Replacing External API Dependencies)
|
||||||
|
|
||||||
|
All tasks that might otherwise require a paid API key are reformulated as self-hosted jobs using the existing local infrastructure.
|
||||||
|
|
||||||
|
| Task | External API Alternative | Self-Hosted Solution |
|
||||||
|
|------|--------------------------|----------------------|
|
||||||
|
| **LLM-as-Judge evaluation** | GPT-4 / Claude API | **GLM4-32B** (32B, local 4090) — already running at `192.168.1.100:8853` |
|
||||||
|
| **Data augmentation** (if more training pairs needed) | GPT-4 for paraphrase generation | **GLM4-32B** — same endpoint, same prompts used for corpus polishing |
|
||||||
|
| **Synthetic evaluation prompts** | API-generated diverse test prompts | **GLM4-32B** — generate varied evaluation prompts locally |
|
||||||
|
| **Model comparison judging** | Claude API for A/B preference judging | **GLM4-32B** — structured judge prompt with forced-choice output |
|
||||||
|
| **Embedding-based dedup** (if scaling corpus) | OpenAI embeddings API | **Sentence-transformers** (e.g., `all-MiniLM-L6-v2`, runs on CPU, 80MB) |
|
||||||
|
| **Classification of failure modes** | API-based analysis | **GLM4-32B** — classify generated sayings by quality/failure type |
|
||||||
|
|
||||||
|
**Key constraint:** GLM4-32B and the training job cannot share the 4090 simultaneously. Sequence the workflow as:
|
||||||
|
|
||||||
|
```
|
||||||
|
1. [GPU: GLM4-32B] Pre-training evaluation — generate baseline judge scores
|
||||||
|
2. [GPU: idle] Shut down inference server
|
||||||
|
3. [GPU: training] Fine-tune Qwen2.5-0.5B (~30-40 min)
|
||||||
|
4. [GPU: idle] Training complete, model saved
|
||||||
|
5. [GPU: GLM4-32B] Restart inference server
|
||||||
|
6. [GPU: GLM4-32B] Post-training evaluation — judge fine-tuned model output
|
||||||
|
```
|
||||||
|
|
||||||
|
If a larger judge model is desired (e.g., 70B for higher-quality evaluation), options for the 4090:
|
||||||
|
- **Qwen2.5-72B-Instruct-AWQ** (4-bit, ~40GB) — does not fit in 24 GB single-GPU
|
||||||
|
- **Qwen2.5-32B-Instruct** — similar quality tier to GLM4-32B, interchangeable
|
||||||
|
- **Llama-3.1-70B-Instruct** (GGUF Q4_K_M) — ~40GB, does not fit single 4090
|
||||||
|
- **Conclusion:** GLM4-32B is the practical ceiling for single-4090 evaluation. For 70B+ judge models, use RunPod with 2x A6000 or 1x A100 80GB (~$1-2/hr, would need <1 hour for 200 evaluations).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Blockers and Dependencies
|
||||||
|
|
||||||
|
### No Blockers
|
||||||
|
|
||||||
|
The training pipeline has no external dependencies that aren't already met.
|
||||||
|
|
||||||
|
| Dependency | Status |
|
||||||
|
|------------|--------|
|
||||||
|
| Training corpus (`training_pairs.jsonl`) | Complete — 36,079 pairs |
|
||||||
|
| GPU hardware (RTX 4090) | Available |
|
||||||
|
| Base model (Qwen2.5-0.5B-Instruct) | Public on HuggingFace, Apache 2.0 |
|
||||||
|
| Python packages (transformers, trl, torch) | Install needed — `pip install transformers trl datasets accelerate torch` |
|
||||||
|
| Evaluation LLM (GLM4-32B) | Running locally |
|
||||||
|
| CUDA toolkit | Installed (driver 555.42.02) |
|
||||||
|
|
||||||
|
### Pre-Training Checklist
|
||||||
|
|
||||||
|
- [ ] Install training dependencies: `pip install transformers trl datasets accelerate tensorboard`
|
||||||
|
- [ ] Download base model: `huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct`
|
||||||
|
- [ ] Write training script (skeleton provided in Section 5)
|
||||||
|
- [ ] Write data preparation script (JSONL → chat-template format with train/val/test split)
|
||||||
|
- [ ] Shut down GLM4-32B inference server to free GPU memory
|
||||||
|
- [ ] Run training (~30-40 min)
|
||||||
|
- [ ] Restart GLM4-32B for evaluation
|
||||||
|
- [ ] Run LLM-as-judge evaluation
|
||||||
|
- [ ] Human spot-check of 50 generated sayings
|
||||||
|
|
||||||
|
### Risks
|
||||||
|
|
||||||
|
| Risk | Likelihood | Mitigation |
|
||||||
|
|------|------------|------------|
|
||||||
|
| Overfitting on 36K examples | Medium | Monitor val loss; use early stopping; try 2-3 epochs instead of 5 |
|
||||||
|
| Mode collapse (model produces same few sayings) | Low-Medium | Check distinct-n metrics; if occurring, reduce learning rate or add dropout |
|
||||||
|
| Template imbalance causes weak coverage | Low | 3 templates are at 8-9% (below 10% threshold); generate more pairs for these families if needed |
|
||||||
|
| GPU memory conflict with inference server | None if sequenced | Follow the GPU sequencing workflow in Section 9 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Stretch Goals (Post-Initial Training)
|
||||||
|
|
||||||
|
These are not blockers but could improve results in subsequent iterations:
|
||||||
|
|
||||||
|
1. **Fictional entity fine-tuning pairs** — The CORPUS_GENERATION_SPEC describes ~200-300 fictional entity training pairs. These were not included in the current 36K corpus. Adding them would teach the model to generalize to novel concepts provided at inference time.
|
||||||
|
|
||||||
|
2. **DPO/ORPO alignment** — After SFT, use GLM4-32B to generate preference pairs (chosen vs rejected sayings) and run a DPO pass. This could sharpen quality without needing more training data.
|
||||||
|
|
||||||
|
3. **Hyperparameter sweep** — Run 3-5 configs varying learning rate (1e-5 to 5e-5) and epochs (2-5). With 30-minute training runs, a full sweep takes ~2.5 hours.
|
||||||
|
|
||||||
|
4. **Larger base model experiment** — Try Qwen2.5-1.5B-Instruct (1.5B params) as a comparison. Still fits on the 4090 for full fine-tune. Compare quality vs the 0.5B model to see if the extra parameters matter for this narrow task.
|
||||||
|
|
||||||
|
5. **GGUF export for deployment** — Convert the final model to GGUF Q8_0 format for CPU-only deployment via llama.cpp or Ollama. A 0.5B Q8 model is ~500MB and runs at interactive speed on any modern CPU.
|
||||||
296
scripts/train_sft.py
Normal file
296
scripts/train_sft.py
Normal file
|
|
@ -0,0 +1,296 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""SFT fine-tune Qwen3-0.6B-Base on folksy proverb training pairs.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python scripts/train_sft.py
|
||||||
|
|
||||||
|
Expects corpus/training_pairs.jsonl in the project root.
|
||||||
|
Outputs model checkpoints and training logs to folksy-model/.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Prevent CUDA fragmentation
|
||||||
|
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from datasets import Dataset
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainerCallback
|
||||||
|
from trl import SFTConfig, SFTTrainer
|
||||||
|
|
||||||
|
# === Configuration ===
|
||||||
|
MODEL_ID = "Qwen/Qwen3-0.6B-Base"
|
||||||
|
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
DATA_FILE = PROJECT_ROOT / "corpus" / "training_pairs.jsonl"
|
||||||
|
OUTPUT_DIR = PROJECT_ROOT / "folksy-model"
|
||||||
|
CHECKPOINT_DIR = OUTPUT_DIR / "checkpoints"
|
||||||
|
FINAL_MODEL_DIR = OUTPUT_DIR / "final"
|
||||||
|
LOG_FILE = OUTPUT_DIR / "training_log.jsonl"
|
||||||
|
|
||||||
|
# ChatML template without Qwen3 thinking tags — clean input/output format
|
||||||
|
CHAT_TEMPLATE = (
|
||||||
|
"{% for message in messages %}"
|
||||||
|
"<|im_start|>{{ message['role'] }}\n"
|
||||||
|
"{{ message['content'] }}<|im_end|>\n"
|
||||||
|
"{% endfor %}"
|
||||||
|
"{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
|
||||||
|
)
|
||||||
|
|
||||||
|
TRAINING_CONFIG = {
|
||||||
|
"num_train_epochs": 3,
|
||||||
|
"per_device_train_batch_size": 32,
|
||||||
|
"learning_rate": 2e-5,
|
||||||
|
"lr_scheduler_type": "cosine",
|
||||||
|
"warmup_ratio": 0.05,
|
||||||
|
"weight_decay": 0.01,
|
||||||
|
"max_seq_length": 128,
|
||||||
|
"eval_steps": 100,
|
||||||
|
"save_steps": 500,
|
||||||
|
"logging_steps": 10,
|
||||||
|
"seed": 42,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def log_event(event: str, **kwargs):
|
||||||
|
"""Append a structured log event."""
|
||||||
|
entry = {"timestamp": datetime.now().isoformat(), "event": event, **kwargs}
|
||||||
|
LOG_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(LOG_FILE, "a") as f:
|
||||||
|
f.write(json.dumps(entry) + "\n")
|
||||||
|
detail = {k: v for k, v in kwargs.items() if k != "timestamp"}
|
||||||
|
print(f"[{entry['timestamp'][:19]}] {event}", detail if detail else "")
|
||||||
|
|
||||||
|
|
||||||
|
def load_and_split_data(data_file: Path, val_ratio=0.05, test_ratio=0.05):
|
||||||
|
"""Load JSONL training pairs and create stratified splits by meta_template."""
|
||||||
|
with open(data_file) as f:
|
||||||
|
records = [json.loads(line) for line in f]
|
||||||
|
|
||||||
|
log_event("data_loaded", total_records=len(records))
|
||||||
|
|
||||||
|
# Convert to chat messages format
|
||||||
|
for r in records:
|
||||||
|
r["messages"] = [
|
||||||
|
{"role": "user", "content": r["input"]},
|
||||||
|
{"role": "assistant", "content": r["output"]},
|
||||||
|
]
|
||||||
|
|
||||||
|
# Stratified split by meta_template
|
||||||
|
random.seed(42)
|
||||||
|
groups = defaultdict(list)
|
||||||
|
for i, r in enumerate(records):
|
||||||
|
groups[r["meta_template"]].append(i)
|
||||||
|
|
||||||
|
train_idx, val_idx, test_idx = [], [], []
|
||||||
|
for template, indices in sorted(groups.items()):
|
||||||
|
random.shuffle(indices)
|
||||||
|
n = len(indices)
|
||||||
|
n_test = max(1, round(n * test_ratio))
|
||||||
|
n_val = max(1, round(n * val_ratio))
|
||||||
|
test_idx.extend(indices[:n_test])
|
||||||
|
val_idx.extend(indices[n_test : n_test + n_val])
|
||||||
|
train_idx.extend(indices[n_test + n_val :])
|
||||||
|
|
||||||
|
def make_dataset(indices):
|
||||||
|
return Dataset.from_list(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"messages": records[i]["messages"],
|
||||||
|
"meta_template": records[i]["meta_template"],
|
||||||
|
}
|
||||||
|
for i in indices
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
train_ds = make_dataset(train_idx)
|
||||||
|
val_ds = make_dataset(val_idx)
|
||||||
|
test_ds = make_dataset(test_idx)
|
||||||
|
|
||||||
|
log_event(
|
||||||
|
"data_split", train=len(train_ds), val=len(val_ds), test=len(test_ds)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Print distribution
|
||||||
|
for name, ds in [("train", train_ds), ("val", val_ds), ("test", test_ds)]:
|
||||||
|
dist = Counter(ds["meta_template"])
|
||||||
|
print(f"\n{name} ({len(ds)} examples):")
|
||||||
|
for t, c in sorted(dist.items()):
|
||||||
|
print(f" {t}: {c} ({c / len(ds) * 100:.1f}%)")
|
||||||
|
|
||||||
|
return train_ds, val_ds, test_ds
|
||||||
|
|
||||||
|
|
||||||
|
class MetricsLogger(TrainerCallback):
|
||||||
|
"""Log training metrics to the run log file."""
|
||||||
|
|
||||||
|
def on_log(self, args, state, control, logs=None, **kwargs):
|
||||||
|
if logs:
|
||||||
|
log_event("train_metrics", step=state.global_step, **logs)
|
||||||
|
|
||||||
|
def on_epoch_end(self, args, state, control, **kwargs):
|
||||||
|
epoch_num = int(state.epoch)
|
||||||
|
train_loss = None
|
||||||
|
for entry in reversed(state.log_history):
|
||||||
|
if "loss" in entry:
|
||||||
|
train_loss = entry["loss"]
|
||||||
|
break
|
||||||
|
log_event("epoch_complete", epoch=epoch_num, train_loss=train_loss)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
# Verify GPU
|
||||||
|
if not torch.cuda.is_available():
|
||||||
|
print("ERROR: No CUDA GPU available")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
gpu_name = torch.cuda.get_device_name(0)
|
||||||
|
gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1024**3
|
||||||
|
log_event("session_start", gpu=gpu_name, vram_gb=round(gpu_mem, 1))
|
||||||
|
|
||||||
|
# Verify data
|
||||||
|
if not DATA_FILE.exists():
|
||||||
|
print(f"ERROR: Training data not found at {DATA_FILE}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Create output directories
|
||||||
|
for d in [OUTPUT_DIR, CHECKPOINT_DIR]:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Load data
|
||||||
|
train_ds, val_ds, test_ds = load_and_split_data(DATA_FILE)
|
||||||
|
|
||||||
|
# Save val/test splits for later evaluation
|
||||||
|
val_ds.to_json(OUTPUT_DIR / "val_split.jsonl")
|
||||||
|
test_ds.to_json(OUTPUT_DIR / "test_split.jsonl")
|
||||||
|
|
||||||
|
# Load model and tokenizer
|
||||||
|
log_event("model_loading", model=MODEL_ID)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
MODEL_ID, dtype=torch.bfloat16
|
||||||
|
)
|
||||||
|
|
||||||
|
# Override chat template to remove Qwen3 thinking tags
|
||||||
|
tokenizer.chat_template = CHAT_TEMPLATE
|
||||||
|
if tokenizer.pad_token is None:
|
||||||
|
tokenizer.pad_token = tokenizer.eos_token
|
||||||
|
|
||||||
|
param_count = sum(p.numel() for p in model.parameters())
|
||||||
|
log_event("model_loaded", parameters=f"{param_count / 1e6:.0f}M")
|
||||||
|
|
||||||
|
# Verify tokenization on a sample
|
||||||
|
sample_messages = train_ds[0]["messages"]
|
||||||
|
sample_encoded = tokenizer.apply_chat_template(
|
||||||
|
sample_messages, tokenize=True, return_dict=False
|
||||||
|
)
|
||||||
|
sample_text = tokenizer.apply_chat_template(
|
||||||
|
sample_messages, tokenize=False
|
||||||
|
)
|
||||||
|
print(f"\nSample tokenization ({len(sample_encoded)} tokens):")
|
||||||
|
print(sample_text)
|
||||||
|
|
||||||
|
# Configure training
|
||||||
|
training_args = SFTConfig(
|
||||||
|
output_dir=str(CHECKPOINT_DIR),
|
||||||
|
num_train_epochs=TRAINING_CONFIG["num_train_epochs"],
|
||||||
|
per_device_train_batch_size=TRAINING_CONFIG["per_device_train_batch_size"],
|
||||||
|
gradient_accumulation_steps=1,
|
||||||
|
learning_rate=TRAINING_CONFIG["learning_rate"],
|
||||||
|
lr_scheduler_type=TRAINING_CONFIG["lr_scheduler_type"],
|
||||||
|
warmup_ratio=TRAINING_CONFIG["warmup_ratio"],
|
||||||
|
weight_decay=TRAINING_CONFIG["weight_decay"],
|
||||||
|
bf16=True,
|
||||||
|
max_length=TRAINING_CONFIG["max_seq_length"],
|
||||||
|
eval_strategy="steps",
|
||||||
|
eval_steps=TRAINING_CONFIG["eval_steps"],
|
||||||
|
save_strategy="steps",
|
||||||
|
save_steps=TRAINING_CONFIG["save_steps"],
|
||||||
|
logging_steps=TRAINING_CONFIG["logging_steps"],
|
||||||
|
report_to="tensorboard",
|
||||||
|
logging_dir=str(OUTPUT_DIR / "runs"),
|
||||||
|
seed=TRAINING_CONFIG["seed"],
|
||||||
|
dataloader_num_workers=2,
|
||||||
|
optim="adamw_torch_fused",
|
||||||
|
load_best_model_at_end=True,
|
||||||
|
metric_for_best_model="eval_loss",
|
||||||
|
)
|
||||||
|
|
||||||
|
trainer = SFTTrainer(
|
||||||
|
model=model,
|
||||||
|
args=training_args,
|
||||||
|
train_dataset=train_ds,
|
||||||
|
eval_dataset=val_ds,
|
||||||
|
processing_class=tokenizer,
|
||||||
|
callbacks=[MetricsLogger()],
|
||||||
|
)
|
||||||
|
|
||||||
|
steps_per_epoch = len(train_ds) // TRAINING_CONFIG["per_device_train_batch_size"]
|
||||||
|
total_steps = steps_per_epoch * TRAINING_CONFIG["num_train_epochs"]
|
||||||
|
log_event(
|
||||||
|
"training_start",
|
||||||
|
steps_per_epoch=steps_per_epoch,
|
||||||
|
total_steps_approx=total_steps,
|
||||||
|
config=TRAINING_CONFIG,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Train
|
||||||
|
train_result = trainer.train()
|
||||||
|
|
||||||
|
training_time = time.time() - start_time
|
||||||
|
log_event(
|
||||||
|
"training_complete",
|
||||||
|
wall_time_seconds=round(training_time, 1),
|
||||||
|
wall_time_minutes=round(training_time / 60, 1),
|
||||||
|
train_loss=train_result.training_loss,
|
||||||
|
train_runtime=train_result.metrics.get("train_runtime"),
|
||||||
|
train_samples_per_second=train_result.metrics.get(
|
||||||
|
"train_samples_per_second"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Save final model
|
||||||
|
FINAL_MODEL_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
trainer.save_model(str(FINAL_MODEL_DIR))
|
||||||
|
tokenizer.save_pretrained(str(FINAL_MODEL_DIR))
|
||||||
|
|
||||||
|
# Save full training log history
|
||||||
|
with open(FINAL_MODEL_DIR / "trainer_state.json", "w") as f:
|
||||||
|
json.dump(trainer.state.log_history, f, indent=2)
|
||||||
|
|
||||||
|
log_event("final_model_saved", path=str(FINAL_MODEL_DIR))
|
||||||
|
|
||||||
|
# Run final eval on test set (swap eval dataset temporarily)
|
||||||
|
original_eval_ds = trainer.eval_dataset
|
||||||
|
trainer.eval_dataset = test_ds
|
||||||
|
test_metrics = trainer.evaluate(metric_key_prefix="test")
|
||||||
|
trainer.eval_dataset = original_eval_ds
|
||||||
|
log_event("test_eval", **test_metrics)
|
||||||
|
|
||||||
|
# Print summary
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("TRAINING COMPLETE")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"Model: {MODEL_ID}")
|
||||||
|
print(f"Training pairs: {len(train_ds)}")
|
||||||
|
print(f"Val pairs: {len(val_ds)}")
|
||||||
|
print(f"Test pairs: {len(test_ds)}")
|
||||||
|
print(f"Final train loss: {train_result.training_loss:.4f}")
|
||||||
|
print(f"Test loss: {test_metrics.get('test_loss', 'N/A')}")
|
||||||
|
print(f"Wall time: {training_time / 60:.1f} minutes")
|
||||||
|
print(f"Checkpoint: {FINAL_MODEL_DIR}")
|
||||||
|
print(f"Training log: {LOG_FILE}")
|
||||||
|
print(f"TensorBoard: {OUTPUT_DIR / 'runs'}")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Add table
Add a link
Reference in a new issue