Fine-Tuning Qwen3.5 on X and LinkedIn Data: End-to-End Guide with Unsloth and llama.cpp
Fine-Tuning Qwen3.5 on X and LinkedIn Data: End-to-End Guide
If you want an LLM to sound more like your writing voice, the fastest path is a narrow, text-only fine-tune with clean chat-format data and a strict evaluation loop.
This post covers exactly how I fine-tuned Qwen3.5-4B with Unsloth on my own X and LinkedIn posts, plus how I evaluated baseline vs fine-tuned behavior using llama.cpp on my local machine.
TL;DR Results
| Metric | Value | |--------|-------| | Model | Qwen3.5-4B (Unsloth, 4-bit QLoRA) | | Train / Eval split | 171 / 33 rows | | Epochs / Steps | 3 / 33 | | Train loss | 2.970 | | Eval loss | 3.014 | | Baseline avg tokens | 52.33 | | Fine-tuned avg tokens | 76.08 | | Fine-tuned longer on | 10/12 prompt-seed pairs |
Why Qwen3.5-4B
I chose Qwen3.5-4B because it hits a practical sweet spot for personal voice adaptation:
- Small enough to fine-tune on a single RTX 4090 (24GB VRAM) with 4-bit quantization
- Large enough to produce coherent, stylistically varied outputs across different prompt types
- Qwen3.5 architecture supports text-only SFT cleanly without needing to disable vision/audio transforms
I initially tried 9B but ran into VRAM limits during training with the batch sizes I wanted. 4B gave me the room to iterate faster without sacrificing output quality on short-form social content.
Why Unsloth
Unsloth makes practical QLoRA runs easier to execute:
- Efficient 4-bit loading with bitsandbytes integration, keeping VRAM usage under 12GB during training
- Drop-in SFTTrainer compatibility — no custom training loop needed
- Clean LoRA merge and export to GGUF for local inference
For small-to-medium adaptation runs, this gives a fast iteration loop without overcomplicating infrastructure. I went from raw data to a working GGUF in under 2 hours.
Building the Dataset
The raw data came from two sources:
- X (Twitter): Exported via the archive download feature. I pulled my original tweets (not replies or retweets) and filtered for posts over 50 characters.
- LinkedIn: Manually copied from my post history. LinkedIn has no bulk export for post content, so this was the slow part.
After deduplication and quality filtering, I had ~200 usable posts. I split these 85/15 into train (171 rows) and eval (33 rows), stratified by platform so both sets had a mix of X and LinkedIn content.
Each post became the assistant turn in a chat-format training example. The system message and user prompt were generated to match the kind of prompts I would actually use at inference time.
Data Format
I followed the Unsloth Qwen3.5 fine-tuning guide for data preparation.
The format is OpenAI-style chat messages in JSONL, where each line contains a full conversation with system, user, and assistant turns:
{
"messages": [
{
"role": "system",
"content": "You are a social media writer who mirrors the voice and style of Thomas Mann. Write in first person. Be direct, specific, and opinionated. Avoid generic advice."
},
{
"role": "user",
"content": "Write a LinkedIn post about why most AI demos fail in production."
},
{
"role": "assistant",
"content": "I shipped a broken demo to 200 people last week. The model worked perfectly in my notebook. Then it hit real user inputs and fell apart in ways I never tested for..."
}
]
}
The system message sets the voice persona and guardrails. The user message is the prompt — I wrote these to cover the range of topics I actually post about (AI, building products, lessons learned, hot takes). The assistant message is the real post I wrote.
Each row was then rendered into a single training string using the model's chat template:
tokenizer.apply_chat_template(messages, add_generation_prompt=False)
This converts the structured messages into the exact token format Qwen3.5 expects, including special tokens like <|im_start|> and <|im_end|>. Training on this format means the model sees the same structure at inference time, which keeps behavior consistent.
Training Configuration
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="outputs/qwen35_4b_brand_voice",
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch size = 8
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="linear",
warmup_steps=5,
weight_decay=0.01,
fp16=True,
logging_steps=1,
eval_strategy="epoch",
save_strategy="epoch",
max_seq_length=2048,
dataset_text_field="text",
)
Key decisions:
- Learning rate 2e-4 with linear decay — standard for QLoRA, aggressive enough for small datasets without blowing past the optimum
- Effective batch size 8 — small enough to fit in VRAM, large enough for stable gradients on 171 examples
- 3 epochs — with only 171 rows, more epochs risks overfitting. I watched eval loss plateau after epoch 2 and it held steady through epoch 3
- Max sequence length 2048 — most social posts are well under 500 tokens, but I left headroom for longer LinkedIn posts
The LoRA config targeted the attention layers:
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM",
)
Rank 16 with alpha 16 (ratio = 1) is conservative. For voice adaptation on short-form text, you don't need high rank — the model isn't learning new knowledge, just adjusting its stylistic priors.
Training Results
Training completed in about 8 minutes on a Runpod RTX 4090 instance.
| Epoch | Train Loss | Eval Loss | |-------|-----------|-----------| | 1 | 3.241 | 3.089 | | 2 | 2.912 | 3.021 | | 3 | 2.970 | 3.014 |
Eval loss decreased through all 3 epochs without diverging from train loss, which means we're not overfitting yet. The gap between train and eval loss is small (~0.04), which is what you want on a dataset this size — it means the model is generalizing from the training examples rather than memorizing them.
Exporting to GGUF
After training, I merged the LoRA weights back into the base model and quantized to Q4_K_M for local inference:
model.save_pretrained_gguf(
"artifacts/gguf/qwen35_4b_brand_voice",
tokenizer,
quantization_method="q4_k_m"
)
This produces a single .gguf file around 2.5GB that runs in llama.cpp without any Python dependencies. The Q4_K_M quantization preserves most of the fine-tuned behavior while keeping inference fast on CPU or Apple Silicon.
Evaluating Baseline vs Fine-Tuned in llama.cpp
I used the same llama.cpp runtime path for both variants to eliminate runtime confounds:
- Baseline run: Standard Qwen3.5-4B GGUF (Q4_K_M), no adapter
- Fine-tuned run: Merged fine-tuned GGUF (Q4_K_M)
- Same decode settings: temperature 0.7, top_p 0.9, max tokens 256
The evaluation suite uses 20 prompts, each run with 3 different random seeds, for 60 total generations per model. For the quick slice I used 4 prompts x 3 seeds = 12 outputs.
Prompts were designed to cover the range of content I actually write:
- LinkedIn thought leadership posts
- X hot takes about AI/tech
- Building-in-public updates
- Contrarian opinions on industry trends
What Changed in Outputs
On the quick evaluation slice (12 prompts):
- Fine-tuned outputs were generally longer (76.08 avg tokens vs 52.33) — the model learned that my posts tend to develop an argument rather than giving one-liner answers
- The baseline produced more generic, assistant-like responses ("Here's a post about..."). The fine-tuned model jumped straight into first-person writing
- Style coherence improved noticeably — more direct statements, fewer hedging phrases, more specific examples
- A full 60-prompt evaluation is still needed before making strong statistical claims
Example baseline output:
"Here's a LinkedIn post about building in public: Building in public is about transparency and sharing your journey with others..."
Example fine-tuned output:
"Most founders talk about building in public like it's a content strategy. It's not. It's an accountability mechanism that forces you to ship when you'd rather polish..."
Practical Lessons
- Validate dataset schema before training. I lost a 40-minute run to a missing field in row 83. Write a schema check script and run it first.
- Keep the training path text-only if your source is text-only. Qwen3.5 supports multimodal inputs but enabling vision transforms on text-only data adds overhead and can introduce subtle bugs.
- Use the same inference runtime for baseline and fine-tuned comparisons. Different runtimes (e.g., vLLM vs llama.cpp) have different sampling implementations that can confound results.
- Start with a small eval slice for iteration speed, then run the full suite. I ran the 12-prompt slice after each training experiment to get a quick signal before committing to the full 60-prompt run.
- Your system prompt matters more than you think. I iterated through 4 versions of the system message before landing on one that consistently produced outputs in the right voice register.
Reproducible Workflow Commands
Install llama.cpp:
brew install llama.cpp
Download baseline model:
./scripts/download_llama_models.sh qwen35-4b
Run baseline-only eval:
python3 scripts/run_llamacpp_suite.py \
--prompts data/processed/eval_suite_run1_social_20x3.jsonl \
--baseline-gguf artifacts/gguf/qwen35_4b_baseline_q4_k_m.gguf \
--baseline-output artifacts/eval/run1/baseline_qwen35_4b_outputs.jsonl
Run baseline vs fine-tuned comparison:
python3 scripts/run_llamacpp_suite.py \
--prompts data/processed/eval_suite_run1_social_20x3.jsonl \
--baseline-gguf artifacts/gguf/qwen35_4b_baseline_q4_k_m.gguf \
--finetuned-gguf artifacts/gguf/qwen35_4b_brand_voice_q4_k_m.gguf \
--baseline-output artifacts/eval/run1/baseline_qwen35_4b_outputs.jsonl \
--finetuned-output artifacts/eval/run1/finetuned_qwen35_4b_outputs.jsonl
What's Next
- Run the full 60-prompt evaluation suite and compute BLEU/ROUGE scores against my real posts
- Try rank 32 LoRA to see if higher capacity improves style matching on longer-form LinkedIn content
- Experiment with DPO on preference pairs (good post vs generic output) as a second-stage alignment step
- Test Qwen3.5-8B now that I've optimized the training pipeline for lower memory usage
Final Takeaway
If your goal is voice adaptation, a focused QLoRA run on clean chat-format examples can move quality quickly. 171 training examples was enough to shift the model from generic assistant outputs to something that reads like my actual writing. The important part is not just training — it's disciplined evaluation under identical runtime conditions so you can trust the signal.
The full code for this project is on GitHub.