Fine-Tuning Qwen3.5 on X and LinkedIn Data: End-to-End Guide

If you want an LLM to sound more like your writing voice, the fastest path is a narrow, text-only fine-tune with clean chat-format data and a strict evaluation loop.

This post covers exactly how I fine-tuned Qwen3.5-4B with Unsloth on my own X and LinkedIn posts, plus how I evaluated baseline vs fine-tuned behavior using llama.cpp on my local machine.

TL;DR Results

| Metric | Value | |--------|-------| | Model | Qwen3.5-4B (Unsloth, 4-bit QLoRA) | | Train / Eval split | 171 / 33 rows | | Epochs / Steps | 3 / 33 | | Train loss | 2.970 | | Eval loss | 3.014 | | Baseline avg tokens | 52.33 | | Fine-tuned avg tokens | 76.08 | | Fine-tuned longer on | 10/12 prompt-seed pairs |

Why Qwen3.5-4B

I chose Qwen3.5-4B because it hits a practical sweet spot for personal voice adaptation:

Small enough to fine-tune on a single RTX 4090 (24GB VRAM) with 4-bit quantization
Large enough to produce coherent, stylistically varied outputs across different prompt types
Qwen3.5 architecture supports text-only SFT cleanly without needing to disable vision/audio transforms

I initially tried 9B but ran into VRAM limits during training with the batch sizes I wanted. 4B gave me the room to iterate faster without sacrificing output quality on short-form social content.

Why Unsloth

Unsloth makes practical QLoRA runs easier to execute:

Efficient 4-bit loading with bitsandbytes integration, keeping VRAM usage under 12GB during training
Drop-in SFTTrainer compatibility — no custom training loop needed
Clean LoRA merge and export to GGUF for local inference

For small-to-medium adaptation runs, this gives a fast iteration loop without overcomplicating infrastructure. I went from raw data to a working GGUF in under 2 hours.

Building the Dataset

The raw data came from two sources:

X (Twitter): Exported via the archive download feature. I pulled my original tweets (not replies or retweets) and filtered for posts over 50 characters.
LinkedIn: Manually copied from my post history. LinkedIn has no bulk export for post content, so this was the slow part.

After deduplication and quality filtering, I had ~200 usable posts. I split these 85/15 into train (171 rows) and eval (33 rows), stratified by platform so both sets had a mix of X and LinkedIn content.

Each post became the assistant turn in a chat-format training example. The system message and user prompt were generated to match the kind of prompts I would actually use at inference time.

Data Format

I followed the Unsloth Qwen3.5 fine-tuning guide for data preparation.

The format is OpenAI-style chat messages in JSONL, where each line contains a full conversation with system, user, and assistant turns:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a social media writer who mirrors the voice and style of Thomas Mann. Write in first person. Be direct, specific, and opinionated. Avoid generic advice."
    },
    {
      "role": "user",
      "content": "Write a LinkedIn post about why most AI demos fail in production."
    },
    {
      "role": "assistant",
      "content": "I shipped a broken demo to 200 people last week. The model worked perfectly in my notebook. Then it hit real user inputs and fell apart in ways I never tested for..."
    }
  ]
}

The system message sets the voice persona and guardrails. The user message is the prompt — I wrote these to cover the range of topics I actually post about (AI, building products, lessons learned, hot takes). The assistant message is the real post I wrote.

Each row was then rendered into a single training string using the model's chat template:

tokenizer.apply_chat_template(messages, add_generation_prompt=False)

This converts the structured messages into the exact token format Qwen3.5 expects, including special tokens like <|im_start|> and <|im_end|>. Training on this format means the model sees the same structure at inference time, which keeps behavior consistent.

Training Configuration

from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="outputs/qwen35_4b_brand_voice",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,    # effective batch size = 8
    num_train_epochs=3,
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_steps=5,
    weight_decay=0.01,
    fp16=True,
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    max_seq_length=2048,
    dataset_text_field="text",
)

Key decisions:

Learning rate 2e-4 with linear decay — standard for QLoRA, aggressive enough for small datasets without blowing past the optimum
Effective batch size 8 — small enough to fit in VRAM, large enough for stable gradients on 171 examples
3 epochs — with only 171 rows, more epochs risks overfitting. I watched eval loss plateau after epoch 2 and it held steady through epoch 3
Max sequence length 2048 — most social posts are well under 500 tokens, but I left headroom for longer LinkedIn posts

The LoRA config targeted the attention layers:

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM",
)

Rank 16 with alpha 16 (ratio = 1) is conservative. For voice adaptation on short-form text, you don't need high rank — the model isn't learning new knowledge, just adjusting its stylistic priors.

Training Results

Training completed in about 8 minutes on a Runpod RTX 4090 instance.

| Epoch | Train Loss | Eval Loss | |-------|-----------|-----------| | 1 | 3.241 | 3.089 | | 2 | 2.912 | 3.021 | | 3 | 2.970 | 3.014 |

Eval loss decreased through all 3 epochs without diverging from train loss, which means we're not overfitting yet. The gap between train and eval loss is small (~0.04), which is what you want on a dataset this size — it means the model is generalizing from the training examples rather than memorizing them.

Exporting to GGUF

After training, I merged the LoRA weights back into the base model and quantized to Q4_K_M for local inference:

model.save_pretrained_gguf(
    "artifacts/gguf/qwen35_4b_brand_voice",
    tokenizer,
    quantization_method="q4_k_m"
)

This produces a single .gguf file around 2.5GB that runs in llama.cpp without any Python dependencies. The Q4_K_M quantization preserves most of the fine-tuned behavior while keeping inference fast on CPU or Apple Silicon.

Evaluating Baseline vs Fine-Tuned in llama.cpp

I used the same llama.cpp runtime path for both variants to eliminate runtime confounds:

Baseline run: Standard Qwen3.5-4B GGUF (Q4_K_M), no adapter
Fine-tuned run: Merged fine-tuned GGUF (Q4_K_M)
Same decode settings: temperature 0.7, top_p 0.9, max tokens 256

The evaluation suite uses 20 prompts, each run with 3 different random seeds, for 60 total generations per model. For the quick slice I used 4 prompts x 3 seeds = 12 outputs.

Prompts were designed to cover the range of content I actually write:

LinkedIn thought leadership posts
X hot takes about AI/tech
Building-in-public updates
Contrarian opinions on industry trends

What Changed in Outputs

On the quick evaluation slice (12 prompts):

Fine-tuned outputs were generally longer (76.08 avg tokens vs 52.33) — the model learned that my posts tend to develop an argument rather than giving one-liner answers
The baseline produced more generic, assistant-like responses ("Here's a post about..."). The fine-tuned model jumped straight into first-person writing
Style coherence improved noticeably — more direct statements, fewer hedging phrases, more specific examples
A full 60-prompt evaluation is still needed before making strong statistical claims

Example baseline output:

"Here's a LinkedIn post about building in public: Building in public is about transparency and sharing your journey with others..."

Example fine-tuned output:

"Most founders talk about building in public like it's a content strategy. It's not. It's an accountability mechanism that forces you to ship when you'd rather polish..."

Practical Lessons

Validate dataset schema before training. I lost a 40-minute run to a missing field in row 83. Write a schema check script and run it first.
Keep the training path text-only if your source is text-only. Qwen3.5 supports multimodal inputs but enabling vision transforms on text-only data adds overhead and can introduce subtle bugs.
Use the same inference runtime for baseline and fine-tuned comparisons. Different runtimes (e.g., vLLM vs llama.cpp) have different sampling implementations that can confound results.
Start with a small eval slice for iteration speed, then run the full suite. I ran the 12-prompt slice after each training experiment to get a quick signal before committing to the full 60-prompt run.
Your system prompt matters more than you think. I iterated through 4 versions of the system message before landing on one that consistently produced outputs in the right voice register.

Reproducible Workflow Commands

Install llama.cpp:

brew install llama.cpp

Download baseline model:

./scripts/download_llama_models.sh qwen35-4b

Run baseline-only eval:

python3 scripts/run_llamacpp_suite.py \
  --prompts data/processed/eval_suite_run1_social_20x3.jsonl \
  --baseline-gguf artifacts/gguf/qwen35_4b_baseline_q4_k_m.gguf \
  --baseline-output artifacts/eval/run1/baseline_qwen35_4b_outputs.jsonl

Run baseline vs fine-tuned comparison:

python3 scripts/run_llamacpp_suite.py \
  --prompts data/processed/eval_suite_run1_social_20x3.jsonl \
  --baseline-gguf artifacts/gguf/qwen35_4b_baseline_q4_k_m.gguf \
  --finetuned-gguf artifacts/gguf/qwen35_4b_brand_voice_q4_k_m.gguf \
  --baseline-output artifacts/eval/run1/baseline_qwen35_4b_outputs.jsonl \
  --finetuned-output artifacts/eval/run1/finetuned_qwen35_4b_outputs.jsonl

What's Next

Run the full 60-prompt evaluation suite and compute BLEU/ROUGE scores against my real posts
Try rank 32 LoRA to see if higher capacity improves style matching on longer-form LinkedIn content
Experiment with DPO on preference pairs (good post vs generic output) as a second-stage alignment step
Test Qwen3.5-8B now that I've optimized the training pipeline for lower memory usage

Final Takeaway

If your goal is voice adaptation, a focused QLoRA run on clean chat-format examples can move quality quickly. 171 training examples was enough to shift the model from generic assistant outputs to something that reads like my actual writing. The important part is not just training — it's disciplined evaluation under identical runtime conditions so you can trust the signal.

The full code for this project is on GitHub.