In Plain English: Fine-Tuning a Large Language Model on Your Legal Work Product (Part 3 - Model Results and Costs)
In Part 2, we built the eval: a dataset, a prompt, a scorer, and an experiment that compares the model's answer to the expected answer. Now we use it for the decision that matters: which open source model should we move forward with as our base?
Here is the problem: leaderboards tell you which model is "best." They do not tell you that the best-scoring model costs nearly three times the hardware of a model barely two points behind it.
For a law firm deciding what to self-host, the score is only half the question. The other half is what it costs to keep that model running. If the model is going to protect privilege, stay off third-party APIs, and eventually fine-tune on your own work product, cost and feasibility matter.
So we tested seven open-weight models on a 500-item slice of LegalBench, then put each model's score next to the GPU footprint it takes to serve. The result reorders the leaderboard completely.
How we tested
We drew 500 tasks from LegalBench and split them into two halves that stress very different capabilities:
- 250 "closed" tasks: binary or multiple-choice classification (e.g. "is this clause a non-compete? yes/no"). Scored by exact match.
- 250 "open" tasks: complex, open-ended legal reasoning and drafting where there is no single string answer. Scored by an LLM judge rubric.
Every model is an open-weights model you can download and run yourself. A model's headline Score below is the average of its closed and open results. All cost figures are for inference only. Fine-tuning is covered separately at the end.
Here is the key: the highest score is not automatically the best business decision. The model we want is the one that is strong enough on legal work, affordable enough to run, and practical enough to fine-tune.
1. The raw leaderboard: GLM wins
Sorted by blended score, Z.ai's GLM-5.2 takes the top spot, with DeepSeek and Qwen close behind. Note the active vs. total parameters column. Most of these are Mixture-of-Experts (MoE) models, which only "light up" a fraction of their weights per token. That gap matters enormously for cost, as we'll see.
One model is not like the others. We included Gemma-3-27B because it is a genuinely open, permissively licensed model that any firm can run, but at 27 billion parameters it is a fraction the size of the rest of the field, several of which run into the hundreds of billions or past a trillion total parameters. It is no surprise that a 27B model scores lower than models 15 to 60 times its size. The interesting question, which the cost tables below answer, is whether that smaller model earns its keep on price.
| Lab | Model | Score | Total Params | Active Params | Quant |
|---|---|---|---|---|---|
| Z.ai | GLM-5.2 | 72.4 | 754B | 40B | FP8 |
| Deepseek | Deepseek-V4-Pro | 71.4 | 1.6T | 49B | FP8 |
| Qwen | Qwen3.5-397B | 70.0 | 397B | 17B | FP8 |
| NVIDIA | Nemotron-3-Ultra | 69.2 | 550B | 55B | BF16 |
| Meta | Llama-4-Maverick | 66.8 | 400B | 17B | FP8 |
| OpenAI | GPT-OSS-120B | 66.4 | 120B | 5.5B | BF16 |
| Gemma-3-27B | 60.2 | 27B | 27B | FP8 |
2. The split that actually matters to lawyers
The blended score hides the most important finding. When you separate the binary tasks from the open-ended ones, a pattern jumps out:
| Model | Closed (250) exact match | Open (250) LLM judge | Gap |
|---|---|---|---|
| GLM-5.2 | 85.6% | 59.2% | 26.4 |
| Deepseek-V4-Pro | 80.4% | 62.4% | 18.0 |
| Qwen3.5-397B | 82.0% | 58.0% | 24.0 |
| Nemotron-3-Ultra | 80.8% | 57.6% | 23.2 |
| Llama-4-Maverick | 82.4% | 51.2% | 31.2 |
| GPT-OSS-120B | 82.4% | 50.4% | 32.0 |
| Gemma-3-27B | 82.0% | 38.4% | 43.6 |
Binary classification is nearly a solved problem. Every model lands between 80.4% and 85.6% on the closed set, a spread of just five points. If your use case is "tag every clause in this contract" or "is this document responsive, yes/no," almost any of these models will do, and you should buy on price.
Open-ended reasoning is where models separate. The open set spans 38.4% to 62.4%, a 24-point spread. This is the work that actually resembles legal practice: synthesizing facts, applying a standard, drafting an argument. Two things stand out:
- DeepSeek-V4-Pro is the best open-ended reasoner (62.4%), even though it sits second on the blended leaderboard. If your work is heavy on drafting and analysis, it outperforms the nominal "winner."
- Gemma-3-27B collapses on open tasks (38.4%). It is the cheapest model to run by a wide margin and perfectly competent on binary tasks, but it is not a substitute for a frontier model on complex reasoning. Cheap has a ceiling, and at 27B parameters that ceiling arrives early.
3. Now add the hardware bill
Here is where the leaderboard inverts. Serving these models takes GPUs, specifically enough NVIDIA B200s to hold the weights plus a working KV cache. We price each B200 at $8.50/hour, a representative on-demand cloud rate. Your negotiated or reserved rate will be lower, and spot can be lower still. This is an estimate for planning, not a quote.
| Lab | Model | Score | B200s | Est. $/hr |
|---|---|---|---|---|
| Z.ai | GLM-5.2 | 72.4 | 8 | $68.00 |
| Deepseek | Deepseek-V4-Pro | 71.4 | 5 | $42.50 |
| Qwen | Qwen3.5-397B | 70.0 | 3 | $25.50 |
| NVIDIA | Nemotron-3-Ultra | 69.2 | 6 | $51.00 |
| Meta | Llama-4-Maverick | 66.8 | 3 | $25.50 |
| OpenAI | GPT-OSS-120B | 66.4 | 2 | $17.00 |
| Gemma-3-27B | 60.2 | 1 | $8.50 |
GLM-5.2 tops the leaderboard at 72.4, and costs $68/hr on eight B200s. Qwen3.5-397B scores 70.0, just 2.4 points behind, on three B200s at $25.50/hr. Running 24/7, that gap is roughly $596,000/year vs. $223,000/year: more than a third of a million dollars a year for those last two points. The top of the leaderboard is the bottom of the value chart.
A self-hosted model that serves your firm is not rented by the hour and switched off. It runs continuously. Here is what each model costs to keep online around the clock for a year:
| Model | Score | B200s | $/hr | Annual (24/7/365) |
|---|---|---|---|---|
| GLM-5.2 | 72.4 | 8 | $68.00 | $595,680 |
| Deepseek-V4-Pro | 71.4 | 5 | $42.50 | $372,300 |
| Qwen3.5-397B | 70.0 | 3 | $25.50 | $223,380 |
| Nemotron-3-Ultra | 69.2 | 6 | $51.00 | $446,760 |
| Llama-4-Maverick | 66.8 | 3 | $25.50 | $223,380 |
| GPT-OSS-120B | 66.4 | 2 | $17.00 | $148,920 |
| Gemma-3-27B | 60.2 | 1 | $8.50 | $74,460 |
4. Score per dollar: the real ranking for a firm
Divide score by cost and the picture clarifies. Two columns: cost per score point (lower is better; what each point of capability costs you per hour) and score per dollar/hour (higher is better; how much capability you rent per dollar).
| Model | Score | $/hr | Cost / point lower better | Score / $·hr higher better |
|---|---|---|---|---|
| Gemma-3-27B | 60.2 | $8.50 | 0.14 | 7.08 |
| GPT-OSS-120B | 66.4 | $17.00 | 0.26 | 3.91 |
| Qwen3.5-397B | 70.0 | $25.50 | 0.36 | 2.75 |
| Llama-4-Maverick | 66.8 | $25.50 | 0.38 | 2.62 |
| Deepseek-V4-Pro | 71.4 | $42.50 | 0.60 | 1.68 |
| Nemotron-3-Ultra | 69.2 | $51.00 | 0.74 | 1.36 |
| GLM-5.2 | 72.4 | $68.00 | 0.94 | 1.06 |
The efficiency ranking is almost the exact reverse of the score ranking. The reason is the MoE active-parameter gap: Qwen3.5-397B activates only 17B parameters per token, so it serves on 3 B200s while scoring 70, within striking distance of GLM's 72.4 at a third of the hardware. Qwen is the standout all-rounder: 97% of the top score, 2.6× the value, and the third-best open-reasoning score (58.0%) in the field.
The one model to pick: Qwen3.5-397B
A firm does not want a fleet of models. It wants one model to standardize on, fine-tune on its own work product, and run in-house. On this data, that model is Qwen3.5-397B, and it is not especially close.
Qwen scores 70.0 overall, within 2.4 points of the top-ranked GLM-5.2, while running on just 3 B200s instead of 8. It is the third-best open-ended reasoner in the field (58.0%), so it holds up on the drafting and analysis work that actually separates these models, not just the binary classification that everything does well. And because it activates only 17B parameters per token, it is the cheapest of the genuinely capable models to both serve and fine-tune.
The alternatives only sharpen the case. GLM-5.2 and DeepSeek-V4-Pro score a hair higher but cost two to three times the hardware, every hour, forever. GPT-OSS-120B and Gemma-3-27B are cheaper to run but give up real ground on open-ended reasoning, exactly the capability you are fine-tuning to improve. Qwen sits at the point where capability, cost, and feasibility all line up.
5. Fine-tuning: what it costs to make one of these yours
Everything above is inference, the cost of running a stock model. The reason a firm self-hosts an open-weight model in the first place is usually to fine-tune it on your own work product so it writes in your voice, knows your templates, and reasons the way your practice does. (We covered the "why" and the first steps, building golden prompt→response pairs, in our fine-tuning guide.)
Here is the good news: the actual computing involved in fine-tuning is cheap, usually tens to a few hundred dollars of GPU time. The thing that decides whether fine-tuning costs you a dinner or a down payment is not the computing. It is how many GPUs you have to rent to do it, and that comes down to which method you pick.
Two methods, very different bills
There are a handful of ways to fine-tune, but for a firm it really comes down to two:
- LoRA (and its leaner cousin, QLoRA) leaves the original model frozen and trains a small "adapter" that sits on top of it. This is what almost everyone uses. It fits on a handful of GPUs (QLoRA can run on a single card), and it gets you the large majority of the quality. Here, the cheap computing number really is what you pay.
- Full fine-tuning rewrites every weight in the model. The quality ceiling is slightly higher, but you have to load the entire model into GPU memory at once. For these big models that means renting a small data center, and you pay for every card the whole time. The cluster, not the computing, is the bill.
(There is also DPO, an optional polishing pass you run after LoRA using "this answer is better than that one" pairs. Budget roughly 50% on top of a LoRA run for it.)
Our numbers assume about 4,000 tokens per golden pair (legal answers run long), three passes over the data, and the same $8.50 per GPU-hour as before. They count raw GPU time only, so add a comfortable margin for data prep and trial runs: a real project usually lands at two to three times these figures.
The realistic path: LoRA
This is what a LoRA fine-tune costs in GPU time, by model, at three dataset sizes. Because LoRA runs on just a few cards, these figures are close to what you would actually pay.
| Model | 5,000 pairs | 10,000 pairs | 25,000 pairs |
|---|---|---|---|
| GLM-5.2 | $26 | $52 | $131 |
| Deepseek-V4-Pro | $32 | $64 | $160 |
| Qwen3.5-397B | $11 | $22 | $56 |
| Nemotron-3-Ultra | $36 | $72 | $180 |
| Llama-4-Maverick | $11 | $22 | $56 |
| GPT-OSS-120B | $4 | $7 | $18 |
| Gemma-3-27B | $18 | $35 | $88 |
That is the whole story for the model we recommend: a LoRA fine-tune of Qwen3.5-397B on 10,000 of your own examples is about $22 of GPU time. Even 25,000 examples is under $60.
It is also fast. The same runs finish in roughly 15 minutes (5,000 pairs), half an hour (10,000), and about an hour (25,000) of computing on a handful of cards. Add loading, a checkpoint, and a quick evaluation, and a full LoRA fine-tune is comfortably a before-lunch job: you can train a new version in the morning and be testing it the same afternoon. Other models in the table land in the same range, from a few minutes for the smaller ones to a couple of hours for the largest.
Why not just full fine-tune?
There is a heavier method, full fine-tuning, that rewrites every weight in the model instead of training an adapter on top. Its quality ceiling is slightly higher, but it only pulls ahead when you are teaching a model genuinely new ways to reason, not when you are teaching it your firm's voice and templates. For legal work, that extra ceiling is headroom you will not use.
And it is expensive in a way the cheap computing number hides. To full-fine-tune one of these models you have to load the whole thing into GPU memory at once, which means renting a small cluster and paying for every card the entire run. Full-fine-tuning Qwen3.5-397B means roughly a 36-GPU cluster at about $300 an hour; DeepSeek-V4-Pro is far worse, a 142-GPU cluster at about $1,200 an hour. That is ten to fifty times the cost of the LoRA run above, for a sliver of quality a law firm does not need. It also leaves you with a fragile second copy of a giant model to host and maintain, where LoRA hands you a small, swappable adapter and leaves the proven base model untouched. For a firm, there is almost no reason to take this path.
Here is the full picture. The cluster column is what makes this method hurt: it is sized to hold the model, not to compute the answer, so you rent and pay for every card the whole run, even though the actual training touches only the active parameters. Compare these totals to the LoRA table above.
| Model | Cluster (B200s) | $/hr | 5,000 pairs | 10,000 pairs | 25,000 pairs |
|---|---|---|---|---|---|
| GLM-5.2 | 67 | $569.50 | $520 | $1,045 | $2,630 |
| Deepseek-V4-Pro | 142 | $1,207.00 | $1,360 | $2,725 | $6,815 |
| Qwen3.5-397B | 36 | $306.00 | $120 | $240 | $605 |
| Nemotron-3-Ultra | 49 | $416.50 | $530 | $1,060 | $2,645 |
| Llama-4-Maverick | 36 | $306.00 | $120 | $240 | $605 |
| GPT-OSS-120B | 11 | $93.50 | $13 | $23 | $59 |
| Gemma-3-27B | 3 | $25.50 | $54 | $105 | $264 |
The contrast is the whole argument. A LoRA fine-tune of Qwen3.5-397B on 10,000 pairs is $22; the full fine-tune of the same model on the same data is $240, and DeepSeek-V4-Pro jumps from $64 to $2,725. You pay ten to forty times more to rewrite weights you were never going to improve. For teaching a model your firm's voice and templates, LoRA is not the budget option, it is the correct one.
How much better will it actually get?
The natural next question is how many points all of this buys you. Honestly, that is the one number we cannot put in a table. The lift from fine-tuning depends on how far the base model already is from your kind of work, and, far more than on the raw count of pairs, on how good and how varied those pairs are. Anyone who promises you "plus six points at 10,000 examples" is guessing.
What we can say with confidence is the shape of it:
- The gains come early and then taper. Going from nothing to 5,000 good pairs moves the needle the most. 10,000 adds meaningfully more, and 25,000 more still, but each doubling buys a bit less than the last. For most firms, the bulk of the benefit has arrived by around 10,000 pairs.
- Voice, format, and house style improve fastest. If your goal is a model that drafts in your firm's structure and citation conventions, even a few thousand examples closes most of that gap. This is the part most firms care about most.
- Genuinely new reasoning is the slow, data-hungry part, and the place where bad or repetitive data can actually set you back. This is where larger, carefully built datasets earn their keep.
Because the runs are so cheap and so fast, you do not have to guess. The sensible approach is to start at 5,000 pairs, measure the lift on a held-out set of your own work, and only keep feeding it data while the curve is still climbing.
The takeaway
For almost every firm, the answer is LoRA on Qwen3.5-397B. You can train a genuinely firm-specific model on 10,000 of your own examples for the price of lunch, on hardware you can actually get. The expensive, valuable work is not renting the GPUs. It is building the golden pairs that make the model yours.
The bottom line
The headline score crowns GLM-5.2, but the leaderboard ignores the bill. Once you weigh capability against the hardware it takes to serve, Qwen3.5-397B is the model a firm should standardize on: near-top quality, strong open-ended reasoning, and a third of the top model's hardware. It is also one of the most affordable capable models to fine-tune, and when you do, the GPUs are cheap and LoRA keeps the job feasible on a handful of cards. The expensive, valuable work is not renting the hardware. It is building the golden pairs that make the model yours.
Scores from a 500-item LegalBench subset (250 closed / 250 open). Cost figures are planning estimates at $8.50/B200-hour and will vary with your provider, context length, and utilization. Fine-tuning figures are raw GPU-compute estimates; budget 2 to 3× for an all-in project.