Running a frontier coder on hardware you rent

On June 1, 2026, MiniMax released MiniMax M3, the first open-weight model to pair frontier coding with a 1M-token context and native multimodality. It tops the open-weight SWE-Bench Pro leaderboard at 59.0%.

AI Coding MLOps Cloud

June 10, 20267 min read10 sectionsBy Ahmed Abdullah

Running a frontier coder on hardware you rent

Introduction

On June 1, 2026, MiniMax released MiniMax M3, the first open weight model to pair frontier coding with a 1M token context and native multimodality. It tops the open weight SWE Bench Pro leaderboard at 59.0%. MiniMax has committed to publishing the weights on Hugging Face within roughly ten days of launch, so by mid June you can download a model in the same coding tier as the hosted frontier and run it on a GPU you rent by the hour, paying zero per token fees. This tutorial sets up MiniMax M3 with vLLM behind an OpenAI compatible endpoint and points a coding agent at it.

What MiniMax M3 is, and why the architecture matters

M3 is a large open-weight model (community estimates put it in the 200-400B parameter range) built on a new attention mechanism MiniMax calls MSA, MiniMax Sparse Attention. MSA does block level sparse selection on a GQA backbone, which is what lets the model hold a 1M token context without the memory cost of dense attention at that length. For a coding workload that means you can put an entire repository in the prompt and still afford the forward pass.

The headline number is the one to keep honest about: 59.0% on the open weight SWE Bench Pro split. That beats every other open weight model and trails the closed frontier. You are not getting GPT tier scores for free. You are getting close enough that for a lot of internal tooling the gap stops mattering, and you own the deployment

Setup: vLLM and a place to put the weights

vLLM is the serving layer. It exposes any supported Hugging Face model as an OpenAI compatible HTTP API, which means your existing OpenAI client code works against it unchanged.

code

# A clean environment with a recent vLLM (M3's MSA support landed in vLLM 0.9+)
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade "vllm>=0.9.2" "openai>=1.40"
# Confirm GPUs are visible (M3 needs multi-GPU; a single 80GB card will not hold it)
nvidia-smi --query-gpu=name,memory.total --format=csv

A model this size does not fit on one card. Plan for a multi-GPU box (for example, 4x or 8x H100/H200) on whatever cloud you rent from. The exact VRAM depends on the final parameter count and your quantization choice, which you will confirm against the model card once weights are published

Serving the model

Once MiniMax publishes the weights (the Hugging Face id is minimax/minimax-m3), serving is one command. tensor parallel size should match your GPU count.

code

vllm serve minimax/minimax-m3 \
--tensor-parallel-size 8 \
--max-model-len 1000000 \
--served-model-name minimax-m3 \
--host 0.0.0.0 --port 8000

vLLM downloads the weights from Hugging Face on first run, loads them across the eight GPUs, and starts an OpenAI compatible server on port 8000. max model len 1000000 is what unlocks the full 1M-token window; drop it lower if your GPU memory cannot hold the KV cache for the whole context.

Talking to it like it is OpenAI

Because vLLM speaks the OpenAI protocol, you point the standard openai client at your own URL. No vendor SDK, no lock in. This same code runs against MiniMax's hosted API today (swap the base_url and key) and against your self hosted endpoint the day the weights land.

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed-for-local")
resp = client.chat.completions.create(
    model="minimax-m3",
    messages=[
        {"role": "system", "content": "You are a senior Python engineer. Reply with code only."},
        {"role": "user", "content": "Write a function that retries an HTTP GET with exponential backoff."},
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

Feeding it a whole repository

The 1M token context is the reason to bother self hosting instead of chunking against a hosted API and paying for every token twice. Here is a minimal "explain this codebase" call that loads real files into a single prompt.

python

from pathlib import Path
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def load_repo(root: str, exts=(".py",), limit_files=400) -> str:
    chunks = []
    for path in sorted(Path(root).rglob("*"))[:limit_files]:
        if path.suffix in exts and path.is_file():
            chunks.append(f"### FILE: {path}\n{path.read_text(errors='ignore')}")
            return "\n\n".join(chunks)
            repo = load_repo("./my-service")
            resp = client.chat.completions.create(
                model="minimax-m3",
                messages=[
                    {"role": "system", "content": "You are reviewing a codebase. Cite file paths in your answer."},
                    {"role": "user", "content": f"{repo}\n\nWhere is database access not wrapped in a retry? List file an
                ],
                temperature=0.1,
            )
            print(resp.choices[0].message.content)

The model reads the files as one context and answers across them. No vector store, no retrieval step, no chunk boundary bugs. For a 400 file service this is a few hundred thousand tokens, comfortably inside the window

Fitting it on fewer GPUs

A model this size is expensive to host at full precision. Quantization trades a little accuracy for a lot of memory. vLLM serves M3 in fp8 with one flag, storing weights at 8-bit instead of 16-bit.

code

vllm serve minimax/minimax-m3 \
--tensor-parallel-size 4 \
--quantization fp8 \
--max-model-len 256000 \
--served-model-name minimax-m3 \
--port 8000

That can be the difference between a 4-GPU box and an 8-GPU one, which is a real line on the bill. The cost is some accuracy, so measure fp8 against full precision on your own eval before you commit to it. Do not assume the quantized model is good enough because the full one was. Memory is not only the weights. The KV cache for a 1M token context is large on its own, and it grows with every concurrent request you serve. Dropping max model len to 256000 in the command above is the other half of the trade: most coding tasks never need the full million tokens, and a shorter ceiling frees GPU memory for more simultaneous users. Set the context length to the longest prompt you actually send, not the largest the model can take.

Streaming, because code is read as it arrives

A coding model that makes you wait for the entire response before showing a single character is painful to work with. vLLM streams tokens over the same OpenAI protocol. Set stream=True and read the deltas as they land

python

stream = client.chat.completions.create(
    model="minimax-m3",
    messages=[{"role": "user", "content": "Refactor this loop into a list comprehension: total = []\nfor x in
    stream=True,
    temperature=0.2,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

A minimal coding loop

The reason to run a frontier coder instead of chatting with one is to let it act on a real file. Here is a tight read propose write loop: load a file, ask M3 for the patched version, write it back. No framework, just the OpenAI compatible client against your own endpoint.

python

from pathlib import Path
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def edit_file(path: str, instruction: str) -> None:
    src = Path(path).read_text()
    resp = client.chat.completions.create(
        model="minimax-m3",
        messages=[
            {"role": "system", "content": "Return ONLY the full updated file. No prose, no code fences."},
            {"role": "user", "content": f"Instruction: {instruction}\n\n--- {path} ---\n{src}"},
        ],
        temperature=0.1,
    )
    Path(path).write_text(resp.choices[0].message.content)
    print(f"Patched {path}")
    edit_file("service/retry.py", "Add a max_retries argument defaulting to 3 and respect it.")

This is the skeleton every coding agent expands on: read context, ask the model for a change, apply it, then verify by running the tests and diffing the result, looping if the change is wrong. The verify step is the part that matters and the part a leaderboard score does not capture. A model that scores 59% is wrong on more than four cases in ten, and the loop around it is the only thing that catches them.

When self hosting M3 is right, and when it is a trap

Self host M3 when you have steady, high-volume coding workloads (internal agents, CI review bots, bulk refactors) where per token API fees would dwarf a fixed GPU bill, or when the code cannot leave your network. At that volume the math flips fast: a rented 8-GPU box runs a flat hourly rate while a hosted frontier bills every one of your millions of tokens.

Do not self host for spiky or low volume use. A model that costs you the same per hour whether you send one request or ten thousand is pure waste when your traffic is bursty, and the hosted MiniMax API (or the closed frontier) will be cheaper and less work. (The most expensive GPU is the one sitting idle at ninety percent of the day.) And do not self host until you have actually run your own eval set against M3, because a leaderboard score is someone else's benchmark, not your workload.

The open weight frontier did not used to be a real option for serious coding. On June 1 it became one. The interesting question is no longer whether you can run a capable coder yourself. It is whether your volume makes owning the box cheaper than renting the tokens, and now you have the numbers to find out

Keep reading from the journal.

Your best week ever was a duplicate event

July 13, 2026

Data

Your best week ever was a duplicate event

Event contracts put your tracking plan in CI, where bugs die cheap

Build a Screenshot-to-React Service with Kimi K2.7 Code HighSpeed

July 6, 2026

Engineering

Build a Screenshot-to-React Service with Kimi K2.7 Code HighSpeed

In late June 2026, Moonshot AI added a HighSpeed serving tier for Kimi K2.7 Code, the one-trillion-parameter open-weight coding model it published under a Modified MIT license on June 12.

June 15, 2026

Build an Automated Pull Request Review Bot with Moonshot Kimi K2.7-Code

In June 2026 Moonshot AI open-sourced Kimi K2.7-Code under a Modified MIT license.