Running a frontier coder on hardware you rent
On June 1, 2026, MiniMax released MiniMax M3, the first open-weight model to pair frontier coding with a 1M-token context and native multimodality. It tops the open-weight SWE-Bench Pro leaderboard at 59.0%.

Introduction
On June 1, 2026, MiniMax released MiniMax M3, the first open weight model to pair frontier coding with a 1M token context and native multimodality. It tops the open weight SWE Bench Pro leaderboard at 59.0%. MiniMax has committed to publishing the weights on Hugging Face within roughly ten days of launch, so by mid June you can download a model in the same coding tier as the hosted frontier and run it on a GPU you rent by the hour, paying zero per token fees. This tutorial sets up MiniMax M3 with vLLM behind an OpenAI compatible endpoint and points a coding agent at it.
What MiniMax M3 is, and why the architecture matters
M3 is a large open-weight model (community estimates put it in the 200-400B parameter range) built on a new attention mechanism MiniMax calls MSA, MiniMax Sparse Attention. MSA does block level sparse selection on a GQA backbone, which is what lets the model hold a 1M token context without the memory cost of dense attention at that length. For a coding workload that means you can put an entire repository in the prompt and still afford the forward pass.
The headline number is the one to keep honest about: 59.0% on the open weight SWE Bench Pro split. That beats every other open weight model and trails the closed frontier. You are not getting GPT tier scores for free. You are getting close enough that for a lot of internal tooling the gap stops mattering, and you own the deployment
Setup: vLLM and a place to put the weights
vLLM is the serving layer. It exposes any supported Hugging Face model as an OpenAI compatible HTTP API, which means your existing OpenAI client code works against it unchanged.
# A clean environment with a recent vLLM (M3's MSA support landed in vLLM 0.9+)
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade "vllm>=0.9.2" "openai>=1.40"
# Confirm GPUs are visible (M3 needs multi-GPU; a single 80GB card will not hold it)
nvidia-smi --query-gpu=name,memory.total --format=csvA model this size does not fit on one card. Plan for a multi-GPU box (for example, 4x or 8x H100/H200) on whatever cloud you rent from. The exact VRAM depends on the final parameter count and your quantization choice, which you will confirm against the model card once weights are published
Serving the model
Once MiniMax publishes the weights (the Hugging Face id is minimax/minimax-m3), serving is one command. tensor parallel size should match your GPU count.
vllm serve minimax/minimax-m3 \
--tensor-parallel-size 8 \
--max-model-len 1000000 \
--served-model-name minimax-m3 \
--host 0.0.0.0 --port 8000vLLM downloads the weights from Hugging Face on first run, loads them across the eight GPUs, and starts an OpenAI compatible server on port 8000. max model len 1000000 is what unlocks the full 1M-token window; drop it lower if your GPU memory cannot hold the KV cache for the whole context.
Talking to it like it is OpenAI
Because vLLM speaks the OpenAI protocol, you point the standard openai client at your own URL. No vendor SDK, no lock in. This same code runs against MiniMax's hosted API today (swap the base_url and key) and against your self hosted endpoint the day the weights land.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed-for-local")
resp = client.chat.completions.create(
model="minimax-m3",
messages=[
{"role": "system", "content": "You are a senior Python engineer. Reply with code only."},
{"role": "user", "content": "Write a function that retries an HTTP GET with exponential backoff."},
],
temperature=0.2,
)
print(resp.choices[0].message.content)Feeding it a whole repository
The 1M token context is the reason to bother self hosting instead of chunking against a hosted API and paying for every token twice. Here is a minimal "explain this codebase" call that loads real files into a single prompt.
from pathlib import Path
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def load_repo(root: str, exts=(".py",), limit_files=400) -> str:
chunks = []
for path in sorted(Path(root).rglob("*"))[:limit_files]:
if path.suffix in exts and path.is_file():
chunks.append(f"### FILE: {path}\n{path.read_text(errors='ignore')}")
return "\n\n".join(chunks)
repo = load_repo("./my-service")
resp = client.chat.completions.create(
model="minimax-m3",
messages=[
{"role": "system", "content": "You are reviewing a codebase. Cite file paths in your answer."},
{"role": "user", "content": f"{repo}\n\nWhere is database access not wrapped in a retry? List file an
],
temperature=0.1,
)
print(resp.choices[0].message.content)The model reads the files as one context and answers across them. No vector store, no retrieval step, no chunk boundary bugs. For a 400 file service this is a few hundred thousand tokens, comfortably inside the window
Fitting it on fewer GPUs
A model this size is expensive to host at full precision. Quantization trades a little accuracy for a lot of memory. vLLM serves M3 in fp8 with one flag, storing weights at 8-bit instead of 16-bit.
vllm serve minimax/minimax-m3 \
--tensor-parallel-size 4 \
--quantization fp8 \
--max-model-len 256000 \
--served-model-name minimax-m3 \
--port 8000That can be the difference between a 4-GPU box and an 8-GPU one, which is a real line on the bill. The cost is some accuracy, so measure fp8 against full precision on your own eval before you commit to it. Do not assume the quantized model is good enough because the full one was. Memory is not only the weights. The KV cache for a 1M token context is large on its own, and it grows with every concurrent request you serve. Dropping max model len to 256000 in the command above is the other half of the trade: most coding tasks never need the full million tokens, and a shorter ceiling frees GPU memory for more simultaneous users. Set the context length to the longest prompt you actually send, not the largest the model can take.
Streaming, because code is read as it arrives
A coding model that makes you wait for the entire response before showing a single character is painful to work with. vLLM streams tokens over the same OpenAI protocol. Set stream=True and read the deltas as they land
stream = client.chat.completions.create(
model="minimax-m3",
messages=[{"role": "user", "content": "Refactor this loop into a list comprehension: total = []\nfor x in
stream=True,
temperature=0.2,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)A minimal coding loop
The reason to run a frontier coder instead of chatting with one is to let it act on a real file. Here is a tight read propose write loop: load a file, ask M3 for the patched version, write it back. No framework, just the OpenAI compatible client against your own endpoint.
from pathlib import Path
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def edit_file(path: str, instruction: str) -> None:
src = Path(path).read_text()
resp = client.chat.completions.create(
model="minimax-m3",
messages=[
{"role": "system", "content": "Return ONLY the full updated file. No prose, no code fences."},
{"role": "user", "content": f"Instruction: {instruction}\n\n--- {path} ---\n{src}"},
],
temperature=0.1,
)
Path(path).write_text(resp.choices[0].message.content)
print(f"Patched {path}")
edit_file("service/retry.py", "Add a max_retries argument defaulting to 3 and respect it.")This is the skeleton every coding agent expands on: read context, ask the model for a change, apply it, then verify by running the tests and diffing the result, looping if the change is wrong. The verify step is the part that matters and the part a leaderboard score does not capture. A model that scores 59% is wrong on more than four cases in ten, and the loop around it is the only thing that catches them.
When self hosting M3 is right, and when it is a trap
Self host M3 when you have steady, high-volume coding workloads (internal agents, CI review bots, bulk refactors) where per token API fees would dwarf a fixed GPU bill, or when the code cannot leave your network. At that volume the math flips fast: a rented 8-GPU box runs a flat hourly rate while a hosted frontier bills every one of your millions of tokens.
Do not self host for spiky or low volume use. A model that costs you the same per hour whether you send one request or ten thousand is pure waste when your traffic is bursty, and the hosted MiniMax API (or the closed frontier) will be cheaper and less work. (The most expensive GPU is the one sitting idle at ninety percent of the day.) And do not self host until you have actually run your own eval set against M3, because a leaderboard score is someone else's benchmark, not your workload.
The open weight frontier did not used to be a real option for serious coding. On June 1 it became one. The interesting question is no longer whether you can run a capable coder yourself. It is whether your volume makes owning the box cheaper than renting the tokens, and now you have the numbers to find out
You might also like
Keep reading from the journal.
June 11, 2026AI
Your parser is not your product
The consultation call was about emails. A parts-sourcing platform for industrial components, the kind of business where a buyer sends a bill of materials as three paragraphs of prose and somebody on the other end retypes it into a quote system before lunch
June 10, 2026AI
95% accurate and completely useless
The model shipped on a Thursday, and the team did the thing teams do. Screenshot of the dashboard, 95% accuracy, posted in the company channel with a rocket next to it. The fraud system worked. Everyone moved on to the next thing.
June 8, 2026AI
Teaching your website to answer agents
On May 19, 2026, at Google I/O, Chrome announced WebMCP, and as of Chrome 149 it ships as an origin trial. WebMCP lets a web page expose structured tools to a browser-based AI agent, so the agent calls a function you defined instead of guessing its way through your DOM