Build a Multimodal Invoice-to-JSON Extractor with Gemini 3.5 Flash
At Google I/O 2026 Google shipped the Gemini 3.5 series, and the version that matters for document work is Gemini 3.5 Flash: image input plus enforced structured output in one call.

Introduction
At Google I/O 2026 Google shipped the Gemini 3.5 series, and the version that matters for document work is Gemini 3.5 Flash: image input plus enforced structured output in one call. That combination is what lets you turn a scanned invoice into validated JSON without an OCR step, a regex layer, or a per-vendor template. By the end of this tutorial you will have a function that takes an invoice image and returns a typed object you can write straight to a database. About 120 lines, no pipeline.
Document extraction has been a three-stage problem for a decade: OCR the pixels to text, parse the text with rules, then clean up what the rules missed. A multimodal model collapses all three. You hand it the image and the shape you want back. It reads the document the way a person does and fills in the shape.
What Gemini 3.5 Flash does that matters here
Two capabilities, used together. First, it accepts an image directly as input, so the model sees the invoice layout, not a flattened text dump where the total and the tax have lost their spatial relationship. Second, it supports a response schema: you pass a Pydantic model as the expected output, and the API constrains generation to match it. The model cannot return a stray field or a string where you asked for a number. That constraint is the difference between a demo and something you can run unattended.
What structured extraction actually means
Structured extraction is not "ask the model nicely for JSON." It is binding the model's output to a schema the API enforces. The schema is your contract. You declare that an invoice has a vendor name, an invoice number, a date, a list of line items, and a total. The model's job is to populate that shape from the pixels. Everything downstream, validation, storage, reconciliation, depends on the output matching the contract every time, not most of the time.
Set up the project
One SDK, one key.
python -m venv venv && source venv/bin/activate
pip install google-genai pydantic
export GEMINI_API_KEY="your-gemini-api-key"The google-genai package is Google's current SDK for the Gemini API. The key comes fromGoogle AI Studio.
Define the shape you want back
Start with the schema, not the prompt. The schema is the part you will reuse across every invoice, so it deserves the most thought. Model the document, including the nested line items.
from pydantic import BaseModel, Field
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
amount: float
class Invoice(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str = Field(description="ISO 8601, YYYY-MM-DD")
currency: str = Field(description="ISO 4217 code, e.g. USD")
line_items: list[LineItem]
subtotal: float
tax: float
total: floatThe Field(description=...) calls are not documentation. Gemini 3.5 Flash reads them as part of the schema, so "ISO 8601, YYYY-MM-DD" actively steers the date format. The description is a prompt that travels with the field
Read the image and make one call
Now the core build step. Load the invoice bytes, pass them alongside a short instruction, and set the response schema. The SDK returns a parsed Pydantic object directly
import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
def extract_invoice(image_path: str) -> Invoice:
with open(image_path, "rb") as f:
image_bytes = f.read()
mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=[
types.Part.from_bytes(data=image_bytes, mime_type=mime),
"Extract the invoice into the provided schema. "
"Use the totals printed on the document, do not recompute them.",
],
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=Invoice,
),
)
return response.parsedThe instruction "do not recompute them" is doing real work. Left to its own judgment the model will sometimes sum the line items and report that, which disagrees with a printed total when the invoice has a rounding line or a manual discount. You want what the document says, not what the math implies.
Verify the extraction instead of trusting it
A schema guarantees the shape. It does not guarantee the numbers. The cheapest real check is arithmetic the model already has no incentive to fake: does subtotal plus tax land on total?
def validate_invoice(inv: Invoice, tolerance: float = 0.02) -> list[str]:
problems = []
expected = round(inv.subtotal + inv.tax, 2)
if abs(expected - inv.total) > tolerance:
problems.append(
f"subtotal {inv.subtotal} + tax {inv.tax} = {expected}, "
f"but total reads {inv.total}"
)
line_sum = round(sum(i.amount for i in inv.line_items), 2)
if abs(line_sum - inv.subtotal) > tolerance:
problems.append(
f"line items sum to {line_sum}, subtotal reads {inv.subtotal}"
)
return problemsAn invoice that fails this check is not necessarily wrong. It might have a fee the schema does not model. But it is the document a human should look at, which means your extractor just earned its keep by sorting the 95 clean invoices from the 5 that need eyes.
Process a folder and route the doubtful ones
The last build step turns one extraction into a batch with a review queue. Clean invoices flow through. Flagged ones land in a separate list a person works.
from pathlib import Path
def process_folder(folder: str):
clean, needs_review = [], []
for path in sorted(Path(folder).glob("*.[pj][pn]g")):
inv = extract_invoice(str(path))
problems = validate_invoice(inv)
record = {"file": path.name, "invoice": inv.model_dump()}
if problems:
record["problems"] = problems
needs_review.append(record)
else:
clean.append(record)
return clean, needs_reviewThe glob pattern *.[pj][pn]g catches .png and .jpg without a second loop. Everything clean is ready for your database; everything flagged is a short, honest worklist.
Gemini 3.5 Flash vs OCR-plus-templates
The traditional pipeline still has a place. The point is knowing where the line is.
| Dimension | Gemini 3.5 Flash | OCR + per-vendor templates |
|---|---|---|
| New vendor layout | Works with no change | Needs a new template |
| Setup cost | A schema | An OCR engine plus rules per format |
| Handles handwriting / stamps | Often, as context | Poorly |
| Cost per document | A model call | Near zero after setup |
| Auditability | Reasoning is opaque | Rules are inspectable |
| Best at | Varied, messy, long-tail layouts | High volume of one fixed format |
If every invoice you process comes from the same three vendors in the same format at high volume, templates are cheaper and more auditable. The model wins the moment the layouts vary, which for most businesses is immediately.
What this extractor does not do
It trusts the model's read of low-quality scans, and a smudged digit becomes a confident wrong number that passes the schema and sometimes even the arithmetic check. It has no concept of duplicate detection, so the same invoice submitted twice extracts twice. And the validation only catches math errors, not a vendor name read off the "remit to" address instead of the "from" block. The schema makes the output safe to store. It does not make it true. That gap is why the review queue exists.
When to use it
Reach for this when you receive documents in formats you do not control and cannot predict: supplier invoices, receipts, shipping manifests, intake forms from a long tail of senders. The multimodal-plus-schema approach removes the per-format engineering that made document extraction expensive, and the validation layer keeps a human in the loop exactly where the model is weakest. Do not point it at a single high-volume fixed format where a template is cheaper, and never let an unvalidated extraction post to a ledger. Used as a first pass with arithmetic as the gate, it turns a manual data-entry task into a short review of the few documents that actually need a person
Full working example
import os
from pathlib import Path
from google import genai
from google.genai import types
from pydantic import BaseModel, Field
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
amount: float
class Invoice(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str = Field(description="ISO 8601, YYYY-MM-DD")
currency: str = Field(description="ISO 4217 code, e.g. USD")
line_items: list[LineItem]
subtotal: float
tax: float
total: float
def extract_invoice(image_path: str) -> Invoice:
with open(image_path, "rb") as f:
image_bytes = f.read()
mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=[
types.Part.from_bytes(data=image_bytes, mime_type=mime),
"Extract the invoice into the provided schema. "
"Use the totals printed on the document, do not recompute them.",
],
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=Invoice,
),
)
return response.parsed
def validate_invoice(inv: Invoice, tol: float = 0.02) -> list[str]:
out = []
if abs(round(inv.subtotal + inv.tax, 2) - inv.total) > tol:
out.append(f"subtotal + tax != total ({inv.total})")
if abs(round(sum(i.amount for i in inv.line_items), 2) - inv.subtotal) > tol:
out.append(f"line items != subtotal ({inv.subtotal})")
return out
def process_folder(folder: str):
clean, needs_review = [], []
for path in sorted(Path(folder).glob("*.[pj][pn]g")):
inv = extract_invoice(str(path))
problems = validate_invoice(inv)
rec = {"file": path.name, "invoice": inv.model_dump()}
(needs_review if problems else clean).append(
{**rec, "problems": problems} if problems else rec
)
return clean, needs_review
if __name__ == "__main__":
clean, review = process_folder("./invoices")
print(f"{len(clean)} clean, {len(review)} need review")Drop a folder of invoice images next to the script, run it, and you get a clean set ready for storage and a short review list of the documents the arithmetic could not vouch for.
You might also like
Keep reading from the journal.
June 16, 2026AI
You don't need a data scientist yet
What an AI-curious founder actually needs, and why it is not the hire they were about to make.
June 16, 2026AI
The meter was always coming
We built an agent for a client that wakes up on every pull request. It reads the diff, checks it against the rules, leaves a comment, goes back to sleep. It has done this hundreds of times a week for months.
June 16, 2026AI
What your extraction company sells the morning the model learns to read.
A stock model can read any document now, so accuracy stopped being the product. The value moved up a step, to the decision the read sets in motion.