Build a Multimodal Invoice-to-JSON Extractor with Gemini 3.5 Flash

At Google I/O 2026 Google shipped the Gemini 3.5 series, and the version that matters for document work is Gemini 3.5 Flash: image input plus enforced structured output in one call.

AI Data Automation DataEngineering

June 17, 20267 min read12 sectionsBy Ahmed Abdullah

Build a Multimodal Invoice-to-JSON Extractor with Gemini 3.5 Flash

Introduction

At Google I/O 2026 Google shipped the Gemini 3.5 series, and the version that matters for document work is Gemini 3.5 Flash: image input plus enforced structured output in one call. That combination is what lets you turn a scanned invoice into validated JSON without an OCR step, a regex layer, or a per-vendor template. By the end of this tutorial you will have a function that takes an invoice image and returns a typed object you can write straight to a database. About 120 lines, no pipeline.

Document extraction has been a three-stage problem for a decade: OCR the pixels to text, parse the text with rules, then clean up what the rules missed. A multimodal model collapses all three. You hand it the image and the shape you want back. It reads the document the way a person does and fills in the shape.

What Gemini 3.5 Flash does that matters here

Two capabilities, used together. First, it accepts an image directly as input, so the model sees the invoice layout, not a flattened text dump where the total and the tax have lost their spatial relationship. Second, it supports a response schema: you pass a Pydantic model as the expected output, and the API constrains generation to match it. The model cannot return a stray field or a string where you asked for a number. That constraint is the difference between a demo and something you can run unattended.

What structured extraction actually means

Structured extraction is not "ask the model nicely for JSON." It is binding the model's output to a schema the API enforces. The schema is your contract. You declare that an invoice has a vendor name, an invoice number, a date, a list of line items, and a total. The model's job is to populate that shape from the pixels. Everything downstream, validation, storage, reconciliation, depends on the output matching the contract every time, not most of the time.

Set up the project

One SDK, one key.

code

python -m venv venv && source venv/bin/activate
pip install google-genai pydantic
export GEMINI_API_KEY="your-gemini-api-key"

The google-genai package is Google's current SDK for the Gemini API. The key comes fromGoogle AI Studio.

Define the shape you want back

Start with the schema, not the prompt. The schema is the part you will reuse across every invoice, so it deserves the most thought. Model the document, including the nested line items.

code

from pydantic import BaseModel, Field
class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    amount: float
    class Invoice(BaseModel):
        vendor_name: str
        invoice_number: str
        invoice_date: str = Field(description="ISO 8601, YYYY-MM-DD")
        currency: str = Field(description="ISO 4217 code, e.g. USD")
        line_items: list[LineItem]
        subtotal: float
        tax: float
        total: float

The Field(description=...) calls are not documentation. Gemini 3.5 Flash reads them as part of the schema, so "ISO 8601, YYYY-MM-DD" actively steers the date format. The description is a prompt that travels with the field

Read the image and make one call

Now the core build step. Load the invoice bytes, pass them alongside a short instruction, and set the response schema. The SDK returns a parsed Pydantic object directly

python

import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
def extract_invoice(image_path: str) -> Invoice:
    with open(image_path, "rb") as f:
        image_bytes = f.read()
        mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
        response = client.models.generate_content(
            model="gemini-3.5-flash",
            contents=[
                types.Part.from_bytes(data=image_bytes, mime_type=mime),
                "Extract the invoice into the provided schema. "
                "Use the totals printed on the document, do not recompute them.",
            ],
            config=types.GenerateContentConfig(
                response_mime_type="application/json",
                response_schema=Invoice,
            ),
        )
        return response.parsed

The instruction "do not recompute them" is doing real work. Left to its own judgment the model will sometimes sum the line items and report that, which disagrees with a printed total when the invoice has a rounding line or a manual discount. You want what the document says, not what the math implies.

Verify the extraction instead of trusting it

A schema guarantees the shape. It does not guarantee the numbers. The cheapest real check is arithmetic the model already has no incentive to fake: does subtotal plus tax land on total?

python

def validate_invoice(inv: Invoice, tolerance: float = 0.02) -> list[str]:
    problems = []
    expected = round(inv.subtotal + inv.tax, 2)
    if abs(expected - inv.total) > tolerance:
        problems.append(
            f"subtotal {inv.subtotal} + tax {inv.tax} = {expected}, "
            f"but total reads {inv.total}"
        )
        line_sum = round(sum(i.amount for i in inv.line_items), 2)
        if abs(line_sum - inv.subtotal) > tolerance:
            problems.append(
                f"line items sum to {line_sum}, subtotal reads {inv.subtotal}"
            )
            return problems

An invoice that fails this check is not necessarily wrong. It might have a fee the schema does not model. But it is the document a human should look at, which means your extractor just earned its keep by sorting the 95 clean invoices from the 5 that need eyes.

Process a folder and route the doubtful ones

The last build step turns one extraction into a batch with a review queue. Clean invoices flow through. Flagged ones land in a separate list a person works.

python

from pathlib import Path
def process_folder(folder: str):
    clean, needs_review = [], []
    for path in sorted(Path(folder).glob("*.[pj][pn]g")):
        inv = extract_invoice(str(path))
        problems = validate_invoice(inv)
        record = {"file": path.name, "invoice": inv.model_dump()}
        if problems:
            record["problems"] = problems
            needs_review.append(record)
        else:
            clean.append(record)
            return clean, needs_review

The glob pattern *.[pj][pn]g catches .png and .jpg without a second loop. Everything clean is ready for your database; everything flagged is a short, honest worklist.

Gemini 3.5 Flash vs OCR-plus-templates

The traditional pipeline still has a place. The point is knowing where the line is.

Dimension	Gemini 3.5 Flash	OCR + per-vendor templates
New vendor layout	Works with no change	Needs a new template
Setup cost	A schema	An OCR engine plus rules per format
Handles handwriting / stamps	Often, as context	Poorly
Cost per document	A model call	Near zero after setup
Auditability	Reasoning is opaque	Rules are inspectable
Best at	Varied, messy, long-tail layouts	High volume of one fixed format

If every invoice you process comes from the same three vendors in the same format at high volume, templates are cheaper and more auditable. The model wins the moment the layouts vary, which for most businesses is immediately.

What this extractor does not do

It trusts the model's read of low-quality scans, and a smudged digit becomes a confident wrong number that passes the schema and sometimes even the arithmetic check. It has no concept of duplicate detection, so the same invoice submitted twice extracts twice. And the validation only catches math errors, not a vendor name read off the "remit to" address instead of the "from" block. The schema makes the output safe to store. It does not make it true. That gap is why the review queue exists.

When to use it

Reach for this when you receive documents in formats you do not control and cannot predict: supplier invoices, receipts, shipping manifests, intake forms from a long tail of senders. The multimodal-plus-schema approach removes the per-format engineering that made document extraction expensive, and the validation layer keeps a human in the loop exactly where the model is weakest. Do not point it at a single high-volume fixed format where a template is cheaper, and never let an unvalidated extraction post to a ledger. Used as a first pass with arithmetic as the gate, it turns a manual data-entry task into a short review of the few documents that actually need a person

Full working example

python

import os
from pathlib import Path
from google import genai
from google.genai import types
from pydantic import BaseModel, Field
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    amount: float
    class Invoice(BaseModel):
        vendor_name: str
        invoice_number: str
        invoice_date: str = Field(description="ISO 8601, YYYY-MM-DD")
        currency: str = Field(description="ISO 4217 code, e.g. USD")
        line_items: list[LineItem]
        subtotal: float
        tax: float
        total: float
        def extract_invoice(image_path: str) -> Invoice:
            with open(image_path, "rb") as f:
                image_bytes = f.read()
                mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
                response = client.models.generate_content(
                    model="gemini-3.5-flash",
                    contents=[
                        types.Part.from_bytes(data=image_bytes, mime_type=mime),
                        "Extract the invoice into the provided schema. "
                        "Use the totals printed on the document, do not recompute them.",
                    ],
                    config=types.GenerateContentConfig(
                        response_mime_type="application/json",
                        response_schema=Invoice,
                    ),
                )
                return response.parsed
                def validate_invoice(inv: Invoice, tol: float = 0.02) -> list[str]:
                    out = []
                    if abs(round(inv.subtotal + inv.tax, 2) - inv.total) > tol:
                        out.append(f"subtotal + tax != total ({inv.total})")
                        if abs(round(sum(i.amount for i in inv.line_items), 2) - inv.subtotal) > tol:
                            out.append(f"line items != subtotal ({inv.subtotal})")
                            return out
                            def process_folder(folder: str):
                                clean, needs_review = [], []
                                for path in sorted(Path(folder).glob("*.[pj][pn]g")):
                                    inv = extract_invoice(str(path))
                                    problems = validate_invoice(inv)
                                    rec = {"file": path.name, "invoice": inv.model_dump()}
                                    (needs_review if problems else clean).append(
                                        {**rec, "problems": problems} if problems else rec
                                    )
                                    return clean, needs_review
                                    if __name__ == "__main__":
                                        clean, review = process_folder("./invoices")
                                        print(f"{len(clean)} clean, {len(review)} need review")

Drop a folder of invoice images next to the script, run it, and you get a clean set ready for storage and a short review list of the documents the arithmetic could not vouch for.

Keep reading from the journal.

July 6, 2026

Deduplicate the corpus before the AI reads it

Fingerprint and canonicalize at ingestion, before retrieval trusts an echo

July 6, 2026

Your growth chart is counting the same person twice

Count people, not devices, and let the graph rewrite yesterday

July 6, 2026

Automation

Bill from a ledger, not a counter

Usage you can replay is revenue you can defend