Tensor LabsTENSORLABS

Build a Cross-Modal Search Engine with Google gemini-embedding-2 in Python

In June 2026, Google added gemini-embedding-2 to the Gemini API, the first multimodal embedding model in the family.

June 29, 20267 min read12 sectionsBy Ahmed Abdullah
Build a Cross-Modal Search Engine with Google gemini-embedding-2 in Python

Introduction

In June 2026, Google added gemini-embedding-2 to the Gemini API, the first multimodal embedding model in the family. It maps text, images, audio, video, and PDFs into one shared vector space. This tutorial builds a search engine where a plain text query finds the right image and the right document from the same index, because all of them live on the same map.

The interesting part is not that the model embeds images. It is that the image vector and the sentence vector are comparable. That is what makes "find me the diagram about retries" return a PNG.

What an embedding is

An embedding is a list of numbers that represents the meaning of an input. Similar meanings land near each other; unrelated ones land far apart. Once your content is a set of vectors, search stops being string matching and becomes distance measuring. You embed the query, then find the stored vectors closest to it.

gemini-embedding-2 returns 3,072 dimensions by default. You can request fewer with output_dimensionality, trading a little accuracy for smaller storage and faster comparison. At a few hundred documents the full 3,072 costs nothing you will notice; at a few million, dropping to 768 or 1,536 is the difference between an index that fits in memory and one that does not. Pick the number when you know the scale, not before.

What cross-modal embedding means

Most embedding stacks use one model for text and a different model for images. The two produce vectors in two unrelated spaces, so a text vector and an image vector cannot be compared directly. You end up bolting on a caption pipeline and searching the captions, which means you are searching text about the image, not the image.

A multimodal model removes that seam. gemini-embedding-2 places a sentence and a photograph in the same coordinate system, so the distance between them is meaningful. One index, one query path, every modality. The seam is gone, and so is the caption pipeline you would have maintained forever.

Set up the client and embed text

Install the SDK and embed your first string. The method is client.models.embed_content, the model is the literal string gemini-embedding-2.

code
pip install google-genai numpy
python
# embed_text.py
from google import genai
client = genai.Client() # reads GEMINI_API_KEY from the environment
result = client.models.embed_content(
model="gemini-embedding-2",
contents=["A guide to retrying failed API calls with backoff"],
)
vector = result.embeddings[0].values
print(len(vector)) # 3072

One call, one vector. The text path is the easy half. The point of this model is that the next call looks almost identical but takes an image.

Embed an image into the same space

Images go in as bytes wrapped in types.Part.from_bytes. Note what is not happening here: there is no separate image model, no separate endpoint, no caption step. The same method, the same model string, a different Part.

python
# embed_image.py
from google import genai
from google.genai import types
client = genai.Client()
with open("retry-diagram.png", "rb") as f:
image_bytes = f.read()
result = client.models.embed_content(
model="gemini-embedding-2",
contents=[
types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
],
)
image_vector = result.embeddings[0].values
print(len(image_vector)) # 3072, same space as the text vector

Both vectors have the same length and live in the same space. That is the whole premise. Now we put a mixed pile of them in one index.

Build a unified index across modalities

A real system writes these vectors to a store like pgvector or Qdrant. To keep the tutorial runnable end to end, the index here is a list in memory and the math is explicit, so you can see exactly what a vector database does for you later.

python
# index.py
from google import genai
from google.genai import types
client = genai.Client()
def embed(content) -> list[float]:
result = client.models.embed_content(
model="gemini-embedding-2",
contents=[content],
)
return result.embeddings[0].values
def embed_image(path: str) -> list[float]:
with open(path, "rb") as f:
part = types.Part.from_bytes(data=f.read(), mime_type="image/png")
return embed(part)
# a mixed corpus: text docs and images, one index
INDEX = [
{"id": "doc:backoff", "modality": "text",
"vector": embed("Exponential backoff retries failed requests with growing delays")},
{"id": "doc:idempotency", "modality": "text",
"vector": embed("Idempotency keys stop a retried request from running twice")},
{"id": "img:retry-diagram", "modality": "image",
"vector": embed_image("retry-diagram.png")},
]

Embed the query with a task instruction

gemini-embedding-2 does not accept the task_type parameter that earlier embedding models used. Instead you put the task in the text itself. For a search query, prefix it so the model knows it is embedding a question, not a passage.

python
# query.py
import numpy as np
def cosine(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
def search(query: str, index: list[dict], k: int = 3):
q_vector = embed(f"task: search result | query: {query}")
ranked = sorted(
index,
key=lambda row: cosine(q_vector, row["vector"]),
reverse=True,
)
return ranked[:k]

The prefix task: search result | query: is the asymmetric-search instruction the docs call for. It tells the model this string is a query looking for matching results, which pulls it toward the documents that answer it rather than the documents that merely look like it.

Search across modalities with one query

Now the payoff. A text query, run against an index that holds both documents and an image, ranks all of them together.

python
for hit in search("how do I stop a retry from charging twice", INDEX):
print(hit["id"], hit["modality"])
# doc:idempotency text
# doc:backoff text
# img:retry-diagram image

The text query reached into the image without a single caption being written. The diagram ranked because its pixels landed near the meaning of the question, in the same space, measured with the same cosine.

What this does not do

The model returns one aggregated embedding per request by default, so if you pass a document and an image in the same contents call, you get a single blended vector, not one per item. For an index you almost always want them separate, which means one item per call, which means rate limits and cost matter at scale. The limits are real and worth saying out loud: a maximum of 6 images, 8,192 text tokens, 180 seconds of audio, or 120 seconds of video per request. A cosine loop over a Python list is a teaching tool and a production liability; past a few thousand vectors, move to pgvector or Qdrant and let an index do the distance math.

The full working search engine

python
# search_engine.py
from google import genai
from google.genai import types
import numpy as np
client = genai.Client()
def embed(content) -> list[float]:
result = client.models.embed_content(
model="gemini-embedding-2",
contents=[content],
)
return result.embeddings[0].values
def embed_image(path: str) -> list[float]:
with open(path, "rb") as f:
part = types.Part.from_bytes(data=f.read(), mime_type="image/png")
return embed(part)
def cosine(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
def search(query: str, index: list[dict], k: int = 3):
q_vector = embed(f"task: search result | query: {query}")
ranked = sorted(index, key=lambda r: cosine(q_vector, r["vector"]), reverse=True)
return ranked[:k]
if __name__ == "__main__":
index = [
{"id": "doc:backoff", "modality": "text",
"vector": embed("Exponential backoff retries failed requests with growing delays")},
{"id": "doc:idempotency", "modality": "text",
"vector": embed("Idempotency keys stop a retried request from running twice")},
{"id": "img:retry-diagram", "modality": "image",
"vector": embed_image("retry-diagram.png")},
]
for hit in search("how do I stop a retry from charging twice", index):
print(hit["id"], hit["modality"])

Set GEMINI_API_KEY, drop a PNG named retry-diagram.png next to the file, and run it. The text question ranks the documents and the image in one pass.

When to reach for it

Reach for cross-modal embedding when your content is genuinely mixed and your users ask in words: a docs site with diagrams, a product catalog with photos, a support corpus with screenshots. If everything you store and everything you search is plain text, a text-only embedding model is cheaper and just as good, and the multimodal model is range you are paying for and not using. The moment a screenshot needs to be findable by a sentence, though, the seam between text and pixels is the problem, and this is the model that does not have one.