Skip to main content

How to pass multimodal data directly to models

Prerequisites

This guide assumes familiarity with the following concepts:

Here we demonstrate how to pass multimodal input directly to models. We currently expect all input to be passed in the same format as OpenAI expects. For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.

In this example we will ask a model to describe an image.

import * as fs from "node:fs/promises";

import { ChatAnthropic } from "@langchain/anthropic";

const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
});

const imageData = await fs.readFile("../../../../examples/hotdog.jpg");

The most commonly supported way to pass in images is to pass it in as a byte string within a message with a complex content type for models that support multimodal input. Here’s an example:

import { HumanMessage } from "@langchain/core/messages";

const message = new HumanMessage({
content: [
{
type: "text",
text: "what does this image contain?",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});
const response = await model.invoke([message]);
console.log(response.content);
This image contains a hot dog. It shows a frankfurter or sausage encased in a soft, elongated bread bun. The sausage itself appears to be reddish in color, likely a smoked or cured variety. The bun is a golden-brown color, suggesting it has been lightly toasted or grilled. The hot dog is presented against a plain white background, allowing the details of the iconic American fast food item to be clearly visible.

Some model providers support taking an HTTP URL to the image directly in a content block of type "image_url":

import { ChatOpenAI } from "@langchain/openai";

const openAIModel = new ChatOpenAI({
model: "gpt-4o",
});

const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: { url: imageUrl },
},
],
});
const response = await openAIModel.invoke([message]);
console.log(response.content);
The weather in the image appears to be pleasant and clear. The sky is mostly blue with a few scattered clouds, indicating good visibility and no immediate signs of rain. The lighting suggests it’s either morning or late afternoon, with sunlight creating a warm and bright atmosphere. There is no indication of strong winds, as the grass and foliage appear calm and undisturbed. Overall, it looks like a beautiful day, possibly spring or summer, ideal for outdoor activities.

We can also pass in multiple images.

const message = new HumanMessage({
content: [
{
type: "text",
text: "are these two images the same?",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});
const response = await openAIModel.invoke([message]);
console.log(response.content);
Yes, the two images are the same.

Next steps​

You’ve now learned how to pass multimodal data to a modal.

Next, you can check out our guide on multimodal tool calls.


Was this page helpful?


You can leave detailed feedback on GitHub.