Using prompto for multimodal prompting with Vertex AI¶

In [1]:

Copied!





from prompto.settings import Settings
from prompto.experiment import Experiment
from dotenv import load_dotenv
import warnings
import os
from prompto.settings import Settings
from prompto.experiment import Experiment
from dotenv import load_dotenv
import warnings
import os

When using prompto to query models from the Vertex AI API, lines in our experiment .jsonl files must have "api": "vertexai" in the prompt dict. Please see the Vertex AI notebook for an introduction to using prompto with the Vertex AI API and setting up the necessary environment variables.

In [2]:

Copied!

load_dotenv(dotenv_path=".env")
load_dotenv(dotenv_path=".env")

Out[2]:

True

Now, we obtain those values. We some warnings if the VERTEXAI_PROJECT_ID or VERTEXAI_LOCATION_ID environment variables haven't been set. However, note that when using Vertex AI, you can actually set the default project and location using the gcloud CLI, so these aren't strictly necessary.

In [3]:

Copied!

VERTEXAI_PROJECT_ID = os.environ.get("VERTEXAI_PROJECT_ID")
if VERTEXAI_PROJECT_ID is None:
    warnings.warn("VERTEXAI_PROJECT_ID is not set")
VERTEXAI_PROJECT_ID = os.environ.get("VERTEXAI_PROJECT_ID")
if VERTEXAI_PROJECT_ID is None:
    warnings.warn("VERTEXAI_PROJECT_ID is not set")

In [4]:

Copied!

VERTEXAI_LOCATION_ID = os.environ.get("VERTEXAI_LOCATION_ID")
if VERTEXAI_LOCATION_ID is None:
    warnings.warn("VERTEXAI_LOCATION_ID is not set")
VERTEXAI_LOCATION_ID = os.environ.get("VERTEXAI_LOCATION_ID")
if VERTEXAI_LOCATION_ID is None:
    warnings.warn("VERTEXAI_LOCATION_ID is not set")

If you get any errors or warnings in the above two cells, try to fix your .env file like the example we have above to get these variables set.

Types of prompts¶

As we saw in the Vertex AI notebook, with the Vertex AI API, the prompt (given via the "prompt" key in the prompt dict) can take several forms:

a string: a single prompt to obtain a response for
a list of strings: a sequence of prompts to send to the model
- this is useful in the use case of simulating a conversation with the model by defining the user prompts sequentially
a list of dictionaries with keys "role" and "parts", where "role" is one of "user", "model", or "system" and "parts" is the message
- this is useful in the case of passing in some conversation history or to pass in a system prompt to the model
- note that only the prompt in the list can be a system prompt, and the rest must be user or model prompts

Multimodal prompts¶

For prompting the model with multimodal inputs, we use this last format where we define a prompt by specifying the role of the prompt and then a list of parts that make up the prompt. Individual pieces of the part can be text, images or video which are passed to the model as a multimodal input. In this setting, the prompt can be defined flexibly with text interspersed with images or video.

When specifying an individual part of the prompt, we define this using a dictionary with the keys "type" and "media". There also may sometimes need to be a "mime_type" key too:

"type" is one of "text", "image", or "video"
"media" is the actual content of the part - this can be a string for text, or a file path for images or video. Alternatively, this can be a Google Storage URI for images or video, e.g. gs://bucket-name/path/to/file.jpg
"mime_type" is the MIME type of the media content, e.g. "image/jpeg" for JPEG images or "video/mp4" for MP4 videos. This is required if the type is a video or if the type is a image and the media is a Google Storage URI. If the type is image and the media is a file path, the MIME type is not necessary

For specifying text, you can just have a string, or you can also use this format, e.g. { "type": "text", "media": "some text" }. For images or video, you must use the dictionary format.

An example of a multimodal prompt is the following:

[
    {
        "role": "user",
        "part": [
            "what is in this video?",
            {"type": "video", "mime_type": "video/mp4", "media": "gs://bucket/GreatRedSpot.mp4"},
        ]
    },
]

Here, we have a list of one dictionary where we specify the "role" as "user" and "part" as a list of two elements: the first is a string and the second is a dictionary specifying the type and media content of the part. In this case, the media content is a video file stored in Google Storage.

For this notebook, we have created an input file in data/input/vertexai-multimodal-example.jsonl with several multimodal prompts with local files as an illustration.

Specifying local files¶

When specifying the local files, the file paths must be relative file paths to the media/ folder in the data folder. For example, if you have an image file image.jpg in the media/ folder, you would specify this as "media": "image.jpg" in the prompt. If you have a video file video.mp4 in the media/videos/ folder, you would specify this as "media": "videos/video.mp4" in the prompt.

In [5]:

Copied!





settings = Settings(data_folder="./data", max_queries=30)
experiment = Experiment(
    file_name="vertexai-multimodal-example.jsonl", settings=settings
)
settings = Settings(data_folder="./data", max_queries=30)
experiment = Experiment(
    file_name="vertexai-multimodal-example.jsonl", settings=settings
)

We set max_queries to 30 so we send 30 queries a minute (every 2 seconds).

In [6]:

Copied!

print(settings)
print(settings)

Settings: data_folder=./data, max_queries=30, max_attempts=3, parallel=False
Subfolders: input_folder=./data/input, output_folder=./data/output, media_folder=./data/media

In [7]:

Copied!

len(experiment.experiment_prompts)
len(experiment.experiment_prompts)

Out[7]:

We can see the prompts that we have in the experiment_prompts attribute:

In [8]:

Copied!

experiment.experiment_prompts
experiment.experiment_prompts

Out[8]:

[{'id': 0,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': ['describe what is happening in this image',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}},
 {'id': 1,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': [{'type': 'image', 'media': 'mortadella.jpg'}, 'what is this?']}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}},
 {'id': 2,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': ['what is in this image?',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]},
   {'role': 'model', 'parts': 'This is image shows a group of cyclists.'},
   {'role': 'user',
    'parts': 'are there any notable cyclists in this image? what are their names?'}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}}]

In the first prompt ("id": 0), we have a "prompt" key which specifies a prompt where we ask the model to "describe what is happening in this image" and we pass in an image which is defined using a dictionary with "type" and "media" keys pointing to a file in the media folder
In the second prompt ("id": 1), we have a "prompt" key which specifies a prompt where we first pass in an image defined using a dictionary with "type" and "media" keys pointing to a file in the media folder and then we ask the model "what is this?"
In the third prompt ("id": 2), we have a "prompt" key which is a list of dictionaries. Each of these dictionaries have a "role" and "parts" key and we specify a user/model interaction. First we ask the model "what is in this image?" along with an image defined by a dictionary with "type" and "media" keys to point to a file in the media folder. We then have a model response and another user query

For each of these prompts, we specify a "model_name" key to be "gemini-1.5-flash-002".

Note that we don't have examples here with videos, but similarly we can pass in videos using the same format as images but additionally specifying the "mime_type" key. As mentioned above, we can also use Google Storage URIs for images and videos too but don't do this here.

Running the experiment¶

We now can run the experiment using the async method process which will process the prompts in the input file asynchronously. Note that a new folder named timestamp-vertexai-example (where "timestamp" is replaced with the actual date and time of processing) will be created in the output directory and we will move the input file to the output directory. As the responses come in, they will be written to the output file and there are logs that will be printed to the console as well as being written to a log file in the output directory.

In [9]:

Copied!

responses, avg_query_processing_time = await experiment.process()
responses, avg_query_processing_time = await experiment.process()

Sending 3 queries at 30 QPM with RI of 2.0s (attempt 1/3): 100%|██████████| 3/3 [00:07<00:00,  2.34s/query]
Waiting for responses (attempt 1/3): 100%|██████████| 3/3 [00:03<00:00,  1.29s/query]

We can see that the responses are written to the output file, and we can also see them as the returned object. From running the experiment, we obtain prompt dicts where there is now a "response" key which contains the response(s) from the model.

For the case where the prompt is a list of strings, we see that the response is a list of strings where each string is the response to the corresponding prompt.

In [10]:

Copied!

responses
responses

Out[10]:

[{'id': 1,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': [{'type': 'image', 'media': 'mortadella.jpg'}, 'what is this?']}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-11-56-54',
  'response': "That's **Mortadella**.  Specifically, the image shows whole and sliced mortadella, a large Italian sausage known for its distinctive marbling of fat throughout the meat.  The string tied around it is a common presentation.\n",
  'safety_attributes': {'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}},
 {'id': 0,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': ['describe what is happening in this image',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-11-56-51',
  'response': "Here's a description of the image:\n\nThe photo depicts a group of professional cyclists in a road race, riding closely together in a peloton.\xa0\n\n\nHere's a breakdown of the scene:\n\n* **The Setting:** The cyclists are riding alongside a low stone wall, with a metal fence visible behind it. There's some greenery beyond the fence, suggesting a roadside or urban setting.\n\n* **The Cyclists:** The cyclists are wearing brightly colored, highly visible cycling jerseys representing different teams.  One cyclist is easily identifiable by his pink jersey, possibly indicating a leader's position or stage win. The others are in various colors, including yellow, red, green, and blue.  Their concentration is evident in their postures.\n\n* **The Bicycles:**  The bicycles are sleek racing bikes with thin tires. The bikes all appear to be high-end racing models.\n\n* **The Action:** The cyclists are clearly in the middle of a race,  riding at a high pace. Their close proximity and intense focus suggests a competitive moment in the race.  There's a sense of urgency and speed in the image.\n\n\nThe overall impression is one of intense athletic competition and the energy of a cycling road race. The colors of the jerseys and the setting are vivid and sharp.\n",
  'safety_attributes': {'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}},
 {'id': 2,
  'api': 'vertexai',
  'model_name': 'gemini-1.5-flash-002',
  'prompt': [{'role': 'user',
    'parts': ['what is in this image?',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]},
   {'role': 'model', 'parts': 'This is image shows a group of cyclists.'},
   {'role': 'user',
    'parts': 'are there any notable cyclists in this image? what are their names?'}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-11-56-56',
  'response': "That's a photo from the 1992 Giro d'Italia.  The most prominent cyclist in the image is **Claudio Chiappucci** in the pink jersey.  He's leading the pack.\n\nWhile it's difficult to definitively identify all the other riders with certainty from this angle and image quality,  identifying other notable cyclists in this particular snapshot would require more information or a higher-resolution image.\n",
  'safety_attributes': {'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}}]

Also notice how with the Vertex AI API, we record some additional information related to the safety attributes.

Running the experiment via the command line¶

We can also run the experiment via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/vertexai):

prompto_run_experiment --file data/input/vertexai-multimodal-example.jsonl --max-queries 30