Using prompto for multimodal prompting with Gemini¶

In [2]:

Copied!





from prompto.settings import Settings
from prompto.experiment import Experiment
from dotenv import load_dotenv
import warnings
import os
from prompto.settings import Settings
from prompto.experiment import Experiment
from dotenv import load_dotenv
import warnings
import os

When using prompto to query models from the Gemini API, lines in our experiment .jsonl files must have "api": "gemini" in the prompt dict. Please see the Gemini notebook for an introduction to using prompto with the Gemini API and setting up the necessary environment variables.

In [3]:

Copied!

load_dotenv(dotenv_path=".env")
load_dotenv(dotenv_path=".env")

Out[3]:

True

Now, we obtain those values. We raise an error if the GEMINI_API_KEY environment variable hasn't been set:

In [4]:

Copied!

GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
if GEMINI_API_KEY is None:
    raise ValueError("GEMINI_API_KEY is not set")
GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
if GEMINI_API_KEY is None:
    raise ValueError("GEMINI_API_KEY is not set")

If you get any errors or warnings in the above cell, try to fix your .env file like the example we have above to get these variables set.

Types of prompts¶

As we saw in the Gemini notebook, with the Gemini API, the prompt (given via the "prompt" key in the prompt dict) can take several forms:

a string: a single prompt to obtain a response for
a list of strings: a sequence of prompts to send to the model
- this is useful in the use case of simulating a conversation with the model by defining the user prompts sequentially
a list of dictionaries with keys "role" and "parts", where "role" is one of "user", "model", or "system" and "parts" is the message
- this is useful in the case of passing in some conversation history or to pass in a system prompt to the model
- note that only the prompt in the list can be a system prompt, and the rest must be user or model prompts

Multimodal prompts¶

For prompting the model with multimodal inputs, we use this last format where we define a prompt by specifying the role of the prompt and then a list of parts that make up the prompt. Individual pieces of the part can be text, images or video which are passed to the model as a multimodal input. In this setting, the prompt can be defined flexibly with text interspersed with images or video.

When specifying an individual part of the prompt, we define this using a dictionary with the keys "type" and "media". There also may sometimes need to be a "mime_type" key too:

"type" is one of "text", "image", or "file"
"media" is the actual content of the part - this can be a string for text, or a file path for images. Alternatively, this can be a URI link for images or video which gets passed to Gemini's File API service (see Gemini documentation for details on this)

For specifying text, you can just have a string, or you can also use this format, e.g. { "type": "text", "media": "some text" }. For images or video, you must use the dictionary format.

An example of a multimodal prompt is the following:

[
    {
        "role": "user",
        "part": [
            "what is in this image?",
            {"type": "image", "media": "image.jpg"},
        ]
    },
]

Here, we have a list of one dictionary where we specify the "role" as "user" and "part" as a list of two elements: the first is a string and the second is a dictionary specifying the type and media content of the part. In this case, the media content is a image file path.

For this notebook, we have created an input file in data/input/gemini-multimodal-example.jsonl with several multimodal prompts with local files as an illustration.

Specifying local files¶

When specifying the local files, the file paths must be relative file paths to the media/ folder in the data folder. For example, if you have an image file image.jpg in the media/ folder, you would specify this as "media": "image.jpg" in the prompt. If you have a video file video.mp4 in the media/videos/ folder, you would specify this as "media": "videos/video.mp4" in the prompt.

In [5]:

Copied!

settings = Settings(data_folder="./data", max_queries=30)
experiment = Experiment(file_name="gemini-multimodal-example.jsonl", settings=settings)
settings = Settings(data_folder="./data", max_queries=30)
experiment = Experiment(file_name="gemini-multimodal-example.jsonl", settings=settings)

We set max_queries to 30 so we send 30 queries a minute (every 2 seconds).

In [6]:

Copied!

print(settings)
print(settings)

Settings: data_folder=./data, max_queries=30, max_attempts=3, parallel=False
Subfolders: input_folder=./data/input, output_folder=./data/output, media_folder=./data/media

In [7]:

Copied!

len(experiment.experiment_prompts)
len(experiment.experiment_prompts)

Out[7]:

We can see the prompts that we have in the experiment_prompts attribute:

In [8]:

Copied!

experiment.experiment_prompts
experiment.experiment_prompts

Out[8]:

[{'id': 0,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': ['describe what is happening in this image',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}},
 {'id': 1,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': [{'type': 'image', 'media': 'mortadella.jpg'}, 'what is this?']}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}},
 {'id': 2,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': ['what is in this image?',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]},
   {'role': 'model', 'parts': 'This is image shows a group of cyclists.'},
   {'role': 'user',
    'parts': 'are there any notable cyclists in this image? what are their names?'}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000}}]

In the first prompt ("id": 0), we have a "prompt" key which specifies a prompt where we ask the model to "describe what is happening in this image" and we pass in an image which is defined using a dictionary with "type" and "media" keys pointing to a file in the media folder
In the second prompt ("id": 1), we have a "prompt" key which specifies a prompt where we first pass in an image defined using a dictionary with "type" and "media" keys pointing to a file in the media folder and then we ask the model "what is this?"
In the third prompt ("id": 2), we have a "prompt" key which is a list of dictionaries. Each of these dictionaries have a "role" and "parts" key and we specify a user/model interaction. First we ask the model "what is in this image?" along with an image defined by a dictionary with "type" and "media" keys to point to a file in the media folder. We then have a model response and another user query

For each of these prompts, we specify a "model_name" key to be "gemini-1.5-flash".

Note that we don't have examples here with videos, but similarly we can pass in videos using the same format as images but additionally specifying the "mime_type" key.

Running the experiment¶

We now can run the experiment using the async method process which will process the prompts in the input file asynchronously. Note that a new folder named timestamp-gemini-example (where "timestamp" is replaced with the actual date and time of processing) will be created in the output directory and we will move the input file to the output directory. As the responses come in, they will be written to the output file and there are logs that will be printed to the console as well as being written to a log file in the output directory.

In [9]:

Copied!

responses, avg_query_processing_time = await experiment.process()
responses, avg_query_processing_time = await experiment.process()

Sending 3 queries at 30 QPM with RI of 2.0s (attempt 1/3): 100%|██████████| 3/3 [00:06<00:00,  2.00s/query]
Waiting for responses (attempt 1/3): 100%|██████████| 3/3 [00:02<00:00,  1.25query/s]

We can see that the responses are written to the output file, and we can also see them as the returned object. From running the experiment, we obtain prompt dicts where there is now a "response" key which contains the response(s) from the model.

For the case where the prompt is a list of strings, we see that the response is a list of strings where each string is the response to the corresponding prompt.

In [10]:

Copied!

responses
responses

Out[10]:

[{'id': 0,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': ['describe what is happening in this image',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-12-22-49',
  'response': 'A group of cyclists are racing down a road. The cyclist in the lead is wearing a pink and yellow jersey and is riding a black bicycle. He is closely followed by a cyclist in a yellow and red jersey on a yellow and black bicycle, and a cyclist in a green and yellow jersey on a blue bicycle. The cyclist in the pink jersey is looking over his shoulder at the other cyclists, and the cyclists in the yellow and red and green and yellow jerseys are looking straight ahead. The cyclists are all in a single file line, and the road is paved and has a white line down the middle. There are a few other cyclists behind them, but they are not as clear in the image. There is a stone wall on the right side of the road, and some green grass and trees in the background.',
  'safety_attributes': {'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}},
 {'id': 1,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': [{'type': 'image', 'media': 'mortadella.jpg'}, 'what is this?']}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-12-22-51',
  'response': 'This is Mortadella, an Italian cured pork sausage.  It is known for its characteristic marbling of fat.',
  'safety_attributes': {'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}},
 {'id': 2,
  'api': 'gemini',
  'model_name': 'gemini-1.5-flash',
  'prompt': [{'role': 'user',
    'parts': ['what is in this image?',
     {'type': 'image', 'media': 'pantani_giro.jpg'}]},
   {'role': 'model', 'parts': 'This is image shows a group of cyclists.'},
   {'role': 'user',
    'parts': 'are there any notable cyclists in this image? what are their names?'}],
  'parameters': {'candidate_count': 1,
   'temperature': 1,
   'max_output_tokens': 1000},
  'timestamp_sent': '21-10-2024-12-22-53',
  'response': "The image features several notable cyclists, including:\n\n* **Miguel Indurain:**  The cyclist in pink, wearing a yellow helmet and sunglasses, is the Spanish legend Miguel Indurain. He is a five-time winner of the Tour de France.\n* **Claudio Chiappucci:** The cyclist in the red, white, and yellow jersey is Claudio Chiappucci, an Italian cyclist known for his strong performances in the mountains. \n* **Marco Pantani:** The cyclist in the green and yellow jersey is Marco Pantani, an Italian climber who won the Tour de France in 1998 and the Giro d'Italia in 1998 and 1999.\n\nIt's important to note that this image is from the 1990s, and these cyclists are riding for their respective teams at the time. \n",
  'safety_attributes': {'HARM_CATEGORY_SEXUALLY_EXPLICIT': '1',
   'HARM_CATEGORY_HATE_SPEECH': '1',
   'HARM_CATEGORY_HARASSMENT': '1',
   'HARM_CATEGORY_DANGEROUS_CONTENT': '1',
   'blocked': '[False, False, False, False]',
   'finish_reason': 'STOP'}}]

Also notice how with the Gemini API, we record some additional information related to the safety attributes.

Running the experiment via the command line¶

We can also run the experiment via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/gemini):

prompto_run_experiment --file data/input/gemini-multimodal-example.jsonl --max-queries 30