Querying different models from the same endpoint: prompto vs. synchronous Python for loop¶

In [1]:

Copied!





import time
import os
import tqdm
from dotenv import load_dotenv

from prompto.settings import Settings
from prompto.experiment import Experiment

from api_utils import send_prompt
from dataset_utils import (
    load_prompt_dicts,
    load_prompts,
    generate_experiment_1_file,
    generate_experiment_3_file,
)
import time
import os
import tqdm
from dotenv import load_dotenv

from prompto.settings import Settings
from prompto.experiment import Experiment

from api_utils import send_prompt
from dataset_utils import (
    load_prompt_dicts,
    load_prompts,
    generate_experiment_1_file,
    generate_experiment_3_file,
)

In this experiment, we want to compare the performance of prompto which uses asynchronous programming to query model API endpoints with a traditional synchronous Python for loop. For this experiment, we are going to compare the time it takes for prompto to obtain 100 responses from 3 different models over the same API endpoint and the time it takes for a synchronous Python for loop to obtain the same 100 responses for each model.

We will see that prompto is able to obtain the responses from the models much faster than the synchronous Python for loop, especially when using parallel processing to query the models in parallel.

We choose to query three different models from the Open API endpoint for this experiment: gpt-3.5-turbo, gpt-4 and gpt-4o.

For this experiment, we will need to set up the following environment variables:

OPENAI_API_KEY: the API key for the OpenAI API

To set these environment variables, one can simply have these in a .env file which specifies these environment variables as key-value pairs:

OPENAI_API_KEY=<YOUR-OPENAI=KEY>

If you make this file, you can run the following which should return True if it's found one, or False otherwise:

In [2]:

Copied!

load_dotenv(dotenv_path=".env")
load_dotenv(dotenv_path=".env")

Out[2]:

True

For the experiment, we take a sample of 100 prompts from the alpaca_data.json from the tatsu-lab/stanford_alpaca Github repo and using the prompt template provided by the authors of the repo. To see how we obtain the prompts, please refer to the alpaca_sample_generation.ipynb notebook.

In [3]:

Copied!

alpaca_prompts = load_prompts("./sample_prompts.json")
alpaca_prompts = load_prompts("./sample_prompts.json")

Synchronous approach¶

For the synchronous approach, we simply use a for loop to query the API endpoints:

In [4]:

Copied!

def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
    # naive for loop to synchronously dispatch prompts
    return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]
def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
    # naive for loop to synchronously dispatch prompts
    return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]

Querying of models in parallel with `prompto`¶

We first look at querying each of the models in parallel using prompto.

Experiment setup¶

We will create our experiment files using the generate_experiment_3_file function in the dataset_utils.py file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto expects. We will save these input files into ./data/input and use ./data are our pipeline data folder.

In [5]:

Copied!

OPENAI_MULTIPLE_EXPERIMENT_FILE = "./data/input/openai-multiple-models.jsonl"
OPENAI_MULTIPLE_EXPERIMENT_FILE = "./data/input/openai-multiple-models.jsonl"

In [6]:

Copied!





generate_experiment_3_file(
    path=OPENAI_MULTIPLE_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name=["gpt-3.5-turbo", "gpt-4", "gpt-4o"],
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_3_file(
    path=OPENAI_MULTIPLE_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name=["gpt-3.5-turbo", "gpt-4", "gpt-4o"],
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)

In [7]:

Copied!





print(
    "len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)): "
    f"{len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE))}"
)
print(
    "len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)): "
    f"{len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE))}"
)

len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)): 300

In [8]:

Copied!





start = time.time()
overall_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)
)
sync_times["overall"] = time.time() - start
start = time.time()
overall_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)
)
sync_times["overall"] = time.time() - start

100%|██████████| 300/300 [11:45<00:00,  2.35s/it]

Specifying the rate limits for each model for parallel processing¶

Notice here that we are setting parallel=True in the Settings object as well as specifying the rate limits to send to each of the models, and we specify each of them to be 500. We do this by passing in a dictionary to the max_queries_dict argument in the Settings object which has API names as the keys and the values are also a dictionary where the keys are the model names we wish to process in parallel and the values are rate limits.

For details of how to specify rate limits for different models in the same API, see the Specifying rate limits docs and the Grouping prompts and specifying rate limits notebook.

Note that in the previous experiment, we also used parallel processing but in a slightly different way as we were parallelising the querying of different APIs.

In [9]:

Copied!





gpt4o_experiment = Experiment(
    file_name="openai-multiple-models.jsonl",
    settings=Settings(
        data_folder="./data",
        parallel=True,
        max_queries_dict={
            "openai": {"gpt-3.5-turbo": 500, "gpt-4": 500, "gpt-4o": 500}
        },
    ),
)

start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["overall"] = time.time() - start
gpt4o_experiment = Experiment(
    file_name="openai-multiple-models.jsonl",
    settings=Settings(
        data_folder="./data",
        parallel=True,
        max_queries_dict={
            "openai": {"gpt-3.5-turbo": 500, "gpt-4": 500, "gpt-4o": 500}
        },
    ),
)

start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["overall"] = time.time() - start

Waiting for all groups to complete:   0%|          | 0/4 [00:00<?, ?group/s]
Sending 0 queries at 10 QPM with RI of 6.0s for group openai  (attempt 1/3): 0query [00:00, ?query/s]
Waiting for responses for group openai  (attempt 1/3): 0query [00:00, ?query/s]
Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-3.5-turbo  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  7.73query/s]
Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-4  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  7.72query/s]
Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-4o  (attempt 1/3): 100%|██████████| 100/100 [00:13<00:00,  7.69query/s]
Waiting for responses for group openai-gpt-3.5-turbo  (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 57.16query/s]
Waiting for all groups to complete:  50%|█████     | 2/4 [00:14<00:14,  7.35s/group]
Waiting for responses for group openai-gpt-4o  (attempt 1/3): 100%|██████████| 100/100 [00:03<00:00, 26.66query/s]
Waiting for all groups to complete:  75%|███████▌  | 3/4 [00:16<00:05,  5.15s/group]
Waiting for responses for group openai-gpt-4  (attempt 1/3): 100%|██████████| 100/100 [00:06<00:00, 15.87query/s]
Waiting for all groups to complete: 100%|██████████| 4/4 [00:19<00:00,  4.82s/group]

Running `prompto` via the command line¶

We could have also ran this experiment with prompto via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo):

prompto_run_experiment --file data/input/openai-multiple-models.jsonl --parallel True --max-queries-json experiment_3_parallel_config.json

where experiment_3_parallel_config.json is a JSON file that specifies the rate limits for each of the API endpoints:

{
    "openai": {
        "gpt-3.5-turbo": 500,
        "gpt-4": 500,
        "gpt-4o": 500
    }
}

But for this notebook, we will time the experiments and save them to the sync_times and prompto_times dictionaries.

Querying the models without parallel processing¶

We will also compare the runtime to obtain responses from each of the models using a synchronous Python for loop versus using prompto to query the models asynchronously without parallel processing. We will look at using parallel processing in a later section.

Experiment setup¶

We will create our experiment files using the generate_experiment_1_file function in the dataset_utils.py file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto expects. We will save these input files into ./data/input and use ./data are our pipeline data folder.

See the pipeline data docs for more information about the pipeline data folder.

In [10]:

Copied!





OPENAI_GPT_35_TURBO_EXPERIMENT_FILE = "./data/input/openai-gpt-3pt5-turbo.jsonl"
OPENAI_GPT_4_EXPERIMENT_FILE = "./data/input/openai-gpt-4.jsonl"
OPENAI_GPT_4O_EXPERIMENT_FILE = "./data/input/openai-gpt-4o.jsonl"

INPUT_EXPERIMENT_FILEDIR = "./data/input"

if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
    os.mkdir(INPUT_EXPERIMENT_FILEDIR)
OPENAI_GPT_35_TURBO_EXPERIMENT_FILE = "./data/input/openai-gpt-3pt5-turbo.jsonl"
OPENAI_GPT_4_EXPERIMENT_FILE = "./data/input/openai-gpt-4.jsonl"
OPENAI_GPT_4O_EXPERIMENT_FILE = "./data/input/openai-gpt-4o.jsonl"

INPUT_EXPERIMENT_FILEDIR = "./data/input"

if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
    os.mkdir(INPUT_EXPERIMENT_FILEDIR)

In [11]:

Copied!





generate_experiment_1_file(
    path=OPENAI_GPT_35_TURBO_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-3.5-turbo",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
    path=OPENAI_GPT_35_TURBO_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-3.5-turbo",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)

In [12]:

Copied!





generate_experiment_1_file(
    path=OPENAI_GPT_4_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-4",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
    path=OPENAI_GPT_4_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-4",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)

In [13]:

Copied!





generate_experiment_1_file(
    path=OPENAI_GPT_4O_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-4o",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
    path=OPENAI_GPT_4O_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-4o",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)

For each model, we will compare runtimes for using prompto and a synchronous Python for loop to obtain 100 responses from the model.

We use the send_prompts_sync function defined above for the synchronous Python for loop approach. We can run experiments using the prompto.experiment.Experiment.process method.

In [14]:

Copied!

sync_times = {}
prompto_times = {}
sync_times = {}
prompto_times = {}

In [15]:

Copied!





print(
    f"len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE))}"
)

len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)): 100
len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)): 100
len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)): 100

GPT-3.5-turbo¶

Running the experiment synchronously¶

In [16]:

Copied!





start = time.time()
openai_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)
)
sync_times["gpt-3.5-turbo"] = time.time() - start
start = time.time()
openai_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)
)
sync_times["gpt-3.5-turbo"] = time.time() - start

100%|██████████| 100/100 [02:10<00:00,  1.31s/it]

Running the experiment asynchronously with `prompto`¶

In [17]:

Copied!





openai_experiment = Experiment(
    file_name="openai-gpt-3pt5-turbo.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["gpt-3.5-turbo"] = time.time() - start
openai_experiment = Experiment(
    file_name="openai-gpt-3pt5-turbo.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["gpt-3.5-turbo"] = time.time() - start

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3):   0%|          | 0/100 [00:00<?, ?query/s]

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  8.20query/s]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [00:02<00:00, 48.86query/s]

In [18]:

Copied!

sync_times["gpt-3.5-turbo"], prompto_times["gpt-3.5-turbo"]
sync_times["gpt-3.5-turbo"], prompto_times["gpt-3.5-turbo"]

(130.72972202301025, 14.28760814666748)

GPT-4¶

Running the experiment synchronously¶

In [19]:

Copied!





start = time.time()
gpt4_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)
)
sync_times["gpt4"] = time.time() - start
start = time.time()
gpt4_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)
)
sync_times["gpt4"] = time.time() - start

100%|██████████| 100/100 [06:32<00:00,  3.92s/it]

Running the experiment asynchronously with `prompto`¶

In [20]:

Copied!





gpt4_experiment = Experiment(
    file_name="openai-gpt-4.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
gpt4_responses, _ = await gpt4_experiment.process()
prompto_times["gpt4"] = time.time() - start
gpt4_experiment = Experiment(
    file_name="openai-gpt-4.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
gpt4_responses, _ = await gpt4_experiment.process()
prompto_times["gpt4"] = time.time() - start

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  8.14query/s]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [00:07<00:00, 13.36query/s]

In [21]:

Copied!

sync_times["gpt4"], prompto_times["gpt4"]
sync_times["gpt4"], prompto_times["gpt4"]

(392.211834192276, 19.79161500930786)

GPT-4o¶

Running the experiment synchronously¶

In [22]:

Copied!





start = time.time()
gpt4o_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)
)
sync_times["gpt4o"] = time.time() - start
start = time.time()
gpt4o_sync = send_prompts_sync(
    prompt_dicts=load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)
)
sync_times["gpt4o"] = time.time() - start

100%|██████████| 100/100 [04:01<00:00,  2.41s/it]

Running the experiment asynchronously with `prompto`¶

In [23]:

Copied!





gpt4o_experiment = Experiment(
    file_name="openai-gpt-4o.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["gpt4o"] = time.time() - start
gpt4o_experiment = Experiment(
    file_name="openai-gpt-4o.jsonl",
    settings=Settings(data_folder="./data", max_queries=500),
)

start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["gpt4o"] = time.time() - start

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  8.14query/s]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [00:05<00:00, 17.25query/s]

In [24]:

Copied!

sync_times["gpt4o"], prompto_times["gpt4o"]
sync_times["gpt4o"], prompto_times["gpt4o"]

(241.2371437549591, 18.114527940750122)

Running `prompto` via the command line¶

We can also run the above experiments via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo):

prompto_run_experiment --file data/input/openai-gpt-3pt5-turbo.jsonl --max-queries 500
prompto_run_experiment --file data/input/openai-gpt-4.jsonl --max-queries 500
prompto_run_experiment --file data/input/openai-gpt-4o.jsonl --max-queries 500

But for this notebook, we will time the experiments and save them to the sync_times and prompto_times dictionaries.

Analysis¶

Here, we report the final runtimes for each model and the difference in time between the prompto and synchronous Python for loop approaches:

In [25]:

Copied!

sync_times
sync_times

Out[25]:

{'gpt-3.5-turbo': 130.72972202301025,
 'gpt4': 392.211834192276,
 'gpt4o': 241.2371437549591,
 'overall': 705.384626865387}

In [26]:

Copied!

prompto_times
prompto_times

Out[26]:

{'gpt-3.5-turbo': 14.28760814666748,
 'gpt4': 19.79161500930786,
 'gpt4o': 18.114527940750122,
 'overall': 19.298332929611206}

We can see here a significant speed up within model (i.e. direct comparison of using prompto vs. synchronous Python for loop for a specific model) as well as across models (i.e. comparison of using prompto with parallel processing vs. synchronous Python for loop for different models). We see the runtime for parallel processing is just the time it takes to query the model with the longest runtime (in this case GPT-4).

Querying different models from the same endpoint: prompto vs. synchronous Python for loop¶

Synchronous approach¶

Querying of models in parallel with prompto¶

Experiment setup¶

Specifying the rate limits for each model for parallel processing¶

Running prompto via the command line¶

Querying the models without parallel processing¶

Experiment setup¶

GPT-3.5-turbo¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

GPT-4¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

GPT-4o¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

Running prompto via the command line¶

Analysis¶

Querying of models in parallel with `prompto`¶

Running `prompto` via the command line¶

Running the experiment asynchronously with `prompto`¶

Running the experiment asynchronously with `prompto`¶

Running the experiment asynchronously with `prompto`¶

Running `prompto` via the command line¶