Querying different models from the same endpoint: prompto vs. synchronous Python for loop¶
import time
import os
import tqdm
from dotenv import load_dotenv
from prompto.settings import Settings
from prompto.experiment import Experiment
from api_utils import send_prompt
from dataset_utils import (
load_prompt_dicts,
load_prompts,
generate_experiment_1_file,
generate_experiment_3_file,
)
In this experiment, we want to compare the performance of prompto
which uses asynchronous programming to query model API endpoints with a traditional synchronous Python for loop. For this experiment, we are going to compare the time it takes for prompto
to obtain 100 responses from 3 different models over the same API endpoint and the time it takes for a synchronous Python for loop to obtain the same 100 responses for each model.
We will see that prompto
is able to obtain the responses from the models much faster than the synchronous Python for loop, especially when using parallel processing to query the models in parallel.
We choose to query three different models from the Open API endpoint for this experiment: gpt-3.5-turbo
, gpt-4
and gpt-4o
.
For this experiment, we will need to set up the following environment variables:
OPENAI_API_KEY
: the API key for the OpenAI API
To set these environment variables, one can simply have these in a .env
file which specifies these environment variables as key-value pairs:
OPENAI_API_KEY=<YOUR-OPENAI=KEY>
If you make this file, you can run the following which should return True if it's found one, or False otherwise:
load_dotenv(dotenv_path=".env")
True
For the experiment, we take a sample of 100 prompts from the alpaca_data.json
from the tatsu-lab/stanford_alpaca
Github repo and using the prompt template provided by the authors of the repo. To see how we obtain the prompts, please refer to the alpaca_sample_generation.ipynb notebook.
alpaca_prompts = load_prompts("./sample_prompts.json")
Synchronous approach¶
For the synchronous approach, we simply use a for loop to query the API endpoints:
def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
# naive for loop to synchronously dispatch prompts
return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]
Querying of models in parallel with prompto
¶
We first look at querying each of the models in parallel using prompto
.
Experiment setup¶
We will create our experiment files using the generate_experiment_3_file
function in the dataset_utils.py
file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto
expects. We will save these input files into ./data/input
and use ./data
are our pipeline data folder.
OPENAI_MULTIPLE_EXPERIMENT_FILE = "./data/input/openai-multiple-models.jsonl"
generate_experiment_3_file(
path=OPENAI_MULTIPLE_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="openai",
model_name=["gpt-3.5-turbo", "gpt-4", "gpt-4o"],
params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
print(
"len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)): "
f"{len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE))}"
)
len(load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)): 300
start = time.time()
overall_sync = send_prompts_sync(
prompt_dicts=load_prompt_dicts(OPENAI_MULTIPLE_EXPERIMENT_FILE)
)
sync_times["overall"] = time.time() - start
100%|██████████| 300/300 [11:45<00:00, 2.35s/it]
Specifying the rate limits for each model for parallel processing¶
Notice here that we are setting parallel=True
in the Settings
object as well as specifying the rate limits to send to each of the models, and we specify each of them to be 500. We do this by passing in a dictionary to the max_queries_dict
argument in the Settings
object which has API names as the keys and the values are also a dictionary where the keys are the model names we wish to process in parallel and the values are rate limits.
For details of how to specify rate limits for different models in the same API, see the Specifying rate limits docs and the Grouping prompts and specifying rate limits notebook.
Note that in the previous experiment, we also used parallel processing but in a slightly different way as we were parallelising the querying of different APIs.
gpt4o_experiment = Experiment(
file_name="openai-multiple-models.jsonl",
settings=Settings(
data_folder="./data",
parallel=True,
max_queries_dict={
"openai": {"gpt-3.5-turbo": 500, "gpt-4": 500, "gpt-4o": 500}
},
),
)
start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["overall"] = time.time() - start
Waiting for all groups to complete: 0%| | 0/4 [00:00<?, ?group/s] Sending 0 queries at 10 QPM with RI of 6.0s for group openai (attempt 1/3): 0query [00:00, ?query/s] Waiting for responses for group openai (attempt 1/3): 0query [00:00, ?query/s] Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-3.5-turbo (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 7.73query/s] Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-4 (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 7.72query/s] Sending 100 queries at 500 QPM with RI of 0.12s for group openai-gpt-4o (attempt 1/3): 100%|██████████| 100/100 [00:13<00:00, 7.69query/s] Waiting for responses for group openai-gpt-3.5-turbo (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 57.16query/s] Waiting for all groups to complete: 50%|█████ | 2/4 [00:14<00:14, 7.35s/group] Waiting for responses for group openai-gpt-4o (attempt 1/3): 100%|██████████| 100/100 [00:03<00:00, 26.66query/s] Waiting for all groups to complete: 75%|███████▌ | 3/4 [00:16<00:05, 5.15s/group] Waiting for responses for group openai-gpt-4 (attempt 1/3): 100%|██████████| 100/100 [00:06<00:00, 15.87query/s] Waiting for all groups to complete: 100%|██████████| 4/4 [00:19<00:00, 4.82s/group]
Running prompto
via the command line¶
We could have also ran this experiment with prompto
via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo
):
prompto_run_experiment --file data/input/openai-multiple-models.jsonl --parallel True --max-queries-json experiment_3_parallel_config.json
where experiment_3_parallel_config.json
is a JSON file that specifies the rate limits for each of the API endpoints:
{
"openai": {
"gpt-3.5-turbo": 500,
"gpt-4": 500,
"gpt-4o": 500
}
}
But for this notebook, we will time the experiments and save them to the sync_times
and prompto_times
dictionaries.
Querying the models without parallel processing¶
We will also compare the runtime to obtain responses from each of the models using a synchronous Python for loop versus using prompto
to query the models asynchronously without parallel processing. We will look at using parallel processing in a later section.
Experiment setup¶
We will create our experiment files using the generate_experiment_1_file
function in the dataset_utils.py
file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto
expects. We will save these input files into ./data/input
and use ./data
are our pipeline data folder.
See the pipeline data docs for more information about the pipeline data folder.
OPENAI_GPT_35_TURBO_EXPERIMENT_FILE = "./data/input/openai-gpt-3pt5-turbo.jsonl"
OPENAI_GPT_4_EXPERIMENT_FILE = "./data/input/openai-gpt-4.jsonl"
OPENAI_GPT_4O_EXPERIMENT_FILE = "./data/input/openai-gpt-4o.jsonl"
INPUT_EXPERIMENT_FILEDIR = "./data/input"
if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
os.mkdir(INPUT_EXPERIMENT_FILEDIR)
generate_experiment_1_file(
path=OPENAI_GPT_35_TURBO_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="openai",
model_name="gpt-3.5-turbo",
params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
path=OPENAI_GPT_4_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="openai",
model_name="gpt-4",
params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
path=OPENAI_GPT_4O_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="openai",
model_name="gpt-4o",
params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
For each model, we will compare runtimes for using prompto
and a synchronous Python for loop to obtain 100 responses from the model.
We use the send_prompts_sync
function defined above for the synchronous Python for loop approach. We can run experiments using the prompto.experiment.Experiment.process
method.
sync_times = {}
prompto_times = {}
print(
f"len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE))}"
)
print(
f"len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE))}"
)
print(
f"len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE))}"
)
len(load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)): 100 len(load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)): 100 len(load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)): 100
GPT-3.5-turbo¶
Running the experiment synchronously¶
start = time.time()
openai_sync = send_prompts_sync(
prompt_dicts=load_prompt_dicts(OPENAI_GPT_35_TURBO_EXPERIMENT_FILE)
)
sync_times["gpt-3.5-turbo"] = time.time() - start
100%|██████████| 100/100 [02:10<00:00, 1.31s/it]
Running the experiment asynchronously with prompto
¶
openai_experiment = Experiment(
file_name="openai-gpt-3pt5-turbo.jsonl",
settings=Settings(data_folder="./data", max_queries=500),
)
start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["gpt-3.5-turbo"] = time.time() - start
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 0%| | 0/100 [00:00<?, ?query/s]
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.20query/s] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [00:02<00:00, 48.86query/s]
sync_times["gpt-3.5-turbo"], prompto_times["gpt-3.5-turbo"]
(130.72972202301025, 14.28760814666748)
GPT-4¶
Running the experiment synchronously¶
start = time.time()
gpt4_sync = send_prompts_sync(
prompt_dicts=load_prompt_dicts(OPENAI_GPT_4_EXPERIMENT_FILE)
)
sync_times["gpt4"] = time.time() - start
100%|██████████| 100/100 [06:32<00:00, 3.92s/it]
Running the experiment asynchronously with prompto
¶
gpt4_experiment = Experiment(
file_name="openai-gpt-4.jsonl",
settings=Settings(data_folder="./data", max_queries=500),
)
start = time.time()
gpt4_responses, _ = await gpt4_experiment.process()
prompto_times["gpt4"] = time.time() - start
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.14query/s] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [00:07<00:00, 13.36query/s]
sync_times["gpt4"], prompto_times["gpt4"]
(392.211834192276, 19.79161500930786)
GPT-4o¶
Running the experiment synchronously¶
start = time.time()
gpt4o_sync = send_prompts_sync(
prompt_dicts=load_prompt_dicts(OPENAI_GPT_4O_EXPERIMENT_FILE)
)
sync_times["gpt4o"] = time.time() - start
100%|██████████| 100/100 [04:01<00:00, 2.41s/it]
Running the experiment asynchronously with prompto
¶
gpt4o_experiment = Experiment(
file_name="openai-gpt-4o.jsonl",
settings=Settings(data_folder="./data", max_queries=500),
)
start = time.time()
gpt4o_responses, _ = await gpt4o_experiment.process()
prompto_times["gpt4o"] = time.time() - start
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.14query/s] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [00:05<00:00, 17.25query/s]
sync_times["gpt4o"], prompto_times["gpt4o"]
(241.2371437549591, 18.114527940750122)
Running prompto
via the command line¶
We can also run the above experiments via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo
):
prompto_run_experiment --file data/input/openai-gpt-3pt5-turbo.jsonl --max-queries 500
prompto_run_experiment --file data/input/openai-gpt-4.jsonl --max-queries 500
prompto_run_experiment --file data/input/openai-gpt-4o.jsonl --max-queries 500
But for this notebook, we will time the experiments and save them to the sync_times
and prompto_times
dictionaries.
Analysis¶
Here, we report the final runtimes for each model and the difference in time between the prompto
and synchronous Python for loop approaches:
sync_times
{'gpt-3.5-turbo': 130.72972202301025, 'gpt4': 392.211834192276, 'gpt4o': 241.2371437549591, 'overall': 705.384626865387}
prompto_times
{'gpt-3.5-turbo': 14.28760814666748, 'gpt4': 19.79161500930786, 'gpt4o': 18.114527940750122, 'overall': 19.298332929611206}
We can see here a significant speed up within model (i.e. direct comparison of using prompto
vs. synchronous Python for loop for a specific model) as well as across models (i.e. comparison of using prompto
with parallel processing vs. synchronous Python for loop for different models). We see the runtime for parallel processing is just the time it takes to query the model with the longest runtime (in this case GPT-4).