Querying different LLM endpoints: prompto with parallel processing vs. synchronous Python for loop¶
import time
import os
import requests
import tqdm
from dotenv import load_dotenv
from prompto.settings import Settings
from prompto.experiment import Experiment
from api_utils import send_prompt
from dataset_utils import load_prompt_dicts, load_prompts, generate_experiment_2_file
In this experiment, we want to compare the performance of prompto
which uses asynchronous programming to query model API endpoints with a traditional synchronous Python for loop. For this experiment, we are going to compare the time it takes for prompto
to obtain 100 responses from different model API endpoints in parallel and the time it takes for a synchronous Python for loop to obtain the same 100 responses from each endpoint.
We will see that prompto
is able to obtain the responses from the different endpoints in parallel, which is much faster than the synchronous Python for loop.
We choose three API endpoints for this experiment:
- OpenAI API
- Gemini API
- Ollama API (which is locally hosted)
For this experiment, we will need to set up the following environment variables:
OPENAI_API_KEY
: the API key for the OpenAI APIGEMINI_API_KEY
: the API key for the Gemini APIOLLAMA_API_ENDPOINT
: the endpoint for the Ollama API
To set these environment variables, one can simply have these in a .env
file which specifies these environment variables as key-value pairs:
OPENAI_API_KEY=<YOUR-OPENAI=KEY>
GEMINI_API_KEY=<YOUR-GEMINI-KEY>
OLLAMA_API_ENDPOINT=<YOUR-OLLAMA-ENDPOINT>
If you make this file, you can run the following which should return True if it's found one, or False otherwise:
load_dotenv(dotenv_path=".env")
True
Synchronous approach¶
For the synchronous approach, we simply use a for loop to query the API endpoints:
def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
# naive for loop to synchronously dispatch prompts
return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]
Experiment setup¶
For the experiment, we take a sample of 100 prompts from the alpaca_data.json
from the tatsu-lab/stanford_alpaca
Github repo and using the prompt template provided by the authors of the repo. To see how we obtain the prompts, please refer to the alpaca_sample_generation.ipynb notebook.
alpaca_prompts = load_prompts("./sample_prompts.json")
We will create our experiment files using the generate_experiment_2_file
function in the dataset_utils.py
file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto
expects. We will save these input files into ./data/input
and use ./data
are our pipeline data folder.
See the pipeline data docs for more information about the pipeline data folder.
COMBINED_EXPERIMENT_FILENAME = "./data/input/all_experiments.jsonl"
INPUT_EXPERIMENT_FILEDIR = "./data/input"
if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
os.mkdir(INPUT_EXPERIMENT_FILEDIR)
Notice that we query the following models:
gpt-3.5-turbo
for the OpenAI APIgemini-1.5-flash
for the Gemini APIllama3
(8B, 4bit quantised) for the Ollama API
Notice that each different API has different argument names for the generation configurations.
generate_experiment_2_file(
path=COMBINED_EXPERIMENT_FILENAME,
prompts=alpaca_prompts,
api=["openai", "gemini", "ollama"],
model_name=["gpt-3.5-turbo", "gemini-1.5-flash", "llama3"],
params=[
{"n": 1, "temperature": 0.9, "max_tokens": 100},
{"candidate_count": 1, "temperature": 0.9, "max_output_tokens": 100},
{"temperature": 0.9, "num_predict": 100, "seed": 42},
],
)
print(
f"len(load_prompt_dicts(COMBINED_EXPERIMENT_FILENAME)): {len(load_prompt_dicts(COMBINED_EXPERIMENT_FILENAME))}"
)
len(load_prompt_dicts(COMBINED_EXPERIMENT_FILENAME)): 300
Running the experiment synchronously¶
Before running the experiment, we will just send an empty prompt request to the Ollama server with the llama3
model to 1) check that the model is available and working, and 2) to ensure that the model is loaded in memory - sending an empty request in Ollama ensures pre-loading of the model.
requests.post(
f"{os.environ.get('OLLAMA_API_ENDPOINT')}/api/generate", json={"model": "llama3"}
)
<Response [200]>
We use the send_prompts_sync
function defined above for the synchronous Python for loop approach. We can run experiments using the prompto.experiment.Experiment.process
method.
start = time.time()
multiple_api_sync = send_prompts_sync(
prompt_dicts=load_prompt_dicts(COMBINED_EXPERIMENT_FILENAME)
)
sync_time = time.time() - start
100%|██████████| 300/300 [09:18<00:00, 1.86s/it]
Running the experiment asynchronously with prompto
¶
We compare the runtime between sending these prompts in a synchronous Python for loop to obtain 100 responses from each API endpoint and using prompto
with parallel processing. First we will run the synchronous Python for loop and then we will run the prompto
pipeline.
Notice here that we are setting parallel=True
in the Settings
object as well as specifying the rate limits to send to each of the APIs. We set the rate limits to 500 queries per minute for OpenAI and Gemini APIs while setting the rate limit to 50 queries per minute for the Ollama API. We do this by passing in a dictionary to the max_queries_dict
argument in the Settings
object which has API names as the keys and the rate limits as the values.
For details of how to specify rate limits, see the Specifying rate limits docs and the Grouping prompts and specifying rate limits notebook.
multiple_api_experiment = Experiment(
file_name="all_experiments.jsonl",
settings=Settings(
data_folder="./data",
parallel=True,
max_queries_dict={"openai": 500, "gemini": 500, "ollama": 50},
),
)
start = time.time()
multiple_api_responses, _ = await multiple_api_experiment.process()
prompto_time = time.time() - start
Waiting for all groups to complete: 0%| | 0/3 [00:00<?, ?group/s] Sending 100 queries at 500 QPM with RI of 0.12s for group openai (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.05query/s] Sending 100 queries at 500 QPM with RI of 0.12s for group gemini (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.03query/s] Waiting for responses for group gemini (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 61.42query/s] Waiting for responses for group openai (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 58.39query/s] Sending 100 queries at 50 QPM with RI of 1.2s for group ollama (attempt 1/3): 100%|██████████| 100/100 [02:00<00:00, 1.21s/query] Waiting for responses for group ollama (attempt 1/3): 100%|██████████| 100/100 [01:58<00:00, 1.18s/query] Waiting for all groups to complete: 100%|██████████| 3/3 [04:29<00:00, 89.67s/group]
Running prompto
via the command line¶
We could have also ran the experiments via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo
):
prompto_run_experiment --file data/input/all_experiments.jsonl --parallel True --max-queries-json experiment_2_parallel_config.json
where experiment_2_parallel_config.json
is a JSON file that specifies the rate limits for each of the API endpoints:
{
"openai": 500,
"gemini": 500,
"ollama": 50
}
But for this notebook, we will time the experiments and save them to the sync_times
and prompto_times
dictionaries.
Analysis¶
Here, we report the final runtimes for each API and the difference in time between the prompto
and synchronous Python for loop approaches:
sync_time, prompto_time
(558.7412779331207, 269.0622651576996)
We can see that the prompto
approach is much faster than the synchronous Python for loop approach for querying the different model API endpoints. If we compare with the results from the previous notebook, the prompto
runtime is very close to just how long it took to process the Ollama requests. This is because the Ollama API has a much longer computation time and we are running at a lower rate limit too. When querying different APIs or models in parallel, you are simply just limited by the slowest API or model.