Querying different LLM endpoints: prompto vs. synchronous Python for loop¶
import time
import os
import requests
import tqdm
from dotenv import load_dotenv
from prompto.settings import Settings
from prompto.experiment import Experiment
from api_utils import send_prompt
from dataset_utils import load_prompt_dicts, load_prompts, generate_experiment_1_file
In this experiment, we want to compare the performance of prompto
which uses asynchronous programming to query model API endpoints with a traditional synchronous Python for loop. For this experiment, we are going to compare the time it takes for prompto
to obtain 100 responses from a model API endpoint and the time it takes for a synchronous Python for loop to obtain the same 100 responses.
We will see that prompto
is able to obtain the 100 responses faster than the synchronous Python for loop.
We choose three API endpoints for this experiment:
- OpenAI API
- Gemini API
- Ollama API (which is locally hosted)
For this experiment, we will need to set up the following environment variables:
OPENAI_API_KEY
: the API key for the OpenAI APIGEMINI_API_KEY
: the API key for the Gemini APIOLLAMA_API_ENDPOINT
: the endpoint for the Ollama API
To set these environment variables, one can simply have these in a .env
file which specifies these environment variables as key-value pairs:
OPENAI_API_KEY=<YOUR-OPENAI=KEY>
GEMINI_API_KEY=<YOUR-GEMINI-KEY>
OLLAMA_API_ENDPOINT=<YOUR-OLLAMA-ENDPOINT>
If you make this file, you can run the following which should return True if it's found one, or False otherwise:
load_dotenv(dotenv_path=".env")
True
Synchronous approach¶
For the synchronous approach, we simply use a for loop to query the API endpoints:
def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
# naive for loop to synchronously dispatch prompts
return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]
Experiment setup¶
For the experiment, we take a sample of 100 prompts from the alpaca_data.json
from the tatsu-lab/stanford_alpaca
Github repo and using the prompt template provided by the authors of the repo. To see how we obtain the prompts, please refer to the alpaca_sample_generation.ipynb notebook.
alpaca_prompts = load_prompts("./sample_prompts.json")
We will create our experiment files using the generate_experiment_1_file
function in the dataset_utils.py
file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto
expects. We will save these input files into ./data/input
and use ./data
are our pipeline data folder.
See the pipeline data docs for more information about the pipeline data folder.
OPENAI_EXPERIMENT_FILE = "./data/input/openai.jsonl"
GEMINI_EXPERIMENT_FILE = "./data/input/gemini.jsonl"
OLLAMA_EXPERIMENT_FILE = "./data/input/ollama.jsonl"
INPUT_EXPERIMENT_FILEDIR = "./data/input"
if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
os.mkdir(INPUT_EXPERIMENT_FILEDIR)
Notice that we query the following models:
gpt-3.5-turbo
for the OpenAI APIgemini-1.5-flash
for the Gemini APIllama3
(8B, 4bit quantised) for the Ollama API
Notice that each different API has different argument names for the generation configurations.
generate_experiment_1_file(
path=OPENAI_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="openai",
model_name="gpt-3.5-turbo",
params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
path=GEMINI_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="gemini",
model_name="gemini-1.5-flash",
params={"candidate_count": 1, "temperature": 0.9, "max_output_tokens": 100},
)
generate_experiment_1_file(
path=OLLAMA_EXPERIMENT_FILE,
prompts=alpaca_prompts,
api="ollama",
model_name="llama3",
params={"temperature": 0.9, "num_predict": 100, "seed": 42},
)
For each API, we will compare runtimes for using prompto
and a synchronous Python for loop to obtain 100 responses from the API.
We use the send_prompts_sync
function defined above for the synchronous Python for loop approach. We can run experiments using the prompto.experiment.Experiment.process
method.
sync_times = {}
prompto_times = {}
print(
f"len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE))}"
)
print(
f"len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE)): {len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE))}"
)
print(
f"len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE)): {len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))}"
)
len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE)): 100 len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE)): 100 len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE)): 100
OpenAI¶
Running the experiment synchronously¶
start = time.time()
openai_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OPENAI_EXPERIMENT_FILE))
sync_times["openai"] = time.time() - start
100%|██████████| 100/100 [02:06<00:00, 1.26s/it]
Running the experiment asynchronously with prompto
¶
For running prompto
with the OpenAI API, we can run prompts at 500QPM. It is possible to have tiers which offer a higher rate limit, but we will use the 500QPM rate limit for this experiment.
openai_experiment = Experiment(
file_name="openai.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)
start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["openai"] = time.time() - start
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.20query/s] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 58.66query/s]
sync_times["openai"], prompto_times["openai"]
(126.30979299545288, 13.91887378692627)
Gemini¶
Running the experiment synchronously¶
start = time.time()
gemini_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(GEMINI_EXPERIMENT_FILE))
sync_times["gemini"] = time.time() - start
100%|██████████| 100/100 [02:43<00:00, 1.63s/it]
Running the experiment asynchronously with prompto
¶
As with the OpenAI API, for running prompto
with the Gemini API, we can run prompts at 500QPM. It is possible to have tiers which offer a higher rate limit, but we will use the 500QPM rate limit for this experiment.
gemini_experiment = Experiment(
file_name="gemini.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)
start = time.time()
gemini_responses, _ = await gemini_experiment.process()
prompto_times["gemini"] = time.time() - start
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 0%| | 0/100 [00:00<?, ?query/s]
Sending 100 queries at 500 QPM with RI of 0.12s (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00, 8.17query/s] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 55.77query/s]
sync_times["gemini"], prompto_times["gemini"]
(163.48729801177979, 14.094270944595337)
Ollama¶
Before running the Ollama experiment, we will just send an empty prompt request with the llama3
model to 1) check that the model is available and working, and 2) to ensure that the model is loaded in memory - sending an empty request in Ollama ensures pre-loading of the model.
requests.post(
f"{os.environ.get('OLLAMA_API_ENDPOINT')}/api/generate", json={"model": "llama3"}
)
<Response [200]>
Running the experiment synchronously¶
start = time.time()
ollama_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))
sync_times["ollama"] = time.time() - start
100%|██████████| 100/100 [04:31<00:00, 2.71s/it]
Running the experiment asynchronously with prompto
¶
For Ollama, we use a locally hosted API endpoint. While the Ollama API implements a queue system to allow for asynchronous requests, it actually still only processes one request at a time so we expect a modest speedup when using prompto
to query the Ollama API. We will use a 50QPM rate limit for this experiment as we are using a M1 Pro Macbook 14" model to run Ollama for this experiment which cannot handle much higher than this.
ollama_experiment = Experiment(
file_name="ollama.jsonl", settings=Settings(data_folder="./data", max_queries=50)
)
start = time.time()
ollama_responses, _ = await ollama_experiment.process()
prompto_times["ollama"] = time.time() - start
Sending 100 queries at 50 QPM with RI of 1.2s (attempt 1/3): 100%|██████████| 100/100 [02:00<00:00, 1.20s/query] Waiting for responses (attempt 1/3): 100%|██████████| 100/100 [01:58<00:00, 1.18s/query]
sync_times["ollama"], prompto_times["ollama"]
(271.4494206905365, 268.59372997283936)
Running prompto
via the command line¶
We can also run the experiments via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo
):
prompto_run_experiment --file data/input/openai.jsonl --max-queries 500
prompto_run_experiment --file data/input/gemini.jsonl --max-queries 500
prompto_run_experiment --file data/input/ollama.jsonl --max-queries 50
But for this notebook, we will time the experiments and save them to the sync_times
and prompto_times
dictionaries.
Analysis¶
Here, we report the final runtimes for each API and the difference in time between the prompto
and synchronous Python for loop approaches:
sync_times
{'openai': 126.30979299545288, 'gemini': 163.48729801177979, 'ollama': 271.4494206905365}
prompto_times
{'openai': 13.91887378692627, 'gemini': 14.094270944595337, 'ollama': 268.59372997283936}
We can see significant speedups when using prompto
to query the OpenAI and Gemini APIs. For the Ollama API, we see a modest speedup when using prompto
to query the API since Ollama only processes one request at a time.