Querying different LLM endpoints: prompto vs. synchronous Python for loop¶

In [1]:

Copied!





import time
import os
import requests
import tqdm
from dotenv import load_dotenv

from prompto.settings import Settings
from prompto.experiment import Experiment

from api_utils import send_prompt
from dataset_utils import load_prompt_dicts, load_prompts, generate_experiment_1_file
import time
import os
import requests
import tqdm
from dotenv import load_dotenv

from prompto.settings import Settings
from prompto.experiment import Experiment

from api_utils import send_prompt
from dataset_utils import load_prompt_dicts, load_prompts, generate_experiment_1_file

In this experiment, we want to compare the performance of prompto which uses asynchronous programming to query model API endpoints with a traditional synchronous Python for loop. For this experiment, we are going to compare the time it takes for prompto to obtain 100 responses from a model API endpoint and the time it takes for a synchronous Python for loop to obtain the same 100 responses.

We will see that prompto is able to obtain the 100 responses faster than the synchronous Python for loop.

We choose three API endpoints for this experiment:

OpenAI API
Gemini API
Ollama API (which is locally hosted)

For this experiment, we will need to set up the following environment variables:

OPENAI_API_KEY: the API key for the OpenAI API
GEMINI_API_KEY: the API key for the Gemini API
OLLAMA_API_ENDPOINT: the endpoint for the Ollama API

To set these environment variables, one can simply have these in a .env file which specifies these environment variables as key-value pairs:

OPENAI_API_KEY=<YOUR-OPENAI=KEY>
GEMINI_API_KEY=<YOUR-GEMINI-KEY>
OLLAMA_API_ENDPOINT=<YOUR-OLLAMA-ENDPOINT>

If you make this file, you can run the following which should return True if it's found one, or False otherwise:

In [2]:

Copied!

load_dotenv(dotenv_path=".env")
load_dotenv(dotenv_path=".env")

Out[2]:

True

Synchronous approach¶

For the synchronous approach, we simply use a for loop to query the API endpoints:

In [3]:

Copied!

def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
    # naive for loop to synchronously dispatch prompts
    return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]
def send_prompts_sync(prompt_dicts: list[dict]) -> list[str]:
    # naive for loop to synchronously dispatch prompts
    return [send_prompt(prompt_dict) for prompt_dict in tqdm(prompt_dicts)]

Experiment setup¶

For the experiment, we take a sample of 100 prompts from the alpaca_data.json from the tatsu-lab/stanford_alpaca Github repo and using the prompt template provided by the authors of the repo. To see how we obtain the prompts, please refer to the alpaca_sample_generation.ipynb notebook.

In [4]:

Copied!

alpaca_prompts = load_prompts("./sample_prompts.json")
alpaca_prompts = load_prompts("./sample_prompts.json")

We will create our experiment files using the generate_experiment_1_file function in the dataset_utils.py file in this directory. This function will just take these prompts and create a jsonl file with the prompts in the format that prompto expects. We will save these input files into ./data/input and use ./data are our pipeline data folder.

See the pipeline data docs for more information about the pipeline data folder.

In [5]:

Copied!





OPENAI_EXPERIMENT_FILE = "./data/input/openai.jsonl"
GEMINI_EXPERIMENT_FILE = "./data/input/gemini.jsonl"
OLLAMA_EXPERIMENT_FILE = "./data/input/ollama.jsonl"

INPUT_EXPERIMENT_FILEDIR = "./data/input"

if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
    os.mkdir(INPUT_EXPERIMENT_FILEDIR)
OPENAI_EXPERIMENT_FILE = "./data/input/openai.jsonl"
GEMINI_EXPERIMENT_FILE = "./data/input/gemini.jsonl"
OLLAMA_EXPERIMENT_FILE = "./data/input/ollama.jsonl"

INPUT_EXPERIMENT_FILEDIR = "./data/input"

if not os.path.isdir(INPUT_EXPERIMENT_FILEDIR):
    os.mkdir(INPUT_EXPERIMENT_FILEDIR)

Notice that we query the following models:

gpt-3.5-turbo for the OpenAI API
gemini-1.5-flash for the Gemini API
llama3 (8B, 4bit quantised) for the Ollama API

Notice that each different API has different argument names for the generation configurations.

In [6]:

Copied!





generate_experiment_1_file(
    path=OPENAI_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-3.5-turbo",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)
generate_experiment_1_file(
    path=OPENAI_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="openai",
    model_name="gpt-3.5-turbo",
    params={"n": 1, "temperature": 0.9, "max_tokens": 100},
)

In [7]:

Copied!





generate_experiment_1_file(
    path=GEMINI_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="gemini",
    model_name="gemini-1.5-flash",
    params={"candidate_count": 1, "temperature": 0.9, "max_output_tokens": 100},
)
generate_experiment_1_file(
    path=GEMINI_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="gemini",
    model_name="gemini-1.5-flash",
    params={"candidate_count": 1, "temperature": 0.9, "max_output_tokens": 100},
)

In [8]:

Copied!





generate_experiment_1_file(
    path=OLLAMA_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="ollama",
    model_name="llama3",
    params={"temperature": 0.9, "num_predict": 100, "seed": 42},
)
generate_experiment_1_file(
    path=OLLAMA_EXPERIMENT_FILE,
    prompts=alpaca_prompts,
    api="ollama",
    model_name="llama3",
    params={"temperature": 0.9, "num_predict": 100, "seed": 42},
)

For each API, we will compare runtimes for using prompto and a synchronous Python for loop to obtain 100 responses from the API.

We use the send_prompts_sync function defined above for the synchronous Python for loop approach. We can run experiments using the prompto.experiment.Experiment.process method.

In [9]:

Copied!

sync_times = {}
prompto_times = {}
sync_times = {}
prompto_times = {}

In [10]:

Copied!





print(
    f"len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE)): {len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE)): {len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE)): {len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE)): {len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE))}"
)
print(
    f"len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE)): {len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))}"
)

len(load_prompt_dicts(OPENAI_EXPERIMENT_FILE)): 100
len(load_prompt_dicts(GEMINI_EXPERIMENT_FILE)): 100
len(load_prompt_dicts(OLLAMA_EXPERIMENT_FILE)): 100

OpenAI¶

Running the experiment synchronously¶

In [11]:

Copied!

start = time.time()
openai_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OPENAI_EXPERIMENT_FILE))
sync_times["openai"] = time.time() - start
start = time.time()
openai_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OPENAI_EXPERIMENT_FILE))
sync_times["openai"] = time.time() - start

100%|██████████| 100/100 [02:06<00:00,  1.26s/it]

Running the experiment asynchronously with `prompto`¶

For running prompto with the OpenAI API, we can run prompts at 500QPM. It is possible to have tiers which offer a higher rate limit, but we will use the 500QPM rate limit for this experiment.

In [12]:

Copied!





openai_experiment = Experiment(
    file_name="openai.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)

start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["openai"] = time.time() - start
openai_experiment = Experiment(
    file_name="openai.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)

start = time.time()
openai_responses, _ = await openai_experiment.process()
prompto_times["openai"] = time.time() - start

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  8.20query/s]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 58.66query/s]

In [13]:

Copied!

sync_times["openai"], prompto_times["openai"]
sync_times["openai"], prompto_times["openai"]

Out[13]:

(126.30979299545288, 13.91887378692627)

Gemini¶

Running the experiment synchronously¶

In [14]:

Copied!

start = time.time()
gemini_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(GEMINI_EXPERIMENT_FILE))
sync_times["gemini"] = time.time() - start
start = time.time()
gemini_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(GEMINI_EXPERIMENT_FILE))
sync_times["gemini"] = time.time() - start

100%|██████████| 100/100 [02:43<00:00,  1.63s/it]

Running the experiment asynchronously with `prompto`¶

As with the OpenAI API, for running prompto with the Gemini API, we can run prompts at 500QPM. It is possible to have tiers which offer a higher rate limit, but we will use the 500QPM rate limit for this experiment.

In [15]:

Copied!





gemini_experiment = Experiment(
    file_name="gemini.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)

start = time.time()
gemini_responses, _ = await gemini_experiment.process()
prompto_times["gemini"] = time.time() - start
gemini_experiment = Experiment(
    file_name="gemini.jsonl", settings=Settings(data_folder="./data", max_queries=500)
)

start = time.time()
gemini_responses, _ = await gemini_experiment.process()
prompto_times["gemini"] = time.time() - start

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3):   0%|          | 0/100 [00:00<?, ?query/s]

Sending 100 queries at 500 QPM with RI of 0.12s  (attempt 1/3): 100%|██████████| 100/100 [00:12<00:00,  8.17query/s]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [00:01<00:00, 55.77query/s]

In [16]:

Copied!

sync_times["gemini"], prompto_times["gemini"]
sync_times["gemini"], prompto_times["gemini"]

Out[16]:

(163.48729801177979, 14.094270944595337)

Ollama¶

Before running the Ollama experiment, we will just send an empty prompt request with the llama3 model to 1) check that the model is available and working, and 2) to ensure that the model is loaded in memory - sending an empty request in Ollama ensures pre-loading of the model.

In [17]:

Copied!

requests.post(
    f"{os.environ.get('OLLAMA_API_ENDPOINT')}/api/generate", json={"model": "llama3"}
)
requests.post(
    f"{os.environ.get('OLLAMA_API_ENDPOINT')}/api/generate", json={"model": "llama3"}
)

Out[17]:

<Response [200]>

Running the experiment synchronously¶

In [18]:

Copied!

start = time.time()
ollama_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))
sync_times["ollama"] = time.time() - start
start = time.time()
ollama_sync = send_prompts_sync(prompt_dicts=load_prompt_dicts(OLLAMA_EXPERIMENT_FILE))
sync_times["ollama"] = time.time() - start

100%|██████████| 100/100 [04:31<00:00,  2.71s/it]

Running the experiment asynchronously with `prompto`¶

For Ollama, we use a locally hosted API endpoint. While the Ollama API implements a queue system to allow for asynchronous requests, it actually still only processes one request at a time so we expect a modest speedup when using prompto to query the Ollama API. We will use a 50QPM rate limit for this experiment as we are using a M1 Pro Macbook 14" model to run Ollama for this experiment which cannot handle much higher than this.

In [19]:

Copied!





ollama_experiment = Experiment(
    file_name="ollama.jsonl", settings=Settings(data_folder="./data", max_queries=50)
)

start = time.time()
ollama_responses, _ = await ollama_experiment.process()
prompto_times["ollama"] = time.time() - start
ollama_experiment = Experiment(
    file_name="ollama.jsonl", settings=Settings(data_folder="./data", max_queries=50)
)

start = time.time()
ollama_responses, _ = await ollama_experiment.process()
prompto_times["ollama"] = time.time() - start

Sending 100 queries at 50 QPM with RI of 1.2s  (attempt 1/3): 100%|██████████| 100/100 [02:00<00:00,  1.20s/query]
Waiting for responses  (attempt 1/3): 100%|██████████| 100/100 [01:58<00:00,  1.18s/query]

In [20]:

Copied!

sync_times["ollama"], prompto_times["ollama"]
sync_times["ollama"], prompto_times["ollama"]

Out[20]:

(271.4494206905365, 268.59372997283936)

Running `prompto` via the command line¶

We can also run the experiments via the command line. The command is as follows (assuming that your working directory is the current directory of this notebook, i.e. examples/system-demo):

prompto_run_experiment --file data/input/openai.jsonl --max-queries 500
prompto_run_experiment --file data/input/gemini.jsonl --max-queries 500
prompto_run_experiment --file data/input/ollama.jsonl --max-queries 50

But for this notebook, we will time the experiments and save them to the sync_times and prompto_times dictionaries.

Analysis¶

Here, we report the final runtimes for each API and the difference in time between the prompto and synchronous Python for loop approaches:

In [21]:

Copied!

sync_times
sync_times

Out[21]:

{'openai': 126.30979299545288,
 'gemini': 163.48729801177979,
 'ollama': 271.4494206905365}

In [22]:

Copied!

prompto_times
prompto_times

Out[22]:

{'openai': 13.91887378692627,
 'gemini': 14.094270944595337,
 'ollama': 268.59372997283936}

We can see significant speedups when using prompto to query the OpenAI and Gemini APIs. For the Ollama API, we see a modest speedup when using prompto to query the API since Ollama only processes one request at a time.

Querying different LLM endpoints: prompto vs. synchronous Python for loop¶

Synchronous approach¶

Experiment setup¶

OpenAI¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

Gemini¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

Ollama¶

Running the experiment synchronously¶

Running the experiment asynchronously with prompto¶

Running prompto via the command line¶

Analysis¶

Running the experiment asynchronously with `prompto`¶

Running the experiment asynchronously with `prompto`¶

Running the experiment asynchronously with `prompto`¶

Running `prompto` via the command line¶