oobabooga · oobabooga · Nov 9, 2023 · Nov 6, 2023 · Nov 7, 2023 · Nov 7, 2023
diff --git a/README.md b/README.md
@@ -20,9 +20,8 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 * [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
 * [Extensions framework](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions)
 * [Custom chat characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character)
-* Very efficient text streaming
 * Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
-* OpenAI-compatible API server
+* OpenAI-compatible API server with Chat and Completions endpoints -- see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples)
 
 ## Documentation
 
@@ -328,6 +327,7 @@ Optionally, you can use the following command-line flags:
 | `--tensor_split TENSOR_SPLIT`       | Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17. |
 | `--llama_cpp_seed SEED`             | Seed for llama-cpp models. Default is 0 (random). |
 | `--numa`      | Activate NUMA task allocation for llama.cpp. |
+| `--logits_all`| Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower. |
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
 
 #### ExLlama

diff --git a/docs/03 ‐ Parameters Tab.md b/docs/03 ‐ Parameters Tab.md
@@ -98,10 +98,12 @@ So you can use those special placeholders in your character definitions. They ar
 Defines the instruction template that is used in the Chat tab when "instruct" or "chat-instruct" are selected under "Mode".
 
 * **Instruction template**: A dropdown menu where you can select from saved templates, save a new template (💾 button), and delete the currently selected template (🗑️).
-* **User string**: In the turn template, `<|user|>` gets replaced with this string.
-* **Bot string**: In the turn template, `<|bot|>` gets replaced with this string.
-* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The system message for the model can be edited inside this string to customize its behavior.
-* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input and `<|bot-message|>` gets replaced with the bot reply. It is necessary to include `<|user|>` and `<|bot|>` even if "User string" and "Bot string" above are empty, as those placeholders are used to split the template in parts in the backend.
+* **Custom system message**: A message that defines the personality of the chatbot, replacing its default "System message" string. Example: "You are a duck."
+* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input, `<|bot-message|>` gets replaced with the bot reply, `<|user|>` gets replaced with the "User string" below, and `<|bot|>` gets replaced with "Bot string" below. The `<|user|>` and `<|bot|>` placeholders must be included even if "User string" and "Bot string" are empty, as they are used to split the template in parts in the backend.
+* **User string**: Replaces `<|user|>` in the turn template.
+* **Bot string**: Replaces `<|bot|>` in the turn template.
+* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The `<|system-message|>` placeholder gets replaced with the "System message" string below, unless "Custom system message" is not empty, in which case it is used instead.
+* **System message**: A default message recommended by the model creator(s) to define the personality of the chatbot.
 * **Send to default**: Send the full instruction template in string format to the Default tab.
 * **Send to notebook**: Send the full instruction template in string format to the Notebook tab.
 * **Send to negative prompt**: Send the full instruction template in string format to the "Negative prompt" field under "Parameters" > "Generation".

diff --git a/docs/04 ‐ Model Tab.md b/docs/04 ‐ Model Tab.md
@@ -110,6 +110,10 @@ To use it, you need to download a tokenizer. There are two options:
 1) Download `oobabooga/llama-tokenizer` under "Download model or LoRA". That's a default Llama tokenizer.
 2) Place your .gguf in a subfolder of `models/` along with these 3 files: `tokenizer.model`, `tokenizer_config.json`, and `special_tokens_map.json`. This takes precedence over Option 1.
 
+It has an additional parameter:
+
+* **logits_all**: Needs to be checked if you want to evaluate the perplexity of the llama.cpp model using the "Training" > "Perplexity evaluation" tab. Otherwise, leave it unchecked, as it makes prompt processing slower.
+
 ### ctransformers
 
 Loads: GGUF/GGML models.

diff --git a/docs/12 - OpenAI API.md b/docs/12 - OpenAI API.md
@@ -12,10 +12,11 @@ pip install -r extensions/openai/requirements.txt
 
 Add `--extensions openai` to your command-line flags.
 
-* To create a public Cloudflare URL, also add the `--public-api` flag.
-* To listen on your local network, also add the `--listen` flag.
-* To change the port, which is 5000 by default, use `--port 1234` (change 1234 to your desired port number).
+* To create a public Cloudflare URL, add the `--public-api` flag.
+* To listen on your local network, add the `--listen` flag.
+* To change the port, which is 5000 by default, use `--api-port 1234` (change 1234 to your desired port number).
 * To use SSL, add `--ssl-keyfile key.pem --ssl-certfile cert.pem`. Note that it doesn't work with `--public-api`.
+* To use an API key for authentication, add `--api-key yourkey`.
 
 #### Environment variables
 
@@ -44,7 +45,7 @@ openai-debug: 1
 
 ### Examples
 
-For the documentation with all the parameters, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.
+For the documentation with all the parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.
 
 The official examples in the [OpenAI documentation](https://platform.openai.com/docs/api-reference) should also work, and the same parameters apply (although the API here has more optional parameters).
 
@@ -128,7 +129,7 @@ headers = {
 }
 
 history = []
-    
+
 while True:
     user_message = input("> ")
     history.append({"role": "user", "content": user_message})
@@ -144,8 +145,82 @@ while True:
     print(assistant_message)
 ```
 
-### Client Application Setup
+#### Python chat example with streaming
+
+Start the script with `python -u` to see the output in real time.
+
+```python
+import requests
+import sseclient  # pip install sseclient-py
+import json
+
+url = "http://127.0.0.1:5000/v1/chat/completions"
+
+headers = {
+    "Content-Type": "application/json"
+}
+
+history = []
+
+while True:
+    user_message = input("> ")
+    history.append({"role": "user", "content": user_message})
+    data = {
+        "mode": "instruct",
+        "stream": True,
+        "messages": history
+    }
+
+    stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
+    client = sseclient.SSEClient(stream_response)
+
+    assistant_message = ''
+    for event in client.events():
+        payload = json.loads(event.data)
+        chunk = payload['choices'][0]['message']['content']
+        assistant_message += chunk
+        print(chunk, end='')
+
+    print()
+    history.append({"role": "assistant", "content": assistant_message})
+```
+
+#### Python completions example with streaming
+
+Start the script with `python -u` to see the output in real time.
+
+```python
+import json
+import requests
+import sseclient  # pip install sseclient-py
 
+url = "http://127.0.0.1:5000/v1/completions"
+
+headers = {
+    "Content-Type": "application/json"
+}
+
+data = {
+    "prompt": "This is a cake recipe:\n\n1.",
+    "max_tokens": 200,
+    "temperature": 1,
+    "top_p": 0.9,
+    "seed": 10,
+    "stream": True,
+}
+
+stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
+client = sseclient.SSEClient(stream_response)
+
+print(data['prompt'], end='')
+for event in client.events():
+    payload = json.loads(event.data)
+    print(payload['choices'][0]['text'], end='')
+
+print()
+```
+
+### Third-party application setup
 
 You can usually force an application that uses the OpenAI API to connect to the local API by using the following environment variables:
 
@@ -157,18 +232,18 @@ or
 
 ```shell
 OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
-OPENAI_API_BASE=http://127.0.0.1:500/v1
+OPENAI_API_BASE=http://127.0.0.1:5000/v1
 ```
 
-With the [official python openai client](https://github.com/openai/openai-python), set the `OPENAI_API_BASE` environment variables:
+With the [official python openai client](https://github.com/openai/openai-python), the address can be set like this:
 
-```shell
-# Sample .env file:
-OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
-OPENAI_API_BASE=http://0.0.0.0:5001/v1
-```
+```python
+import openai
 
-If needed, replace 127.0.0.1 with the IP/port of your server.
+openai.api_key = "..."
+openai.api_base = "http://127.0.0.1:5000/v1"
+openai.api_version = "2023-05-15"
+```
 
 If using .env files to save the `OPENAI_API_BASE` and `OPENAI_API_KEY` variables, make sure the .env file is loaded before the openai module is imported:
 
@@ -212,35 +287,10 @@ In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller st
 
 Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable.
 
-### API Documentation & Examples
-
-The OpenAI API is well documented, you can view the documentation here: https://platform.openai.com/docs/api-reference
-
-Examples of how to use the Completions API in Python can be found here: https://platform.openai.com/examples
-Not all of them will work with all models unfortunately, See the notes on Models for how to get the best results.
-
-Here is a simple python example.
-
-```python
-import os
-os.environ['OPENAI_API_KEY']="sk-111111111111111111111111111111111111111111111111"
-os.environ['OPENAI_API_BASE']="http://0.0.0.0:5001/v1"
-import openai
-
-response = openai.ChatCompletion.create(
-  model="x",
-  messages = [{ 'role': 'system', 'content': "Answer in a consistent style." },
-    {'role': 'user', 'content': "Teach me about patience."},
-    {'role': 'assistant', 'content': "The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread."},
-    {'role': 'user', 'content': "Teach me about the ocean."},
-  ]
-)
-text = response['choices'][0]['message']['content']
-print(text)
-```
-
 ### Compatibility & not so compatibility
 
+Note: the table below may be obsolete.
+
 | API endpoint              | tested with                        | notes                                                                       |
 | ------------------------- | ---------------------------------- | --------------------------------------------------------------------------- |
 | /v1/chat/completions      | openai.ChatCompletion.create()     | Use it with instruction following models                                    |
@@ -263,11 +313,12 @@ print(text)
 | /v1/fine-tunes\*          | openai.FineTune.\*                 | not yet supported                                                           |
 | /v1/search                | openai.search, engines.search      | not yet supported                                                           |
 
-
 #### Applications
 
 Almost everything needs the `OPENAI_API_KEY` and `OPENAI_API_BASE` environment variable set, but there are some exceptions.
 
+Note: the table below may be obsolete.
+
 | Compatibility | Application/Library    | Website                                                                        | Notes                                                                                                                                                                                                        |
 | ------------- | ---------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | ✅❌          | openai-python (v0.25+) | https://github.com/openai/openai-python                                        | only the endpoints from above are working. OPENAI_API_BASE=http://127.0.0.1:5001/v1                                                                                                                          |

diff --git a/extensions/openai/completions.py b/extensions/openai/completions.py
@@ -140,6 +140,7 @@ def convert_history(history):
     current_message = ""
     current_reply = ""
     user_input = ""
+    system_message = ""
 
     for entry in history:
         content = entry["content"]
@@ -159,11 +160,13 @@ def convert_history(history):
                 current_reply = ""
             else:
                 chat_dialogue.append(['', current_reply])
+        elif role == "system":
+            system_message = content
 
     # if current_message:
     #     chat_dialogue.append([current_message, ''])
 
-    return user_input, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}
+    return user_input, system_message, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}
 
 
 def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -> dict:
@@ -198,7 +201,7 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
     # Instruction template
     instruction_template = body['instruction_template'] or shared.settings['instruction_template']
     instruction_template = "Alpaca" if instruction_template == "None" else instruction_template
-    name1_instruct, name2_instruct, _, _, context_instruct, turn_template = load_character_memoized(instruction_template, '', '', instruct=True)
+    name1_instruct, name2_instruct, _, _, context_instruct, turn_template, system_message = load_character_memoized(instruction_template, '', '', instruct=True)
     name1_instruct = body['name1_instruct'] or name1_instruct
     name2_instruct = body['name2_instruct'] or name2_instruct
     context_instruct = body['context_instruct'] or context_instruct
@@ -208,13 +211,13 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
     character = body['character'] or shared.settings['character']
     character = "Assistant" if character == "None" else character
     name1 = body['name1'] or shared.settings['name1']
-    name1, name2, _, greeting, context, _ = load_character_memoized(character, name1, '', instruct=False)
+    name1, name2, _, greeting, context, _, _ = load_character_memoized(character, name1, '', instruct=False)
     name2 = body['name2'] or name2
     context = body['context'] or context
     greeting = body['greeting'] or greeting
 
     # History
-    user_input, history = convert_history(messages)
+    user_input, custom_system_message, history = convert_history(messages)
 
     generate_params.update({
         'mode': body['mode'],
@@ -225,6 +228,8 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
         'name1_instruct': name1_instruct,
         'name2_instruct': name2_instruct,
         'context_instruct': context_instruct,
+        'system_message': system_message,
+        'custom_system_message': custom_system_message,
         'turn_template': turn_template,
         'chat-instruct_command': body['chat_instruct_command'],
         'history': history,
@@ -287,13 +292,7 @@ def chat_streaming_chunk(content):
                 continue
 
             seen_content = answer
-
-            # strip extra leading space off new generated content
-            if len_seen == 0 and new_content[0] == ' ':
-                new_content = new_content[1:]
-
             chunk = chat_streaming_chunk(new_content)
-
             yield chunk
 
     completion_token_count = len(encode(answer)[0])
@@ -355,8 +354,8 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
     generate_params['stream'] = stream
     requested_model = generate_params.pop('model')
     logprob_proc = generate_params.pop('logprob_proc', None)
-    # generate_params['suffix'] = body.get('suffix', generate_params['suffix'])
-    generate_params['echo'] = body.get('echo', generate_params['echo'])
+    suffix = body['suffix'] if body['suffix'] else ''
+    echo = body['echo']
 
     if not stream:
         prompt_arg = body[prompt_str]
@@ -379,6 +378,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
                     except KeyError:
                         prompt = decode(prompt)[0]
 
+            prefix = prompt if echo else ''
             token_count = len(encode(prompt)[0])
             total_prompt_token_count += token_count
 
@@ -390,10 +390,6 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
             for a in generator:
                 answer = a
 
-            # strip extra leading space off new generated content
-            if answer and answer[0] == ' ':
-                answer = answer[1:]
-
             completion_token_count = len(encode(answer)[0])
             total_completion_token_count += completion_token_count
             stop_reason = "stop"
@@ -403,7 +399,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
             respi = {
                 "index": idx,
                 "finish_reason": stop_reason,
-                "text": answer,
+                "text": prefix + answer + suffix,
                 "logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
             }
 
@@ -435,6 +431,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
             else:
                 raise InvalidRequestError(message="API Batched generation not yet supported.", param=prompt_str)
 
+        prefix = prompt if echo else ''
         token_count = len(encode(prompt)[0])
 
         def text_streaming_chunk(content):
@@ -454,7 +451,7 @@ def text_streaming_chunk(content):
 
             return chunk
 
-        yield text_streaming_chunk('')
+        yield text_streaming_chunk(prefix)
 
         # generate reply #######################################
         debug_msg({'prompt': prompt, 'generate_params': generate_params})
@@ -474,25 +471,15 @@ def text_streaming_chunk(content):
                 continue
 
             seen_content = answer
-
-            # strip extra leading space off new generated content
-            if len_seen == 0 and new_content[0] == ' ':
-                new_content = new_content[1:]
-
             chunk = text_streaming_chunk(new_content)
-
             yield chunk
 
-        # to get the correct count, we strip the leading space if present
-        if answer and answer[0] == ' ':
-            answer = answer[1:]
-
         completion_token_count = len(encode(answer)[0])
         stop_reason = "stop"
         if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
             stop_reason = "length"
 
-        chunk = text_streaming_chunk('')
+        chunk = text_streaming_chunk(suffix)
         chunk[resp_list][0]["finish_reason"] = stop_reason
         chunk["usage"] = {
             "prompt_tokens": token_count,