Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dev branch #4522

Merged
merged 37 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
4a45dc4
Reorder the parameters in the FastAPI documentation
oobabooga Nov 6, 2023
97c21e5
Don't strip leading spaces in OpenAI API
oobabooga Nov 7, 2023
79b3f5a
Add /v1/internal/stop-generation to OpenAI API (#4498)
oobabooga Nov 7, 2023
18739c8
Update peft requirement from ==0.5.* to ==0.6.* (#4494)
dependabot[bot] Nov 7, 2023
fd893ba
Bump optimum from 1.13.1 to 1.14.0 (#4492)
dependabot[bot] Nov 7, 2023
3496044
Update 12 - OpenAI API.md (#4501)
mocheng Nov 7, 2023
b2afdda
Add more API examples
oobabooga Nov 7, 2023
15d4ea1
Merge remote-tracking branch 'refs/remotes/origin/dev' into dev
oobabooga Nov 7, 2023
6ec997f
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
40e73aa
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
ddca694
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
cc04abd
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
2bda1a9
Mention --api-key
oobabooga Nov 7, 2023
b0b999d
Merge remote-tracking branch 'refs/remotes/origin/dev' into dev
oobabooga Nov 7, 2023
55dc984
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
0c44087
Update 12 - OpenAI API.md
oobabooga Nov 7, 2023
d59f1ad
Update README.md
oobabooga Nov 7, 2023
48c9c31
Document the "preset" option in the API
oobabooga Nov 7, 2023
cee099f
Merge remote-tracking branch 'refs/remotes/origin/dev' into dev
oobabooga Nov 7, 2023
3d59346
Implement echo/suffix parameters
oobabooga Nov 7, 2023
3fc505d
Document unused parameters
oobabooga Nov 7, 2023
5c3eb22
Bump llama-cpp-python to 0.2.14
oobabooga Nov 7, 2023
af3d25a
Disable logits_all in llamacpp_HF (makes processing 3x faster)
oobabooga Nov 7, 2023
5c0559d
Training: fix .txt files now showing in dropdowns
oobabooga Nov 7, 2023
322c170
Document logits_all
oobabooga Nov 7, 2023
6e2e031
Separate context and system message in instruction formats (#4499)
oobabooga Nov 7, 2023
f6ca9cf
Add /v1/internal/model-info endpoint
oobabooga Nov 8, 2023
1b69694
Add types to the encode/decode/token-count endpoints
oobabooga Nov 8, 2023
43c53a7
Refactor the /v1/models endpoint
oobabooga Nov 8, 2023
2358706
Add /v1/internal/model/load endpoint (tentative)
oobabooga Nov 8, 2023
38b0749
Add a comment to /v1/models
oobabooga Nov 8, 2023
050ff36
Revert "Add a comment to /v1/models"
oobabooga Nov 8, 2023
881e8a6
Small bug fix in /v1/internal/model/load
oobabooga Nov 8, 2023
6c7aad1
openai extension: wrong frequency_penalty type (#4512)
hronoas Nov 8, 2023
1754a37
Include trust remote code usage in openai api's embedder (#4513)
MrMojoR Nov 8, 2023
678fd73
Document /v1/internal/model/load and fix a bug
oobabooga Nov 9, 2023
21ed9a2
Document the new "Custom system message" field
oobabooga Nov 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,8 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
* [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
* [Extensions framework](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions)
* [Custom chat characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character)
* Very efficient text streaming
* Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
* OpenAI-compatible API server
* OpenAI-compatible API server with Chat and Completions endpoints -- see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples)

## Documentation

Expand Down Expand Up @@ -328,6 +327,7 @@ Optionally, you can use the following command-line flags:
| `--tensor_split TENSOR_SPLIT` | Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17. |
| `--llama_cpp_seed SEED` | Seed for llama-cpp models. Default is 0 (random). |
| `--numa` | Activate NUMA task allocation for llama.cpp. |
| `--logits_all`| Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower. |
| `--cache-capacity CACHE_CAPACITY` | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |

#### ExLlama
Expand Down
10 changes: 6 additions & 4 deletions docs/03 ‐ Parameters Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,12 @@ So you can use those special placeholders in your character definitions. They ar
Defines the instruction template that is used in the Chat tab when "instruct" or "chat-instruct" are selected under "Mode".

* **Instruction template**: A dropdown menu where you can select from saved templates, save a new template (💾 button), and delete the currently selected template (🗑️).
* **User string**: In the turn template, `<|user|>` gets replaced with this string.
* **Bot string**: In the turn template, `<|bot|>` gets replaced with this string.
* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The system message for the model can be edited inside this string to customize its behavior.
* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input and `<|bot-message|>` gets replaced with the bot reply. It is necessary to include `<|user|>` and `<|bot|>` even if "User string" and "Bot string" above are empty, as those placeholders are used to split the template in parts in the backend.
* **Custom system message**: A message that defines the personality of the chatbot, replacing its default "System message" string. Example: "You are a duck."
* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input, `<|bot-message|>` gets replaced with the bot reply, `<|user|>` gets replaced with the "User string" below, and `<|bot|>` gets replaced with "Bot string" below. The `<|user|>` and `<|bot|>` placeholders must be included even if "User string" and "Bot string" are empty, as they are used to split the template in parts in the backend.
* **User string**: Replaces `<|user|>` in the turn template.
* **Bot string**: Replaces `<|bot|>` in the turn template.
* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The `<|system-message|>` placeholder gets replaced with the "System message" string below, unless "Custom system message" is not empty, in which case it is used instead.
* **System message**: A default message recommended by the model creator(s) to define the personality of the chatbot.
* **Send to default**: Send the full instruction template in string format to the Default tab.
* **Send to notebook**: Send the full instruction template in string format to the Notebook tab.
* **Send to negative prompt**: Send the full instruction template in string format to the "Negative prompt" field under "Parameters" > "Generation".
Expand Down
4 changes: 4 additions & 0 deletions docs/04 ‐ Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,10 @@ To use it, you need to download a tokenizer. There are two options:
1) Download `oobabooga/llama-tokenizer` under "Download model or LoRA". That's a default Llama tokenizer.
2) Place your .gguf in a subfolder of `models/` along with these 3 files: `tokenizer.model`, `tokenizer_config.json`, and `special_tokens_map.json`. This takes precedence over Option 1.

It has an additional parameter:

* **logits_all**: Needs to be checked if you want to evaluate the perplexity of the llama.cpp model using the "Training" > "Perplexity evaluation" tab. Otherwise, leave it unchecked, as it makes prompt processing slower.

### ctransformers

Loads: GGUF/GGML models.
Expand Down
135 changes: 93 additions & 42 deletions docs/12 - OpenAI API.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ pip install -r extensions/openai/requirements.txt

Add `--extensions openai` to your command-line flags.

* To create a public Cloudflare URL, also add the `--public-api` flag.
* To listen on your local network, also add the `--listen` flag.
* To change the port, which is 5000 by default, use `--port 1234` (change 1234 to your desired port number).
* To create a public Cloudflare URL, add the `--public-api` flag.
* To listen on your local network, add the `--listen` flag.
* To change the port, which is 5000 by default, use `--api-port 1234` (change 1234 to your desired port number).
* To use SSL, add `--ssl-keyfile key.pem --ssl-certfile cert.pem`. Note that it doesn't work with `--public-api`.
* To use an API key for authentication, add `--api-key yourkey`.

#### Environment variables

Expand Down Expand Up @@ -44,7 +45,7 @@ openai-debug: 1

### Examples

For the documentation with all the parameters, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.
For the documentation with all the parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.

The official examples in the [OpenAI documentation](https://platform.openai.com/docs/api-reference) should also work, and the same parameters apply (although the API here has more optional parameters).

Expand Down Expand Up @@ -128,7 +129,7 @@ headers = {
}

history = []

while True:
user_message = input("> ")
history.append({"role": "user", "content": user_message})
Expand All @@ -144,8 +145,82 @@ while True:
print(assistant_message)
```

### Client Application Setup
#### Python chat example with streaming

Start the script with `python -u` to see the output in real time.

```python
import requests
import sseclient # pip install sseclient-py
import json

url = "http://127.0.0.1:5000/v1/chat/completions"

headers = {
"Content-Type": "application/json"
}

history = []

while True:
user_message = input("> ")
history.append({"role": "user", "content": user_message})
data = {
"mode": "instruct",
"stream": True,
"messages": history
}

stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
client = sseclient.SSEClient(stream_response)

assistant_message = ''
for event in client.events():
payload = json.loads(event.data)
chunk = payload['choices'][0]['message']['content']
assistant_message += chunk
print(chunk, end='')

print()
history.append({"role": "assistant", "content": assistant_message})
```

#### Python completions example with streaming

Start the script with `python -u` to see the output in real time.

```python
import json
import requests
import sseclient # pip install sseclient-py

url = "http://127.0.0.1:5000/v1/completions"

headers = {
"Content-Type": "application/json"
}

data = {
"prompt": "This is a cake recipe:\n\n1.",
"max_tokens": 200,
"temperature": 1,
"top_p": 0.9,
"seed": 10,
"stream": True,
}

stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
client = sseclient.SSEClient(stream_response)

print(data['prompt'], end='')
for event in client.events():
payload = json.loads(event.data)
print(payload['choices'][0]['text'], end='')

print()
```

### Third-party application setup

You can usually force an application that uses the OpenAI API to connect to the local API by using the following environment variables:

Expand All @@ -157,18 +232,18 @@ or

```shell
OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
OPENAI_API_BASE=http://127.0.0.1:500/v1
OPENAI_API_BASE=http://127.0.0.1:5000/v1
```

With the [official python openai client](https://github.com/openai/openai-python), set the `OPENAI_API_BASE` environment variables:
With the [official python openai client](https://github.com/openai/openai-python), the address can be set like this:

```shell
# Sample .env file:
OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
OPENAI_API_BASE=http://0.0.0.0:5001/v1
```
```python
import openai

If needed, replace 127.0.0.1 with the IP/port of your server.
openai.api_key = "..."
openai.api_base = "http://127.0.0.1:5000/v1"
openai.api_version = "2023-05-15"
```

If using .env files to save the `OPENAI_API_BASE` and `OPENAI_API_KEY` variables, make sure the .env file is loaded before the openai module is imported:

Expand Down Expand Up @@ -212,35 +287,10 @@ In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller st

Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable.

### API Documentation & Examples

The OpenAI API is well documented, you can view the documentation here: https://platform.openai.com/docs/api-reference

Examples of how to use the Completions API in Python can be found here: https://platform.openai.com/examples
Not all of them will work with all models unfortunately, See the notes on Models for how to get the best results.

Here is a simple python example.

```python
import os
os.environ['OPENAI_API_KEY']="sk-111111111111111111111111111111111111111111111111"
os.environ['OPENAI_API_BASE']="http://0.0.0.0:5001/v1"
import openai

response = openai.ChatCompletion.create(
model="x",
messages = [{ 'role': 'system', 'content': "Answer in a consistent style." },
{'role': 'user', 'content': "Teach me about patience."},
{'role': 'assistant', 'content': "The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread."},
{'role': 'user', 'content': "Teach me about the ocean."},
]
)
text = response['choices'][0]['message']['content']
print(text)
```

### Compatibility & not so compatibility

Note: the table below may be obsolete.

| API endpoint | tested with | notes |
| ------------------------- | ---------------------------------- | --------------------------------------------------------------------------- |
| /v1/chat/completions | openai.ChatCompletion.create() | Use it with instruction following models |
Expand All @@ -263,11 +313,12 @@ print(text)
| /v1/fine-tunes\* | openai.FineTune.\* | not yet supported |
| /v1/search | openai.search, engines.search | not yet supported |


#### Applications

Almost everything needs the `OPENAI_API_KEY` and `OPENAI_API_BASE` environment variable set, but there are some exceptions.

Note: the table below may be obsolete.

| Compatibility | Application/Library | Website | Notes |
| ------------- | ---------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ✅❌ | openai-python (v0.25+) | https://github.com/openai/openai-python | only the endpoints from above are working. OPENAI_API_BASE=http://127.0.0.1:5001/v1 |
Expand Down
45 changes: 16 additions & 29 deletions extensions/openai/completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ def convert_history(history):
current_message = ""
current_reply = ""
user_input = ""
system_message = ""

for entry in history:
content = entry["content"]
Expand All @@ -159,11 +160,13 @@ def convert_history(history):
current_reply = ""
else:
chat_dialogue.append(['', current_reply])
elif role == "system":
system_message = content

# if current_message:
# chat_dialogue.append([current_message, ''])

return user_input, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}
return user_input, system_message, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}


def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -> dict:
Expand Down Expand Up @@ -198,7 +201,7 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
# Instruction template
instruction_template = body['instruction_template'] or shared.settings['instruction_template']
instruction_template = "Alpaca" if instruction_template == "None" else instruction_template
name1_instruct, name2_instruct, _, _, context_instruct, turn_template = load_character_memoized(instruction_template, '', '', instruct=True)
name1_instruct, name2_instruct, _, _, context_instruct, turn_template, system_message = load_character_memoized(instruction_template, '', '', instruct=True)
name1_instruct = body['name1_instruct'] or name1_instruct
name2_instruct = body['name2_instruct'] or name2_instruct
context_instruct = body['context_instruct'] or context_instruct
Expand All @@ -208,13 +211,13 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
character = body['character'] or shared.settings['character']
character = "Assistant" if character == "None" else character
name1 = body['name1'] or shared.settings['name1']
name1, name2, _, greeting, context, _ = load_character_memoized(character, name1, '', instruct=False)
name1, name2, _, greeting, context, _, _ = load_character_memoized(character, name1, '', instruct=False)
name2 = body['name2'] or name2
context = body['context'] or context
greeting = body['greeting'] or greeting

# History
user_input, history = convert_history(messages)
user_input, custom_system_message, history = convert_history(messages)

generate_params.update({
'mode': body['mode'],
Expand All @@ -225,6 +228,8 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
'name1_instruct': name1_instruct,
'name2_instruct': name2_instruct,
'context_instruct': context_instruct,
'system_message': system_message,
'custom_system_message': custom_system_message,
'turn_template': turn_template,
'chat-instruct_command': body['chat_instruct_command'],
'history': history,
Expand Down Expand Up @@ -287,13 +292,7 @@ def chat_streaming_chunk(content):
continue

seen_content = answer

# strip extra leading space off new generated content
if len_seen == 0 and new_content[0] == ' ':
new_content = new_content[1:]

chunk = chat_streaming_chunk(new_content)

yield chunk

completion_token_count = len(encode(answer)[0])
Expand Down Expand Up @@ -355,8 +354,8 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
generate_params['stream'] = stream
requested_model = generate_params.pop('model')
logprob_proc = generate_params.pop('logprob_proc', None)
# generate_params['suffix'] = body.get('suffix', generate_params['suffix'])
generate_params['echo'] = body.get('echo', generate_params['echo'])
suffix = body['suffix'] if body['suffix'] else ''
echo = body['echo']

if not stream:
prompt_arg = body[prompt_str]
Expand All @@ -379,6 +378,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
except KeyError:
prompt = decode(prompt)[0]

prefix = prompt if echo else ''
token_count = len(encode(prompt)[0])
total_prompt_token_count += token_count

Expand All @@ -390,10 +390,6 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
for a in generator:
answer = a

# strip extra leading space off new generated content
if answer and answer[0] == ' ':
answer = answer[1:]

completion_token_count = len(encode(answer)[0])
total_completion_token_count += completion_token_count
stop_reason = "stop"
Expand All @@ -403,7 +399,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
respi = {
"index": idx,
"finish_reason": stop_reason,
"text": answer,
"text": prefix + answer + suffix,
"logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
}

Expand Down Expand Up @@ -435,6 +431,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
else:
raise InvalidRequestError(message="API Batched generation not yet supported.", param=prompt_str)

prefix = prompt if echo else ''
token_count = len(encode(prompt)[0])

def text_streaming_chunk(content):
Expand All @@ -454,7 +451,7 @@ def text_streaming_chunk(content):

return chunk

yield text_streaming_chunk('')
yield text_streaming_chunk(prefix)

# generate reply #######################################
debug_msg({'prompt': prompt, 'generate_params': generate_params})
Expand All @@ -474,25 +471,15 @@ def text_streaming_chunk(content):
continue

seen_content = answer

# strip extra leading space off new generated content
if len_seen == 0 and new_content[0] == ' ':
new_content = new_content[1:]

chunk = text_streaming_chunk(new_content)

yield chunk

# to get the correct count, we strip the leading space if present
if answer and answer[0] == ' ':
answer = answer[1:]

completion_token_count = len(encode(answer)[0])
stop_reason = "stop"
if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
stop_reason = "length"

chunk = text_streaming_chunk('')
chunk = text_streaming_chunk(suffix)
chunk[resp_list][0]["finish_reason"] = stop_reason
chunk["usage"] = {
"prompt_tokens": token_count,
Expand Down
Loading