Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve multi-modal models with a new concept of "attachments" #587

Closed
14 tasks done
simonw opened this issue Oct 25, 2024 · 67 comments
Closed
14 tasks done

Solve multi-modal models with a new concept of "attachments" #587

simonw opened this issue Oct 25, 2024 · 67 comments
Labels
attachments enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Oct 25, 2024

Previous work is in:

I'm going a different direction. Previously I had just been thinking about images, but Gemini accepts PDFs and videos and audio clips and the latest GPT-4o model supports audio clips too.

The llm prompt command isn't using -a for anything yet, so I'm going to have -a filename be the way an attachment (or multiple attachments) is added to a prompt.

-a is short for --attachment - not for --attach because that already means something different for the llm embed-multi command (it attaches extra SQLite databases).

TODO

  • Get llm 'describe image' -a image.jpeg working
  • And llm 'describe image' -a https://static.simonwillison.net/static/2024/imgcat.jpg
  • And cat image.jpeg | llm 'describe image' -a -
  • Think about how async might work. Maybe the Attachment class should not have code for httpx.get() fetching of content, since an asyncio wrapper may want to do that a different way.
  • Figure out database persistence, so continue conversation can work
  • Implement OpenAI and Gemini plugins
  • Docs for how to write plugins that accept attachments
  • llm logs output for prompts with attachments
  • llm logs --json output
  • Finalize Python API
  • Document Python API
  • Document how to use attachments in CLI
  • Ship an alpha
  • Automated tests

Out of scope for this issue:

  • llm chat support for attachments via !attachment path-or-url
@simonw simonw added enhancement New feature or request multi-modal labels Oct 25, 2024
@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Here's some research I did against Gemini the other day: https://til.simonwillison.net/llms/prompt-gemini

Resulting in a Bash script that could do this:

prompt-gemini 'extract text from this image' example-handwriting.jpg

In that case it was detecting the file type from the file extension, since that type needs to be passed like so:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this image"
        },
        {
          "inlineData": {
            "data": "$(base64 -i image.png)",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}

But in some cases the file extension may not be usable. In those cases I'm going to have a second option: --at which is short for --attachment-type - and that's going to take a file path and an explicit type like this:

llm prompt "extract text" --at myimage image/png

Models that accept attachments should specify what blah/blah types they accept.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

I'm going to use the term "attachments" for binary files returned by models as well. So far I have two examples of those:

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Incoming attachments to the CLI tool can be specified in one of three ways:

  • A URL
  • A path to a file on disk
  • A - which means read from standard input (if I can get that to work cleanly)

Some models accept URLs directly, in which case the URL will be passed to the model. Other models don't, in which case LLM will detect that and download the image from the URL and send the bytes.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Here's an interesting challenge: do we resize the images before we send them or not?

Different models have different recommendations around this. I expect there are some models out there that are vastly less expensive if you resize the image before sending it, in which case resizing is an important feature.

We could use Pillow for that. Question is, how do we know that dimensions to resize to?

Maybe this can be an option that the model classes themselves specify. We could have a CLI option for --no-resize which users can send if they really don't want that to happen.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

https://platform.openai.com/docs/guides/vision/calculating-costs says:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

That's pretty complicated! It also exposes the need for a mechanism for sending detail low/high when making the API calls.

One option could be to default to low and allow users of that model to do this:

llm -m gpt-4o --at bigimage.png image-high/png

So we abuse the --at option and invent a special content type that maps to that high detail setting. Bit weird though.

I'll keep an eye out for any other oddities like that in other models that may need to be supported.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

GPT-4o format support:

We currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).

Here's what GPT-4o preview audio input looks like:

{
  "model": "gpt-4o-audio-preview",
  "modalities": ["text", "audio"],
  "audio": { "voice": "alloy", "format": "wav" },
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this recording?" },
        { 
          "type": "input_audio", 
          "input_audio": { 
            "data": "<base64 bytes here>", 
            "format": "wav" 
          }
        }
      ]
    }
  ]
}

Where format can be wav or mp3 according to the API reference docs on https://platform.openai.com/docs/api-reference/chat/create

Interestingly you don't need to pass the image type for images, even for base64 data:

CleanShot 2024-10-24 at 19 05 37@2x

Since detail is optional I may ignore it for the first implementation of this.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Claude models DO require a content type:

{
    "role": "user",
    "content": [
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/jpeg",
                "data": "/9j/4AAQSkZJRg..."
            }
        },
        {
            "type": "text",
            "text": "What is in this image?"
        }
    ]
}

https://docs.anthropic.com/en/api/messages says

We currently support the base64 source type for images, and the image/jpeg, image/png, image/gif, and image/webp media types.

As far as I can tell Claude doesn't accept URLs to images, only base64 encoded data.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Gemini supports these image formats: https://ai.google.dev/gemini-api/docs/vision?lang=rest

  • PNG - image/png
  • JPEG - image/jpeg
  • WEBP - image/webp
  • HEIC - image/heic
  • HEIF - image/heif

Maybe LLM should know how to convert images from unsupported formats to supported formats? Not sure if that's worth the fuss, maybe a plugin thing at a later date?

Gemini has a file API and really encourages you to upload images first... but it says that if your files add up to less than 20MB you can use base64 instead. I think I'll stick with base64 at first.

Gemini can also do what it calls "document processing" - https://ai.google.dev/gemini-api/docs/document-processing?lang=rest

Document pages must be in one of the following text data MIME types:

  • PDF - application/pdf
  • JavaScript - application/x-javascript, text/javascript
  • Python - application/x-python, text/x-python
  • TXT - text/plain
  • HTML - text/html
  • CSS - text/css
  • Markdown - text/md
  • CSV - text/csv
  • XML - text/xml
  • RTF - text/rtf

Each document page is equivalent to 258 tokens.

I definitely want to support these, especially since they can represent a big discount on overall cost because of the weird 258 token flat rate (also the rate for an image).

And for audio: https://ai.google.dev/gemini-api/docs/audio?lang=rest

Gemini supports the following audio format MIME types:

  • WAV - audio/wav
  • MP3 - audio/mp3
  • AIFF - audio/aiff
  • AAC - audio/aac
  • OGG Vorbis - audio/ogg
  • FLAC - audio/flac

Gemini imposes the following rules on audio:

  • Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
  • Gemini can only infer responses to English-language speech.
  • Gemini can "understand" non-speech components, such as birdsong or sirens.

And video too! https://ai.google.dev/gemini-api/docs/vision?lang=rest#technical-details-video

Gemini 1.5 Pro and Flash support up to approximately an hour of video data.

Video must be in one of the following video format MIME types:

  • video/mp4
  • video/mpeg
  • video/mov
  • video/avi
  • video/x-flv
  • video/mpg
  • video/webm
  • video/wmv
  • video/3gpp

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

I think llm-gemini may be the most interesting of the initial plugins for multi-modal attachments, especially given the research I already did in https://til.simonwillison.net/llms/prompt-gemini

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

As far as I can tell there is no way to provide Gemini with a URL to content that has NOT been uploaded first to the Google file service.

So out of OpenAI, Anthropic, Google it looks like OpenAI are the only ones that accept an arbitrary URL to an image.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

I don't know if OpenAI accept URLs to both images and audio clips. To be safe, maybe the API design should have the ability to define a accepts_urls(file) method which can say yes or no dynamically based on the file that is passed in.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

The Pixtral API accepts URLs: https://docs.mistral.ai/capabilities/vision/

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": "https://tripfixers.com/wp-content/uploads/2019/11/eiffel-tower-with-snow.jpeg"
            }
        ]
    }
]

Or base64 images:

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": "data:image/jpeg;base64,{base64_image}" 
            }
        ]
    }
]

Note that you don't have to specify image/jpeg with a URL but you do with a base64 image.

Supported file types:

PNG (.png)
JPEG (.jpeg and .jpg)
WEBP (.webp)
Non-animated GIF with only one frame (.gif)

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Groq API also supports both base64 and regular URLs, for Llama 3.1 vision models: https://console.groq.com/docs/vision

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }
]
[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Whats the weather like in this state?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                }
            }
        ]
    }
]

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

This worked against Groq:

curl https://api.groq.com/openai/v1/chat/completions -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GROQ_API_KEY" \
-d '{
    "model": "llama-3.2-11b-vision-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in a great deal of detail, do not describe any people in it"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ]
}' | jq

image

Returned:

{
  "id": "chatcmpl-33f4a341-1dd6-44fa-8dbd-90ed9d437b80",
  "object": "chat.completion",
  "created": 1729826570,
  "model": "llama-3.2-11b-vision-preview",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image depicts the iconic Statue of Liberty in New York Harbor, with the Manhattan skyline in the background. \n\nThe Statue of Liberty is prominently featured in the foreground, situated on a small island that juts out into the harbor. The statue's copper sheeting, which is normally a bright green due to oxidation, appears a slightly lighter shade in the image, possibly due to the lighting conditions or exposure of the photo. The statue's broken shackles and chains are visible, symbolizing the abolition of slavery.\n\nIn the background, the Manhattan skyline rises majestically, dominated by the towering skyscrapers of the Financial District. The image showcases several notable landmarks, including One World Trade Center, the former World Trade Center, and the majestic Brooklyn Bridge. The atmosphere of the image is peaceful and serenely beautiful, with the calm waters of the harbor reflecting the soft light of the setting sun. The overall mood is one of tranquility and wonder, inviting the viewer to appreciate the majesty of this iconic symbol of freedom."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "queue_time": 0.258170439,
    "prompt_tokens": 28,
    "prompt_time": 0.001334961,
    "completion_tokens": 207,
    "completion_time": 0.421154809,
    "total_tokens": 235,
    "total_time": 0.42248977
  },
  "system_fingerprint": "fp_fa3d3d25b0",
  "x_groq": {
    "id": "req_01jb0v5ganf85s8gfzcxh9n50r"
  }
}

The image depicts the iconic Statue of Liberty in New York Harbor, with the Manhattan skyline in the background.

The Statue of Liberty is prominently featured in the foreground, situated on a small island that juts out into the harbor. The statue's copper sheeting, which is normally a bright green due to oxidation, appears a slightly lighter shade in the image, possibly due to the lighting conditions or exposure of the photo. The statue's broken shackles and chains are visible, symbolizing the abolition of slavery.

In the background, the Manhattan skyline rises majestically, dominated by the towering skyscrapers of the Financial District. The image showcases several notable landmarks, including One World Trade Center, the former World Trade Center, and the majestic Brooklyn Bridge. The atmosphere of the image is peaceful and serenely beautiful, with the calm waters of the harbor reflecting the soft light of the setting sun. The overall mood is one of tranquility and wonder, inviting the viewer to appreciate the majesty of this iconic symbol of freedom.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

OK, database design.

Reminder: the current schema on https://llm.datasette.io/en/stable/logging.html#sql-schema looks like this:

CREATE TABLE [conversations] (
  [id] TEXT PRIMARY KEY,
  [name] TEXT,
  [model] TEXT
);
CREATE TABLE [responses] (
  [id] TEXT PRIMARY KEY,
  [model] TEXT,
  [prompt] TEXT,
  [system] TEXT,
  [prompt_json] TEXT,
  [options_json] TEXT,
  [response] TEXT,
  [response_json] TEXT,
  [conversation_id] TEXT REFERENCES [conversations]([id]),
  [duration_ms] INTEGER,
  [datetime_utc] TEXT
);
CREATE VIRTUAL TABLE [responses_fts] USING FTS5 (
  [prompt],
  [response],
  content=[responses]
);

I'm going to have a new attachments table, with the following columns:

  • id - a text ULID
  • path - optional text that's a fully resolved path to a file on disk
  • url - optional text that's a URL to something online
  • content - optional blob that's the actual binary content
  • type - text, image/png etc

At least one of filepath or url or content will need to be populated.

Here's where things get tricky: how should these be associated with data in the responses table?

These things can be used for both input AND output (the OpenAI audio output case). So maybe there are two many-to-many columns:

  • attachment_inputs - links attachment_id and response_id to represent an input
  • attachment_outputs - links attachment_id and response_id to represent an output

I guess for outputs I'll be populating just the content and content_type tables - and I'll provide LLM CLI commands for exporting those to disk. They'll be easier to view when I get a web UI working for the tool.

I'm not crazy about those table names. Other option:

  • prompt_attachments / response_attachments - a bit weird because there's no prompts table

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2024

Should I still store the full prompt_json and response_json - including the base64 encoded images - if I'm also storing duplicates of those in the attachments table?

If not I could invent my own clever JSON format, something like this:

{
    "model": "llama-3.2-11b-vision-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in a great deal of detail, do not describe any people in it"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": {
                            "$attachment": {
                                "id": "v01jax4p0rstqbs3fvbkkszvt",
                                "column": "url"
                            }
                        }
                    }
                }
            ]
        }
    ]
}

So that {"$attachment"...} bit gets dynamically replaced by the data pulled from the attachments table.

@mqudsi
Copy link

mqudsi commented Oct 25, 2024

My vote would be for not duplicating the images/attachments in the database.

You don't strictly need to make (serious) changes to the prompt_json/response_json if you invert the hierarchy and reference the prompt/response from the attachments table as foreign keys (though having them directly available here would, of course, also be useful if you're making changes anyway).

If the attachments are stored as sqlite blobs, you would also reduce the cost of base64-encoding their contents. I would definitely wish for there to be a proper sqlite foreign key and first-class (non-json) references between the two (three) tables.

Giving it some more thought, it actually would be better to have the references in the prompt/response tables pointing to the attachments table (rather than the other way around) because I just realized that you might want to deduplicate the attachments. Given that you could, quite easily, imagine that a person would make 10+ queries in a row against the same attachment, which could be rather large (in the case of some of the gemini models, 10s of MiBs+). If you're already ingesting the full attachment, I would definitely say go ahead and calculate the sha256 (often hardware-accelerated and very much standardized) or wyhash (fastest general hash I've found ported to Python, from a quick search) as you ingest each chunk. It'll have fairly low additional overhead (since you're IO-limited) but you'll realize massive savings if you can prevent adding a possibly very large blob to the database each time.

@simonw
Copy link
Owner Author

simonw commented Oct 26, 2024

Great point about de-duplicating attachments there, given the need to support long conversation threads.

... actually that's handled a bit already: the resopnses table doesn't store the full JSON that was sent to the LLM for each message, it instead stores the JSON for that specific round of request/response and uses the foreign key to conversations when it needs to inflate the full previous message history to send in a chat completion.

I'll still think about ways to avoid duplicate storage though - might even calculate a sha255 hash of the BLOB content and store that in a column (or maybe even use that as the ID itself?)

A neat thing about using a SHA ID is that it means if you send the same stored image to multiple different LLMs (to compare their responses for example) you only record it once in the database. That's a pretty compelling reason to do this.

Note that my current idea is that if you store filepath in the table you don't store the content BLOB, with the assumption that the filepath will continue to work in the future. It does mean that you'll get an error if you attempt to continue a conversation what used filepath and those files are no longer there though.

... so I may have some kind of option that means "store the images in the database BLOB columns anyway", maybe this:

llm -m claude-3.5-sonnet "describe this image" -a image.png --store-attachments

Could be --sa for short (`-s already means system prompt).

@irthomasthomas
Copy link

Is there anything I can do to help? I made a whole vision analysis cli based on claude. https://github.com/irthomasthomas/claude-vision
But full multi-modal support in llm would be amazing. So lmk if I can help at all

@simonw
Copy link
Owner Author

simonw commented Oct 26, 2024

Initial attempt at an AttachmentType parameter:

diff --git a/llm/cli.py b/llm/cli.py
index a1b1457..1082dc7 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -30,7 +30,10 @@ from llm import (
 from .migrations import migrate
 from .plugins import pm
 import base64
+from dataclasses import dataclass
+import httpx
 import pathlib
+import puremagic
 import pydantic
 import readline
 from runpy import run_module
@@ -48,6 +51,44 @@ warnings.simplefilter("ignore", ResourceWarning)
 DEFAULT_TEMPLATE = "prompt: "
 
 
+@dataclass
+class Attachment:
+    mimetype: str
+    filepath: str
+    url: str
+    content: bytes
+
+
+class AttachmentType(click.ParamType):
+    name = "attachment"
+
+    def convert(self, value, param, ctx):
+        if value == "-":
+            content = sys.stdin.buffer.read()
+            # Try to guess type
+            try:
+                mimetype = puremagic.from_string(content, mime=True)
+            except puremagic.PureError:
+                raise click.BadParameter("Could not determine mimetype of stdin")
+            return Attachment(mimetype, None, None, content)
+        if "://" in value:
+            # Confirm URL exists and try to guess type
+            try:
+                response = httpx.head(value)
+                response.raise_for_status()
+                mimetype = response.headers.get("content-type")
+            except httpx.HTTPError as ex:
+                raise click.BadParameter(str(ex))
+            return Attachment(mimetype, None, value, None)
+        # Check that the file exists
+        path = pathlib.Path(value)
+        if not path.exists():
+            self.fail(f"File {value} does not exist", param, ctx)
+        # Try to guess type
+        mimetype = puremagic.from_file(str(path), mime=True)
+        return Attachment(mimetype, str(path), None, None)
+
+
 def _validate_metadata_json(ctx, param, value):
     if value is None:
         return value
@@ -88,6 +129,22 @@ def cli():
 @click.argument("prompt", required=False)
 @click.option("-s", "--system", help="System prompt to use")
 @click.option("model_id", "-m", "--model", help="Model to use")
+@click.option(
+    "attachments",
+    "-a",
+    "--attachment",
+    type=AttachmentType(),
+    multiple=True,
+    help="Attachment path or URL or -",
+)
+@click.option(
+    "attachment_types",
+    "--at",
+    "--attachment-type",
+    type=(str, str),
+    multiple=True,
+    help="Attachment with explicit mimetype",
+)
 @click.option(
     "options",
     "-o",
@@ -127,6 +184,8 @@ def prompt(
     prompt,
     system,
     model_id,
+    attachments,
+    attachment_types,
     options,
     template,
     param,
@@ -143,6 +202,8 @@ def prompt(
 
     Documentation: https://llm.datasette.io/en/stable/usage.html
     """
+    print(attachments)
+    return
     if log and no_log:
         raise click.ClickException("--log and --no-log are mutually exclusive")
 
diff --git a/setup.py b/setup.py
index 1f6adcd..b8b55bf 100644
--- a/setup.py
+++ b/setup.py
@@ -48,6 +48,7 @@ setup(
         "setuptools",
         "pip",
         "pyreadline3; sys_platform == 'win32'",
+        "puremagic",
     ],
     extras_require={
         "test": [

@simonw
Copy link
Owner Author

simonw commented Oct 26, 2024

I'm thinking about how the Python API is going to work. I'm leaning towards this:

model = llm.get_model("gpt-4o")
response = model.prompt("Describe these images", open("image.jpg", "rb"), open("image2.jpg", "rb"))

I could have that accept file-like objects or string paths or string URLs, or maybe I could tell people to do this instead:

response = model.prompt(
    "Describe these images",
    llm.Attachment(url="https://..."),
    llm.Attachment(path="image.jpg")
)

I like that second option better, it's more fitting with Python's optional type hints.

So the prompt() method starts taking multiple attachment arguments that follow the initial prompt.

Right now the signature of that method looks like this:

llm/llm/models.py

Lines 270 to 280 in d654c95

def prompt(
self,
prompt: Optional[str],
system: Optional[str] = None,
stream: bool = True,
**options
):
return self.response(
Prompt(prompt, system=system, model=self, options=self.Options(**options)),
stream=stream,
)

Technically this would be a breaking change, because system= and stream= are currently available as positional arguments. I the documents I've only described them as keyword arguments though:

llm/docs/python-api.md

Lines 42 to 51 in d654c95

### System prompts
For models that accept a system prompt, pass it as `system="..."`:
```python
response = model.prompt(
"Five surprising names for a pet pelican",
system="Answer like GlaDOS"
)
```

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

... well I got this to work (including some hacking around with llm-gemini):

% llm -m gemini-1.5-flash-8b-latest 'transcribe' --at russian-pelican-in-spanish.mp3 audio/mp3
Oye camarada, aquí está tu pelicano Californiano con acento Russo.  ¡Qué tal!  Listo para charlar en español.

How's your day today?

¡Mi día! ¡Estoy volando sobre las olas, buscando peces y disfrutando del sol Californiano!  ¿Y tú, amigo? ¿Cómo ha estado tu día?

% llm -m gemini-1.5-flash-8b-latest 'extract all text' -a llm-pictionary.mp4
prompt extract all text attachments (Attachment(type='video/mp4', path='llm-pictionary.mp4', url=None, content=None),)
LLM Pictionary
START NEW ROUND
Round 5: Claude 3.5 Sonnet (Oct 24) is drawing
Llama 3.2 90B Vision Instruc: Sky;
Claude 3.5 Sonnet (June 24): Ocean;
GPT-40: Ocean;
Gemini Flash 1.5-002: Sky;
Claude 3.5 Sonnet (Oct 24): ocean;
Gemini Pro 1.5-002: Ocean;
GPT-40 Mini: Sky;
Llama 3.2 90B Vision Instruc: Sun;
Gemini Flash 1.5-002: Ocean;
GPT-40 Mini: Beach
Llama 3.2 90B Vision Instruc: Image can't be displayed. An ocean sun beach is the likely depiction.
Claude 3.5 Sonnet (June 24) guessed first!

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

OK, I have a working prototype of this for both the default OpenAI plugin and the llm-gemini plugin. Still todo: (moved this to issue body ^)

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

llm -m gpt-4o 'ocr' -a example.jpg 
Example handwriting

Let's try this out

example

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

Design question: I keep mistakenly running this:

llm -m gpt-4o 'ocr' example.jpg 

Which currently gives this error:

Error: Got unexpected extra argument (example.jpg)

I could say that all extra options are treated as attachments.

But it's been suggested in the past that llm should accept multiple options so you can do this:

llm -m gpt-4o capital of france

Which I know is a nice pattern because https://github.com/simonw/llm-cmd does it:

llm cmd use ffmpeg to convert blah.mp4 to mp3

So I have three options:

  1. Leave it alone - llm ocr image.jpg will error
  2. Say that optional arguments are treated as attachments - llm ocr image.jpg will work
  3. Say that optional arguments are part of the prompt - llm ocr image.jpg would be treated the same as llm "ocr image.jpg", you would have to do llm ocr -a image.jpg explicitly

I'm torn between all three options at the moment.

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

Here's a puremagic annoyance:

% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  image.jpg
image/jpeg
% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  russian-pelican-in-spanish.mp3
audio/mpeg
% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  llm-pictionary.mp4 
video/mp4

Note that the mp3 file was identified as audio/mpeg - but that doesn't work for Gemini, which is why earlier I had to do this instead:

% llm -m gemini-1.5-flash-8b-latest 'transcribe' \
  --at russian-pelican-in-spanish.mp3 audio/mp3

Can I get puremagic to treat mp3 as a better match than audio/mpeg?

@simonw
Copy link
Owner Author

simonw commented Oct 27, 2024

Tried this:

python -c '
import puremagic, sys, pprint
pprint.pprint(
    puremagic.magic_stream(open(sys.argv[-1], "rb"))
)' russian-pelican-in-spanish.mp3

Got:

[PureMagicWithConfidence(byte_match=b'ID3\x04\x00\x00\x00\x00\x02\x0cTXXX', offset=10, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'ID3\x04\x00', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 ID3v2.4.0 (MP3) audio file', confidence=0.5),
 PureMagicWithConfidence(byte_match=b'ID3', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) audio file', confidence=0.3)]

@simonw
Copy link
Owner Author

simonw commented Oct 28, 2024

@simonw
Copy link
Owner Author

simonw commented Oct 28, 2024

Alpha is out!

uvx --with 'llm==0.17a0' llm prompt -m gpt-4o 'describe this image' -a https://static.simonwillison.net/static/2024/pelicans.jpg

The image shows a large group of birds gathered on a rocky area near a body of water. There are numerous pelicans along with smaller birds, likely resting or socializing. The scene suggests a busy and crowded wildlife area, possibly a coastal or lakeside habitat.

@simonw
Copy link
Owner Author

simonw commented Oct 28, 2024

This works too:

alias llm="uvx --with 'llm==0.17a0' llm"

Then:

llm --version

llm, version 0.17a0

@simonw simonw changed the title Solve multi-modal models with a new concept of "attachments". Solve multi-modal models with a new concept of "attachments" Oct 28, 2024
simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024
* Prototype of attachments support
* Support for continued attachment conversations

Refs simonw/llm#587
simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024
simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024
@simonw
Copy link
Owner Author

simonw commented Oct 28, 2024

https://github.com/simonw/llm-gemini/releases/tag/0.3a0

I've released this as an alpha and llm-gemini too - so you can do this:

alias llm="uvx --with 'llm==0.17a0' --with 'llm-gemini==0.3a0' llm"

And then this (idea to hit the webcam URL from Drew on Discord):

llm -m gemini-1.5-flash-latest \
  'how foggy is it on a scale of 1-10, also tell me the current time and date and elevation and vibes' \
  -a 'https://cameras.alertcalifornia.org/public-camera-data/Axis-Purisma1/latest-frame.jpg'

It's not foggy at all. The fog level is 0.
The current time and date is October 28, 2024 at 4:18:12 PM.
The elevation is 738.
The vibes are serene and peaceful. The landscape is beautiful with rolling hills and a clear blue sky. The air is probably crisp and fresh.

simonw added a commit to simonw/llm-claude-3 that referenced this issue Oct 29, 2024
simonw added a commit to simonw/llm-claude-3 that referenced this issue Oct 29, 2024
@simonw
Copy link
Owner Author

simonw commented Oct 29, 2024

https://github.com/simonw/llm-claude-3/releases/tag/0.6a0

uvx --with 'llm==0.17a0' --with 'llm-claude-3==0.6a0' \          
  llm -m claude-3.5-sonnet 'describe image' \
  -a https://static.simonwillison.net/static/2024/pelicans.jpg

This image shows a large gathering of brown pelicans crowded together on what appears to be a rocky shoreline next to water. The pelicans are densely packed, with their distinctive long beaks and pouches visible. You can see dozens of them huddled together, creating a mass of brown and grey feathers. Pelicans often gather in large groups like this to rest and socialize, which is known as a "pod" or "squadron" of pelicans. The water in the background appears calm and dark, providing a nice contrast to the lighter colored birds.

@simonw
Copy link
Owner Author

simonw commented Oct 29, 2024

OK, this is great! I'm going to ship it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
attachments enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants