Solve multi-modal models with a new concept of "attachments" #587

simonw · 2024-10-25T01:50:05Z

simonw · 2024-10-25T01:53:12Z

Here's some research I did against Gemini the other day: https://til.simonwillison.net/llms/prompt-gemini

Resulting in a Bash script that could do this:

prompt-gemini 'extract text from this image' example-handwriting.jpg

In that case it was detecting the file type from the file extension, since that type needs to be passed like so:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this image"
        },
        {
          "inlineData": {
            "data": "$(base64 -i image.png)",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}

But in some cases the file extension may not be usable. In those cases I'm going to have a second option: --at which is short for --attachment-type - and that's going to take a file path and an explicit type like this:

llm prompt "extract text" --at myimage image/png

Models that accept attachments should specify what blah/blah types they accept.

simonw · 2024-10-25T01:55:47Z

I'm going to use the term "attachments" for binary files returned by models as well. So far I have two examples of those:

gpt-4o-audio-preview can return audio content as a base64 chunk of JSON, see https://simonwillison.net/2024/Oct/18/openai-audio/
DeepSeek Janus https://github.com/deepseek-ai/Janus is a model that can also return images (like GPT-4o will be able to some day)

simonw · 2024-10-25T01:57:20Z

Incoming attachments to the CLI tool can be specified in one of three ways:

A URL
A path to a file on disk
A - which means read from standard input (if I can get that to work cleanly)

Some models accept URLs directly, in which case the URL will be passed to the model. Other models don't, in which case LLM will detect that and download the image from the URL and send the bytes.

simonw · 2024-10-25T01:58:28Z

Here's an interesting challenge: do we resize the images before we send them or not?

Different models have different recommendations around this. I expect there are some models out there that are vastly less expensive if you resize the image before sending it, in which case resizing is an important feature.

We could use Pillow for that. Question is, how do we know that dimensions to resize to?

Maybe this can be an option that the model classes themselves specify. We could have a CLI option for --no-resize which users can send if they really don't want that to happen.

simonw · 2024-10-25T02:00:47Z

https://platform.openai.com/docs/guides/vision/calculating-costs says:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

That's pretty complicated! It also exposes the need for a mechanism for sending detail low/high when making the API calls.

One option could be to default to low and allow users of that model to do this:

llm -m gpt-4o --at bigimage.png image-high/png

So we abuse the --at option and invent a special content type that maps to that high detail setting. Bit weird though.

I'll keep an eye out for any other oddities like that in other models that may need to be supported.

simonw · 2024-10-25T02:12:58Z

GPT-4o format support:

We currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).

Here's what GPT-4o preview audio input looks like:

{
  "model": "gpt-4o-audio-preview",
  "modalities": ["text", "audio"],
  "audio": { "voice": "alloy", "format": "wav" },
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this recording?" },
        { 
          "type": "input_audio", 
          "input_audio": { 
            "data": "<base64 bytes here>", 
            "format": "wav" 
          }
        }
      ]
    }
  ]
}

Where format can be wav or mp3 according to the API reference docs on https://platform.openai.com/docs/api-reference/chat/create

Interestingly you don't need to pass the image type for images, even for base64 data:

Since detail is optional I may ignore it for the first implementation of this.

simonw · 2024-10-25T02:13:14Z

Claude models DO require a content type:

{
    "role": "user",
    "content": [
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/jpeg",
                "data": "/9j/4AAQSkZJRg..."
            }
        },
        {
            "type": "text",
            "text": "What is in this image?"
        }
    ]
}

https://docs.anthropic.com/en/api/messages says

We currently support the base64 source type for images, and the image/jpeg, image/png, image/gif, and image/webp media types.

As far as I can tell Claude doesn't accept URLs to images, only base64 encoded data.

simonw · 2024-10-25T02:16:30Z

Gemini supports these image formats: https://ai.google.dev/gemini-api/docs/vision?lang=rest

PNG - image/png
JPEG - image/jpeg
WEBP - image/webp
HEIC - image/heic
HEIF - image/heif

Maybe LLM should know how to convert images from unsupported formats to supported formats? Not sure if that's worth the fuss, maybe a plugin thing at a later date?

Gemini has a file API and really encourages you to upload images first... but it says that if your files add up to less than 20MB you can use base64 instead. I think I'll stick with base64 at first.

Gemini can also do what it calls "document processing" - https://ai.google.dev/gemini-api/docs/document-processing?lang=rest

Document pages must be in one of the following text data MIME types:

PDF - application/pdf

JavaScript - application/x-javascript, text/javascript

Python - application/x-python, text/x-python

TXT - text/plain

HTML - text/html

CSS - text/css

Markdown - text/md

CSV - text/csv

XML - text/xml

RTF - text/rtf

Each document page is equivalent to 258 tokens.

I definitely want to support these, especially since they can represent a big discount on overall cost because of the weird 258 token flat rate (also the rate for an image).

And for audio: https://ai.google.dev/gemini-api/docs/audio?lang=rest

Gemini supports the following audio format MIME types:

WAV - audio/wav

MP3 - audio/mp3

AIFF - audio/aiff

AAC - audio/aac

OGG Vorbis - audio/ogg

FLAC - audio/flac

Gemini imposes the following rules on audio:

Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.

Gemini can only infer responses to English-language speech.

Gemini can "understand" non-speech components, such as birdsong or sirens.

And video too! https://ai.google.dev/gemini-api/docs/vision?lang=rest#technical-details-video

Gemini 1.5 Pro and Flash support up to approximately an hour of video data.

Video must be in one of the following video format MIME types:

video/mp4

video/mpeg

video/mov

video/avi

video/x-flv

video/mpg

video/webm

video/wmv

video/3gpp

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

I think llm-gemini may be the most interesting of the initial plugins for multi-modal attachments, especially given the research I already did in https://til.simonwillison.net/llms/prompt-gemini

simonw · 2024-10-25T02:22:46Z

As far as I can tell there is no way to provide Gemini with a URL to content that has NOT been uploaded first to the Google file service.

So out of OpenAI, Anthropic, Google it looks like OpenAI are the only ones that accept an arbitrary URL to an image.

simonw · 2024-10-25T02:23:49Z

I don't know if OpenAI accept URLs to both images and audio clips. To be safe, maybe the API design should have the ability to define a accepts_urls(file) method which can say yes or no dynamically based on the file that is passed in.

simonw · 2024-10-25T03:11:40Z

The Pixtral API accepts URLs: https://docs.mistral.ai/capabilities/vision/

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": "https://tripfixers.com/wp-content/uploads/2019/11/eiffel-tower-with-snow.jpeg"
            }
        ]
    }
]

Or base64 images:

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": "data:image/jpeg;base64,{base64_image}" 
            }
        ]
    }
]

Note that you don't have to specify image/jpeg with a URL but you do with a base64 image.

Supported file types:

PNG (.png)
JPEG (.jpeg and .jpg)
WEBP (.webp)
Non-animated GIF with only one frame (.gif)

simonw · 2024-10-25T03:16:35Z

Groq API also supports both base64 and regular URLs, for Llama 3.1 vision models: https://console.groq.com/docs/vision

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }
]

[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Whats the weather like in this state?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                }
            }
        ]
    }
]

simonw · 2024-10-25T03:30:46Z

This worked against Groq:

curl https://api.groq.com/openai/v1/chat/completions -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GROQ_API_KEY" \
-d '{
    "model": "llama-3.2-11b-vision-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in a great deal of detail, do not describe any people in it"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ]
}' | jq

Returned:

{
  "id": "chatcmpl-33f4a341-1dd6-44fa-8dbd-90ed9d437b80",
  "object": "chat.completion",
  "created": 1729826570,
  "model": "llama-3.2-11b-vision-preview",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image depicts the iconic Statue of Liberty in New York Harbor, with the Manhattan skyline in the background. \n\nThe Statue of Liberty is prominently featured in the foreground, situated on a small island that juts out into the harbor. The statue's copper sheeting, which is normally a bright green due to oxidation, appears a slightly lighter shade in the image, possibly due to the lighting conditions or exposure of the photo. The statue's broken shackles and chains are visible, symbolizing the abolition of slavery.\n\nIn the background, the Manhattan skyline rises majestically, dominated by the towering skyscrapers of the Financial District. The image showcases several notable landmarks, including One World Trade Center, the former World Trade Center, and the majestic Brooklyn Bridge. The atmosphere of the image is peaceful and serenely beautiful, with the calm waters of the harbor reflecting the soft light of the setting sun. The overall mood is one of tranquility and wonder, inviting the viewer to appreciate the majesty of this iconic symbol of freedom."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "queue_time": 0.258170439,
    "prompt_tokens": 28,
    "prompt_time": 0.001334961,
    "completion_tokens": 207,
    "completion_time": 0.421154809,
    "total_tokens": 235,
    "total_time": 0.42248977
  },
  "system_fingerprint": "fp_fa3d3d25b0",
  "x_groq": {
    "id": "req_01jb0v5ganf85s8gfzcxh9n50r"
  }
}

The image depicts the iconic Statue of Liberty in New York Harbor, with the Manhattan skyline in the background.

The Statue of Liberty is prominently featured in the foreground, situated on a small island that juts out into the harbor. The statue's copper sheeting, which is normally a bright green due to oxidation, appears a slightly lighter shade in the image, possibly due to the lighting conditions or exposure of the photo. The statue's broken shackles and chains are visible, symbolizing the abolition of slavery.

In the background, the Manhattan skyline rises majestically, dominated by the towering skyscrapers of the Financial District. The image showcases several notable landmarks, including One World Trade Center, the former World Trade Center, and the majestic Brooklyn Bridge. The atmosphere of the image is peaceful and serenely beautiful, with the calm waters of the harbor reflecting the soft light of the setting sun. The overall mood is one of tranquility and wonder, inviting the viewer to appreciate the majesty of this iconic symbol of freedom.

simonw · 2024-10-25T03:39:25Z

OK, database design.

Reminder: the current schema on https://llm.datasette.io/en/stable/logging.html#sql-schema looks like this:

CREATE TABLE [conversations] (
  [id] TEXT PRIMARY KEY,
  [name] TEXT,
  [model] TEXT
);
CREATE TABLE [responses] (
  [id] TEXT PRIMARY KEY,
  [model] TEXT,
  [prompt] TEXT,
  [system] TEXT,
  [prompt_json] TEXT,
  [options_json] TEXT,
  [response] TEXT,
  [response_json] TEXT,
  [conversation_id] TEXT REFERENCES [conversations]([id]),
  [duration_ms] INTEGER,
  [datetime_utc] TEXT
);
CREATE VIRTUAL TABLE [responses_fts] USING FTS5 (
  [prompt],
  [response],
  content=[responses]
);

I'm going to have a new attachments table, with the following columns:

id - a text ULID
path - optional text that's a fully resolved path to a file on disk
url - optional text that's a URL to something online
content - optional blob that's the actual binary content
type - text, image/png etc

At least one of filepath or url or content will need to be populated.

Here's where things get tricky: how should these be associated with data in the responses table?

These things can be used for both input AND output (the OpenAI audio output case). So maybe there are two many-to-many columns:

attachment_inputs - links attachment_id and response_id to represent an input
attachment_outputs - links attachment_id and response_id to represent an output

I guess for outputs I'll be populating just the content and content_type tables - and I'll provide LLM CLI commands for exporting those to disk. They'll be easier to view when I get a web UI working for the tool.

I'm not crazy about those table names. Other option:

prompt_attachments / response_attachments - a bit weird because there's no prompts table

simonw · 2024-10-25T03:41:56Z

Should I still store the full prompt_json and response_json - including the base64 encoded images - if I'm also storing duplicates of those in the attachments table?

If not I could invent my own clever JSON format, something like this:

{
    "model": "llama-3.2-11b-vision-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in a great deal of detail, do not describe any people in it"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": {
                            "$attachment": {
                                "id": "v01jax4p0rstqbs3fvbkkszvt",
                                "column": "url"
                            }
                        }
                    }
                }
            ]
        }
    ]
}

So that {"$attachment"...} bit gets dynamically replaced by the data pulled from the attachments table.

mqudsi · 2024-10-25T17:25:17Z

My vote would be for not duplicating the images/attachments in the database.

You don't strictly need to make (serious) changes to the prompt_json/response_json if you invert the hierarchy and reference the prompt/response from the attachments table as foreign keys (though having them directly available here would, of course, also be useful if you're making changes anyway).

If the attachments are stored as sqlite blobs, you would also reduce the cost of base64-encoding their contents. I would definitely wish for there to be a proper sqlite foreign key and first-class (non-json) references between the two (three) tables.

Giving it some more thought, it actually would be better to have the references in the prompt/response tables pointing to the attachments table (rather than the other way around) because I just realized that you might want to deduplicate the attachments. Given that you could, quite easily, imagine that a person would make 10+ queries in a row against the same attachment, which could be rather large (in the case of some of the gemini models, 10s of MiBs+). If you're already ingesting the full attachment, I would definitely say go ahead and calculate the sha256 (often hardware-accelerated and very much standardized) or wyhash (fastest general hash I've found ported to Python, from a quick search) as you ingest each chunk. It'll have fairly low additional overhead (since you're IO-limited) but you'll realize massive savings if you can prevent adding a possibly very large blob to the database each time.

simonw · 2024-10-26T17:17:44Z

Great point about de-duplicating attachments there, given the need to support long conversation threads.

... actually that's handled a bit already: the resopnses table doesn't store the full JSON that was sent to the LLM for each message, it instead stores the JSON for that specific round of request/response and uses the foreign key to conversations when it needs to inflate the full previous message history to send in a chat completion.

I'll still think about ways to avoid duplicate storage though - might even calculate a sha255 hash of the BLOB content and store that in a column (or maybe even use that as the ID itself?)

A neat thing about using a SHA ID is that it means if you send the same stored image to multiple different LLMs (to compare their responses for example) you only record it once in the database. That's a pretty compelling reason to do this.

Note that my current idea is that if you store filepath in the table you don't store the content BLOB, with the assumption that the filepath will continue to work in the future. It does mean that you'll get an error if you attempt to continue a conversation what used filepath and those files are no longer there though.

... so I may have some kind of option that means "store the images in the database BLOB columns anyway", maybe this:

llm -m claude-3.5-sonnet "describe this image" -a image.png --store-attachments

Could be --sa for short (`-s already means system prompt).

irthomasthomas · 2024-10-26T17:28:02Z

Is there anything I can do to help? I made a whole vision analysis cli based on claude. https://github.com/irthomasthomas/claude-vision
But full multi-modal support in llm would be amazing. So lmk if I can help at all

simonw · 2024-10-26T21:51:42Z

Initial attempt at an AttachmentType parameter:

diff --git a/llm/cli.py b/llm/cli.py
index a1b1457..1082dc7 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -30,7 +30,10 @@ from llm import (
 from .migrations import migrate
 from .plugins import pm
 import base64
+from dataclasses import dataclass
+import httpx
 import pathlib
+import puremagic
 import pydantic
 import readline
 from runpy import run_module
@@ -48,6 +51,44 @@ warnings.simplefilter("ignore", ResourceWarning)
 DEFAULT_TEMPLATE = "prompt: "
 
 
+@dataclass
+class Attachment:
+    mimetype: str
+    filepath: str
+    url: str
+    content: bytes
+
+
+class AttachmentType(click.ParamType):
+    name = "attachment"
+
+    def convert(self, value, param, ctx):
+        if value == "-":
+            content = sys.stdin.buffer.read()
+            # Try to guess type
+            try:
+                mimetype = puremagic.from_string(content, mime=True)
+            except puremagic.PureError:
+                raise click.BadParameter("Could not determine mimetype of stdin")
+            return Attachment(mimetype, None, None, content)
+        if "://" in value:
+            # Confirm URL exists and try to guess type
+            try:
+                response = httpx.head(value)
+                response.raise_for_status()
+                mimetype = response.headers.get("content-type")
+            except httpx.HTTPError as ex:
+                raise click.BadParameter(str(ex))
+            return Attachment(mimetype, None, value, None)
+        # Check that the file exists
+        path = pathlib.Path(value)
+        if not path.exists():
+            self.fail(f"File {value} does not exist", param, ctx)
+        # Try to guess type
+        mimetype = puremagic.from_file(str(path), mime=True)
+        return Attachment(mimetype, str(path), None, None)
+
+
 def _validate_metadata_json(ctx, param, value):
     if value is None:
         return value
@@ -88,6 +129,22 @@ def cli():
 @click.argument("prompt", required=False)
 @click.option("-s", "--system", help="System prompt to use")
 @click.option("model_id", "-m", "--model", help="Model to use")
+@click.option(
+    "attachments",
+    "-a",
+    "--attachment",
+    type=AttachmentType(),
+    multiple=True,
+    help="Attachment path or URL or -",
+)
+@click.option(
+    "attachment_types",
+    "--at",
+    "--attachment-type",
+    type=(str, str),
+    multiple=True,
+    help="Attachment with explicit mimetype",
+)
 @click.option(
     "options",
     "-o",
@@ -127,6 +184,8 @@ def prompt(
     prompt,
     system,
     model_id,
+    attachments,
+    attachment_types,
     options,
     template,
     param,
@@ -143,6 +202,8 @@ def prompt(
 
     Documentation: https://llm.datasette.io/en/stable/usage.html
     """
+    print(attachments)
+    return
     if log and no_log:
         raise click.ClickException("--log and --no-log are mutually exclusive")
 
diff --git a/setup.py b/setup.py
index 1f6adcd..b8b55bf 100644
--- a/setup.py
+++ b/setup.py
@@ -48,6 +48,7 @@ setup(
         "setuptools",
         "pip",
         "pyreadline3; sys_platform == 'win32'",
+        "puremagic",
     ],
     extras_require={
         "test": [

simonw · 2024-10-26T23:34:43Z

I'm thinking about how the Python API is going to work. I'm leaning towards this:

model = llm.get_model("gpt-4o")
response = model.prompt("Describe these images", open("image.jpg", "rb"), open("image2.jpg", "rb"))

I could have that accept file-like objects or string paths or string URLs, or maybe I could tell people to do this instead:

response = model.prompt(
    "Describe these images",
    llm.Attachment(url="https://..."),
    llm.Attachment(path="image.jpg")
)

I like that second option better, it's more fitting with Python's optional type hints.

So the prompt() method starts taking multiple attachment arguments that follow the initial prompt.

Right now the signature of that method looks like this:

llm/llm/models.py

Lines 270 to 280 in d654c95

    
           def prompt( 
        
               self, 
        
               prompt: Optional[str], 
        
               system: Optional[str] = None, 
        
               stream: bool = True, 
        
               **options 
        
           ): 
        
               return self.response( 
        
                   Prompt(prompt, system=system, model=self, options=self.Options(**options)), 
        
                   stream=stream, 
        
               )

Technically this would be a breaking change, because system= and stream= are currently available as positional arguments. I the documents I've only described them as keyword arguments though:

llm/docs/python-api.md

Lines 42 to 51 in d654c95

    
           ### System prompts 
        
           For models that accept a system prompt, pass it as `system="..."`: 
        
           ```python 
        
           response = model.prompt( 
        
               "Five surprising names for a pet pelican", 
        
               system="Answer like GlaDOS" 
        
           ) 
        
           ```

simonw · 2024-10-27T00:34:07Z

... well I got this to work (including some hacking around with llm-gemini):

% llm -m gemini-1.5-flash-8b-latest 'transcribe' --at russian-pelican-in-spanish.mp3 audio/mp3
Oye camarada, aquí está tu pelicano Californiano con acento Russo.  ¡Qué tal!  Listo para charlar en español.

How's your day today?

¡Mi día! ¡Estoy volando sobre las olas, buscando peces y disfrutando del sol Californiano!  ¿Y tú, amigo? ¿Cómo ha estado tu día?

% llm -m gemini-1.5-flash-8b-latest 'extract all text' -a llm-pictionary.mp4
prompt extract all text attachments (Attachment(type='video/mp4', path='llm-pictionary.mp4', url=None, content=None),)
LLM Pictionary
START NEW ROUND
Round 5: Claude 3.5 Sonnet (Oct 24) is drawing
Llama 3.2 90B Vision Instruc: Sky;
Claude 3.5 Sonnet (June 24): Ocean;
GPT-40: Ocean;
Gemini Flash 1.5-002: Sky;
Claude 3.5 Sonnet (Oct 24): ocean;
Gemini Pro 1.5-002: Ocean;
GPT-40 Mini: Sky;
Llama 3.2 90B Vision Instruc: Sun;
Gemini Flash 1.5-002: Ocean;
GPT-40 Mini: Beach
Llama 3.2 90B Vision Instruc: Image can't be displayed. An ocean sun beach is the likely depiction.
Claude 3.5 Sonnet (June 24) guessed first!

simonw · 2024-10-27T00:38:26Z

OK, I have a working prototype of this for both the default OpenAI plugin and the llm-gemini plugin. Still todo: (moved this to issue body ^)

simonw · 2024-10-27T00:39:47Z

llm -m gpt-4o 'ocr' -a example.jpg

Example handwriting

Let's try this out

simonw · 2024-10-27T00:43:32Z

Design question: I keep mistakenly running this:

llm -m gpt-4o 'ocr' example.jpg

Which currently gives this error:

Error: Got unexpected extra argument (example.jpg)

I could say that all extra options are treated as attachments.

But it's been suggested in the past that llm should accept multiple options so you can do this:

llm -m gpt-4o capital of france

Which I know is a nice pattern because https://github.com/simonw/llm-cmd does it:

llm cmd use ffmpeg to convert blah.mp4 to mp3

So I have three options:

Leave it alone - llm ocr image.jpg will error
Say that optional arguments are treated as attachments - llm ocr image.jpg will work
Say that optional arguments are part of the prompt - llm ocr image.jpg would be treated the same as llm "ocr image.jpg", you would have to do llm ocr -a image.jpg explicitly

I'm torn between all three options at the moment.

simonw · 2024-10-27T00:47:49Z

Here's a puremagic annoyance:

% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  image.jpg
image/jpeg
% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  russian-pelican-in-spanish.mp3
audio/mpeg
% python -c 'import puremagic, sys; print(puremagic.from_file(sys.argv[-1], mime=True))' \
  llm-pictionary.mp4 
video/mp4

Note that the mp3 file was identified as audio/mpeg - but that doesn't work for Gemini, which is why earlier I had to do this instead:

% llm -m gemini-1.5-flash-8b-latest 'transcribe' \
  --at russian-pelican-in-spanish.mp3 audio/mp3

Can I get puremagic to treat mp3 as a better match than audio/mpeg?

simonw · 2024-10-27T00:50:46Z

Tried this:

python -c '
import puremagic, sys, pprint
pprint.pprint(
    puremagic.magic_stream(open(sys.argv[-1], "rb"))
)' russian-pelican-in-spanish.mp3

Got:

[PureMagicWithConfidence(byte_match=b'ID3\x04\x00\x00\x00\x00\x02\x0cTXXX', offset=10, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'ID3\x04\x00', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 ID3v2.4.0 (MP3) audio file', confidence=0.5),
 PureMagicWithConfidence(byte_match=b'ID3', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) audio file', confidence=0.3)]

Refs #587

Refs #587, #590

simonw · 2024-10-28T22:47:23Z

Release notes: https://github.com/simonw/llm/blob/ba1ccb3a4a8dbef7fc17c8fdec6c9f78a4ab137d/docs/changelog.md#017a0-2024-10-28

simonw · 2024-10-28T22:54:27Z

Alpha is out!

uvx --with 'llm==0.17a0' llm prompt -m gpt-4o 'describe this image' -a https://static.simonwillison.net/static/2024/pelicans.jpg

The image shows a large group of birds gathered on a rocky area near a body of water. There are numerous pelicans along with smaller birds, likely resting or socializing. The scene suggests a busy and crowded wildlife area, possibly a coastal or lakeside habitat.

simonw · 2024-10-28T22:57:43Z

This works too:

alias llm="uvx --with 'llm==0.17a0' llm"

Then:

llm --version

llm, version 0.17a0

* Prototype of attachments support * Support for continued attachment conversations Refs simonw/llm#587

Refs simonw/llm#587

simonw · 2024-10-28T23:21:04Z

https://github.com/simonw/llm-gemini/releases/tag/0.3a0

I've released this as an alpha and llm-gemini too - so you can do this:

alias llm="uvx --with 'llm==0.17a0' --with 'llm-gemini==0.3a0' llm"

And then this (idea to hit the webcam URL from Drew on Discord):

llm -m gemini-1.5-flash-latest \
  'how foggy is it on a scale of 1-10, also tell me the current time and date and elevation and vibes' \
  -a 'https://cameras.alertcalifornia.org/public-camera-data/Axis-Purisma1/latest-frame.jpg'

It's not foggy at all. The fog level is 0.
The current time and date is October 28, 2024 at 4:18:12 PM.
The elevation is 738.
The vibes are serene and peaceful. The landscape is beautiful with rolling hills and a clear blue sky. The air is probably crisp and fresh.

Refs #19 Refs simonw/llm#587

Refs #19, simonw/llm#587

simonw · 2024-10-29T01:56:41Z

https://github.com/simonw/llm-claude-3/releases/tag/0.6a0

uvx --with 'llm==0.17a0' --with 'llm-claude-3==0.6a0' \          
  llm -m claude-3.5-sonnet 'describe image' \
  -a https://static.simonwillison.net/static/2024/pelicans.jpg

This image shows a large gathering of brown pelicans crowded together on what appears to be a rocky shoreline next to water. The pelicans are densely packed, with their distinctive long beaks and pouches visible. You can see dozens of them huddled together, creating a mass of brown and grey feathers. Pelicans often gather in large groups like this to rest and socialize, which is known as a "pod" or "squadron" of pelicans. The water in the background appears calm and dark, providing a nice contrast to the lighter colored birds.

simonw · 2024-10-29T02:28:08Z

OK, this is great! I'm going to ship it.

Refs #587, #590, #591

simonw added enhancement New feature or request multi-modal labels Oct 25, 2024

simonw mentioned this issue Oct 25, 2024

Multi-modal support for vision models such as GPT-4 vision #331

Closed

simonw added a commit that referenced this issue Oct 27, 2024

First working prototype of new attachments feature, refs #587

a1ee8ac

simonw added a commit that referenced this issue Oct 28, 2024

llm logs markdown support for attachments, refs #587

bb5b802

simonw added a commit that referenced this issue Oct 28, 2024

llm logs --json for attachments, refs #587

a9bc1c7

simonw added a commit that referenced this issue Oct 28, 2024

attachments= keyword argument, tests pass again - refs #587

286cf9f

simonw added a commit that referenced this issue Oct 28, 2024

Python attachment documentation, plus fixed a mimetype detection bug

570a3ec

Refs #587

simonw added a commit that referenced this issue Oct 28, 2024

Docs for CLI attachments, refs #587

f0ed54a

simonw added a commit that referenced this issue Oct 28, 2024

Release 0.17a0

ba1ccb3

Refs #587, #590

simonw changed the title ~~Solve multi-modal models with a new concept of "attachments".~~ Solve multi-modal models with a new concept of "attachments" Oct 28, 2024

simonw mentioned this issue Oct 28, 2024

Multi-modal attachments support simonw/llm-gemini#17

Merged

simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024

Multi-modal attachments support (#17)

c6a168f

* Prototype of attachments support * Support for continued attachment conversations Refs simonw/llm#587

simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024

Release 0.3a0

9916539

Refs simonw/llm#587

simonw added a commit to simonw/llm-gemini that referenced this issue Oct 28, 2024

Release 0.3a0

f596082

Refs simonw/llm#587

simonw mentioned this issue Oct 29, 2024

Image support via new attachments feature simonw/llm-claude-3#19

Closed

simonw added a commit to simonw/llm-claude-3 that referenced this issue Oct 29, 2024

Image attachment support

97f2aee

Refs #19 Refs simonw/llm#587

simonw added a commit to simonw/llm-claude-3 that referenced this issue Oct 29, 2024

Release 0.6a0

adcc254

Refs #19, simonw/llm#587

simonw added a commit that referenced this issue Oct 29, 2024

Automated tests for attachments, refs #587

39d61d4

simonw closed this as completed Oct 29, 2024

simonw added a commit that referenced this issue Oct 29, 2024

Release 0.17

a44ba49

Refs #587, #590, #591

simonw mentioned this issue Oct 29, 2024

Support for pixtral using LLM 0.17 attachments simonw/llm-mistral#12

Closed

This was referenced Nov 6, 2024

llm models --options should show supported attachment types, too #612

Closed

Add attachments tables to schema documentation #615

Closed

simonw added the attachments label Nov 6, 2024

simonw mentioned this issue Nov 6, 2024

Option for passing in longer context fragments, stored in a deduped table #617

Open

simonw unpinned this issue Nov 6, 2024

simonw mentioned this issue Nov 8, 2024

.wav files detected as audio/wave when maybe they should be audio/wav cdgriffith/puremagic#104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve multi-modal models with a new concept of "attachments" #587

Solve multi-modal models with a new concept of "attachments" #587

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

mqudsi commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 26, 2024 •

edited

Loading

irthomasthomas commented Oct 26, 2024

simonw commented Oct 26, 2024

simonw commented Oct 26, 2024 •

edited

Loading

simonw commented Oct 27, 2024

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 27, 2024

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 27, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024 •

edited

Loading

simonw commented Oct 29, 2024 •

edited

Loading

simonw commented Oct 29, 2024

Solve multi-modal models with a new concept of "attachments" #587

Solve multi-modal models with a new concept of "attachments" #587

Comments

simonw commented Oct 25, 2024 • edited Loading

TODO

Out of scope for this issue:

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 • edited Loading

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 • edited Loading

simonw commented Oct 25, 2024

simonw commented Oct 25, 2024 • edited Loading

simonw commented Oct 25, 2024 • edited Loading

simonw commented Oct 25, 2024 • edited Loading

mqudsi commented Oct 25, 2024 • edited Loading

simonw commented Oct 26, 2024 • edited Loading

irthomasthomas commented Oct 26, 2024

simonw commented Oct 26, 2024

simonw commented Oct 26, 2024 • edited Loading

simonw commented Oct 27, 2024

simonw commented Oct 27, 2024 • edited Loading

simonw commented Oct 27, 2024

simonw commented Oct 27, 2024 • edited Loading

simonw commented Oct 27, 2024 • edited Loading

simonw commented Oct 27, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024

simonw commented Oct 28, 2024 • edited Loading

simonw commented Oct 29, 2024 • edited Loading

simonw commented Oct 29, 2024

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 25, 2024 •

edited

Loading

mqudsi commented Oct 25, 2024 •

edited

Loading

simonw commented Oct 26, 2024 •

edited

Loading

simonw commented Oct 26, 2024 •

edited

Loading

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 27, 2024 •

edited

Loading

simonw commented Oct 28, 2024 •

edited

Loading

simonw commented Oct 29, 2024 •

edited

Loading