models/yolo-world/ #8224

2024-02-15T13:51:40Z

giscus[bot]
bot Feb 15, 2024

models/yolo-world/

Discover YOLO-World, a YOLOv8-based framework for real-time open-vocabulary object detection in images. It enhances user interaction, boosts computational efficiency, and adapts across various vision tasks.

https://docs.ultralytics.com/models/yolo-world/

IamShubhamGupto · 2024-02-15T13:51:40Z

IamShubhamGupto
Feb 15, 2024 — with giscus

Authors of yolo world released instructions to fine tune their models, would that be supported by ultralytics? if so id like to contribute - https://github.com/AILab-CVC/YOLO-World/blob/master/docs/finetuning.md

3 replies

pderrenger Feb 16, 2024
Maintainer

@IamShubhamGupto hey there! 👋

Absolutely, we're all for community contributions and support for fine-tuning models like YOLO-World. If you're interested in contributing, that's fantastic! 🌟

Just a heads up, while we do support various YOLOv8 tasks and modes, the training mode for YOLO-World models (yolov8*-world.pt) isn't currently supported. However, you can still use the pre-trained weights for inference and validation.

If you have a specific fine-tuning process in mind that's compatible with our existing framework, we'd love to see what you have! Feel free to fork the repo, make your changes, and then submit a pull request. If you need any help or have questions along the way, don't hesitate to ask.

Happy coding! 😊👨‍💻

woerstcase Feb 16, 2024 — with giscus

If I get you right you thinking in a use like this?
https://supervision.roboflow.com/develop/notebooks/zero-shot-object-detection-with-yolo-world/#filtering-detectuions-by-area

pderrenger Feb 16, 2024
Maintainer

Hey there! 👋

Yes, you're on the right track! The example you've linked from Roboflow demonstrates filtering detections by area, which is indeed supported by Ultralytics YOLOv8. You can achieve similar functionality by manipulating the results object after prediction to filter out detections based on their bounding box area or any other criteria you prefer.

Here's a quick example of how you might do that:

from ultralytics import YOLO

# Load a pretrained YOLOv8 model
model = YOLO('yolov8n.pt')

# Perform object detection on an image
results = model('path/to/image.jpg')

# Filter detections by bounding box area (e.g., area > 1000 pixels)
filtered_results = [det for det in results[0].boxes.xyxy if (det[2] - det[0]) * (det[3] - det[1]) > 1000]

# Do something with filtered_results

Hope this helps! If you have more questions or need further assistance, feel free to ask. Happy coding! 😊🚀

bdv29 · 2024-02-16T18:23:36Z

bdv29
Feb 16, 2024 — with giscus

im having trouble understanding what this is exactly,
does this work like a VLM you ask it a question about where something is?

2 replies

pderrenger Feb 16, 2024
Maintainer

Hey there! 👋

It seems like you're curious about what YOLO-World is and how it works. In a nutshell, YOLO-World is a framework based on YOLOv8 that's designed for real-time object detection in images. It's not exactly like a virtual machine you ask questions to, but rather a powerful tool that can identify objects in images based on descriptions.

Here's a quick example of how you might use it in Python:

from ultralytics import YOLOWorld

# Initialize the model with pre-trained weights
model = YOLOWorld('yolov8s-world.pt')

# Set what you'd like to find in your image
model.set_classes(["red car", "happy face"])

# Run object detection for your custom classes on your image
results = model.predict('path/to/your/image.jpg')

# Display the results
results[0].show()

This code loads a pre-trained YOLO-World model, runs it on an image, and then shows you the detected objects. It's efficient, fast, and doesn't require a ton of computational power. Plus, you can even customize it to look for specific objects by setting prompts!

Hope that clears things up! If you have more questions or need further assistance, feel free to ask. 😊

aikedaerC Jun 25, 2024 — with giscus

I guess the "VLM" refered to as "Vision Language Model"

sandgrowagro · 2024-02-21T01:46:44Z

sandgrowagro
Feb 21, 2024 — with giscus

I am getting different results for YOLOv8l-world compared to what I am trying on Yolo-World hugging face demo page, Any idea why this could be happening?

2 replies

woerstcase Feb 21, 2024

Maybe you using other parameters in inference result?
There are a lot you can choose:

for example:https: results = model(image, save_conf=True, conf=0.02, iou=0.1) and many more

https://docs.ultralytics.com/modes/predict/#inference-arguments

pderrenger Feb 21, 2024
Maintainer

Hey there! 👋 It's possible that the difference in results between your YOLOv8l-world model and the Hugging Face demo could be due to a variety of factors. Here are a few things to consider:

Parameter Settings: Ensure that the inference parameters like confidence threshold (conf) and Intersection Over Union (iou) are consistent across both platforms. Small changes in these values can lead to different detection results.
Model Version: Double-check that you're using the same model version and weights in both cases. Sometimes, updates or different versions can cause discrepancies.
Input Preprocessing: The way images are preprocessed before being fed into the model might differ. Look into normalization, resizing, or any other preprocessing steps.
Environment: Differences in the software environment, such as the version of PyTorch, CUDA, or other dependencies, might affect the results.

If you're looking to match the settings, here's a quick example of how to set your inference parameters in Python:

from ultralytics import YOLO

# Load your model
model = YOLO('yolov8l-world.pt')

# Set inference parameters
results = model.predict('path/to/image.jpg', conf=0.25, iou=0.45)

For more detailed guidance on inference arguments, check out the docs here. If you continue to see discrepancies, feel free to share more details, and we can dive deeper into this together! 🕵️‍♂️💡

KishoreElvicto · 2024-02-21T15:19:14Z

KishoreElvicto
Feb 21, 2024 — with giscus

i tried this code to detect car and it's number plate. but i am not getting any results for number plate. car is working fine. what to do?
'
from ultralytics import YOLOWorld
model = YOLOWorld('yolov8l-world.pt')
model.set_classes(["car","number plate"])
results = model.predict('smile.jpg')
results[0].show()
'
also tried with happy face. didn't work. it works with only coco classes? i want to detect number plates. help

1 reply

glenn-jocher Feb 21, 2024
Maintainer

@KishoreElvicto hey there! 👋 It looks like you're trying to detect cars and their number plates using YOLO-World, but you're not getting results for the number plates. Since YOLO-World is designed for open-vocabulary detection, it should be able to detect objects beyond the COCO classes.

Here's a couple of things you could try:

Ensure that your image ('smile.jpg') contains clear and visible number plates. The model might struggle if the number plates are too small, blurry, or obstructed.
Try using the set_classes() method with more descriptive prompts, like "vehicle number plate" instead of just "number plate", to see if it helps the model identify the plates better.

If you're still facing issues, it could be helpful to check if the number plates are part of the model's offline vocabulary. If they're not, you might need to use a custom dataset that includes number plates for retraining or find a pre-trained model specifically for number plate detection.

Here's a quick example of how you might adjust your code:

from ultralytics import YOLOWorld

# Initialize a YOLO-World model
model = YOLOWorld('yolov8l-world.pt')

# Define custom classes with more descriptive prompts
model.set_classes(["car", "vehicle number plate"])

# Execute prediction for specified categories on an image
results = model.predict('smile.jpg')

# Show results
results[0].show()

Give these suggestions a try, and if you're still stuck, feel free to share more details or reach out again! 😊

qihuijia · 2024-02-22T15:49:27Z

qihuijia
Feb 22, 2024 — with giscus

Nice job. I am very glad that ultralytics have already support yolo-world. And I found that you are already support the coreml export.

This works.

from ultralytics import YOLOWorld

model = YOLOWorld('yolov8s-world.pt')
model.export(format='coreml')

But when I want to set the classes, it does not work.

from ultralytics import YOLOWorld

model = YOLOWorld('yolov8s-world.pt')
model.set_classes(["colorchecker", "ball", "object", "painting", "flower", "vase", "lavander", "rabbit"])
model.export(format='coreml')

The error is:

Ultralytics YOLOv8.1.17 🚀 Python-3.10.12 torch-2.1.0+cu121 CPU (Intel Xeon 2.00GHz)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-67-a013544c0706> in <cell line: 1>()
----> 1 model.export(format='coreml')

8 frames
/usr/local/lib/python3.10/dist-packages/torch/_tensor.py in __deepcopy__(self, memo)
     84             return handle_torch_function(Tensor.__deepcopy__, (self,), self, memo)
     85         if not self.is_leaf:
---> 86             raise RuntimeError(
     87                 "Only Tensors created explicitly by the user "
     88                 "(graph leaves) support the deepcopy protocol at the moment.  "

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment.  If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001

7 replies

pderrenger Feb 23, 2024
Maintainer

Yes, you're absolutely right! The screenshot you've shared is from the CoreML model's class labels. You can customize your own class settings there to match the specific classes you're interested in detecting. Just replace the existing labels with your desired ones like "colorchecker", "ball", etc., and you should be good to go! If you need any further assistance, feel free to reach out. Happy coding! 😊👍

pierre1618 Feb 23, 2024 — with giscus

Hi
I encountered a similar issue when attempting to export a YOLO model with custom classes. I managed to resolve it by saving the modified model and then reloading it. However, I faced another error related to gradient tracking, which I resolved by deactivating it using torch.no_grad().

import torch
from ultralytics import YOLO  # Ensure Ultralytics is installed

# To handle potential CUDA out of memory errors, disable gradient tracking with torch.no_grad()
with torch.no_grad():
    # Initialize a YOLO model with pretrained weights
    model = YOLO('yolov8s-world.pt')  # You can also choose yolov8m/l-world.pt based on your needs

    # Define custom classes specific to your application
    custom_classes = ["car", "vehicle number plate"]
    model.set_classes(custom_classes)

    # Save the model with the custom classes defined (modified code)
    model.save("custom_yolov8s.pt")  # This saves extra metadata required for CoreML conversion

    # Load the saved model with custom classes
    model = YOLO("custom_yolov8s.pt")

    # Export the model to CoreML format with non-maximum suppression enabled
    model.export(format="coreml", nms=True)

This approach allowed me to successfully export the model with custom classes to CoreML format. Hope this could help!

pderrenger Feb 23, 2024
Maintainer

@pierre1618 hey there! 👋

Great to hear you've found a workaround for exporting a YOLO model with custom classes and resolved the gradient tracking issue with torch.no_grad(). Your approach to saving and reloading the model is a smart move! 🧠

For anyone else encountering similar issues, this user-shared snippet could be a helpful reference. Just remember to adjust the model and custom_classes to fit your specific use case.

Keep up the good work, and if you have any more insights or questions, feel free to drop them here. Happy coding! 🚀

qihuijia Feb 28, 2024 — with giscus

Thanks everyone, now since lastest version(8.1.19) 2024.2.27 launched, I can convert the yolo world model in M1/M2 chip macOS, thanks very much.

pderrenger Feb 28, 2024
Maintainer

@qihuijia hey there! 🎉 It's fantastic to hear that the latest version (8.1.19) as of 2024.2.27 has enabled you to successfully convert the YOLO-World model for use on M1/M2 chip macOS. We're thrilled to see our updates making a positive impact on your projects.

If you're diving into customizing classes or further exploring YOLO-World's capabilities, remember that the flexibility and efficiency of YOLO-World make it a powerful tool for a wide range of vision tasks. Whether you're fine-tuning for specific objects with custom prompts or leveraging the model's real-time detection prowess, YOLO-World is designed to adapt to your needs.

Should you have any more questions or if there's anything else we can help you with, don't hesitate to reach out. Happy detecting! 🚀

bdv29 · 2024-02-28T14:16:54Z

bdv29
Feb 28, 2024

is there a away to fine tune the model if you manage to get fairly close to accurate to what you want to box bound but just need to focused on your particular application once you've set the model?

3 replies

pderrenger Feb 28, 2024
Maintainer

Absolutely! Fine-tuning a YOLOv8 model for your specific application is a great way to enhance its performance on the tasks most relevant to you. Here's a quick guide on how to do it:

Prepare Your Dataset: Make sure your dataset is annotated with the classes of interest. For fine-tuning, it's crucial to have a dataset that closely represents the scenarios where you'll deploy the model.

Set Custom Classes: If you're focusing on specific classes, you can use the set_classes method to tailor the model. This is especially useful with YOLO-World models, as they allow for dynamic class specification. Here's how you might do it:

from ultralytics import YOLO

# Load your YOLO-World model
model = YOLO('yolov8s-world.pt')  # Choose the appropriate model size

# Define your custom classes
model.set_classes(["your_class_1", "your_class_2"])

# Now your model is focused on these classes

Train (Fine-Tune) Your Model: Use the train mode to fine-tune your model on your dataset. Remember to adjust the training parameters according to your dataset size and the computational resources available.
```
yolo detect train data=your_dataset.yaml model=custom_yolov8s.pt epochs=50
```
Evaluate and Iterate: After training, evaluate your model's performance using the val mode. It might take a few iterations of training with adjusted parameters to achieve the desired accuracy.

Fine-tuning allows you to leverage the powerful YOLOv8 architecture while focusing the model's learning on your specific use case. This process can significantly improve detection performance for your classes of interest. Happy fine-tuning! 😊

bdv29 Mar 2, 2024 — with giscus

i get the error NotImplementedError: WARNING 'YOLOWorld' model does not support 'train' mode for 'detect' task yet.
are there plans for adding it anytime soon?

pderrenger Mar 2, 2024
Maintainer

Hey there! 😊 Thanks for reaching out. Currently, the YOLO-World models, including yolov8s-world.pt, are optimized for inference and validation tasks, and they don't support the 'train' mode for the 'detect' task just yet.

We're always working on enhancing the capabilities of our models, and training support for YOLO-World is definitely on our radar. While I can't provide a specific timeline right now, keep an eye on our GitHub repo and docs for updates!

In the meantime, if you're looking to fine-tune a model for detection, you might consider using the standard YOLOv8 models which fully support training. Here's a quick example on how to start training:

yolo detect train data=your_custom_data.yaml model=yolov8n.pt epochs=100

Make sure your your_custom_data.yaml is set up with your dataset. We appreciate your patience and interest in pushing the boundaries of what's possible with YOLO! 🚀

AdadAlShabab · 2024-03-04T17:34:54Z

AdadAlShabab
Mar 4, 2024 — with giscus

How can we run 'videos' with custom classes defined?

5 replies

glenn-jocher Mar 4, 2024
Maintainer

Hey there! Running videos with custom classes using YOLO-World is straightforward. You can leverage the set_classes() method to specify your custom classes before running predictions. Here's a quick example:

from ultralytics import YOLO

# Load your YOLO-World model
model = YOLO('yolov8s-world.pt')  # Adjust the model as needed

# Define your custom classes (e.g., 'cat', 'dog')
custom_classes = ['cat', 'dog']
model.set_classes(custom_classes)

# Now you can run predictions on your video
results = model.predict('path/to/your/video.mp4')

# Display the results
results[0].show()

This way, the model will focus on detecting only the classes you've defined. Hope this helps! Let me know if you have any more questions. 😊

AdadAlShabab Mar 5, 2024

Thanks a lot!! @glenn-jocher

glenn-jocher Mar 5, 2024
Maintainer

@AdadAlShabab hey there! 🌟 To run videos with custom classes using YOLO-World, you can leverage the set_classes() method to specify your desired classes. Here's a quick example to get you started:

from ultralytics import YOLO

# Load your YOLO-World model
model = YOLO('yolov8s-world.pt')  # Adjust based on your model choice

# Define your custom classes (e.g., 'cat', 'dog')
custom_classes = ['cat', 'dog']
model.set_classes(custom_classes)

# Run predictions on your video
results = model.predict('path/to/your/video.mp4')

# Display the results
results[0].show()

This way, the model will focus on detecting only the classes you've specified. Hope this helps! Let me know if you have any more questions. 😊

qihuijia Mar 5, 2024

I think you can try if you set the classes more than 8. It will not work.

glenn-jocher Mar 5, 2024
Maintainer

Hey there! Thanks for reaching out. 😊

If you're encountering issues with setting more than 8 classes in YOLO-World, it might be due to how the classes are being specified. Here's a quick example on how to correctly set custom classes for your model:

from ultralytics import YOLO

# Load your YOLO-World model
model = YOLO('yolov8s-world.pt')  # Adjust based on your model choice

# Define your custom classes
custom_classes = ['class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9']
model.set_classes(custom_classes)

# Now you can run predictions on your video or images
results = model.predict('path/to/your/video.mp4')

Make sure your class names match exactly with those in the model's training data. If you're still facing issues, could you provide more details or error messages you're getting? That way, I can help you more effectively! 🚀

qihuijia · 2024-03-14T15:07:02Z

qihuijia
Mar 14, 2024 — with giscus

when the yolo world support "Training"?

1 reply

glenn-jocher Mar 14, 2024
Maintainer

Hey there! 🌟 Currently, the YOLO-World models are optimized for inference and validation tasks, with training and export modes not supported at the moment (as indicated by the ❌ in the documentation). However, we're always working on enhancing the capabilities of our models. Stay tuned for future updates where training support might be introduced! If you're looking to train a model for open-vocabulary detection, I'd recommend exploring the existing YOLOv8 models and their training functionalities. Feel free to share any specific requirements or use cases you have in mind, and we might be able to suggest an alternative approach. Happy coding! 🚀

bdv29 · 2024-03-14T18:35:48Z

bdv29
Mar 14, 2024

are there plans to implement it to the export method?
tried exporting to tensorRT but it doesn't seem it is implmented

7 replies

glenn-jocher Oct 16, 2024
Maintainer

To run inference on your TensorRT file, you can use the TensorRT Python API or a tool like trtexec. Unfortunately, direct support for TensorRT inference isn't available in the Ultralytics repo. You might find community resources or forums helpful for specific TensorRT usage examples.

LLH-Harward Oct 17, 2024

hello, I am currently using yolov8s-worldv2.pt for inference as follows:
from ultralytics import YOLO

Initialize a YOLO-World model

model = YOLO("yolov8s-world.pt") # or choose yolov8m/l-world.pt

Define custom classes

model.set_classes(["person", "bus"])

Execute prediction for specified categories on an image

results = model.predict("path/to/image.jpg")

Show results

results[0].show()

After exporting to ONNX, how can I perform inference using the exported ONNX model? How do I set the classes? Thanks you.

glenn-jocher Oct 17, 2024
Maintainer

To perform inference with your exported ONNX model, you'll need to use an ONNX runtime like onnxruntime. Unfortunately, setting custom classes directly in ONNX isn't supported. You'll need to handle class filtering in your post-processing code. For more details, you can refer to the ONNX runtime documentation.

LLH-Harward Oct 18, 2024

Thank you. I found a Python package named 'yolo-world-onnx' which is helpful for me.

https://github.com/ibaiGorordo/ONNX-YOLO-World-Open-Vocabulary-Object-Detection
https://pypi.org/project/yolo-world-onnx/

glenn-jocher Oct 18, 2024
Maintainer

Thank you for sharing the resource. For running inference with the ONNX model, you can use the onnxruntime package. Unfortunately, setting classes directly in ONNX isn't supported, so you'll need to filter classes in your post-processing code. For more details, refer to the ONNX runtime documentation.

AnandSingh-0619 · 2024-03-22T22:18:51Z

AnandSingh-0619
Mar 22, 2024 — with giscus

Can we run inference with multiple images or batch of images for reducing overall computation time?

1 reply

glenn-jocher Mar 23, 2024
Maintainer

Absolutely, running inference with a batch of images can be an effective way to reduce overall computation time. You can achieve this by using the predict function and passing your batch of images as a list. For example:

from ultralytics import YOLO

# Initialize model
model = YOLO('yolov8n.pt')  # Example for detection

# List of image paths
images = ['path/to/image1.jpg', 'path/to/image2.jpg']

# Run inference on batch of images
results = model(images)

# Process the results as needed
for result in results:
    result.show()  # Display

This way, you leverage batch processing, which could significantly speed up the inference time, especially on GPUs. 🚀

Dragosjosan · 2024-04-04T10:03:31Z

Dragosjosan
Apr 4, 2024

The model performance of the Yolo World Hugging Face model (https://huggingface.co/spaces/stevengrove/YOLO-World) is better for my purpose than the standard inference one "YOLOv8x-worldv2" for example.

Because my lack of domain knowledge regarding this topics I can't make the HF model work on my Mac (not an M processor) so I would like to be able to use the model weights with the abstraction from the ultralytics library.

Is there any way to use the author's provided weights together with the inference library?
For example:
model = YOLOWorld('yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth')

From the author's instruction on how to run the model on an image one needs the configuration file and the weights: https://github.com/AILab-CVC/YOLO-World

The conf file I want to use and the weights are available here:
https://huggingface.co/spaces/stevengrove/YOLO-World/blob/main/configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py
https://huggingface.co/spaces/stevengrove/YOLO-World/blob/main/yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth

Thank you in advance :)

18 replies

Dragosjosan Apr 26, 2024

@junnnnchi for my application the model from Hugging Face works best...
But good that you found something that works.
May I ask what your App do? :)

mustfkeskin Oct 4, 2024

Has anyone solved this problem? 🤗

glenn-jocher Oct 4, 2024
Maintainer

If you're facing issues with the YOLO-World model, please ensure you're using the correct configuration and weights as specified in the documentation. If you need further assistance, feel free to ask here.

mustfkeskin Oct 4, 2024

How can I integrate the above mentioned files into the ultralytics library. I liked the model results better when working on Huggingface. I noticed that the model size used on Huggingface is a bit bigger size

The conf file I want to use and the weights are available here:
https://huggingface.co/spaces/stevengrove/YOLO-World/blob/main/configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py
https://huggingface.co/spaces/stevengrove/YOLO-World/blob/main/yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth

glenn-jocher Oct 4, 2024
Maintainer

@mustfkeskin unfortunately, integrating the specific .pth weights with the Ultralytics library isn't directly supported. You might need to convert the weights to a compatible format like .pt and adjust configurations manually. For further guidance, consider exploring PyTorch documentation or community forums.

Nishantg7 · 2024-04-12T23:25:14Z

Nishantg7
Apr 12, 2024 — with giscus

Hello There,
I am having trouble using YOLO-WORLD when I convert the results into voice output. The yolov8 model works fine with the same voice input method please help me out with this problem.

CODE:
import cv2
import tkinter as tk
from PIL import Image, ImageTk
from ultralytics import YOLOWorld
import pyttsx3

class WebcamApp:
def init(self, window, window_title):
self.window = window
self.window.title(window_title)

    # Open the webcam
    self.cap = cv2.VideoCapture(0)

    # Create a label for displaying the video feed
    self.label = tk.Label(window)
    self.label.pack()

    # Button to capture image
    self.btn_capture = tk.Button(window, text="Capture Image", width=50, command=self.capture_image)
    self.btn_capture.pack()

    # Initialize YOLO-World model
    self.model = YOLOWorld("yolov8s-world.pt")

    # Initialize text-to-speech engine
    self.engine = pyttsx3.init('sapi5')
    voices = self.engine.getProperty('voices')
    self.engine.setProperty('voice', voices[0].id)

    self.window.protocol("WM_DELETE_WINDOW", self.on_closing)
    self.window.mainloop()

def capture_image(self):
    # Read frame from webcam
    ret, frame = self.cap.read()
    if ret:
        # Convert the frame to RGB color space
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        # Convert the frame to PIL format
        img = Image.fromarray(frame_rgb)
        # Save the captured image
        img.save("captured_image.jpg")
        # Process the captured image with YOLO-World model
        results = self.model.predict("captured_image.jpg")
        # Extract class names from results
        objects = [res['name'] for res in results]
        # Convert detected objects list to speech
        speech_text = f"I see {', '.join(objects)}."
        # Speak the detected objects
        self.speak(speech_text)
    else:
        print("Error capturing image.")


def speak(self, audio):
    self.engine.say(audio)
    self.engine.runAndWait()

def on_closing(self):
    # Release the webcam
    self.cap.release()
    # Close the application window
    self.window.destroy()

Create a Tkinter window

root = tk.Tk()

Create the WebcamApp object

app = WebcamApp(root, "Webcam Application")

ERRORS:
image 1/1 d:\CODING\YOLO WORLD\captured_image.jpg: 480x640 1 bottle, 1 clock, 427.0ms
Speed: 3.0ms preprocess, 427.0ms inference, 1703.1ms postprocess per image at shape (1, 3, 480, 640)
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\gawde\AppData\Local\Programs\Python\Python310\lib\tkinter_init_.py", line 1921, in call
return self.func(*args)
File "d:\CODING\YOLO WORLD\test3.py", line 47, in capture_image
objects = [res['name'] for res in results]
File "d:\CODING\YOLO WORLD\test3.py", line 47, in
objects = [res['name'] for res in results]
File "C:\Users\gawde\AppData\Local\Programs\Python\Python310\lib\site-packages\ultralytics\engine\results.py", line 126, in getitem
return self._apply("getitem", idx)
File "C:\Users\gawde\AppData\Local\Programs\Python\Python310\lib\site-packages\ultralytics\engine\results.py", line 163, in _apply
setattr(r, k, getattr(v, fn)(*args, **kwargs))
File "C:\Users\gawde\AppData\Local\Programs\Python\Python310\lib\site-packages\ultralytics\engine\results.py", line 63, in getitem
return self.class(self.data[idx], self.orig_shape)
IndexError: too many indices for tensor of dimension 2

3 replies

pderrenger Apr 13, 2024
Maintainer

Hello there! 👋 It looks like you're encountering issues with using YOLO-World and converting the results to voice output. From the logs, the error seems to stem from how the results are being parsed and accessed. Specifically, the error indicates a possible mismatch in the expected structure or dimensions of the results object.

To address the issue, we need to ensure we're accessing the results properly. With YOLO-World's output format, you might need to adjust how you're iterating through the results. Here’s a small tweak to how you process the results:

# Process the captured image with YOLO-World model
results = self.model.predict("captured_image.jpg")
# Assuming results contain detections
if results.detections:
    objects = [detection.name for detection in results.detections[0]]
    ...

This ensures you're directly accessing the detections attribute, which holds the detected object info. Each detection provides the .name property among others like .confidence, etc.

Please give this a try and check if it resolves the error. If the issue persists or if there’s anything more specific you’re facing, feel free to share more details! 🤞

Nishantg7 Apr 13, 2024 — with giscus

Thank you so much worked for me.

pderrenger Apr 14, 2024
Maintainer

Hey there! 🎉 I'm thrilled to hear that the solution worked for you!

For anyone venturing into integrating YOLO-World's detection capabilities with voice output, it's key to correctly parse the results object. For YOLO-World specifically, the setup might differ slightly from YOLOv8 due to its unique output format tailored for open-vocabulary tasks. 🤖✨

If you're tweaking the code for YOLO-World and bump into any snags, make sure you're accessing the detection results as intended by the model's output structure. The Ultralytics docs provide great insights on handling outputs effectively.

Happy coding, and if further questions surface, feel free to drop them here!

illink7 · 2024-04-23T09:38:38Z

illink7
Apr 23, 2024 — with giscus

How can I combine YOLO-World with SAHI to process satellite images?
I suppose it has to be interesting

1 reply

pderrenger Apr 23, 2024
Maintainer

Combining YOLO-World with SAHI for processing satellite images sounds like a fascinating prospect!🌍✨ You can leverage the strong open-vocabulary detection capabilities of YOLO-World to identify a wide range of objects in satellite imagery. Here’s how you can get started:

Install YOLO-World and SAHI: Make sure both libraries are installed in your environment.
Load YOLO-World Model: Choose a suitable YOLO-World model based on your need for accuracy or speed (e.g., yolov8s-worldv2.pt for a balance between the two).
Use SAHI for Inference: Utilize SAHI's slicing and prediction merging capabilities to handle large satellite images.

from ultralytics import YOLO
from sahi.predict import predict

# Load the YOLO-World model
model = YOLO('yolov8s-worldv2.pt')  # Adjust the model file as needed

# Predict using SAHI with a large satellite image
result = predict(
    model_type='yolov5',  # Use 'yolov5' type as a placeholder
    model=model,
    source='path/to/satellite/image.jpg'
)

This example provides a starting point. You might need to customize it further based on SAHI's current API and how it interfaces with models. Remember to keep an eye on both YOLO-World and SAHI documentation for updates or changes in the integration process.

Happy detecting! 🛰️💡 Feel free to share your findings or ask more questions as you dive into this interesting integration.

Kunal8020 · 2024-05-24T12:32:33Z

Kunal8020
May 24, 2024 — with giscus

I finetuned a yolo world model for a different classes which was not detectable using the original weights Now I have a weights for for detection of that classes. I also get another weight for class which are present in the original weight

from ultralytics import YOLOWorld
model = YOLOWorld("yolov8x-worldv2.pt")
model.set_classes(["baseball bat","ball"])
model.save("custom_yolo.pt")

How can I combine both weights so that i can detect all required classes??

4 replies

Kunal8020 May 24, 2024 — with giscus

To be simple, How can I combine two custom weights??

glenn-jocher May 24, 2024
Maintainer

Currently, combining two different sets of weights directly isn't supported in the YOLO architecture. Each set of weights is trained under specific conditions and merging them could lead to conflicts or degraded performance.

A practical approach would be to combine the datasets for both classes and retrain the model from scratch or fine-tune from one of the pre-trained weights. This way, the model learns to detect all the required classes simultaneously, ensuring optimal performance and compatibility.

If you need further guidance on how to proceed with training on a combined dataset, feel free to ask! 🚀

Kunal8020 May 28, 2024 — with giscus

But i do not have dataset for the classes present in the YOLO world. That is why i am using set classes of required objects. I finetuned a model with my dataset.
Also, when i finetume the model using below code

from ultralytics import YOLOWorld
model=YOLOWorld('weights.pt')
results = model.train(data='config.yaml', epochs=100, imgsz=640)

all previous classes are not detectable with new weights. How can I train a model such that it will detect the all previous classes from the previous weights and also finetuned classes ??

glenn-jocher May 28, 2024
Maintainer

Thank you for your detailed explanation! Combining weights from two different models directly isn't supported, as it can lead to conflicts and degraded performance. However, you can achieve your goal by merging the datasets and retraining the model. Here’s a suggested approach:

Combine Datasets: Merge your dataset with the dataset containing the classes from the original YOLO-World weights.
Retrain the Model: Use the combined dataset to retrain the model. This way, the model learns to detect all the required classes simultaneously.

Here’s an example of how you can proceed with training:

from ultralytics import YOLOWorld

# Load a pretrained YOLO-World model
model = YOLOWorld("yolov8x-worldv2.pt")

# Train the model on the combined dataset
results = model.train(data="combined_dataset.yaml", epochs=100, imgsz=640)

If you need further assistance with dataset preparation or training, feel free to ask! 😊

Kunal8020 · 2024-06-13T11:45:56Z

Kunal8020
Jun 13, 2024 — with giscus

I am trying to get the tracking id from image using below code
from ultralytics import YOLO
model=YOLO('path/yolov8x-worldv2.pt')

img=cv2.imread('WA0004_444.jpg')
results=model.track(img)

when i check the results of tracking it is showing false
results[0].boxes

id: None
is_track: False

5 replies

Kunal8020 Jun 13, 2024 — with giscus

when i am using the finetune weight on custom datset then it is working
model_yoloworld = YOLOWorld('best.pt')
results=model_yoloworld.track(img)

results[0].boxes

id: tensor([1., 2.])
is_track: True

pderrenger Jun 14, 2024
Maintainer

@Kunal8020 hello!

Thank you for reaching out and providing detailed information about your issue. It seems like you're encountering a problem with tracking IDs when using the yolov8x-worldv2.pt model. Let's troubleshoot this together.

Firstly, ensure that you are using the latest versions of torch and ultralytics packages. You can update them using the following commands:

pip install --upgrade torch ultralytics

If the issue persists, please provide a minimum reproducible example so we can investigate further. This helps us understand the context and reproduce the bug on our end. You can refer to our guide on creating a minimum reproducible example here: Minimum Reproducible Example.

In the meantime, here's a refined version of your code to ensure everything is set up correctly:

import cv2
from ultralytics import YOLO

# Load the YOLO-World model
model = YOLO('path/yolov8x-worldv2.pt')

# Read the image
img = cv2.imread('WA0004_444.jpg')

# Perform tracking
results = model.track(img)

# Check the results
if results[0].boxes.id is not None:
    print(f"Tracking IDs: {results[0].boxes.id}")
else:
    print("No tracking IDs found.")

If you still encounter issues, it might be helpful to verify if the model is correctly loaded and if the image is being processed as expected. Additionally, ensure that the image path and model path are correct.

Feel free to share any additional details or errors you encounter, and we'll be happy to assist further. 😊

pderrenger Jun 14, 2024
Maintainer

Hello!

Thank you for reaching out and providing detailed information about your issue. It seems like you're encountering a problem with tracking IDs when using the yolov8x-worldv2.pt model. Let's troubleshoot this together.

Firstly, ensure that you are using the latest versions of torch and ultralytics packages. You can update them using the following commands:

pip install --upgrade torch ultralytics

Next, let's verify that the model and the image are being processed correctly. Here is a minimum reproducible example to help you debug:

from ultralytics import YOLO
import cv2

# Load the YOLO-World model
model = YOLO('path/yolov8x-worldv2.pt')

# Read the image
img = cv2.imread('WA0004_444.jpg')

# Perform tracking
results = model.track(img)

# Check the tracking results
print(results[0].boxes)

If the id is still None and is_track is False, it might be due to the confidence threshold or other tracking parameters. You can try adjusting the confidence threshold and other parameters to see if it resolves the issue:

results = model.track(img, conf=0.25, iou=0.45)

Additionally, if you have a custom-trained model (best.pt) that works correctly, it might be beneficial to compare the configurations and training parameters used for both models. Sometimes, fine-tuning on a specific dataset can significantly impact the tracking performance.

If the issue persists, please provide more details or a minimum reproducible example as outlined in our Minimum Reproducible Example Guide. This will help us investigate the problem more effectively.

Feel free to reach out if you have any further questions or need additional assistance. We're here to help! 😊

Kunal8020 Jun 14, 2024 — with giscus

thank you for the response. It worked when i use the below code
results = model.track(img, conf=0.25,tracker="bytetrack.yaml")

where i reduced the track_thresh parametrers

pderrenger Jun 14, 2024
Maintainer

Hi @Kunal8020,

Thank you for the update! I'm glad to hear that adjusting the track_thresh parameter and using the bytetrack.yaml tracker resolved the issue. 🎉

For anyone else encountering similar issues, adjusting the conf and track_thresh parameters can significantly impact the tracking performance. Here's a quick example:

results = model.track(img, conf=0.25, tracker="bytetrack.yaml")

This approach ensures that the model is more sensitive to detections, which can help in scenarios where tracking IDs are not being generated as expected.

If you have any further questions or need additional assistance, feel free to ask. We're here to help!

GaviraghiElia · 2024-09-10T08:52:55Z

GaviraghiElia
Sep 10, 2024 — with giscus

I want predict three class: litter, car license plate and face.
YOLO World often fails to predict face and license plate, but if I follow a two stage pipeline like this:

Predict litter, car, person.
Take the crops of car and person
Predict car license plate on car crops and predict face on person crops
it's successful most of the time and it's really impressive.

My problem is to bring the predicted bounding boxes on the crops back to the scale of the original image by saving them in the .txt file in the usual YOLO format: how can I do it?

The code where I do the first prediction is this

model = YOLO('yolov8x-worldv2.pt')
model.set_classes(["litter", "car", "person", ""])

image_path = "/content/20240520_195133.jpg"
original_image = cv2.imread(image_path)

results = model.predict(image_path, imgsz = 1280, save=True, save_txt = True, save_crop = True)

# path of predicted labels
labels_dir = "/content/runs/detect/predict/labels/"

The second part, where I do the prediction on the crops is there

# directories of crops and label
car_crop_dir = "/content/runs/detect/predict/crops/car/"
person_crop_dir = "/content/runs/detect/predict/crops/person/"
labels_dir_crop = "/content/runs/detect/predict2/labels/"

# models for face and license plate
license_plate_model = YOLO('yolov8x-worldv2.pt')
face_model = YOLO('yolov8x-worldv2.pt')
# set classses
license_plate_model.set_classes(["car license plate", ""])
face_model.set_classes(["face", ""])

# example - prediction on car crops
for crop_file in os.listdir(car_crop_dir):
    crop_path = os.path.join(car_crop_dir, crop_file)
    crop_image = cv2.imread(crop_path)

    crop_results = license_plate_model.predict(crop_image, max_det=1, conf=0.001, save_txt=True, save_crop=True)
    
    # read the bounding box just predicted
    label_file = os.path.join(labels_dir_crop, "image0.txt")
    with open(label_file, 'r') as f:
        bbox = list(map(float, f.readline().strip().split()[1:]))

Now I should bring the bounding box back to the original image (i.e. save it in the .txt file of the first prediction), but I don't know what conversion to do.

1 reply

glenn-jocher Sep 10, 2024
Maintainer

@GaviraghiElia to map the bounding boxes from the crops back to the original image, you'll need to adjust the coordinates based on the crop's position and scale relative to the original image. When saving the bounding boxes, multiply the crop coordinates by the scale factor and add the crop's top-left corner coordinates. This will align the boxes with the original image dimensions.

jinyoonok2 · 2024-09-28T13:12:31Z

jinyoonok2
Sep 28, 2024

Hello, I recently started exploring the YOLO-World project and greatly appreciate your work.

However, I noticed some differences between the original YOLO-World model and YOLO-World v2. In the original YOLO-World paper, the vision features influence the text embeddings through the Image-Pooling Attention (I-Pooling Attention) module within the Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). This module enhances the text embeddings by integrating image-aware information from multi-scale image features.

It appears that the YOLO-World v2 of Ultralytics has removed the I-Pooling Attention module. My questions are: Would this change decrease the model's performance while improving inference speed? Does this mean that in v2, no module at all impacts the text embeddings using the vision features from the YOLOv8 backbone?

If there is anything wrong in my understanding, please correct me.

Again, Thank you for your work

1 reply

glenn-jocher Sep 28, 2024
Maintainer

@jinyoonok2 thank you for your insights on YOLO-World. The removal of the I-Pooling Attention module in YOLO-World v2 aims to enhance inference speed while maintaining performance. The model now focuses on efficient detection without directly impacting text embeddings using vision features. For further details, please refer to the YOLO-World documentation.

import-jalashwa · 2024-10-08T07:24:02Z

import-jalashwa
Oct 8, 2024 — with giscus

Could you help me understand how to perform inference using a ONNX or TFLite YOLO-World model on images / videos

1 reply

glenn-jocher Oct 8, 2024
Maintainer

@import-jalashwa to perform inference using an ONNX or TFLite YOLO-World model, you can use the Ultralytics Python API or CLI. For ONNX, load the model with onnxruntime and run predictions. For TFLite, use the TensorFlow Lite Interpreter. For detailed steps, please refer to the YOLO-World documentation.

goodbbboy · 2024-10-10T16:22:13Z

goodbbboy
Oct 10, 2024 — with giscus

hello! Can ultralytics‘s yolo-world convert text features（text_feats） from CLIP into nn.parameter, thereby allowing text embedding to be updated with the optimizer? As far as I know, the generation of text features (text_feats)（world/train.py） does not explicitly appear in the model block (such as Conv or C2f). Is there a way to convert text_feats into nn.parameter?Thank you for your reply. Wishing you a happy life！

1 reply

glenn-jocher Oct 10, 2024
Maintainer

@goodbbboy hello! Currently, YOLO-World doesn't support converting text_feats into nn.Parameter for optimization. The text features are used as embeddings and not directly integrated into the model's trainable parameters. For more details, you might want to explore the model's architecture in the source code.

YUANMU227 · 2024-10-16T03:38:35Z

YUANMU227
Oct 16, 2024 — with giscus

Hi, great work! I have a problem when training yolov8x-worldv2.pt. It automatically downloads yolo11n.pt and seems to load it? This seems unreasonable because training yolov8x-worldv2.pt does not depend on yolo11n.pt.

Execute command: yolo detect train data='custom_datasets.yaml' model=yolov8x-worldv2.pt epochs=100 imgsz=640 device=0,1,2,3 save_period=1 name=train_yolov8x-worldv2 patience=10

When training, it will output:

YOLOv8x-worldv2 summary: 396 layers, 72,886,377 parameters, 72,886,361 gradients, 283.6 GFLOPs

AMP: running Automatic Mixed Precision (AMP) checks with YOLO11n...
AMP: checks passed ✅

8 replies

glenn-jocher Oct 16, 2024
Maintainer

@GaviraghiElia training YOLO-World models can enhance their capabilities for specific tasks or datasets. While zero-shot abilities are a key feature, fine-tuning can improve performance for targeted applications. For more details, you can explore the YOLO-World documentation.

GaviraghiElia Oct 17, 2024

@glenn-jocher I know, but at this point it might be better to use the other YOLOs (v5, v8, v10, etc.) instead of YOLO World. Unless I've missed something, it offers no real advantage for targeted application: once the weights are updated and the initial zero-shot mode is lost, unless training on another huge dataset (e.g. ImageNet) to create a new zero-shot ability, it is nothing more than a YOLO v8.
Or am I wrong in my reasoning?

glenn-jocher Oct 17, 2024
Maintainer

YOLO-World is designed for open-vocabulary detection, offering flexibility with custom prompts and efficient real-time performance. While fine-tuning can enhance specific tasks, its strength lies in zero-shot capabilities and adaptability. For more details, visit the YOLO-World documentation.

maryaliza Oct 28, 2024 — with giscus

Thanks for the great explanation @glenn-jocher!
I'm wondering if using later version of Yolo in the backbone instead of Yolo-v8 would improve the performance? considering that Yolov11 has higher mAP and lower latency. or would the performance mostly is affected by the open-vocabulary encoder?

Also, is there any future plan on updating the current Yolo-world to the latest version of Yolo.v11 instead of Yolo.v8?

glenn-jocher Oct 28, 2024
Maintainer

@maryaliza using YOLOv11 as a backbone could potentially improve performance due to its higher mAP and lower latency. However, the open-vocabulary encoder plays a significant role in YOLO-World's capabilities. As for updates, there are no current plans to integrate YOLOv11 into YOLO-World, but advancements are always considered. For more details, please refer to the YOLO-World documentation.

YUANMU227 · 2024-10-17T09:58:20Z

YUANMU227
Oct 17, 2024 — with giscus

When training yolov8x-worldv2.pt, an error message of insufficient memory appears. Does the training part seem to have a memory leak? Please check.

Error:

OSError: [Errno 28] No space left on device

2 replies

YUANMU227 Oct 17, 2024 — with giscus

I solved this problem.

glenn-jocher Oct 17, 2024
Maintainer

@YUANMU227 it seems like the issue is related to insufficient disk space rather than a memory leak. Please check your device's storage and free up some space to resolve the error. If you need further assistance, feel free to ask!

alganzory · 2024-10-23T09:42:09Z

alganzory
Oct 23, 2024 — with giscus

How would the exported version of this model work? for example tensorflow.js? how do I "set_classes", etc?

1 reply

glenn-jocher Oct 23, 2024
Maintainer

@alganzory to export a YOLO-World model to TensorFlow.js and set custom classes, first export the model using the export function with format='tfjs'. Then, use the set_classes method to specify your desired classes. For more details, please refer to the YOLO-World documentation.

zhuitaiyang · 2024-10-29T02:09:39Z

zhuitaiyang
Oct 29, 2024 — with giscus

Can you share the training logs for the reproduce of YOLO-World?

10 replies

zhuitaiyang Nov 14, 2024

Thanks for your answer, I looked at the code again carefully. In the RandomLoadTxt class, cls is mapped to new_cls, while 80 prompt samples containing positive and negative samples are selected from class_texts, but class_texts are equivalent to the class name corresponding to val in data.yaml. Does this selection allow the model to learn the correct image and text representation?

glenn-jocher Nov 14, 2024
Maintainer

The model learns the relationship between new_cls and true class names through the training process, where it associates these mapped classes with the corresponding image features and text representations. Ensure your dataset and mappings are correctly aligned for accurate training results.

zhuitaiyang Nov 15, 2024

Thank you for your answer. I re-checked the label of the data set and conducted training. During the training process, the map value on the verification set is as follows, which seems to be a little low. Is this normal?

zhuitaiyang Nov 15, 2024

Hi, there's one more thing that's really confusing to me. In the class bulid_datasets , when run to the get_dataset,the line 96 and 97 of https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/yolo/world/train_world.py#L67-L98:

final_data["names"] = data["val"][0]["names"]
self.data = final_data

Only the class names in the validation set are saved in the self.data. When run to the class YOLOMultiModalDataset , the line 269 of https://github.com/ultralytics/ultralytics/blob/main/ultralytics/data/dataset.py#L265-L270:

labels["texts"] = [v.split("/") for _, v in self.data["names"].items()]

indicate labels["texts"] is from self.data["names"]. When run to the RandomLoadText for update neg_samples and pos_samples, the line 2229 and 2258 of https://github.com/ultralytics/ultralytics/blob/main/ultralytics/data/augment.py#L2229-L2269:

class_texts = labels["texts"]

prompts = class_texts[label]

indicate the texts in labels in from validation set, none of objects365. For Text Contrastive heads, pos_samples do not correspond to their labels.

glenn-jocher Nov 18, 2024
Maintainer

It seems like there might be a mix-up with the text labels during the dataset construction and usage phases. Ensure that the class names in your validation set align with those in your training set. The YOLOMultiModalDataset and RandomLoadText classes should reference the correct labels. Double-check the dataset preparation to ensure consistency across training and validation processes. If the issue persists, reviewing the data loading and augmentation logic might help identify discrepancies.

goodbbboy · 2024-11-04T05:10:57Z

goodbbboy
Nov 4, 2024 — with giscus

Hey there, while training with YOLO-worldv2, I've noticed this pattern: regardless of the img-size I specify（640*640）, the model always initializes with a size of 256x256. Additionally, after every epoch, the validation is done at a size of 384x672. Could there be an issue with my code? Or is this a deliberate strategy? If intentional, how can I adjust this behavior? Thanks for your response!

3 replies

glenn-jocher Nov 4, 2024
Maintainer

The behavior you're observing is likely due to the model's internal resizing strategy for efficiency. To adjust this, you can explicitly set the desired image size in your training script or configuration file. If the issue persists, please ensure your environment is correctly configured to respect these settings.

goodbbboy Nov 4, 2024 — with giscus

Thank you for your reply! We found that the issue was due to a problem in the build_yolo_dataset() function, which was originally intended to reduce computational load. We have now resolved our issue. Thanks again!

glenn-jocher Nov 4, 2024
Maintainer

Great to hear you've resolved the issue with the build_yolo_dataset() function! If you have any more questions or need further assistance, feel free to ask.

zhuitaiyang · 2024-11-04T08:52:25Z

zhuitaiyang
Nov 4, 2024 — with giscus

Hi, how can i use the pretrained model yolov8s-worldv2.pt to inference 1203 classes of lvis datasets? When i use val.py to run, there is an error in the number of categories.

3 replies

glenn-jocher Nov 4, 2024
Maintainer

@zhuitaiyang to use the yolov8s-worldv2.pt model with the LVIS dataset, ensure your model and dataset configurations align, particularly the number of classes. You might need to adjust the model's class settings or use custom prompts to match the LVIS categories. For further guidance, please refer to the YOLO-World documentation.

zhuitaiyang Nov 8, 2024 — with giscus

Thanks for your response! I reset the classes of yolov8s-worldv2.pt as you suggested, and It had the effect I wanted.

glenn-jocher Nov 8, 2024
Maintainer

I'm glad to hear that resetting the classes worked for you! If you have any more questions or need further assistance, feel free to ask.

import-jalashwa · 2024-11-05T13:14:46Z

import-jalashwa
Nov 5, 2024 — with giscus

1 reply

glenn-jocher Nov 5, 2024
Maintainer

@import-jalashwa to train YOLO-World with the GQA dataset, you can use the WorldTrainerFromScratch class to handle both detection and grounding datasets. Ensure your dataset paths and annotations are correctly set in the data dictionary before initiating training.

jinyoonok2 · 2024-11-08T07:52:24Z

jinyoonok2
Nov 8, 2024

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8-World-v2 object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs
  s: [0.33, 0.50, 1024] # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPs
  m: [0.67, 0.75, 768] # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPs
  l: [1.00, 1.00, 512] # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
  x: [1.00, 1.25, 512] # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]] # 9

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, C2fAttn, [512, 256, 8]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 3, C2fAttn, [256, 128, 4]] # 15 (P3/8-small)

  - [15, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]] # cat head P4
  - [-1, 3, C2fAttn, [512, 256, 8]] # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P5
  - [-1, 3, C2fAttn, [1024, 512, 16]] # 21 (P5/32-large)

  - [[15, 18, 21], 1, WorldDetect, [nc, 512, True]] # Detect(P3, P4, P5)

Hi Ultralytics Team,

Thank you for your efforts in bringing YOLO-World into the Ultralytics framework.

I've encountered a potential issue (or perhaps just a point of confusion) that I'm hoping to clarify. The YAML file for yolov8-world, located in ultralytics/cfg/models/v8/yolov8-world.yaml, appears intended to create the original version of YOLO-World, which includes the ImagePoolingAttn module. However, I’ve noticed that none of the subsequent modules seem to take input from this ImagePoolingAttn module, neither the detection head module nor the module directly following it.

It seems that at least the module immediately following the ImagePoolingAttn module should have -1 in the "from" section to indicate it’s taking input from the previous layer. Alternatively, another module should specify input from 16, which is the index location of the ImagePoolingAttn module.

I also checked ultralytics/nn/tasks.py, which is responsible for integrating modules into the model, and observed the following in the world detection model:

class WorldModel(DetectionModel):
    """YOLOv8 World Model."""

    def __init__(self, cfg="yolov8s-world.yaml", ch=3, nc=None, verbose=True):
        """Initialize YOLOv8 world model with given config and parameters."""
        self.txt_feats = torch.randn(1, nc or 80, 512)  # features placeholder
        self.clip_model = None  # CLIP model placeholder
        super().__init__(cfg=cfg, ch=ch, nc=nc, verbose=verbose)

    def set_classes(self, text, batch=80, cache_clip_model=True):
        """Set classes in advance so that model could do offline-inference without clip model."""
        try:
            import clip
        except ImportError:
            check_requirements("git+https://github.com/ultralytics/CLIP.git")
            import clip

        if (
            not getattr(self, "clip_model", None) and cache_clip_model
        ):  # for backwards compatibility of models lacking clip_model attribute
            self.clip_model = clip.load("ViT-B/32")[0]
        model = self.clip_model if cache_clip_model else clip.load("ViT-B/32")[0]
        device = next(model.parameters()).device
        text_token = clip.tokenize(text).to(device)
        txt_feats = [model.encode_text(token).detach() for token in text_token.split(batch)]
        txt_feats = txt_feats[0] if len(txt_feats) == 1 else torch.cat(txt_feats, dim=0)
        txt_feats = txt_feats / txt_feats.norm(p=2, dim=-1, keepdim=True)
        self.txt_feats = txt_feats.reshape(-1, len(text), txt_feats.shape[-1])
        self.model[-1].nc = len(text)

    def predict(self, x, profile=False, visualize=False, txt_feats=None, augment=False, embed=None):
        """
        Perform a forward pass through the model.

        Args:
            x (torch.Tensor): The input tensor.
            profile (bool, optional): If True, profile the computation time for each layer. Defaults to False.
            visualize (bool, optional): If True, save feature maps for visualization. Defaults to False.
            txt_feats (torch.Tensor): The text features, use it if it's given. Defaults to None.
            augment (bool, optional): If True, perform data augmentation during inference. Defaults to False.
            embed (list, optional): A list of feature vectors/embeddings to return.

        Returns:
            (torch.Tensor): Model's output tensor.
        """
        txt_feats = (self.txt_feats if txt_feats is None else txt_feats).to(device=x.device, dtype=x.dtype)
        if len(txt_feats) != len(x):
            txt_feats = txt_feats.repeat(len(x), 1, 1)
        ori_txt_feats = txt_feats.clone()
        y, dt, embeddings = [], [], []  # outputs
        for m in self.model:  # except the head part
            if m.f != -1:  # if not from previous layer
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            if profile:
                self._profile_one_layer(m, x, dt)
            if isinstance(m, C2fAttn):
                x = m(x, txt_feats)
            elif isinstance(m, C2fAttnPE):
                x = m(x, txt_feats)
            elif isinstance(m, WorldDetect):
                x = m(x, ori_txt_feats)
            elif isinstance(m, ImagePoolingAttn):
                txt_feats = m(x, txt_feats)
            else:
                x = m(x)  # run

            y.append(x if m.i in self.save else None)  # save output
            if visualize:
                feature_visualization(x, m.type, m.i, save_dir=visualize)
            if embed and m.i in embed:
                embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flatten
                if m.i == max(embed):
                    return torch.unbind(torch.cat(embeddings, 1), dim=0)
        return x

    def loss(self, batch, preds=None):
        """
        Compute loss.

        Args:
            batch (dict): Batch to compute loss on.
            preds (torch.Tensor | List[torch.Tensor]): Predictions.
        """
        if not hasattr(self, "criterion"):
            self.criterion = self.init_criterion()

        if preds is None:
            preds = self.forward(batch["img"], txt_feats=batch["txt_feats"])
        return self.criterion(preds, batch)

As observed, each module consistently accesses the same initial text features from the CLIP model without impacting these features.

This suggests that the ImagePoolingAttn module may not be functioning as intended. There appears to be no mechanism in place for image features to influence the text embeddings, even though the original YOLO-World paper introduced this concept.

Given this, would it be correct to conclude that the safer approach for using YOLO-World in the Ultralytics framework is to use YOLO-World v2, which removed the ImagePoolingAttn module entirely? YOLO-World v2 relies instead on the C2fAttn module, which incorporates T-CSP layers as described in the original paper. This adjustment seems appropriate, as the ImagePoolingAttn module in YOLO-World does not appear to be fully implemented, potentially missing critical elements.

While I may have overlooked parts where image features influence text features, I haven't been able to identify them thus far. Please clarify if there is any confusion about this.

Thank you for your time and assistance.

1 reply

glenn-jocher Nov 8, 2024
Maintainer

@jinyoonok2 thank you for your detailed observations. You're correct that YOLO-World v2, which omits the ImagePoolingAttn module, is recommended for its improved implementation and efficiency. If you have further questions, feel free to ask here.

ZHAOBenyun · 2024-11-13T11:28:29Z

ZHAOBenyun
Nov 13, 2024 — with giscus

Hi,

Thank you for your wonderful work about YOLO-World detection. I have try this part, that is great.

I have found the scrips codes of official YOLO-World, they have two different tasks. One for object detection, and one for instance segmentation.

Will you update the code about instance segmentation head? Maybe named WorldSegment?

1 reply

glenn-jocher Nov 13, 2024
Maintainer

@ZHAOBenyun thank you for your interest in YOLO-World. Currently, there are no plans to update the code for an instance segmentation head. We appreciate your feedback and will consider it for future updates.

zhuitaiyang · 2024-11-27T02:03:33Z

zhuitaiyang
Nov 27, 2024

Hi, i want to set imgsz to 640×512 to train YOLO-World, but the rect=True can not support multi-GPU, how can i use imgsz=[512,640] to train?

1 reply

glenn-jocher Nov 27, 2024
Maintainer

@zhuitaiyang the rect=True option currently doesn't support multi-GPU as it relies on fixed image sizes. To use non-square imgsz=[512, 640] for training on multiple GPUs, disable rect and ensure the dataset preprocessing can handle variable image dimensions. Alternatively, consider modifying the dataloader to reshape batches dynamically while maintaining compatibility with multiple GPUs.

zhuitaiyang · 2025-01-02T07:40:44Z

zhuitaiyang
Jan 2, 2025

Hi, i added private datasets (include person and car) to the training set of YOLO-World, hoping that YOLO-World can maintain zero-shot ability while adapting to private datasets. Through testing on lvis datasets, I found that the index only dropped a little. But when I used model detecting fire just now, I found that, i need to set the confidence level very low（0.00005） to detect it, and the model without private datasets can be detected despite using a higher confidence threshold（0.01）, what is the problem?
with private datasets

without private datasets

1 reply

glenn-jocher Jan 2, 2025
Maintainer

Thank you for sharing your observations. The issue likely arises from a domain shift introduced by including private datasets in training, which could affect the model's zero-shot generalization capabilities. Fine-tuning with additional data may prioritize specific classes while slightly compromising detection confidence for others, like "fire." To address this, consider carefully balancing your training dataset or experimenting with custom prompts to guide the model's focus on desired objects. Learn more about prompts here: https://docs.ultralytics.com/models/yolo-world/.

models/yolo-world/ #8224

giscus[bot] bot Feb 15, 2024

models/yolo-world/

Replies: 36 comments · 121 replies

IamShubhamGupto Feb 15, 2024 — with giscus

pderrenger Feb 16, 2024 Maintainer

woerstcase Feb 16, 2024 — with giscus

pderrenger Feb 16, 2024 Maintainer

bdv29 Feb 16, 2024 — with giscus

pderrenger Feb 16, 2024 Maintainer

aikedaerC Jun 25, 2024 — with giscus

sandgrowagro Feb 21, 2024 — with giscus

pderrenger Feb 21, 2024 Maintainer

KishoreElvicto Feb 21, 2024 — with giscus

glenn-jocher Feb 21, 2024 Maintainer

qihuijia Feb 22, 2024 — with giscus

pderrenger Feb 23, 2024 Maintainer

pierre1618 Feb 23, 2024 — with giscus

pderrenger Feb 23, 2024 Maintainer

qihuijia Feb 28, 2024 — with giscus

pderrenger Feb 28, 2024 Maintainer

pderrenger Feb 28, 2024 Maintainer

bdv29 Mar 2, 2024 — with giscus

pderrenger Mar 2, 2024 Maintainer

AdadAlShabab Mar 4, 2024 — with giscus

glenn-jocher Mar 4, 2024 Maintainer

glenn-jocher Mar 5, 2024 Maintainer

glenn-jocher Mar 5, 2024 Maintainer

qihuijia Mar 14, 2024 — with giscus

glenn-jocher Mar 14, 2024 Maintainer

glenn-jocher Oct 16, 2024 Maintainer

Initialize a YOLO-World model

Define custom classes

Execute prediction for specified categories on an image

Show results

glenn-jocher Oct 17, 2024 Maintainer

glenn-jocher Oct 18, 2024 Maintainer

AnandSingh-0619 Mar 22, 2024 — with giscus

glenn-jocher Mar 23, 2024 Maintainer

glenn-jocher Oct 4, 2024 Maintainer

glenn-jocher Oct 4, 2024 Maintainer

Nishantg7 Apr 12, 2024 — with giscus

Create a Tkinter window

Create the WebcamApp object

app = WebcamApp(root, "Webcam Application")

pderrenger Apr 13, 2024 Maintainer

Nishantg7 Apr 13, 2024 — with giscus

pderrenger Apr 14, 2024 Maintainer

illink7 Apr 23, 2024 — with giscus

pderrenger Apr 23, 2024 Maintainer

Kunal8020 May 24, 2024 — with giscus

Kunal8020 May 24, 2024 — with giscus

glenn-jocher May 24, 2024 Maintainer

Kunal8020 May 28, 2024 — with giscus

glenn-jocher May 28, 2024 Maintainer

Kunal8020 Jun 13, 2024 — with giscus

Kunal8020 Jun 13, 2024 — with giscus

pderrenger Jun 14, 2024 Maintainer

pderrenger Jun 14, 2024 Maintainer

Kunal8020 Jun 14, 2024 — with giscus

pderrenger Jun 14, 2024 Maintainer

GaviraghiElia Sep 10, 2024 — with giscus

glenn-jocher Sep 10, 2024 Maintainer

giscus[bot]
bot Feb 15, 2024

Replies: 36 comments 121 replies

IamShubhamGupto
Feb 15, 2024 — with giscus

pderrenger Feb 16, 2024
Maintainer

pderrenger Feb 16, 2024
Maintainer

bdv29
Feb 16, 2024 — with giscus

pderrenger Feb 16, 2024
Maintainer

sandgrowagro
Feb 21, 2024 — with giscus

pderrenger Feb 21, 2024
Maintainer

KishoreElvicto
Feb 21, 2024 — with giscus

glenn-jocher Feb 21, 2024
Maintainer

qihuijia
Feb 22, 2024 — with giscus

pderrenger Feb 23, 2024
Maintainer

pderrenger Feb 23, 2024
Maintainer

pderrenger Feb 28, 2024
Maintainer

pderrenger Feb 28, 2024
Maintainer

pderrenger Mar 2, 2024
Maintainer

AdadAlShabab
Mar 4, 2024 — with giscus

glenn-jocher Mar 4, 2024
Maintainer

glenn-jocher Mar 5, 2024
Maintainer

glenn-jocher Mar 5, 2024
Maintainer

qihuijia
Mar 14, 2024 — with giscus

glenn-jocher Mar 14, 2024
Maintainer

glenn-jocher Oct 16, 2024
Maintainer

glenn-jocher Oct 17, 2024
Maintainer

glenn-jocher Oct 18, 2024
Maintainer

AnandSingh-0619
Mar 22, 2024 — with giscus

glenn-jocher Mar 23, 2024
Maintainer

glenn-jocher Oct 4, 2024
Maintainer

glenn-jocher Oct 4, 2024
Maintainer

Nishantg7
Apr 12, 2024 — with giscus

pderrenger Apr 13, 2024
Maintainer

pderrenger Apr 14, 2024
Maintainer

illink7
Apr 23, 2024 — with giscus

pderrenger Apr 23, 2024
Maintainer

Kunal8020
May 24, 2024 — with giscus

glenn-jocher May 24, 2024
Maintainer

glenn-jocher May 28, 2024
Maintainer

Kunal8020
Jun 13, 2024 — with giscus

pderrenger Jun 14, 2024
Maintainer

pderrenger Jun 14, 2024
Maintainer

pderrenger Jun 14, 2024
Maintainer

GaviraghiElia
Sep 10, 2024 — with giscus

glenn-jocher Sep 10, 2024
Maintainer