-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100
Comments
so mostly it doesn't work very well? |
edit: Happy to learn I'm probably wrong about this dataset being bad.. It's got some garbage in it, but any sufficiently large dataset will these days. What we need now is more data, more compute, larger batch sizes and higher depth. Original:
Having said that, yeah we still need a bigger dataset. OpenAI used an extremely large dataset. This doesnt get anywhere close to that. They also used a higher quality VaE and a batch size of 512... These thing aren't going to be possible without mesh-dalle. Hopefully we can continue to find techniques that get better results out of smaller datasets as well. But yeah, please dont take this as some sort of scientific baseline for "how good dalle-pytorch is". It's a bad dataset with bad captions flooded with images that are likely to make CLIP very happy even if they have mistakes as they were partially generated using the very same CLIP. The only reasonable data in here is the COCO2014 set and it's only 200k images out of ~1.6 million. |
@lucidrains Until we can go through and really clean the hell out of the prompts, I'd advise staying away from OpenImages "Localized Narratives" for this. The phrasing is too verbose, distracted, wandering, and contains enough mistakes that I see gibberish about 5% of the time, potentially coherent 95 percent of time... It's...pretty bad. At least compared to the claims they make on the front page for the project. Was really annoying to download all 100 GiB of that to find out it was incredibly poorly labeled. So I started two training sessions last night as well that are still running, but only on COCO and the blog post images. I'll post the results later today. In the meantime, enjoy this mannequin:
|
You said OpenAI used a higher quality VaE. Didn't they release the weights for it? |
@sorrge they are released, and you can even start training with them in this repo! https://github.com/lucidrains/dalle-pytorch#openais-pretrained-vae |
Thanks. Is there a reason to believe that the OpenAI VaE is better than Taming Transformers one that @afiaka87 used? Besides the token values range. Did somebody compare their reconstructions? |
@sorrge #86 (comment) yea, the mannequins look quite good, at least |
Yes, they did. You can train DALLE-pytorch with it. It's something of a VRAM hog though and the taming-transformers VaE shows decent accuracy for a much lower runtime/memory cost because it only uses 1024 tokens. It's impressive work. There are documented issues with not picking up certain details in reconstructions as well as OpenAI's VAE can. So it's not perfect, but it helps quite a bit in terms of being able to actually train DALLE-pytorch.
That's totally fair. Dealing with some of this stuff and getting a bad result can be frustrating and may color my opinions unfortunately. I have to say though, I've been messing with OpenAI's pretrained ViT-B/32 CLIP for quite awhile and it's just never been well-suited to these types of prompts even when they are properly written. It tries its best to maximize the features in context, but sometimes it just doesnt know enough about that many tokens in that order to really get anything other than a few words relayed. I think you'd need to train a custom CLIP on this data to get it to work the way you're thinking (where it fills in every little detail in the prompt). Which is a fantastic idea actually! |
@sorrge @lucidrains So that dataset is pretty cool in that it has mouse positions from the labeler. They were required to highlight the region they were talking about as they described the word vocally which then gets transcribed with timing information they can use to lookup "where" each word in the image is meant to generally go. This has obvious implications for segmentation (which they mention as their motivation). But is there any way we could train on that information in dalle-pytorch's transformer? It's essentially a mapping to the relevant region in the image for each token in the "Localized Narrative" I could think of some prompt engineering tricks but that would require...prompt engineering. |
@afiaka87 Their CLIP is trained on a gigantic dataset (400M image-captions IIRC). Surely there was a lot of garbage there. It may be confused by the format "in this image we can see", because that's not how people usually annotate their pictures. But it doesn't matter for DALL-E, does it? For post-filtering it will still work, because you would use the "normal" prompts for generation, which CLIP can understand. |
Yeah I'm out of my depth on this one. @lucidrains ? Edit: all i know is that i've had trouble with it, like, anecdotally. I'm relatively new to machine learning though so I don't have the full depth of understanding needed and you could very well be correct! If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon. |
DALL-E will probably just learn to ignore the "In this image we can see" beginning and use the list of things that goes afterwards as the clues for what should be included. |
If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon. edit: It could also be that the smaller VAE's errors can accumulate more on this dataset? No idea. |
Yes, the size of the dataset and the depth of the model are the keys, per OpenAI's paper. That was the main point, as in their other notable works: how far can the model be pushed. So, if we want quality, we need to match the effort. In this attempt that you made here, the repetition in the captions (from the blog post) likely caused some overfitting. For example, the mannequins are relatively similar in both captions and images, and it learned them the best. To train a powerful model, we need more diversity in the data. |
Thanks that's helpful. I guess the main issues with that is the obvious lack of compute. Without finding potential optimizations (such as the 1024 token model) we're looking at some year-long training times. Anyway, that's always been obvious. As for the depth - I continue to shoot for 64 (which surprisingly fits in VRAM at a batch_size of <12 on the 1024 vae). It does indeed produce a higher quality image. Do you think training on this same dataset with a depth of 64 is worthwhile? |
I'd at least wait for the WIT data, which should come out in a few days. That's ~11M images with +-good captions from Wikipedia. It will be a dramatic jump forward from the current dataset. |
Cool thanks. This compute is expensive and it's very useful for me to know when something is a waste of effort/money or not. At any rate, I just got some stimulus and decided to invest it in a couple hundred $ of GPU compute on vast ai so I can have an actual stable development environment for a while. This is all a good learning experience for me whether I get good results or not. If you have an idea for a dataset to train or need compute for debugging a new feature, do let me know and I'll see if I have any compute available still. |
I'm still waiting for my Rig to get shipped, until then I will only be able to comment on "metadata". But I think you do a really good job @afiaka87 ! Even with the "bad" dataset, the results seem promising, and the tamed transformer speeds things up! Things to look forward to:
Moreover, we should start a list with big datasets which might fit the DALL-E training:
|
Please check the discussions tab for information on my training efforts: |
In case you haven't read my usual disclaimer: this data set is weird. The repetition in the OpenAI images causes those to be highly overfit (mannequins) and the remainder of the dataset is much more diverse, which dalle-pytorch doesnt manage to capture very well here. Also, keep in mind - this isn't even a full epoch. Just having fun. Try not to evaluate this as representative of dalle-pytorch's current capabilities.
Hey everyone. @lucidrains got the the new, lighter pretrained VAE from the taming-transformers group recently. It uses substantially less memory and compute. I decided to take all the datasets ive collected thus far, put them in a single folder on an A100, and train dalle-pytorch for several hours.
Here are the results:
https://wandb.ai/afiaka87/OpenImagesV6/reports/Training-on-COCO-OpenImage-Blogpost--Vmlldzo1NDE3NjU
I'm exhausted so that's all for now, but please click the link and have a look at the thousands of reconstructions it made (and the horrible captions from the "Localized Narratives" dataset I got from Google). I'll be updating this post with more info throughout the day.
The text was updated successfully, but these errors were encountered: