Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

afiaka87 · 2021-03-18T14:58:20Z

In case you haven't read my usual disclaimer: this data set is weird. The repetition in the OpenAI images causes those to be highly overfit (mannequins) and the remainder of the dataset is much more diverse, which dalle-pytorch doesnt manage to capture very well here. Also, keep in mind - this isn't even a full epoch. Just having fun. Try not to evaluate this as representative of dalle-pytorch's current capabilities.

Hey everyone. @lucidrains got the the new, lighter pretrained VAE from the taming-transformers group recently. It uses substantially less memory and compute. I decided to take all the datasets ive collected thus far, put them in a single folder on an A100, and train dalle-pytorch for several hours.

Here are the results:

https://wandb.ai/afiaka87/OpenImagesV6/reports/Training-on-COCO-OpenImage-Blogpost--Vmlldzo1NDE3NjU

I'm exhausted so that's all for now, but please click the link and have a look at the thousands of reconstructions it made (and the horrible captions from the "Localized Narratives" dataset I got from Google). I'll be updating this post with more info throughout the day.

rom1504 · 2021-03-18T15:10:30Z

so mostly it doesn't work very well?
I guess a better dataset / training for longer is needed?

afiaka87 · 2021-03-18T15:34:53Z

so mostly it doesn't work very well?
I guess a better dataset / training for longer is needed?

edit: Happy to learn I'm probably wrong about this dataset being bad.. It's got some garbage in it, but any sufficiently large dataset will these days. What we need now is more data, more compute, larger batch sizes and higher depth.

Original:

Who knows what would happen after a full 17k iters * 20 epochs. This dataset is pretty bad. Seriously, go read some of the prompts from the labelled annotations. They very often dont make sense. They have mouse position as well though and it's more of a dataset for finding good image segmentation techniques, I now realize.

The largest portion of the dataset is generated images containing lots of mistakes. Apparently training even the top 32 of 512 generations from DALL-E proper will just produce something totally incorrect something like 20% of the time. Garbage in, garbage out. The definitely curated the examples on the front page even though they claim they didnt (even after getting CLIP to re-rank (basically automated curation) the images for them).

Having said that, yeah we still need a bigger dataset. OpenAI used an extremely large dataset. This doesnt get anywhere close to that. They also used a higher quality VaE and a batch size of 512... These thing aren't going to be possible without mesh-dalle. Hopefully we can continue to find techniques that get better results out of smaller datasets as well.

But yeah, please dont take this as some sort of scientific baseline for "how good dalle-pytorch is". It's a bad dataset with bad captions flooded with images that are likely to make CLIP very happy even if they have mistakes as they were partially generated using the very same CLIP. The only reasonable data in here is the COCO2014 set and it's only 200k images out of ~1.6 million.

afiaka87 · 2021-03-18T16:24:57Z

@lucidrains Until we can go through and really clean the hell out of the prompts, I'd advise staying away from OpenImages "Localized Narratives" for this. The phrasing is too verbose, distracted, wandering, and contains enough mistakes that I see gibberish about 5% of the time, potentially coherent 95 percent of time... It's...pretty bad. At least compared to the claims they make on the front page for the project. Was really annoying to download all 100 GiB of that to find out it was incredibly poorly labeled.

So I started two training sessions last night as well that are still running, but only on COCO and the blog post images. I'll post the results later today. In the meantime, enjoy this mannequin:

a male mannequin dressed in a blue and black bomber jacket and brown pleated trousers

sorrge · 2021-03-18T16:38:04Z

You said OpenAI used a higher quality VaE. Didn't they release the weights for it?
Also, I don't agree with your assessment of localized narratives captions. They are decent. They at least mention the most important objects. Importantly, they are made for this dataset, with training in mind. The "wild" captions scraped from the internet, which OpenAI used, are much worse, because they are made by people with no intention to describe the picture accurately.

lucidrains · 2021-03-18T17:12:26Z

@sorrge they are released, and you can even start training with them in this repo! https://github.com/lucidrains/dalle-pytorch#openais-pretrained-vae

sorrge · 2021-03-18T17:17:46Z

Thanks. Is there a reason to believe that the OpenAI VaE is better than Taming Transformers one that @afiaka87 used? Besides the token values range. Did somebody compare their reconstructions?

lucidrains · 2021-03-18T17:31:45Z

@sorrge #86 (comment) yea, the mannequins look quite good, at least

afiaka87 · 2021-03-18T18:02:35Z

You said OpenAI used a higher quality VaE. Didn't they release the weights for it?

Yes, they did. You can train DALLE-pytorch with it. It's something of a VRAM hog though and the taming-transformers VaE shows decent accuracy for a much lower runtime/memory cost because it only uses 1024 tokens. It's impressive work. There are documented issues with not picking up certain details in reconstructions as well as OpenAI's VAE can. So it's not perfect, but it helps quite a bit in terms of being able to actually train DALLE-pytorch.

Also, I don't agree with your assessment of localized narratives captions. They are decent. They at least mention the most important objects. Importantly, they are made for this dataset, with training in mind. The "wild" captions scraped from the internet, which OpenAI used, are much worse, because they are made by people with no intention to describe the picture accurately.

That's totally fair. Dealing with some of this stuff and getting a bad result can be frustrating and may color my opinions unfortunately. I have to say though, I've been messing with OpenAI's pretrained ViT-B/32 CLIP for quite awhile and it's just never been well-suited to these types of prompts even when they are properly written. It tries its best to maximize the features in context, but sometimes it just doesnt know enough about that many tokens in that order to really get anything other than a few words relayed.

I think you'd need to train a custom CLIP on this data to get it to work the way you're thinking (where it fills in every little detail in the prompt). Which is a fantastic idea actually!

afiaka87 · 2021-03-18T18:10:30Z

@sorrge @lucidrains So that dataset is pretty cool in that it has mouse positions from the labeler. They were required to highlight the region they were talking about as they described the word vocally which then gets transcribed with timing information they can use to lookup "where" each word in the image is meant to generally go. This has obvious implications for segmentation (which they mention as their motivation). But is there any way we could train on that information in dalle-pytorch's transformer? It's essentially a mapping to the relevant region in the image for each token in the "Localized Narrative"

I could think of some prompt engineering tricks but that would require...prompt engineering.

sorrge · 2021-03-18T18:12:02Z

@afiaka87 Their CLIP is trained on a gigantic dataset (400M image-captions IIRC). Surely there was a lot of garbage there. It may be confused by the format "in this image we can see", because that's not how people usually annotate their pictures. But it doesn't matter for DALL-E, does it? For post-filtering it will still work, because you would use the "normal" prompts for generation, which CLIP can understand.

afiaka87 · 2021-03-18T18:13:49Z

@afiaka87 Their CLIP is trained on a gigantic dataset (400M image-captions IIRC). Surely there was a lot of garbage there. It may be confused by the format "in this image we can see", because that's not how people usually annotate their pictures. But it doesn't matter for DALL-E, does it? For post-filtering it will still work, because you would use the "normal" prompts for generation, which CLIP can understand.

Yeah I'm out of my depth on this one. @lucidrains ?

Edit: all i know is that i've had trouble with it, like, anecdotally. I'm relatively new to machine learning though so I don't have the full depth of understanding needed and you could very well be correct!

If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon.

sorrge · 2021-03-18T18:15:05Z

DALL-E will probably just learn to ignore the "In this image we can see" beginning and use the list of things that goes afterwards as the clues for what should be included.

afiaka87 · 2021-03-18T18:16:23Z

If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon.

edit: It could also be that the smaller VAE's errors can accumulate more on this dataset? No idea.

sorrge · 2021-03-18T18:23:11Z

Yes, the size of the dataset and the depth of the model are the keys, per OpenAI's paper. That was the main point, as in their other notable works: how far can the model be pushed. So, if we want quality, we need to match the effort.

In this attempt that you made here, the repetition in the captions (from the blog post) likely caused some overfitting. For example, the mannequins are relatively similar in both captions and images, and it learned them the best. To train a powerful model, we need more diversity in the data.

afiaka87 · 2021-03-18T18:31:09Z

Yes, the size of the dataset and the depth of the model are the keys, per OpenAI's paper. That was the main point, as in their other notable works: how far can the model be pushed. So, if we want quality, we need to match the effort.

In this attempt that you made here, the repetition in the captions (from the blog post) likely caused some overfitting. For example, the mannequins are relatively similar in both captions and images, and it learned them the best. To train a powerful model, we need more diversity in the data.

Thanks that's helpful. I guess the main issues with that is the obvious lack of compute. Without finding potential optimizations (such as the 1024 token model) we're looking at some year-long training times. Anyway, that's always been obvious.

As for the depth - I continue to shoot for 64 (which surprisingly fits in VRAM at a batch_size of <12 on the 1024 vae). It does indeed produce a higher quality image. Do you think training on this same dataset with a depth of 64 is worthwhile?

sorrge · 2021-03-18T18:35:13Z

I'd at least wait for the WIT data, which should come out in a few days. That's ~11M images with +-good captions from Wikipedia. It will be a dramatic jump forward from the current dataset.

afiaka87 · 2021-03-18T18:37:47Z

Cool thanks. This compute is expensive and it's very useful for me to know when something is a waste of effort/money or not.

At any rate, I just got some stimulus and decided to invest it in a couple hundred $ of GPU compute on vast ai so I can have an actual stable development environment for a while. This is all a good learning experience for me whether I get good results or not. If you have an idea for a dataset to train or need compute for debugging a new feature, do let me know and I'll see if I have any compute available still.

robvanvolt · 2021-03-18T19:55:13Z

I'm still waiting for my Rig to get shipped, until then I will only be able to comment on "metadata". But I think you do a really good job @afiaka87 ! Even with the "bad" dataset, the results seem promising, and the tamed transformer speeds things up!

Things to look forward to:

20th of March with the release of wikipedia image-text files as @sorrge says
I've found another repo which was used to train the original DALL-E: https://pypi.org/project/yfcc100m/

Moreover, we should start a list with big datasets which might fit the DALL-E training:

https://github.com/Tencent/tencent-ml-images (subset of open images and image-net)
yfcc100m
Wikipedia
...

afiaka87 · 2021-03-19T20:12:21Z

Please check the discussions tab for information on my training efforts:
#106

afiaka87 closed this as completed Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

afiaka87 commented Mar 18, 2021 •

edited

Loading

rom1504 commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021

lucidrains commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021

lucidrains commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

robvanvolt commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 19, 2021

Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

Comments

afiaka87 commented Mar 18, 2021 • edited Loading

rom1504 commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 • edited Loading

Original:

afiaka87 commented Mar 18, 2021 • edited Loading

sorrge commented Mar 18, 2021

lucidrains commented Mar 18, 2021 • edited Loading

sorrge commented Mar 18, 2021

lucidrains commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 • edited Loading

afiaka87 commented Mar 18, 2021

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 • edited Loading

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 • edited Loading

sorrge commented Mar 18, 2021 • edited Loading

afiaka87 commented Mar 18, 2021 • edited Loading

sorrge commented Mar 18, 2021

afiaka87 commented Mar 18, 2021 • edited Loading

robvanvolt commented Mar 18, 2021 • edited Loading

afiaka87 commented Mar 19, 2021

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

lucidrains commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

sorrge commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

afiaka87 commented Mar 18, 2021 •

edited

Loading

robvanvolt commented Mar 18, 2021 •

edited

Loading