You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed description become a lot better when I included 4 zoomed sections of the image, and combined the descriptions, yet this is very slow b/c of the use of language models multiple times.
To improve detailed description, how about making separate sections of of each identified object in the highest resolution - so one for each object - and feeding this zoomed selection to clip for captioning? (I presume at the moment clip only takes from the image in it's full form) In this way you could do several steps for clip, potentially making much better use of clips resolution.
So for a photo of a person with a car. It could look at the upperbody, lower body, face, tires etc all in detail. It should make it much easier to recognise emotions for example.
Is this possible? Or is this how clip works already? Sorry not sure on the mechanics here :-)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I noticed description become a lot better when I included 4 zoomed sections of the image, and combined the descriptions, yet this is very slow b/c of the use of language models multiple times.
To improve detailed description, how about making separate sections of of each identified object in the highest resolution - so one for each object - and feeding this zoomed selection to clip for captioning? (I presume at the moment clip only takes from the image in it's full form) In this way you could do several steps for clip, potentially making much better use of clips resolution.
So for a photo of a person with a car. It could look at the upperbody, lower body, face, tires etc all in detail. It should make it much easier to recognise emotions for example.
Is this possible? Or is this how clip works already? Sorry not sure on the mechanics here :-)
Beta Was this translation helpful? Give feedback.
All reactions