LLaVA Vision Model's Feature Resolution and Dense Features Availability #1598
Unanswered
pengsongyou
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello LLaVA Community,
Firstly, thanks for the incredible series of LLaVA works! While I have yet to experiment with it personally, I found it highly relevant to my current project.
I have a specific query regarding the LLaVA visual encoder. Assuming I input an image with a resolution of 256x256, could you clarify what the resolution of the language embedding tokens H_v would be (as shown in the architecture below)? Is it typically reduced by factors like 4x or 8x compared to the original image resolution, or does the model generate a single global embedding?
Furthermore, I am curious if any models within the LLaVA family are capable of providing such dense feature representations?
Thank you so much in advance for your help!
Best regards,
Songyou
Beta Was this translation helpful? Give feedback.
All reactions