Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few thoughts on JEPA: task-goal / task-definition #61

Open
yuedajiong opened this issue May 16, 2024 · 2 comments
Open

A few thoughts on JEPA: task-goal / task-definition #61

yuedajiong opened this issue May 16, 2024 · 2 comments

Comments

@yuedajiong
Copy link

yuedajiong commented May 16, 2024

(1) LeCun and his collaborators or doctoral students are all experts, and I greatly admire them.
(2) Technically, my understanding may be incorrect.

Just technical thoughts!!!

I always feel that JEPA is not quite suitable or expansive/leapfrogging enough in terms of task goal/task definition, which leads to the JEPA series algorithms are not enough even if they are optimized well.

(1) Even in the vision alone, JEPA does not explain the representations it learns internally, nor what representations such as world models are. Are they still just distributed weights of ordinary neural networks, or are there special network structures like laten representations, or are there explicit 3D/4D representations? Without delving into the details of JEPA, looking at this network in a general way does not show a significant difference, nor does it provide a special task definition leap for stronger AI like AGI/ASI.
Even at the forefront like JEPA, I believe that even when focusing solely on pure vision tasks, there hasn't been a fundamental breakthrough. I don't think an ideal, powerful vision system should be a one-way, one-shot, one-train-many-infer system similar to LLMs. Each visual processing involves multiple visual recognitions occurring in parallel, alternating and iterating repeatedly before producing the final output.

(2) From the perspective of the perfect task ultimate form of vision, I personally believe it should be, like humans, being able to construct a 3-dimensional world from a single image/a pair of images (with left and right disparity) or video, without needing the camera/observation position, and even a dynamic 4-dimensional world (in most cases, not requiring physical-level precision). Here, it could be laten representation, but it would be better to have an explicit representation (such as point-cloud, surface-mesh, gauss-splatting, ...).

(3) In order to support various high-level applications such as differentiable form prediction/inference/planning based on vision, this laten or explicit representation can be utilized by neural networks. (For example, estimating how moving objects maneuver around a building on the road). Depending on the requirements of more applications, this 3D/4D representation may also need estimated-distance and semantic labels.(human, building; stone, swamp; water, fog,glass ...; old or new; soft or hard; ...)

(4) I want to append some sampels (follow) about vision-AI but not limited on vision-only.

Not questioning the great minds, just pondering what is the ultimate definition of the visual task. After JEPA is done well, how far away are we from it?

@yuedajiong yuedajiong changed the title A few thoughts on JEPA A few thoughts on JEPA: task-goal / task-definition May 16, 2024
@yuedajiong
Copy link
Author

yuedajiong commented Jun 1, 2024

the diversity of visual tasks, another example among hundreds of tasks: from dynamic-vision-objects to symbolic-functions

superi-cv-vision-to-symbol

https://github.com/yuedajiong/super-ai-objective-world

IMPORTANT: the functions are blackbox, even the symbolic abstraction is forced and passive. Not plan/design by algorihtm.

@yuedajiong
Copy link
Author

yuedajiong commented Jun 1, 2024

the diversity of visual tasks, another example among hundreds of tasks: strong positional/orientation information dependence
(more tasks: regarding equivariance (not invariance) dependence; reliance on raw 2D spatial information; ...)

direction-1
335788505-cb1c8299-16ce-49d9-a909-301762209617

direction-2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant