-
Notifications
You must be signed in to change notification settings - Fork 24
/
index.html
2630 lines (2630 loc) · 288 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/@alpinejs/[email protected]/dist/cdn.min.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/cdn.min.js"></script>
</head>
<body>
<div class="relative mx-auto h-full max-w-2xl text-md">
<table class="table-auto">
<tbody>
<tr>
<td></td>
<td>
<h1 class="text-4xl pt-4 font-bold"><span class="underline">Vincent's</span> Arxiv FrontPage</h1>
<br>
<p>Generated on 2024-12-15.</p><br/>
<p class="text-sm text-gray-500 pt-2">This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. </p>
<br>
</td>
</tr><tr>
<td></td>
<td>
<h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction.However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption.We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds.A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models.To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions.<span class='px-1 mx-1 bg-yellow-200'>Code and corruption datasets have been made publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.891</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09511v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field.In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos.Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos.<span class='px-1 mx-1 bg-yellow-200'>In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.786</span></span>We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding.Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench.Codes are available at https://github.com/Hon-Wong/ByteVideoLLM</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09530v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Neptune: The Long Orbit to Benchmarking Long Video Understanding
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos.Many existing video datasets and models are focused on short clips (10s-30s).While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost.In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length).Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning.Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune.Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes.Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos.<span class='px-1 mx-1 bg-yellow-200'>The dataset is available at https://github.com/google-deepmind/neptune <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.823</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09582v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets.<span class='px-1 mx-1 bg-yellow-200'>OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.838</span></span><span class='px-1 mx-1 bg-yellow-200'>We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09587v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature.In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors.Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs.This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions.2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials.Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects.<span class='px-1 mx-1 bg-yellow-200'>Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.726</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09593v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
RatBodyFormer: Rodent Body Surface from Keypoints
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Rat behavior modeling goes to the heart of many scientific studies, yet the textureless body surface evades automatic analysis as it literally has no keypoints that detectors can find.The movement of the body surface, however, is a rich source of information for deciphering the rat behavior.We introduce two key contributions to automatically recover densely 3D sampled rat body surface points, passively.<span class='px-1 mx-1 bg-yellow-200'>The first is RatDome, a novel multi-camera system for rat behavior capture, and a large-scale dataset captured with it that consists of pairs of 3D keypoints and 3D body surface points. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>The second is RatBodyFormer, a novel network to transform detected keypoints to 3D body surface points.RatBodyFormer is agnostic to the exact locations of the 3D body surface points in the training data and is trained with masked-learning.We experimentally validate our framework with a number of real-world experiments.Our results collectively serve as a novel foundation for automated rat behavior analysis and will likely have far-reaching implications for biomedical and neuroscientific research.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09599v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Hidden Biases of End-to-End Driving Datasets
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>End-to-end driving systems have made rapid progress, but have so far not been applied to the challenging new CARLA Leaderboard 2.0.Further, while there is a large body of literature on end-to-end architectures and training strategies, the impact of the training dataset is often overlooked.In this work, we make a first attempt at end-to-end driving for Leaderboard 2.0.Instead of investigating architectures, we systematically analyze the training dataset, leading to new insights: (1) Expert style significantly affects downstream policy performance.(2) In complex data sets, the frames should not be weighted on the basis of simplistic criteria such as class frequencies.(3) Instead, estimating whether a frame changes the target labels compared to previous frames can reduce the size of the dataset without removing important information.By incorporating these findings, our model ranks first and second respectively on the map and sensors tracks of the 2024 CARLA Challenge, and sets a new state-of-the-art on the Bench2Drive test routes.Finally, we uncover a design flaw in the current evaluation metrics and propose a modification for future challenges.<span class='px-1 mx-1 bg-yellow-200'>Our dataset, code, and pre-trained models are publicly available at https://github.com/autonomousvision/carla_garage. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.716</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09602v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Do Multimodal Large Language Models See Like Humans?
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans?Current benchmarks lack the ability to evaluate MLLMs from this perspective.To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision.<span class='px-1 mx-1 bg-yellow-200'>HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs.Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results.Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs.We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09603v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Learning Camera Movement Control from Real-World Drone Videos
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels.We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls.Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios.To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics.Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data.Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame.We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans.We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos.<span class='px-1 mx-1 bg-yellow-200'>Data and code are available at dvgformer.github.io. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.714</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09620v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing.While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs.Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions.To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation.Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion.In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points.<span class='px-1 mx-1 bg-yellow-200'>We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.834</span></span>Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation.The project page is available at https://lwq20020127.github.io/OmniDrag.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09623v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Multi-hand semantic grasp generation aims to generate feasible and semantically appropriate grasp poses for different robotic hands based on natural language instructions.Although the task is highly valuable, due to the lack of multi-hand grasp datasets with fine-grained contact description between robotic hands and objects, it is still a long-standing difficult task.<span class='px-1 mx-1 bg-yellow-200'>In this paper, we present Multi-GraspSet, the first large-scale multi-hand grasp dataset with automatically contact annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.863</span></span>Based on Multi-GraspSet, we propose Multi-GraspLLM, a unified language-guided grasp generation framework.It leverages large language models (LLM) to handle variable-length sequences, generating grasp poses for diverse robotic hands in a single unified architecture.Multi-GraspLLM first aligns the encoded point cloud features and text features into a unified semantic space.It then generates grasp bin tokens which are subsequently converted into grasp pose for each robotic hand via hand-aware linear mapping.The experimental results demonstrate that our approach significantly outperforms existing methods on Multi-GraspSet.More information can be found on our project page https://multi-graspllm.github.io.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08468v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive zero-shot classification capabilities with free-form prompts and even show some generalization in specialized domains.However, their performance on satellite imagery is limited due to the underrepresentation of such data in their training sets, which predominantly consist of ground-level images.Existing prompting techniques for satellite imagery are often restricted to generic phrases like a satellite image of ..., limiting their effectiveness for zero-shot land-use and land-cover (LULC) mapping.<span class='px-1 mx-1 bg-yellow-200'>To address these challenges, we introduce SenCLIP, which transfers CLIPs representation to Sentinel-2 imagery by leveraging a large dataset of Sentinel-2 images paired with geotagged ground-level photos from across Europe. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.738</span></span>We evaluate SenCLIP alongside other SOTA remote sensing VLMs on zero-shot LULC mapping tasks using the EuroSAT and BigEarthNet datasets with both aerial and ground-level prompting styles.Our approach, which aligns ground-level representations with satellite imagery, demonstrates significant improvements in classification accuracy across both prompt styles, opening new possibilities for applying free-form textual descriptions in zero-shot LULC mapping.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08536v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text.However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships.We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only.<span class='px-1 mx-1 bg-yellow-200'>To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.794</span></span>Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process.Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets.We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08580v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators.<span class='px-1 mx-1 bg-yellow-200'>To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions.To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes.<span class='px-1 mx-1 bg-yellow-200'>Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.9</span></span>We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE.Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08591v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Sewing patterns, the essential blueprints for fabric cutting and tailoring, act as a crucial bridge between design concepts and producible garments.However, existing uni-modal sewing pattern generation models struggle to effectively encode complex design concepts with a multi-modal nature and correlate them with vectorized sewing patterns that possess precise geometric structures and intricate sewing relations.In this work, we propose a novel sewing pattern generation approach Design2GarmentCode based on Large Multimodal Models (LMMs), to generate parametric pattern-making programs from multi-modal design concepts.LMM offers an intuitive interface for interpreting diverse design inputs, while pattern-making programs could serve as well-structured and semantically meaningful representations of sewing patterns, and act as a robust bridge connecting the cross-domain pattern-making knowledge embedded in LMMs with vectorized sewing patterns.Experimental results demonstrate that our method can flexibly handle various complex design expressions such as images, textual descriptions, designer sketches, or their combinations, and convert them into size-precise sewing patterns with correct stitches.Compared to previous methods, our approach significantly enhances training efficiency, generation quality, and authoring flexibility.<span class='px-1 mx-1 bg-yellow-200'>Our code and data will be publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.817</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08603v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
CoinCLIP: A Multimodal Framework for Evaluating the Viability of Memecoins in the Web3 Ecosystem
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The rapid growth of memecoins within the Web3 ecosystem, driven by platforms like Pump.fun, has made it easier for anyone to create tokens.However, this democratization has also led to an explosion of low-quality or bot-generated projects, often motivated by short-term financial gain.This overwhelming influx of speculative tokens creates a challenge in distinguishing viable memecoins from those that are unlikely to succeed.To address this issue, we introduce CoinVibe, a comprehensive multimodal dataset designed to evaluate the viability of memecoins.CoinVibe integrates textual descriptions, visual content (logos), and community data (user comments, timestamps, and number of likes) to provide a holistic view of a memecoin's potential.In addition, we present CoinCLIP, a novel framework that leverages the Contrastive Language-Image Pre-Training (CLIP) model, augmented with lightweight modules and community data integration, to improve classification accuracy.By combining visual and textual representations with community insights, CoinCLIP provides a robust, data-driven approach to filter out low-quality or bot-driven projects.This research aims to help creators and investors identify high-potential memecoins, while also offering valuable insights into the factors that contribute to their long-term success.<span class='px-1 mx-1 bg-yellow-200'>The code and dataset are publicly available at https://github.com/hwlongCUHK/CoinCLIP.git. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.808</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07591v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
ViewDelta: Text-Prompted Change Detection in Unaligned Images
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Detecting changes between images is a fundamental problem in computer vision with broad applications in situational awareness, infrastructure assessment, environment monitoring, and industrial automation.Existing supervised models are typically limited to detecting specific types of changes, necessitating retraining for new tasks.To address these limitations with a single approach, we propose a novel change detection method that is the first to utilize unaligned images and textual prompts to output a binary segmentation of changes relevant to user-provided text.Our architecture not only enables flexible detection across diverse change detection use cases, but also yields state-of-the art performance on established benchmarks.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we release an accompanying dataset comprising of 100,311 pairs of images with text prompts and the corresponding change detection labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.913</span></span>We demonstrate the effectiveness of our method both quantitatively and qualitatively on datasets with a wide variety of viewpoints in indoor, outdoor, street level, synthetic, and satellite images.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07612v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies.However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation.To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction.OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others.Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types.Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation.OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.<span class='px-1 mx-1 bg-yellow-200'>The codes and dataset is available in https://github.com/opendatalab/OmniDocBench. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.91</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07626v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Ask Humans or AI? Exploring Their Roles in Visualization Troubleshooting
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Visualization authoring is an iterative process requiring users to modify parameters like color schemes and data transformations to achieve desired aesthetics and effectively convey insights.Due to the complexity of these adjustments, users often create defective visualizations and require troubleshooting support.In this paper, we examine two primary approaches for visualization troubleshooting: (1) Human-assisted support via forums, where users receive advice from other individuals, and (2) AI-assisted support using large language models (LLMs).Our goal is to understand the strengths and limitations of each approach in supporting visualization troubleshooting tasks.<span class='px-1 mx-1 bg-yellow-200'>To this end, we collected 889 Vega-Lite cases from Stack Overflow. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.704</span></span>We then conducted a comprehensive analysis to understand the types of questions users ask, the effectiveness of human and AI guidance, and the impact of supplementary resources, such as documentation and examples, on troubleshooting outcomes.Our findings reveal a striking contrast between human- and AI-assisted troubleshooting: Human-assisted troubleshooting provides tailored, context-sensitive advice but often varies in response quality, while AI-assisted troubleshooting offers rapid feedback but often requires additional contextual resources to achieve desired results.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07673v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications.However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography.An intuitive solution involves adopting favorable attributes from the source images.Current methods attempt to distill identity and style from source images.However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics.Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image.In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images.<span class='px-1 mx-1 bg-yellow-200'>To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span><span class='px-1 mx-1 bg-yellow-200'>This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.846</span></span>Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one.This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07674v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
On Motion Blur and Deblurring in Visual Place Recognition
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data.While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary.Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far.This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring.<span class='px-1 mx-1 bg-yellow-200'>The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.753</span></span>Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring.Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07751v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
SAT: Spatial Aptitude Training for Multimodal Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Spatial perception is a fundamental component of intelligence.While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects.Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition.As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks.<span class='px-1 mx-1 bg-yellow-200'>SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.703</span></span><span class='px-1 mx-1 bg-yellow-200'>Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.792</span></span>We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions.Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR.When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning.Our data/code is available at http://arijitray1993.github.io/SAT/ .</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07755v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper aims to manipulate multi-entity 3D motions in video generation.Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results.However, 2D control signals are inherently limited in expressing the 3D nature of object motions.To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities.At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism.In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability.To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference.<span class='px-1 mx-1 bg-yellow-200'>To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.859</span></span>Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.Project page: http://fuxiao0719.github.io/projects/3dtrajmaster</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07759v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency.This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming.Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses.To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints.Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints.Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos.Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints.<span class='px-1 mx-1 bg-yellow-200'>We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.793</span></span>Project page: https://jianhongbai.github.io/SynCamMaster/.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07760v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Object detection in event streams has emerged as a cutting-edge research area, demonstrating superior performance in low-light conditions, scenarios with motion blur, and rapid movements.Current detectors leverage spiking neural networks, Transformers, or convolutional neural networks as their core architectures, each with its own set of limitations including restricted performance, high computational overhead, or limited local receptive fields.This paper introduces a novel MoE (Mixture of Experts) heat conduction-based object detection algorithm that strikingly balances accuracy and computational efficiency.Initially, we employ a stem network for event data embedding, followed by processing through our innovative MoE-HCO blocks.Each block integrates various expert modules to mimic heat conduction within event streams.Subsequently, an IoU-based query selection module is utilized for efficient token extraction, which is then channeled into a detection head for the final object detection process.Furthermore, we are pleased to introduce EvDET200K, a novel benchmark dataset for event-based object detection.<span class='px-1 mx-1 bg-yellow-200'>Captured with a high-definition Prophesee EVK4-HD event camera, this dataset encompasses 10 distinct categories, 200,000 bounding boxes, and 10,054 samples, each spanning 2 to 5 seconds. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.866</span></span>We also provide comprehensive results from over 15 state-of-the-art detectors, offering a solid foundation for future research and comparison.The source code of this paper will be released on: https://github.com/Event-AHU/OpenEvDET</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06647v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Research on large language models has advanced significantly across text, speech, images, and videos.However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets.<span class='px-1 mx-1 bg-yellow-200'>To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.885</span></span>Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos.For music generation, we integrate AudioLDM 2 and MusicGen.Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06660v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Knowledge Transfer and Domain Adaptation for Fine-Grained Remote Sensing Image Segmentation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Fine-grained remote sensing image segmentation is essential for accurately identifying detailed objects in remote sensing images.Recently, vision transformer models (VTM) pretrained on large-scale datasets have shown strong zero-shot generalization, indicating that they have learned the general knowledge of object understanding.We introduce a novel end-to-end learning paradigm combining knowledge guidance with domain refinement to enhance performance.We present two key components: the Feature Alignment Module (FAM) and the Feature Modulation Module (FMM).FAM aligns features from a CNN-based backbone with those from the pretrained VTM's encoder using channel transformation and spatial interpolation, and transfers knowledge via KL divergence and L2 normalization constraint.FMM further adapts the knowledge to the specific domain to address domain shift.<span class='px-1 mx-1 bg-yellow-200'>We also introduce a fine-grained grass segmentation dataset and demonstrate, through experiments on two datasets, that our method achieves a significant improvement of 2.57 mIoU on the grass dataset and 3.73 mIoU on the cloud dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.814</span></span>The results highlight the potential of combining knowledge transfer and domain adaptation to overcome domain-related challenges and data limitations.The project page is available at https://xavierjiezou.github.io/KTDA/.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06664v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation.However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms.In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it.To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos.<span class='px-1 mx-1 bg-yellow-200'>This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.794</span></span>Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive.To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data.Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation.Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets.Please refer to our project page at: https://vision.baai.ac.cn/see3d</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06699v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Toward Non-Invasive Diagnosis of Bankart Lesions with Deep Learning
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Bankart lesions, or anterior-inferior glenoid labral tears, are diagnostically challenging on standard MRIs due to their subtle imaging features-often necessitating invasive MRI arthrograms (MRAs).This study develops deep learning (DL) models to detect Bankart lesions on both standard MRIs and MRAs, aiming to improve diagnostic accuracy and reduce reliance on MRAs.<span class='px-1 mx-1 bg-yellow-200'>We curated a dataset of 586 shoulder MRIs (335 standard, 251 MRAs) from 558 patients who underwent arthroscopy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.762</span></span>Ground truth labels were derived from intraoperative findings, the gold standard for Bankart lesion diagnosis.Separate DL models for MRAs and standard MRIs were trained using the Swin Transformer architecture, pre-trained on a public knee MRI dataset.Predictions from sagittal, axial, and coronal views were ensembled to optimize performance.The models were evaluated on a 20% hold-out test set (117 MRIs: 46 MRAs, 71 standard MRIs).Bankart lesions were identified in 31.9% of MRAs and 8.6% of standard MRIs.The models achieved AUCs of 0.87 (86% accuracy, 83% sensitivity, 86% specificity) and 0.90 (85% accuracy, 82% sensitivity, 86% specificity) on standard MRIs and MRAs, respectively.These results match or surpass radiologist performance on our dataset and reported literature metrics.Notably, our model's performance on non-invasive standard MRIs matched or surpassed the radiologists interpreting MRAs.This study demonstrates the feasibility of using DL to address the diagnostic challenges posed by subtle pathologies like Bankart lesions.Our models demonstrate potential to improve diagnostic confidence, reduce reliance on invasive imaging, and enhance accessibility to care.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06717v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view Event Cameras
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Volumetric reconstruction of dynamic scenes is an important problem in computer vision.It is especially challenging in poor lighting and with fast motion.It is partly due to the limitations of RGB cameras: To capture fast motion without much blur, the framerate must be increased, which in turn requires more lighting.In contrast, event cameras, which record changes in pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion.We hence propose the first method to spatiotemporally reconstruct a scene from sparse multi-view event streams and sparse RGB frames.We train a sequence of cross-faded time-conditioned NeRF models, one per short recording segment.The individual segments are supervised with a set of event- and RGB-based losses and sparse-view regularisation.<span class='px-1 mx-1 bg-yellow-200'>We assemble a real-world multi-view camera rig with six static event cameras around the object and record a benchmark multi-view event stream dataset of challenging motions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.723</span></span>Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras.<span class='px-1 mx-1 bg-yellow-200'>The code and the data will be released soon at https://4dqv.mpi-inf.mpg.de/DynEventNeRF/ <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.752</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06770v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination.For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions.Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue.In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection.Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant).SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria.Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings.<span class='px-1 mx-1 bg-yellow-200'>The code, model, and dataset will be released. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.862</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04292v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large models have achieved remarkable performance across various tasks, yet they incur significant computational costs and privacy concerns during both training and inference.Distributed deployment has emerged as a potential solution, but it necessitates the exchange of intermediate information between model segments, with feature representations serving as crucial information carriers.To optimize information exchange, feature coding methods are applied to reduce transmission and storage overhead.Despite its importance, feature coding for large models remains an under-explored area.In this paper, we draw attention to large model feature coding and make three contributions to this field.First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models.Second, we establish unified test conditions, enabling standardized evaluation pipelines and fair comparisons across future feature coding studies.Third, we introduce two baseline methods derived from widely used image coding techniques and benchmark their performance on the proposed dataset.These contributions aim to advance the field of feature coding, facilitating more efficient large model deployment.<span class='px-1 mx-1 bg-yellow-200'>All source code and the dataset will be made available on GitHub. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.922</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04307v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu.Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity.For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. <span class='px-1 mx-1 bg-yellow-200'>To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.776</span></span>Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing.By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04351v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts.This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image.While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution.Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements.We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP.In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations.To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations.We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.745</span></span>Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach.Code available at https://github.com/shaunak27/grain-clip .</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04429v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis.In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data -- an ill-posed and challenging problem.The fast pace of work in this area has produced multiple simultaneous papers that claim to work best, which cannot all be true.In this work, we organize, benchmark, and analyze many Gaussian-splatting-based methods, providing apples-to-apples comparisons that prior works have lacked.<span class='px-1 mx-1 bg-yellow-200'>We use multiple existing datasets and a new instructive synthetic dataset designed to isolate factors that affect reconstruction quality. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.813</span></span>We systematically categorize Gaussian splatting methods into specific motion representation types and quantify how their differences impact performance.Empirically, we find that their rank order is well-defined in synthetic data, but the complexity of real-world data currently overwhelms the differences.Furthermore, the fast rendering speed of all Gaussian-based methods comes at the cost of brittleness in optimization.We summarize our experiments into a list of findings that can help to further progress in this lively problem setting.Project Webpage: https://lynl7130.github.io/MonoDyGauBench.github.io/</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04457v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Cubify Anything: Scaling Indoor 3D Object Detection
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device.We seek to significantly advance the status quo with respect to both data and modeling.First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects.<span class='px-1 mx-1 bg-yellow-200'>As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.709</span></span>Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs.While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes.Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04458v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Esophageal cancer is among the most common types of cancer worldwide.It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative.However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation.Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited.<span class='px-1 mx-1 bg-yellow-200'>In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.781</span></span>Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves.This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem.Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets.We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues.The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks.Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet.Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03401v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Skel3D: Skeleton Guided Novel View Synthesis
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>In this paper, we present an approach for monocular open-set novel view synthesis (NVS) that leverages object skeletons to guide the underlying diffusion model.<span class='px-1 mx-1 bg-yellow-200'>Building upon a baseline that utilizes a pre-trained 2D image generator, our method takes advantage of the Objaverse dataset, which includes animated objects with bone structures. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.793</span></span>By introducing a skeleton guide layer following the existing ray conditioning normalization (RCN) layer, our approach enhances pose accuracy and multi-view consistency.The skeleton guide layer provides detailed structural information for the generative model, improving the quality of synthesized views.Experimental results demonstrate that our skeleton-guided method significantly enhances consistency and accuracy across diverse object categories within the Objaverse dataset.Our method outperforms existing state-of-the-art NVS techniques both quantitatively and qualitatively, without relying on explicit 3D representations.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03407v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
YT-30M: A multi-lingual multi-category dataset of YouTube comments
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.867</span></span>The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research.<span class='px-1 mx-1 bg-yellow-200'>YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.742</span></span>Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03465v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Dense Scene Reconstruction from Light-Field Images Affected by Rolling Shutter
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper presents a dense depth estimation approach from light-field (LF) images that is able to compensate for strong rolling shutter (RS) effects.Our method estimates RS compensated views and dense RS compensated disparity maps.We present a two-stage method based on a 2D Gaussians Splatting that allows for a ``render and compare" strategy with a point cloud formulation.In the first stage, a subset of sub-aperture images is used to estimate an RS agnostic 3D shape that is related to the scene target shape ``up to a motion".In the second stage, the deformation of the 3D shape is computed by estimating an admissible camera motion.We demonstrate the effectiveness and advantages of this approach through several experiments conducted for different scenes and types of motions.<span class='px-1 mx-1 bg-yellow-200'>Due to lack of suitable datasets for evaluation, we also present a new carefully designed synthetic dataset of RS LF images. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.893</span></span>The source code, trained models and dataset will be made publicly available at: https://github.com/ICB-Vision-AI/DenseRSLF</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03518v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system.<span class='px-1 mx-1 bg-yellow-200'>The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.844</span></span>The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs.We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally.<span class='px-1 mx-1 bg-yellow-200'>The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.952</span></span>We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02638v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Robust soybean seed yield estimation using high-throughput ground robot videos
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present a novel method for soybean (Glycine max (L.) Merr.)yield estimation leveraging high throughput seed counting via computer vision and deep learning techniques.Traditional methods for collecting yield data are labor-intensive, costly, prone to equipment failures at critical data collection times, and require transportation of equipment across field sites.Computer vision, the field of teaching computers to interpret visual data, allows us to extract detailed yield information directly from images.By treating it as a computer vision task, we report a more efficient alternative, employing a ground robot equipped with fisheye cameras to capture comprehensive videos of soybean plots from which images are extracted in a variety of development programs.These images are processed through the P2PNet-Yield model, a deep learning framework where we combined a Feature Extraction Module (the backbone of the P2PNet-Soy) and a Yield Regression Module to estimate seed yields of soybean plots.<span class='px-1 mx-1 bg-yellow-200'>Our results are built on three years of yield testing plot data - 8500 in 2021, 2275 in 2022, and 650 in 2023. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.859</span></span>With these datasets, our approach incorporates several innovations to further improve the accuracy and generalizability of the seed counting and yield estimation architecture, such as the fisheye image correction and data augmentation with random sensor effects.The P2PNet-Yield model achieved a genotype ranking accuracy score of up to 83%.It demonstrates up to a 32% reduction in time to collect yield data as well as costs associated with traditional yield estimation, offering a scalable solution for breeding programs and agricultural productivity enhancement.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02642v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
A Bidirectional Long Short Term Memory Approach for Infrastructure Health Monitoring Using On-board Vibration Response
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>The growing volume of available infrastructural monitoring data enables the development of powerful datadriven approaches to estimate infrastructure health conditions using direct measurements. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.71</span></span>This paper proposes a deep learning methodology to estimate infrastructure physical parameters, such as railway track stiffness, using drive-by vibration response signals.The proposed method employs a Long Short-term Memory (LSTM) feature extractor accounting for temporal dependencies in the feature extraction phase, and a bidirectional Long Short-term Memory (BiLSTM) networks to leverage bidirectional temporal dependencies in both the forward and backward paths of the drive-by vibration response in condition estimation phase.Additionally, a framing approach is employed to enhance the resolution of the monitoring task to the beam level by segmenting the vibration signal into frames equal to the distance between individual beams, centering the frames over the beam nodes.The proposed LSTM-BiLSTM model offers a versatile tool for various bridge and railway infrastructure conditions monitoring using direct drive-by vibration response measurements.The results demonstrate the potential of incorporating temporal analysis in the feature extraction phase and emphasize the pivotal role of bidirectional temporal information in infrastructure health condition estimation.The proposed methodology can accurately and automatically estimate railway track stiffness and identify local stiffness reductions in the presence of noise using drive-by measurements.An illustrative case study of vehicle-track interaction simulation is used to demonstrate the performance of the proposed model, achieving a maximum mean absolute percentage error of 1.7% and 0.7% in estimating railpad and ballast stiffness, respectively.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02643v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions.We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images.<span class='px-1 mx-1 bg-yellow-200'>To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.836</span></span>Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint.FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control.Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views.This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences.We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02690v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td></td>