index.html

<!doctype html>
<html>
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <script src="https://cdn.tailwindcss.com"></script>
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css" integrity="sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0" crossorigin="anonymous">
  <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js" integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4" crossorigin="anonymous"></script>
  <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"></script>
  <script defer src="https://cdn.jsdelivr.net/npm/@alpinejs/collapse@3.x.x/dist/cdn.min.js"></script>
  <script defer src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js"></script>
</head>
<body>
  <div class="relative mx-auto h-full max-w-2xl text-md">
    <table class="table-auto">
      <tbody>
        <tr>
          <td></td>
          <td>
            <h1 class="text-4xl pt-4 font-bold"><span class="underline">Vincent's</span> Arxiv FrontPage</h1>
            <br>
            <p>Generated on 2024-12-15.</p><br/>
            <p class="text-sm text-gray-500 pt-2">This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. </p>
            <br>
          </td>
        </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
          </td>
        </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction.However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption.We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds.A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models.To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions.<span class='px-1 mx-1 bg-yellow-200'>Code and corruption datasets have been made publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.891</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09511v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field.In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos.Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos.<span class='px-1 mx-1 bg-yellow-200'>In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.786</span></span>We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding.Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench.Codes are available at https://github.com/Hon-Wong/ByteVideoLLM</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09530v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Neptune: The Long Orbit to Benchmarking Long Video Understanding
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos.Many existing video datasets and models are focused on short clips (10s-30s).While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost.In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length).Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning.Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune.Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes.Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos.<span class='px-1 mx-1 bg-yellow-200'>The dataset is available at https://github.com/google-deepmind/neptune <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.823</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09582v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets.<span class='px-1 mx-1 bg-yellow-200'>OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.838</span></span><span class='px-1 mx-1 bg-yellow-200'>We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09587v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature.In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors.Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs.This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions.2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials.Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects.<span class='px-1 mx-1 bg-yellow-200'>Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.726</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09593v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                RatBodyFormer: Rodent Body Surface from Keypoints
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Rat behavior modeling goes to the heart of many scientific studies, yet the textureless body surface evades automatic analysis as it literally has no keypoints that detectors can find.The movement of the body surface, however, is a rich source of information for deciphering the rat behavior.We introduce two key contributions to automatically recover densely 3D sampled rat body surface points, passively.<span class='px-1 mx-1 bg-yellow-200'>The first is RatDome, a novel multi-camera system for rat behavior capture, and a large-scale dataset captured with it that consists of pairs of 3D keypoints and 3D body surface points. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>The second is RatBodyFormer, a novel network to transform detected keypoints to 3D body surface points.RatBodyFormer is agnostic to the exact locations of the 3D body surface points in the training data and is trained with masked-learning.We experimentally validate our framework with a number of real-world experiments.Our results collectively serve as a novel foundation for automated rat behavior analysis and will likely have far-reaching implications for biomedical and neuroscientific research.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09599v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Hidden Biases of End-to-End Driving Datasets
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>End-to-end driving systems have made rapid progress, but have so far not been applied to the challenging new CARLA Leaderboard 2.0.Further, while there is a large body of literature on end-to-end architectures and training strategies, the impact of the training dataset is often overlooked.In this work, we make a first attempt at end-to-end driving for Leaderboard 2.0.Instead of investigating architectures, we systematically analyze the training dataset, leading to new insights: (1) Expert style significantly affects downstream policy performance.(2) In complex data sets, the frames should not be weighted on the basis of simplistic criteria such as class frequencies.(3) Instead, estimating whether a frame changes the target labels compared to previous frames can reduce the size of the dataset without removing important information.By incorporating these findings, our model ranks first and second respectively on the map and sensors tracks of the 2024 CARLA Challenge, and sets a new state-of-the-art on the Bench2Drive test routes.Finally, we uncover a design flaw in the current evaluation metrics and propose a modification for future challenges.<span class='px-1 mx-1 bg-yellow-200'>Our dataset, code, and pre-trained models are publicly available at https://github.com/autonomousvision/carla_garage. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.716</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09602v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Do Multimodal Large Language Models See Like Humans?
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans?Current benchmarks lack the ability to evaluate MLLMs from this perspective.To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision.<span class='px-1 mx-1 bg-yellow-200'>HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs.Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results.Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs.We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09603v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Learning Camera Movement Control from Real-World Drone Videos
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels.We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls.Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios.To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics.Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data.Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame.We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans.We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos.<span class='px-1 mx-1 bg-yellow-200'>Data and code are available at dvgformer.github.io. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.714</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09620v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing.While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs.Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions.To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation.Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion.In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points.<span class='px-1 mx-1 bg-yellow-200'>We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.834</span></span>Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation.The project page is available at https://lwq20020127.github.io/OmniDrag.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09623v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Multi-hand semantic grasp generation aims to generate feasible and semantically appropriate grasp poses for different robotic hands based on natural language instructions.Although the task is highly valuable, due to the lack of multi-hand grasp datasets with fine-grained contact description between robotic hands and objects, it is still a long-standing difficult task.<span class='px-1 mx-1 bg-yellow-200'>In this paper, we present Multi-GraspSet, the first large-scale multi-hand grasp dataset with automatically contact annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.863</span></span>Based on Multi-GraspSet, we propose Multi-GraspLLM, a unified language-guided grasp generation framework.It leverages large language models (LLM) to handle variable-length sequences, generating grasp poses for diverse robotic hands in a single unified architecture.Multi-GraspLLM first aligns the encoded point cloud features and text features into a unified semantic space.It then generates grasp bin tokens which are subsequently converted into grasp pose for each robotic hand via hand-aware linear mapping.The experimental results demonstrate that our approach significantly outperforms existing methods on Multi-GraspSet.More information can be found on our project page https://multi-graspllm.github.io.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08468v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive zero-shot classification capabilities with free-form prompts and even show some generalization in specialized domains.However, their performance on satellite imagery is limited due to the underrepresentation of such data in their training sets, which predominantly consist of ground-level images.Existing prompting techniques for satellite imagery are often restricted to generic phrases like a satellite image of ..., limiting their effectiveness for zero-shot land-use and land-cover (LULC) mapping.<span class='px-1 mx-1 bg-yellow-200'>To address these challenges, we introduce SenCLIP, which transfers CLIPs representation to Sentinel-2 imagery by leveraging a large dataset of Sentinel-2 images paired with geotagged ground-level photos from across Europe. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.738</span></span>We evaluate SenCLIP alongside other SOTA remote sensing VLMs on zero-shot LULC mapping tasks using the EuroSAT and BigEarthNet datasets with both aerial and ground-level prompting styles.Our approach, which aligns ground-level representations with satellite imagery, demonstrates significant improvements in classification accuracy across both prompt styles, opening new possibilities for applying free-form textual descriptions in zero-shot LULC mapping.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08536v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text.However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships.We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only.<span class='px-1 mx-1 bg-yellow-200'>To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.794</span></span>Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process.Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets.We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08580v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators.<span class='px-1 mx-1 bg-yellow-200'>To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions.To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes.<span class='px-1 mx-1 bg-yellow-200'>Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.9</span></span>We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE.Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08591v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Sewing patterns, the essential blueprints for fabric cutting and tailoring, act as a crucial bridge between design concepts and producible garments.However, existing uni-modal sewing pattern generation models struggle to effectively encode complex design concepts with a multi-modal nature and correlate them with vectorized sewing patterns that possess precise geometric structures and intricate sewing relations.In this work, we propose a novel sewing pattern generation approach Design2GarmentCode based on Large Multimodal Models (LMMs), to generate parametric pattern-making programs from multi-modal design concepts.LMM offers an intuitive interface for interpreting diverse design inputs, while pattern-making programs could serve as well-structured and semantically meaningful representations of sewing patterns, and act as a robust bridge connecting the cross-domain pattern-making knowledge embedded in LMMs with vectorized sewing patterns.Experimental results demonstrate that our method can flexibly handle various complex design expressions such as images, textual descriptions, designer sketches, or their combinations, and convert them into size-precise sewing patterns with correct stitches.Compared to previous methods, our approach significantly enhances training efficiency, generation quality, and authoring flexibility.<span class='px-1 mx-1 bg-yellow-200'>Our code and data will be publicly available. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.817</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08603v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                CoinCLIP: A Multimodal Framework for Evaluating the Viability of Memecoins in the Web3 Ecosystem
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The rapid growth of memecoins within the Web3 ecosystem, driven by platforms like Pump.fun, has made it easier for anyone to create tokens.However, this democratization has also led to an explosion of low-quality or bot-generated projects, often motivated by short-term financial gain.This overwhelming influx of speculative tokens creates a challenge in distinguishing viable memecoins from those that are unlikely to succeed.To address this issue, we introduce CoinVibe, a comprehensive multimodal dataset designed to evaluate the viability of memecoins.CoinVibe integrates textual descriptions, visual content (logos), and community data (user comments, timestamps, and number of likes) to provide a holistic view of a memecoin's potential.In addition, we present CoinCLIP, a novel framework that leverages the Contrastive Language-Image Pre-Training (CLIP) model, augmented with lightweight modules and community data integration, to improve classification accuracy.By combining visual and textual representations with community insights, CoinCLIP provides a robust, data-driven approach to filter out low-quality or bot-driven projects.This research aims to help creators and investors identify high-potential memecoins, while also offering valuable insights into the factors that contribute to their long-term success.<span class='px-1 mx-1 bg-yellow-200'>The code and dataset are publicly available at https://github.com/hwlongCUHK/CoinCLIP.git. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.808</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07591v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                ViewDelta: Text-Prompted Change Detection in Unaligned Images
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Detecting changes between images is a fundamental problem in computer vision with broad applications in situational awareness, infrastructure assessment, environment monitoring, and industrial automation.Existing supervised models are typically limited to detecting specific types of changes, necessitating retraining for new tasks.To address these limitations with a single approach, we propose a novel change detection method that is the first to utilize unaligned images and textual prompts to output a binary segmentation of changes relevant to user-provided text.Our architecture not only enables flexible detection across diverse change detection use cases, but also yields state-of-the art performance on established benchmarks.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we release an accompanying dataset comprising of 100,311 pairs of images with text prompts and the corresponding change detection labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.913</span></span>We demonstrate the effectiveness of our method both quantitatively and qualitatively on datasets with a wide variety of viewpoints in indoor, outdoor, street level, synthetic, and satellite images.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07612v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies.However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation.To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction.OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others.Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types.Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation.OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.<span class='px-1 mx-1 bg-yellow-200'>The codes and dataset is available in https://github.com/opendatalab/OmniDocBench. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.91</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07626v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Ask Humans or AI? Exploring Their Roles in Visualization Troubleshooting
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Visualization authoring is an iterative process requiring users to modify parameters like color schemes and data transformations to achieve desired aesthetics and effectively convey insights.Due to the complexity of these adjustments, users often create defective visualizations and require troubleshooting support.In this paper, we examine two primary approaches for visualization troubleshooting: (1) Human-assisted support via forums, where users receive advice from other individuals, and (2) AI-assisted support using large language models (LLMs).Our goal is to understand the strengths and limitations of each approach in supporting visualization troubleshooting tasks.<span class='px-1 mx-1 bg-yellow-200'>To this end, we collected 889 Vega-Lite cases from Stack Overflow. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.704</span></span>We then conducted a comprehensive analysis to understand the types of questions users ask, the effectiveness of human and AI guidance, and the impact of supplementary resources, such as documentation and examples, on troubleshooting outcomes.Our findings reveal a striking contrast between human- and AI-assisted troubleshooting: Human-assisted troubleshooting provides tailored, context-sensitive advice but often varies in response quality, while AI-assisted troubleshooting offers rapid feedback but often requires additional contextual resources to achieve desired results.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07673v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications.However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography.An intuitive solution involves adopting favorable attributes from the source images.Current methods attempt to distill identity and style from source images.However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics.Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image.In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images.<span class='px-1 mx-1 bg-yellow-200'>To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span><span class='px-1 mx-1 bg-yellow-200'>This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.846</span></span>Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one.This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07674v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                On Motion Blur and Deblurring in Visual Place Recognition
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data.While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary.Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far.This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring.<span class='px-1 mx-1 bg-yellow-200'>The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.753</span></span>Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring.Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07751v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                SAT: Spatial Aptitude Training for Multimodal Language Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Spatial perception is a fundamental component of intelligence.While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects.Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition.As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks.<span class='px-1 mx-1 bg-yellow-200'>SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.703</span></span><span class='px-1 mx-1 bg-yellow-200'>Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.792</span></span>We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions.Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR.When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning.Our data/code is available at http://arijitray1993.github.io/SAT/ .</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07755v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This paper aims to manipulate multi-entity 3D motions in video generation.Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results.However, 2D control signals are inherently limited in expressing the 3D nature of object motions.To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities.At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism.In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability.To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference.<span class='px-1 mx-1 bg-yellow-200'>To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.859</span></span>Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.Project page: http://fuxiao0719.github.io/projects/3dtrajmaster</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07759v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency.This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming.Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses.To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints.Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints.Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos.Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints.<span class='px-1 mx-1 bg-yellow-200'>We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.793</span></span>Project page: https://jianhongbai.github.io/SynCamMaster/.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07760v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Object detection in event streams has emerged as a cutting-edge research area, demonstrating superior performance in low-light conditions, scenarios with motion blur, and rapid movements.Current detectors leverage spiking neural networks, Transformers, or convolutional neural networks as their core architectures, each with its own set of limitations including restricted performance, high computational overhead, or limited local receptive fields.This paper introduces a novel MoE (Mixture of Experts) heat conduction-based object detection algorithm that strikingly balances accuracy and computational efficiency.Initially, we employ a stem network for event data embedding, followed by processing through our innovative MoE-HCO blocks.Each block integrates various expert modules to mimic heat conduction within event streams.Subsequently, an IoU-based query selection module is utilized for efficient token extraction, which is then channeled into a detection head for the final object detection process.Furthermore, we are pleased to introduce EvDET200K, a novel benchmark dataset for event-based object detection.<span class='px-1 mx-1 bg-yellow-200'>Captured with a high-definition Prophesee EVK4-HD event camera, this dataset encompasses 10 distinct categories, 200,000 bounding boxes, and 10,054 samples, each spanning 2 to 5 seconds. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.866</span></span>We also provide comprehensive results from over 15 state-of-the-art detectors, offering a solid foundation for future research and comparison.The source code of this paper will be released on: https://github.com/Event-AHU/OpenEvDET</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06647v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Research on large language models has advanced significantly across text, speech, images, and videos.However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets.<span class='px-1 mx-1 bg-yellow-200'>To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.885</span></span>Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos.For music generation, we integrate AudioLDM 2 and MusicGen.Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06660v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Knowledge Transfer and Domain Adaptation for Fine-Grained Remote Sensing Image Segmentation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Fine-grained remote sensing image segmentation is essential for accurately identifying detailed objects in remote sensing images.Recently, vision transformer models (VTM) pretrained on large-scale datasets have shown strong zero-shot generalization, indicating that they have learned the general knowledge of object understanding.We introduce a novel end-to-end learning paradigm combining knowledge guidance with domain refinement to enhance performance.We present two key components: the Feature Alignment Module (FAM) and the Feature Modulation Module (FMM).FAM aligns features from a CNN-based backbone with those from the pretrained VTM's encoder using channel transformation and spatial interpolation, and transfers knowledge via KL divergence and L2 normalization constraint.FMM further adapts the knowledge to the specific domain to address domain shift.<span class='px-1 mx-1 bg-yellow-200'>We also introduce a fine-grained grass segmentation dataset and demonstrate, through experiments on two datasets, that our method achieves a significant improvement of 2.57 mIoU on the grass dataset and 3.73 mIoU on the cloud dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.814</span></span>The results highlight the potential of combining knowledge transfer and domain adaptation to overcome domain-related challenges and data limitations.The project page is available at https://xavierjiezou.github.io/KTDA/.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06664v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation.However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms.In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it.To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos.<span class='px-1 mx-1 bg-yellow-200'>This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.794</span></span>Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive.To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data.Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation.Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets.Please refer to our project page at: https://vision.baai.ac.cn/see3d</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06699v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Toward Non-Invasive Diagnosis of Bankart Lesions with Deep Learning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Bankart lesions, or anterior-inferior glenoid labral tears, are diagnostically challenging on standard MRIs due to their subtle imaging features-often necessitating invasive MRI arthrograms (MRAs).This study develops deep learning (DL) models to detect Bankart lesions on both standard MRIs and MRAs, aiming to improve diagnostic accuracy and reduce reliance on MRAs.<span class='px-1 mx-1 bg-yellow-200'>We curated a dataset of 586 shoulder MRIs (335 standard, 251 MRAs) from 558 patients who underwent arthroscopy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.762</span></span>Ground truth labels were derived from intraoperative findings, the gold standard for Bankart lesion diagnosis.Separate DL models for MRAs and standard MRIs were trained using the Swin Transformer architecture, pre-trained on a public knee MRI dataset.Predictions from sagittal, axial, and coronal views were ensembled to optimize performance.The models were evaluated on a 20% hold-out test set (117 MRIs: 46 MRAs, 71 standard MRIs).Bankart lesions were identified in 31.9% of MRAs and 8.6% of standard MRIs.The models achieved AUCs of 0.87 (86% accuracy, 83% sensitivity, 86% specificity) and 0.90 (85% accuracy, 82% sensitivity, 86% specificity) on standard MRIs and MRAs, respectively.These results match or surpass radiologist performance on our dataset and reported literature metrics.Notably, our model's performance on non-invasive standard MRIs matched or surpassed the radiologists interpreting MRAs.This study demonstrates the feasibility of using DL to address the diagnostic challenges posed by subtle pathologies like Bankart lesions.Our models demonstrate potential to improve diagnostic confidence, reduce reliance on invasive imaging, and enhance accessibility to care.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06717v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view Event Cameras
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Volumetric reconstruction of dynamic scenes is an important problem in computer vision.It is especially challenging in poor lighting and with fast motion.It is partly due to the limitations of RGB cameras: To capture fast motion without much blur, the framerate must be increased, which in turn requires more lighting.In contrast, event cameras, which record changes in pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion.We hence propose the first method to spatiotemporally reconstruct a scene from sparse multi-view event streams and sparse RGB frames.We train a sequence of cross-faded time-conditioned NeRF models, one per short recording segment.The individual segments are supervised with a set of event- and RGB-based losses and sparse-view regularisation.<span class='px-1 mx-1 bg-yellow-200'>We assemble a real-world multi-view camera rig with six static event cameras around the object and record a benchmark multi-view event stream dataset of challenging motions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.723</span></span>Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras.<span class='px-1 mx-1 bg-yellow-200'>The code and the data will be released soon at https://4dqv.mpi-inf.mpg.de/DynEventNeRF/ <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.752</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06770v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination.For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions.Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue.In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection.Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant).SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria.Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings.<span class='px-1 mx-1 bg-yellow-200'>The code, model, and dataset will be released. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.862</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04292v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Large models have achieved remarkable performance across various tasks, yet they incur significant computational costs and privacy concerns during both training and inference.Distributed deployment has emerged as a potential solution, but it necessitates the exchange of intermediate information between model segments, with feature representations serving as crucial information carriers.To optimize information exchange, feature coding methods are applied to reduce transmission and storage overhead.Despite its importance, feature coding for large models remains an under-explored area.In this paper, we draw attention to large model feature coding and make three contributions to this field.First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models.Second, we establish unified test conditions, enabling standardized evaluation pipelines and fair comparisons across future feature coding studies.Third, we introduce two baseline methods derived from widely used image coding techniques and benchmark their performance on the proposed dataset.These contributions aim to advance the field of feature coding, facilitating more efficient large model deployment.<span class='px-1 mx-1 bg-yellow-200'>All source code and the dataset will be made available on GitHub. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.922</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04307v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu.Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity.For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality.   <span class='px-1 mx-1 bg-yellow-200'>To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.776</span></span>Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing.By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04351v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Grounding Descriptions in Images informs Zero-Shot Visual Recognition
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts.This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image.While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution.Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements.We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP.In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations.To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations.We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.745</span></span>Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach.Code available at https://github.com/shaunak27/grain-clip .</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04429v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis.In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data -- an ill-posed and challenging problem.The fast pace of work in this area has produced multiple simultaneous papers that claim to work best, which cannot all be true.In this work, we organize, benchmark, and analyze many Gaussian-splatting-based methods, providing apples-to-apples comparisons that prior works have lacked.<span class='px-1 mx-1 bg-yellow-200'>We use multiple existing datasets and a new instructive synthetic dataset designed to isolate factors that affect reconstruction quality. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.813</span></span>We systematically categorize Gaussian splatting methods into specific motion representation types and quantify how their differences impact performance.Empirically, we find that their rank order is well-defined in synthetic data, but the complexity of real-world data currently overwhelms the differences.Furthermore, the fast rendering speed of all Gaussian-based methods comes at the cost of brittleness in optimization.We summarize our experiments into a list of findings that can help to further progress in this lively problem setting.Project Webpage: https://lynl7130.github.io/MonoDyGauBench.github.io/</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04457v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-05</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Cubify Anything: Scaling Indoor 3D Object Detection
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device.We seek to significantly advance the status quo with respect to both data and modeling.First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects.<span class='px-1 mx-1 bg-yellow-200'>As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.709</span></span>Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs.While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes.Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.04458v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Esophageal cancer is among the most common types of cancer worldwide.It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative.However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation.Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited.<span class='px-1 mx-1 bg-yellow-200'>In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.781</span></span>Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves.This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem.Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets.We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues.The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks.Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet.Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03401v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Skel3D: Skeleton Guided Novel View Synthesis
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>In this paper, we present an approach for monocular open-set novel view synthesis (NVS) that leverages object skeletons to guide the underlying diffusion model.<span class='px-1 mx-1 bg-yellow-200'>Building upon a baseline that utilizes a pre-trained 2D image generator, our method takes advantage of the Objaverse dataset, which includes animated objects with bone structures. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.793</span></span>By introducing a skeleton guide layer following the existing ray conditioning normalization (RCN) layer, our approach enhances pose accuracy and multi-view consistency.The skeleton guide layer provides detailed structural information for the generative model, improving the quality of synthesized views.Experimental results demonstrate that our skeleton-guided method significantly enhances consistency and accuracy across diverse object categories within the Objaverse dataset.Our method outperforms existing state-of-the-art NVS techniques both quantitatively and qualitatively, without relying on explicit 3D representations.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03407v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                YT-30M: A multi-lingual multi-category dataset of YouTube comments
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.867</span></span>The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research.<span class='px-1 mx-1 bg-yellow-200'>YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.742</span></span>Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03465v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-04</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Dense Scene Reconstruction from Light-Field Images Affected by Rolling Shutter
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This paper presents a dense depth estimation approach from light-field (LF) images that is able to compensate for strong rolling shutter (RS) effects.Our method estimates RS compensated views and dense RS compensated disparity maps.We present a two-stage method based on a 2D Gaussians Splatting that allows for a ``render and compare" strategy with a point cloud formulation.In the first stage, a subset of sub-aperture images is used to estimate an RS agnostic 3D shape that is related to the scene target shape ``up to a motion".In the second stage, the deformation of the 3D shape is computed by estimating an admissible camera motion.We demonstrate the effectiveness and advantages of this approach through several experiments conducted for different scenes and types of motions.<span class='px-1 mx-1 bg-yellow-200'>Due to lack of suitable datasets for evaluation, we also present a new carefully designed synthetic dataset of RS LF images. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.893</span></span>The source code, trained models and dataset will be made publicly available at: https://github.com/ICB-Vision-AI/DenseRSLF</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.03518v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system.<span class='px-1 mx-1 bg-yellow-200'>The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.844</span></span>The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs.We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally.<span class='px-1 mx-1 bg-yellow-200'>The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.952</span></span>We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02638v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Robust soybean seed yield estimation using high-throughput ground robot videos
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We present a novel method for soybean (Glycine max (L.) Merr.)yield estimation leveraging high throughput seed counting via computer vision and deep learning techniques.Traditional methods for collecting yield data are labor-intensive, costly, prone to equipment failures at critical data collection times, and require transportation of equipment across field sites.Computer vision, the field of teaching computers to interpret visual data, allows us to extract detailed yield information directly from images.By treating it as a computer vision task, we report a more efficient alternative, employing a ground robot equipped with fisheye cameras to capture comprehensive videos of soybean plots from which images are extracted in a variety of development programs.These images are processed through the P2PNet-Yield model, a deep learning framework where we combined a Feature Extraction Module (the backbone of the P2PNet-Soy) and a Yield Regression Module to estimate seed yields of soybean plots.<span class='px-1 mx-1 bg-yellow-200'>Our results are built on three years of yield testing plot data - 8500 in 2021, 2275 in 2022, and 650 in 2023. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.859</span></span>With these datasets, our approach incorporates several innovations to further improve the accuracy and generalizability of the seed counting and yield estimation architecture, such as the fisheye image correction and data augmentation with random sensor effects.The P2PNet-Yield model achieved a genotype ranking accuracy score of up to 83%.It demonstrates up to a 32% reduction in time to collect yield data as well as costs associated with traditional yield estimation, offering a scalable solution for breeding programs and agricultural productivity enhancement.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02642v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                A Bidirectional Long Short Term Memory Approach for Infrastructure Health Monitoring Using On-board Vibration Response
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>The growing volume of available infrastructural monitoring data enables the development of powerful datadriven approaches to estimate infrastructure health conditions using direct measurements. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.71</span></span>This paper proposes a deep learning methodology to estimate infrastructure physical parameters, such as railway track stiffness, using drive-by vibration response signals.The proposed method employs a Long Short-term Memory (LSTM) feature extractor accounting for temporal dependencies in the feature extraction phase, and a bidirectional Long Short-term Memory (BiLSTM) networks to leverage bidirectional temporal dependencies in both the forward and backward paths of the drive-by vibration response in condition estimation phase.Additionally, a framing approach is employed to enhance the resolution of the monitoring task to the beam level by segmenting the vibration signal into frames equal to the distance between individual beams, centering the frames over the beam nodes.The proposed LSTM-BiLSTM model offers a versatile tool for various bridge and railway infrastructure conditions monitoring using direct drive-by vibration response measurements.The results demonstrate the potential of incorporating temporal analysis in the feature extraction phase and emphasize the pivotal role of bidirectional temporal information in infrastructure health condition estimation.The proposed methodology can accurately and automatically estimate railway track stiffness and identify local stiffness reductions in the presence of noise using drive-by measurements.An illustrative case study of vehicle-track interaction simulation is used to demonstrate the performance of the proposed model, achieving a maximum mean absolute percentage error of 1.7% and 0.7% in estimating railpad and ballast stiffness, respectively.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02643v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions.We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images.<span class='px-1 mx-1 bg-yellow-200'>To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.836</span></span>Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint.FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control.Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views.This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences.We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02690v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
          </td>
        </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.<span class='px-1 mx-1 bg-yellow-200'>Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.<span class='px-1 mx-1 bg-yellow-200'>Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.679</span></span>Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08479v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Defending Against Neural Network Model Inversion Attacks via Data Poisoning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs.While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility.Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models.This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data.Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks.   Two defense methods are presented.The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels.<span class='px-1 mx-1 bg-yellow-200'>Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process.Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses.Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07575v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes.This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity.Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models.<span class='px-1 mx-1 bg-yellow-200'>By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.664</span></span>Our code is publicly available at https://github.com/voxel51/reconstruction-error-ratios.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02596v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
          </td>
        </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                From Intention To Implementation: Automating Biomedical Research via LLMs
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Conventional biomedical research is increasingly labor-intensive due to the exponential growth of scientific literature and datasets.Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps.Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements.This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments.BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming.By decomposing complex tasks into logically related sub-tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity.Furthermore, BioResearcher incorporates an LLM-based reviewer for in-process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols.<span class='px-1 mx-1 bg-yellow-200'>BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.631</span></span>The generated protocols averagely outperform typical agent systems by 22.0% on five quality metrics.The system demonstrates significant potential to reduce researchers' workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09429v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Solving Multiagent Path Finding on Highly Centralized Networks
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The Mutliagent Path Finding (MAPF) problem consists of identifying the trajectories that a set of agents should follow inside a given network in order to reach their desired destinations as soon as possible, but without colliding with each other.We aim to minimize the maximum time any agent takes to reach their goal, ensuring optimal path length.<span class='px-1 mx-1 bg-yellow-200'>In this work, we complement a recent thread of results that aim to systematically study the algorithmic behavior of this problem, through the parameterized complexity point of view.    <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.623</span></span>First, we show that MAPF is NP-hard when the given network has a star-like topology (bounded vertex cover number) or is a tree with $11$ leaves.Both of these results fill important gaps in our understanding of the tractability of this problem that were left untreated in the recent work of [Fioravantes et al.Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology.AAAI'24].Nevertheless, our main contribution is an exact algorithm that scales well as the input grows (FPT) when the topology of the given network is highly centralized (bounded distance to clique).This parameter is significant as it mirrors real-world networks.In such environments, a bunch of central hubs (e.g., processing areas) are connected to only few peripheral nodes.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09433v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Towards Robust and Fair Vision Learning in Open-World Environments
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The dissertation presents four key contributions toward fairness and robustness in vision learning.First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning.Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework.The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning.Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views.To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views.Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical.Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data.Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models.<span class='px-1 mx-1 bg-yellow-200'>The research's theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.708</span></span>The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09439v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                ATPrompt: Textual Prompt Learning with Embedded Attributes
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks.However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories.In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories.Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt.This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts.Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form.To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model.As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments on 11 datasets demonstrate the effectiveness of our method. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.605</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09442v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones.Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts.Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from {\em parameter and retrieval} levels.Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference.To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge.By training task-specific adapters, we continually adjust the PTM to downstream tasks.To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information.Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model's inherent ability for better adapter retrieval.By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments on seven benchmark datasets validate MOS's state-of-the-art performance. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.693</span></span>Code is available at: https://github.com/sun-hailong/AAAI25-MOS</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09441v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Search Strategy Generation for Branch and Bound Using Genetic Programming
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Branch-and-Bound (B\&B) is an exact method in integer programming that recursively divides the search space into a tree.During the resolution process, determining the next subproblem to explore within the tree-known as the search strategy-is crucial.Hand-crafted heuristics are commonly used, but none are effective over all problem classes.Recent approaches utilizing neural networks claim to make more intelligent decisions but are computationally expensive.In this paper, we introduce GP2S (Genetic Programming for Search Strategy), a novel machine learning approach that automatically generates a B\&B search strategy heuristic, aiming to make intelligent decisions while being computationally lightweight.We define a policy as a function that evaluates the quality of a B\&B node by combining features from the node and the problem; the search strategy policy is then defined by a best-first search based on this node ranking.The policy space is explored using a genetic programming algorithm, and the policy that achieves the best performance on a training set is selected.We compare our approach with the standard method of the SCIP solver, a recent graph neural network-based method, and handcrafted heuristics.Our first evaluation includes three types of primal hard problems, tested on instances similar to the training set and on larger instances.<span class='px-1 mx-1 bg-yellow-200'>Our method is at most 2\% slower than the best baseline and consistently outperforms SCIP, achieving an average speedup of 11.3\%. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.753</span></span>Additionally, GP2S is tested on the MIPLIB 2017 dataset, generating multiple heuristics from different subsets of instances.It exceeds SCIP's average performance in 7 out of 10 cases across 15 times more instances and under a time limit 15 times longer, with some GP2S methods leading on most experiments in terms of the number of feasible solutions or optimality gap.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09444v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                DumpyOS: A Data-Adaptive Multi-ary Index for Scalable Data Series Similarity Search
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available.<span class='px-1 mx-1 bg-yellow-200'>These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.601</span></span>Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges.DSTree and the iSAX index family are state-of-the-art solutions for this problem.However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy.In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance.First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index.Second, a skewed data distribution will negatively affect the performance of iSAX.To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow.Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series.To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms.An optimized approximate search algorithm, DumpyOS-F which prominently improves the search accuracy without violating the index, is also proposed.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09448v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>With the rapid development of artificial intelligence technology, the application of deepfake technology in the audio field has gradually increased, resulting in a wide range of security risks.Especially in the financial and social security fields, the misuse of deepfake audios has raised serious concerns.To address this challenge, this study proposes an audio deepfake detection method based on multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT).By processing the audio signal into a melspectrogram, using MobileNet V2 to extract deep features, and combining it with the MFCA module to weight different frequency channels in the audio signal, this method can effectively capture the fine-grained frequency domain features in the audio signal and enhance the Classification capability of fake audios.<span class='px-1 mx-1 bg-yellow-200'>Experimental results show that compared with traditional methods, the model proposed in this study shows significant advantages in accuracy, precision,recall, F1 score and other indicators. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.656</span></span>Especially in complex audio scenarios, this method shows stronger robustness and generalization capabilities and provides a new idea for audio deepfake detection and has important practical application value.In the future, more advanced audio detection technologies and optimization strategies will be explored to further improve the accuracy and generalization capabilities of audio deepfake detection.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09467v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Optimizing CDN Architectures: Multi-Metric Algorithmic Breakthroughs for Edge and Distributed Performance
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>A Content Delivery Network (CDN) is a powerful system of distributed caching servers that aims to accelerate content delivery, like high-definition video, IoT applications, and ultra-low-latency services, efficiently and with fast velocity.This has become of paramount importance in the post-pandemic era.Challenges arise when exponential content volume growth and scalability across different geographic locations are required.This paper investigates data-driven evaluations of CDN algorithms in dynamic server selection for latency reduction, bandwidth throttling for efficient resource management, real-time Round Trip Time analysis for adaptive routing, and programmatic network delay simulation to emulate various conditions.<span class='px-1 mx-1 bg-yellow-200'>Key performance metrics, such as round-trip time (RTT) and CPU usage, are carefully analyzed to evaluate scalability and algorithmic efficiency through two experimental setups: a constrained edge-like local system and a scalable FABRIC testbed. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.61</span></span>The statistical validation of RTT trends, alongside CPU utilization, is presented in the results.The optimization process reveals significant trade-offs between scalability and resource consumption, providing actionable insights for effectively deploying and enhancing CDN algorithms in edge and distributed computing environments.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09474v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Video Seal: Open and Efficient Video Watermarking
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms.Video watermarking addresses these challenges by embedding imperceptible signals into videos, allowing for identification.However, the rare open tools and methods often fall short on efficiency, robustness, and flexibility.To reduce these gaps, this paper introduces Video Seal, a comprehensive framework for neural video watermarking and a competitive open-sourced model.Our approach jointly trains an embedder and an extractor, while ensuring the watermark robustness by applying transformations in-between, e.g., video codecs.This training is multistage and includes image pre-training, hybrid post-training and extractor fine-tuning.We also introduce temporal watermark propagation, a technique to convert any image watermarking model to an efficient video watermarking model without the need to watermark every high-resolution frame.<span class='px-1 mx-1 bg-yellow-200'>We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.628</span></span>Video Seal achieves higher robustness compared to strong baselines especially under challenging distortions combining geometric transformations and video compression.Additionally, we provide new insights such as the impact of video compression during training, and how to compare methods operating on different payloads.Contributions in this work - including the codebase, models, and a public demo - are open-sourced under permissive licenses to foster further research and development in the field.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09492v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                JuStRank: Benchmarking LLM Judges for System Ranking
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available.The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge.Crucially, this approach requires first to validate the quality of the LLM judge itself.Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems.We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems.To address this gap, we conduct the first large-scale study of LLM judges as system rankers.<span class='px-1 mx-1 bg-yellow-200'>System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09569v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability.To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction.Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance.Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization.Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM.<span class='px-1 mx-1 bg-yellow-200'>Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.607</span></span>Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09585v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Do Multimodal Large Language Models See Like Humans?
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans?Current benchmarks lack the ability to evaluate MLLMs from this perspective.To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision.HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span>Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results.Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs.We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09603v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Reliable Uncertainty Quantification for Fiber Orientation in Composite Molding Processes using Multilevel Polynomial Surrogates
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Fiber orientation is decisive for the mechanical properties and thus for the performance of composite materials.During manufacturing, variations in material and process parameters can significantly influence the exact fiber orientation.We employ multilevel polynomial surrogates to model the propagation of uncertain material properties in the injection molding process.To ensure reliable uncertainty quantification, a key focus is deriving novel error bounds for statistical measures of a quantity of interest, computed via these surrogates.To verify these bounds, we conduct numerical experiments using the Cross-WLF viscosity model alongside the Hagen-Poiseuille flow in a rectangular channel.In particular, the impact of uncertainties in fiber length and matrix temperature on the fractional anisotropy of fiber orientation is investigated.The Folgar-Tucker equation and the improved anisotropic rotary diffusion model are used, incorporating recently established analytical solutions of these models as part of our verification.<span class='px-1 mx-1 bg-yellow-200'>Our results demonstrate that the investigated method significantly improves upon standard Monte Carlo estimation, while also providing error guarantees. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.699</span></span>These findings offer the first step toward a reliable and practical tool for optimizing fiber-reinforced polymer manufacturing processes in the future.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08459v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                IRL for Restless Multi-Armed Bandits with Applications in Maternal and Child Health
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Public health practitioners often have the goal of monitoring patients and maximizing patients' time spent in "favorable" or healthy states while being constrained to using limited resources.Restless multi-armed bandits (RMAB) are an effective model to solve this problem as they are helpful to allocate limited resources among many agents under resource constraints, where patients behave differently depending on whether they are intervened on or not.However, RMABs assume the reward function is known.This is unrealistic in many public health settings because patients face unique challenges and it is impossible for a human to know who is most deserving of any intervention at such a large scale.To address this shortcoming, this paper is the first to present the use of inverse reinforcement learning (IRL) to learn desired rewards for RMABs, and we demonstrate improved outcomes in a maternal and child health telehealth program.First we allow public health experts to specify their goals at an aggregate or population level and propose an algorithm to design expert trajectories at scale based on those goals.Second, our algorithm WHIRL uses gradient updates to optimize the objective, allowing for efficient and accurate learning of RMAB rewards.<span class='px-1 mx-1 bg-yellow-200'>Third, we compare with existing baselines and outperform those in terms of run-time and accuracy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.742</span></span>Finally, we evaluate and show the usefulness of WHIRL on thousands on beneficiaries from a real-world maternal and child health setting in India.We publicly release our code here: https://github.com/Gjain234/WHIRL.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08463v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Evaluating Different Fault Injection Abstractions on the Assessment of DNN SW Hardening Strategies
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The reliability of Neural Networks has gained significant attention, prompting efforts to develop SW-based hardening techniques for safety-critical scenarios.However, evaluating hardening techniques using application-level fault injection (FI) strategies, which are commonly hardware-agnostic, may yield misleading results.This study for the first time compares two FI approaches (at the application level (APP) and instruction level (ISA)) to evaluate deep neural network SW hardening strategies.Results show that injecting permanent faults at ISA (a more detailed abstraction level than APP) changes completely the ranking of SW hardening techniques, in terms of both reliability and accuracy.<span class='px-1 mx-1 bg-yellow-200'>These results highlight the relevance of using an adequate analysis abstraction for evaluating such techniques. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.619</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08466v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels.CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08479v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Comparative Opinion Mining in Product Reviews: Multi-perspective Prompt-based Learning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Comparative reviews are pivotal in understanding consumer preferences and influencing purchasing decisions.Comparative Quintuple Extraction (COQE) aims to identify five key components in text: the target entity, compared entities, compared aspects, opinions on these aspects, and polarity.<span class='px-1 mx-1 bg-yellow-200'>Extracting precise comparative information from product reviews is challenging due to nuanced language and sequential task errors in traditional methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.619</span></span>To mitigate these problems, we propose MTP-COQE, an end-to-end model designed for COQE.Leveraging multi-perspective prompt-based learning, MTP-COQE effectively guides the generative model in comparative opinion mining tasks.Evaluation on the Camera-COQE (English) and VCOM (Vietnamese) datasets demonstrates MTP-COQE's efficacy in automating COQE, achieving superior performance with a 1.41% higher F1 score than the previous baseline models on the English dataset.Additionally, we designed a strategy to limit the generative model's creativity to ensure the output meets expectations.We also performed data augmentation to address data imbalance and to prevent the model from becoming biased towards dominant samples.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08508v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses.However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query.To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn.Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query.Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences.<span class='px-1 mx-1 bg-yellow-200'>We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.649</span></span>Our code is released online to ease reproduction.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08519v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek.The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging, morphological tagging, dependency parsing, named entity recognition, and Greeklishto-Greek transliteration.The toolkit is based on pre-trained Transformers, it is freely available, and can be easily installed in Python (pip install gr-nlp-toolkit).It is also accessible through a demonstration platform on HuggingFace, along with a publicly available API for non-commercial use.<span class='px-1 mx-1 bg-yellow-200'>We discuss the functionality provided for each task, the underlying methods, experiments against comparable open-source toolkits, and future possible enhancements. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.609</span></span>The toolkit is available at: https://github.com/nlpaueb/gr-nlp-toolkit</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08520v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems.Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively.This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges.We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality.Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation.<span class='px-1 mx-1 bg-yellow-200'>Our experimental results show substantial improvements over existing baseline methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.816</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08529v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Exact Algorithms for Multiagent Path Finding with Communication Constraints on Tree-Like Structures
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Consider the scenario where multiple agents have to move in an optimal way through a network, each one towards their ending position while avoiding collisions.<span class='px-1 mx-1 bg-yellow-200'>By optimal, we mean as fast as possible, which is evaluated by a measure known as the makespan of the proposed solution. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.623</span></span>This is the setting studied in the Multiagent Path Finding problem.In this work, we additionally provide the agents with a way to communicate with each other.Due to size constraints, it is reasonable to assume that the range of communication of each agent will be limited.What should be the trajectories of the agents to, additionally, maintain a backbone of communication?In this work, we study the Multiagent Path Finding with Communication Constraint problem under the parameterized complexity framework.   Our main contribution is three exact algorithms that are efficient when considering particular structures for the input network.We provide such algorithms for the case when the communication range and the number of agents (the makespan resp.)are provided in the input and the network has a tree topology, or bounded maximum degree (has a tree-like topology, i.e., bounded treewidth resp.).We complement these results by showing that it is highly unlikely to construct efficient algorithms when considering the number of agents as part of the input, even if the makespan is $3$ and the communication range is $1$.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08556v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Adaptive Principal Components Allocation with the $\ell_{2,g}$-regularized Gaussian Graphical Model for Efficient Fine-Tuning Large Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>In this work, we propose a novel Parameter-Efficient Fine-Tuning (PEFT) approach based on Gaussian Graphical Models (GGMs), marking the first application of GGMs to PEFT tasks, to the best of our knowledge.The proposed method utilizes the $\ell_{2,g}$-norm to effectively select critical parameters and capture global dependencies.<span class='px-1 mx-1 bg-yellow-200'>The resulting non-convex optimization problem is efficiently solved using a Block Coordinate Descent (BCD) algorithm. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.628</span></span><span class='px-1 mx-1 bg-yellow-200'>Experimental results on the GLUE benchmark [24] for fine-tuning RoBERTa-Base <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.6</span></span>[18] demonstrate the effectiveness of the proposed approach, achieving competitive performance with significantly fewer trainable parameters.The code for this work is available at: https://github.com/jzheng20/Course projects.git.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08592v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Ensuring that Software Requirements Specifications (SRS) align with higher-level organizational or national requirements is vital, particularly in regulated environments such as finance and aerospace.In these domains, maintaining consistency, adhering to regulatory frameworks, minimizing errors, and meeting critical expectations are essential for the reliable functioning of systems.The widespread adoption of large language models (LLMs) highlights their immense potential, yet there remains considerable scope for improvement in retrieving relevant information and enhancing reasoning capabilities.This study demonstrates that integrating a robust Graph-RAG framework with advanced prompt engineering techniques, such as Chain of Thought and Tree of Thought, can significantly enhance performance.Compared to baseline RAG methods and simple prompting strategies, this approach delivers more accurate and context-aware results.<span class='px-1 mx-1 bg-yellow-200'>While this method demonstrates significant improvements in performance, it comes with challenges. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.653</span></span>It is both costly and more complex to implement across diverse contexts, requiring careful adaptation to specific scenarios.Additionally, its effectiveness heavily relies on having complete and accurate input data, which may not always be readily available, posing further limitations to its scalability and practicality.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08593v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Preference Discerning with LLM-Enhanced Generative Retrieval
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Sequential recommendation systems aim to provide personalized recommendations for users based on their interaction history.To achieve this, they often incorporate auxiliary information, such as textual descriptions of items and auxiliary tasks, like predicting user preferences and intent.Despite numerous efforts to enhance these models, they still suffer from limited personalization.To address this issue, we propose a new paradigm, which we term preference discerning.In preference dscerning, we explicitly condition a generative sequential recommendation system on user preferences within its context.To this end, we generate user preferences using Large Language Models (LLMs) based on user reviews and item-specific data.To evaluate preference discerning capabilities of sequential recommendation systems, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following.<span class='px-1 mx-1 bg-yellow-200'>We assess current state-of-the-art methods using our benchmark and show that they struggle to accurately discern user preferences. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.655</span></span><span class='px-1 mx-1 bg-yellow-200'>Therefore, we propose a new method named Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce $\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and achieves state-of-the-art performance on our benchmark. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.639</span></span>Our results show that Mender can be effectively guided by human preferences even though they have not been observed during training, paving the way toward more personalized sequential recommendation systems.<span class='px-1 mx-1 bg-yellow-200'>We will open-source the code and benchmarks upon publication. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.678</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08604v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Fair Primal Dual Splitting Method for Image Inverse Problems
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Image inverse problems have numerous applications, including image processing, super-resolution, and computer vision, which are important areas in image science.These application models can be seen as a three-function composite optimization problem solvable by a variety of primal dual-type methods.We propose a fair primal dual algorithmic framework that incorporates the smooth term not only into the primal subproblem but also into the dual subproblem.<span class='px-1 mx-1 bg-yellow-200'>We unify the global convergence and establish the convergence rates of our proposed fair primal dual method. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.621</span></span>Experiments on image denoising and super-resolution reconstruction demonstrate the superiority of the proposed method over the current state-of-the-art.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08613v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Efficient and Verified Continuous Double Auctions
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Continuous double auctions are commonly used to match orders at currency, stock, and commodities exchanges.A verified implementation of continuous double auctions is a useful tool for market regulators as they give rise to automated checkers that are guaranteed to detect errors in the trade logs of an existing exchange if they contain trades that violate the matching rules.We provide an efficient and formally verified implementation of continuous double auctions that takes $O(n \log n)$ time to match $n$ orders.<span class='px-1 mx-1 bg-yellow-200'>This improves an earlier $O(n^2)$ verified implementation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.683</span></span>We also prove a matching $\Omega(n\log n)$ lower bound on the running time for continuous double auctions.Our new implementation takes only a couple of minutes to run on ten million randomly generated orders as opposed to a few days taken by the earlier implementation.Our new implementation gives rise to an efficient automatic checker.   We use the Coq proof assistant for verifying our implementation and extracting a verified OCaml program.While using Coq's standard library implementation of red-black trees to obtain our improvement, we observed that its specification has serious gaps, which we fill in this work; this might be of independent interest.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08624v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Fast Track to Winning Tickets: Repowering One-Shot Pruning for Graph Neural Networks
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Graph Neural Networks (GNNs) demonstrate superior performance in various graph learning tasks, yet their wider real-world application is hindered by the computational overhead when applied to large-scale graphs.To address the issue, the Graph Lottery Hypothesis (GLT) has been proposed, advocating the identification of subgraphs and subnetworks, \textit{i.e.}, winning tickets, without compromising performance.<span class='px-1 mx-1 bg-yellow-200'>The effectiveness of current GLT methods largely stems from the use of iterative magnitude pruning (IMP), which offers higher stability and better performance than one-shot pruning. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.612</span></span>However, identifying GLTs is highly computationally expensive, due to the iterative pruning and retraining required by IMP.In this paper, we reevaluate the correlation between one-shot pruning and IMP: while one-shot tickets are suboptimal compared to IMP, they offer a \textit{fast track} to tickets with a stronger performance.We introduce a one-shot pruning and denoising framework to validate the efficacy of the \textit{fast track}.Compared to current IMP-based GLT methods, our framework achieves a double-win situation of graph lottery tickets with \textbf{higher sparsity} and \textbf{faster speeds}.<span class='px-1 mx-1 bg-yellow-200'>Through extensive experiments across 4 backbones and 6 datasets, our method demonstrates $1.32\% - 45.62\%$ improvement in weight sparsity and a $7.49\% - 22.71\%$ increase in graph sparsity, along with a $1.7-44 \times$ speedup over IMP-based methods and $95.3\%-98.6\%$ MAC savings. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.615</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07605v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Despite the superior performance of Large language models on many NLP tasks, they still face significant limitations in memorizing extensive world knowledge.Recent studies have demonstrated that leveraging the Retrieval-Augmented Generation (RAG) framework, combined with Knowledge Graphs that encapsulate extensive factual data in a structured format, robustly enhances the reasoning capabilities of LLMs.However, deploying such systems in real-world scenarios presents challenges: the continuous evolution of non-stationary environments may lead to performance degradation and user satisfaction requires a careful balance of performance and responsiveness.To address these challenges, we introduce a Multi-objective Multi-Armed Bandit enhanced RAG framework, supported by multiple retrieval methods with diverse capabilities under rich and evolving retrieval contexts in practice.Within this framework, each retrieval method is treated as a distinct ``arm''.The system utilizes real-time user feedback to adapt to dynamic environments, by selecting the appropriate retrieval method based on input queries and the historical multi-objective performance of each arm.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments conducted on two benchmark KGQA datasets demonstrate that our method significantly outperforms baseline methods in non-stationary settings while achieving state-of-the-art performance in stationary environments. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.621</span></span>Code and data are available at https://github.com/FUTUREEEEEE/Dynamic-RAG.git</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07618v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                ChocoLlama: Lessons Learned From Teaching Llamas Dutch
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data.In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development.We collect 104GB of Dutch text ($32$B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work.For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization.<span class='px-1 mx-1 bg-yellow-200'>We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch benchmark, ChocoLlama-Bench. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.606</span></span>Our results demonstrate that LoRA can effectively scale for language adaptation, and that tokenizer modification with careful weight reinitialization can improve performance.Notably, Llama-3 was released during the course of this project and, upon evaluation, demonstrated superior Dutch capabilities compared to our Dutch-adapted versions of Llama-2.We hence apply the same adaptation technique to Llama-3, using its original tokenizer.While our adaptation methods enhanced Llama-2's Dutch capabilities, we found limited gains when applying the same techniques to Llama-3.This suggests that for ever improving, multilingual foundation models, language adaptation techniques may benefit more from focusing on language-specific posttraining rather than on continued pretraining.We hope this work contributes to the broader understanding of adapting LLMs to lower-resource languages, and to the development of Dutch LLMs in particular.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07633v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Sampling from Boltzmann densities with physics informed low-rank formats
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Our method proposes the efficient generation of samples from an unnormalized Boltzmann density by solving the underlying continuity equation in the low-rank tensor train (TT) format.It is based on the annealing path commonly used in MCMC literature, which is given by the linear interpolation in the space of energies.Inspired by Sequential Monte Carlo, we alternate between deterministic time steps from the TT representation of the flow field and stochastic steps, which include Langevin and resampling steps.These adjust the relative weights of the different modes of the target distribution and anneal to the correct path distribution.<span class='px-1 mx-1 bg-yellow-200'>We showcase the efficiency of our method on multiple numerical examples. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.694</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07637v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets.Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization.In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions.To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO).InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination.Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions.Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE).<span class='px-1 mx-1 bg-yellow-200'>Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07639v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Analytical-Heuristic Modeling and Optimization for Low-Light Image Enhancement
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Low-light image enhancement remains an open problem, and the new wave of artificial intelligence is at the center of this problem.This work describes the use of genetic algorithms for optimizing analytical models that can improve the visualization of images with poor light.Genetic algorithms are part of metaheuristic approaches, which proved helpful in solving challenging optimization tasks.We propose two analytical methods combined with optimization reasoning to approach a solution to the physical and computational aspects of transforming dark images into visible ones.<span class='px-1 mx-1 bg-yellow-200'>The experiments demonstrate that the proposed approach ranks at the top among 26 state-of-the-art algorithms in the LOL benchmark. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span>The results show evidence that a simple genetic algorithm combined with analytical reasoning can defeat the current mainstream in a challenging computer vision task through controlled experiments and objective comparisons.This work opens interesting new research avenues for the swarm and evolutionary computation community and others interested in analytical and heuristic reasoning.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07659v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline.However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models.Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori.We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation.RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information.This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns.Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score.Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information.Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness.<span class='px-1 mx-1 bg-yellow-200'>This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.714</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07675v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Benchmark for Evaluation and Analysis of Citation Recommendation Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Citation recommendation systems have attracted much academic interest, resulting in many studies and implementations.These systems help authors automatically generate proper citations by suggesting relevant references based on the text they have written.However, the methods used in citation recommendation differ across various studies and implementations.Some approaches focus on the overall content of papers, while others consider the context of the citation text.Additionally, the datasets used in these studies include different aspects of papers, such as metadata, citation context, or even the full text of the paper in various formats and structures.The diversity in models, datasets, and evaluation metrics makes it challenging to assess and compare citation recommendation methods effectively.To address this issue, a standardized dataset and evaluation metrics are needed to evaluate these models consistently.<span class='px-1 mx-1 bg-yellow-200'>Therefore, we propose developing a benchmark specifically designed to analyze and compare citation recommendation models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.723</span></span><span class='px-1 mx-1 bg-yellow-200'>This benchmark will evaluate the performance of models on different features of the citation context and provide a comprehensive evaluation of the models across all these tasks, presenting the results in a standardized way. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.708</span></span><span class='px-1 mx-1 bg-yellow-200'>By creating a benchmark with standardized evaluation metrics, researchers and practitioners in the field of citation recommendation will have a common platform to assess and compare different models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.719</span></span>This will enable meaningful comparisons and help identify promising approaches for further research and development in the field.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07713v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                On Motion Blur and Deblurring in Visual Place Recognition
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data.While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary.Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far.<span class='px-1 mx-1 bg-yellow-200'>This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.631</span></span>The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis.Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring.Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07751v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows.To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats.Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table.The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose.(2) Inspect Column Quality:Assesses the data quality for each target column and generates a Data Quality Report as operation objectives.(3) Generate Operation & Arguments:Predicts the next operation and arguments based on the data quality report results.Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels.<span class='px-1 mx-1 bg-yellow-200'>The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.619</span></span>In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows.The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06724v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models.<span class='px-1 mx-1 bg-yellow-200'>To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.645</span></span>ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest.By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias.Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests.   The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness.Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets.<span class='px-1 mx-1 bg-yellow-200'>To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.622</span></span>Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data.<span class='px-1 mx-1 bg-yellow-200'>On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.62</span></span>We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings.We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains.<span class='px-1 mx-1 bg-yellow-200'>Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06745v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Ranking-aware adapter for text-driven image ordering with CLIP
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval.However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images.To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking.Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking.Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment.Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks.<span class='px-1 mx-1 bg-yellow-200'>Code is available: https://github.com/uynaes/RankingAwareCLIP. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.692</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06760v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>While large vision-language models (LVLMs) have shown impressive capabilities in generating plausible responses correlated with input visual contents, they still suffer from hallucinations, where the generated text inaccurately reflects visual contents.To address this, recent approaches apply contrastive decoding to calibrate the model's response via contrasting output distributions with original and visually distorted samples, demonstrating promising hallucination mitigation in a training-free manner.However, the potential of changing information in visual inputs is not well-explored, so a deeper investigation into the behaviors of visual contrastive decoding is of great interest.In this paper, we first explore various methods for contrastive decoding to change visual contents, including image downsampling and editing.Downsampling images reduces the detailed textual information while editing yields new contents in images, providing new aspects as visual contrastive samples.To further study benefits by using different contrastive samples, we analyze probability-level metrics, including entropy and distribution distance.Interestingly, the effect of these samples in mitigating hallucinations varies a lot across LVLMs and benchmarks.Based on our analysis, we propose a simple yet effective method to combine contrastive samples, offering a practical solution for applying contrastive decoding across various scenarios.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments are conducted to validate the proposed fusion method among different benchmarks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.737</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06775v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Global visual geolocation predicts where an image was captured on Earth.Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity.However, existing approaches are deterministic and overlook this aspect.In this paper, we aim to close the gap between traditional geolocalization and modern generative methods.We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface.Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21.In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point.<span class='px-1 mx-1 bg-yellow-200'>We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.615</span></span>Codes and models will be made available.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06781v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
          </td>
        </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                From Intention To Implementation: Automating Biomedical Research via LLMs
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Conventional biomedical research is increasingly labor-intensive due to the exponential growth of scientific literature and datasets.<span class='px-1 mx-1 bg-yellow-200'>Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span>Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements.This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments.BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming.By decomposing complex tasks into logically related sub-tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity.<span class='px-1 mx-1 bg-yellow-200'>Furthermore, BioResearcher incorporates an LLM-based reviewer for in-process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.651</span></span>BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives.The generated protocols averagely outperform typical agent systems by 22.0% on five quality metrics.The system demonstrates significant potential to reduce researchers' workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09429v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones.Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts.<span class='px-1 mx-1 bg-yellow-200'>Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from {\em parameter and retrieval} levels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.648</span></span>Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference.To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge.By training task-specific adapters, we continually adjust the PTM to downstream tasks.To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information.Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model's inherent ability for better adapter retrieval.By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process.Extensive experiments on seven benchmark datasets validate MOS's state-of-the-art performance.Code is available at: https://github.com/sun-hailong/AAAI25-MOS</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09441v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>The use of copyrighted materials in training generative language models raises critical legal and ethical questions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.6</span></span><span class='px-1 mx-1 bg-yellow-200'>This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span>We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance.Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09460v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Can Modern LLMs Act as Agent Cores in Radiology~Environments?
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.682</span></span>Radiology, with its complex analytical requirements, is an ideal field for the application of these agents.<span class='px-1 mx-1 bg-yellow-200'>This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks.Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets.<span class='px-1 mx-1 bg-yellow-200'>Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.683</span></span><span class='px-1 mx-1 bg-yellow-200'>Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.774</span></span><span class='px-1 mx-1 bg-yellow-200'>Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.696</span></span>All of our code and data are open-sourced in https://github.com/MAGIC-AI4Med/RadABench.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09529v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Does Representation Matter? Exploring Intermediate Layers in Large Language Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Understanding what defines a good representation in large language models (LLMs) is fundamental to both theoretical understanding and practical applications. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.698</span></span><span class='px-1 mx-1 bg-yellow-200'>In this paper, we investigate the quality of intermediate representations in various LLM architectures, including Transformers and State Space Models (SSMs). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.735</span></span>We find that intermediate layers often yield more informative representations for downstream tasks than the final layers.To measure the representation quality, we adapt and apply a suite of metrics - such as prompt entropy, curvature, and augmentation-invariance - originally proposed in other contexts.Our empirical study reveals significant architectural differences, how representations evolve throughout training, and how factors like input randomness and prompt length affect each layer.Notably, we observe a bimodal pattern in the entropy of some intermediate layers and consider potential explanations tied to training data.<span class='px-1 mx-1 bg-yellow-200'>Overall, our results illuminate the internal mechanics of LLMs and guide strategies for architectural optimization and training. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.732</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09563v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Obfuscated Activations Bypass LLM Latent-Space Defenses
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.696</span></span>These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions.This prompts the question: Can models execute harmful behavior via inconspicuous latent states?Here, we study such obfuscated activations.We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations.For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate.However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance.Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior.This poses a fundamental challenge to latent-space defenses.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09565v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                JuStRank: Benchmarking LLM Judges for System Ranking
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available.<span class='px-1 mx-1 bg-yellow-200'>The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.658</span></span><span class='px-1 mx-1 bg-yellow-200'>Crucially, this approach requires first to validate the quality of the LLM judge itself. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.721</span></span><span class='px-1 mx-1 bg-yellow-200'>Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.691</span></span>We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems.<span class='px-1 mx-1 bg-yellow-200'>To address this gap, we conduct the first large-scale study of LLM judges as system rankers. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.683</span></span>System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking.Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09569v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.606</span></span>Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty.Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa.In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query.We further implement an abstention policy to withhold responses when uncertainty is high.Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions even when knowing the correct answer.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09572v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<span class='px-1 mx-1 bg-yellow-200'>In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span>To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction.Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance.Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization.Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM.Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09585v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                TimeRefine: Temporal Grounding with Time Refining Video LLM
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.<span class='px-1 mx-1 bg-yellow-200'>Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span><span class='px-1 mx-1 bg-yellow-200'>However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.62</span></span>Our proposed TimeRefine addresses this challenge in two ways.First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment.This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy.Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions.<span class='px-1 mx-1 bg-yellow-200'>Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.659</span></span>The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively.Code and pretrained models will be released.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09601v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.651</span></span>Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results.However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling.In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity.After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs.Our code and models shall be released.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09604v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Orderly Management of Packets in RDMA by Eunomia
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>To fulfill the low latency requirements of today's applications, deployment of RDMA in datacenters has become prevalent over the recent years.However, the in-order delivery requirement of RDMAs prevents them from leveraging powerful techniques that help improve the performance of datacenters, ranging from fine-grained load balancers to throughput-optimal expander topologies.We demonstrate experimentally that these techniques significantly deteriorate the performance in an RDMA network because they induce packet reordering.Furthermore, lifting the in-order delivery constraint enhances the flexibility of RDMA networks and enables them to employ these performance-enhancing techniques.To realize this, we propose an ordering layer, Eunomia, to equip RDMA NICs to handle packet reordering.<span class='px-1 mx-1 bg-yellow-200'>Eunomia employs a hybrid-dynamic bitmap structure that efficiently uses the limited on-chip memory with the help of a customized memory controller and handles high degrees of packet reordering. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.623</span></span>We evaluate the feasibility of Eunomia through an FPGA-based implementation and its performance through large-scale simulations.We show that Eunomia enables a wide range of applications in RDMA datacenter networks, such as fine-grained load balancers which improve performance by reducing average flow completion times by 85% and 52% compared to ECMP and Conweave, respectively, or employment of RDMA in expander topologies like Jellyfish which allows up to 60% lower flow completion times and higher throughput gains compared to Fat tree.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08540v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                MaestroMotif: Skill Design from Artificial Intelligence Feedback
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system.We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents.<span class='px-1 mx-1 bg-yellow-200'>MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.707</span></span><span class='px-1 mx-1 bg-yellow-200'>It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.673</span></span><span class='px-1 mx-1 bg-yellow-200'>Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.681</span></span>We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08542v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Large Language Models are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches.<span class='px-1 mx-1 bg-yellow-200'>While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.647</span></span>As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically.The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks to compare the unlearned models against models retrained without the to-be-unlearned data.However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective.Prior work shows that data outliers may exhibit higher memorization effects.Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is underestimated in the current evaluation.In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations.We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature.Using personally identifiable information as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales.<span class='px-1 mx-1 bg-yellow-200'>Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.69</span></span><span class='px-1 mx-1 bg-yellow-200'>Our minority-aware evaluation framework represents an initial step toward ensuring more equitable assessments of LLM unlearning efficacy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.677</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08559v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Can We Generate Visual Programs Without Prompting LLMs?
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.685</span></span>Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money.<span class='px-1 mx-1 bg-yellow-200'>Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.668</span></span>We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments.<span class='px-1 mx-1 bg-yellow-200'>Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.706</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08564v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective.Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption.In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing).We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process.<span class='px-1 mx-1 bg-yellow-200'>We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.629</span></span>The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs.Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs.Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices.These insights aim to guide and inspire the future research.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08581v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.649</span></span>While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats.Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation.   We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency.Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention.Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08585v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Both encoder-only models (e.g., BERT, RoBERTa) and large language models (LLMs, e.g., Llama3) have been widely used for text classification tasks.<span class='px-1 mx-1 bg-yellow-200'>However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.65</span></span>This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches.<span class='px-1 mx-1 bg-yellow-200'>We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.636</span></span>Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets.<span class='px-1 mx-1 bg-yellow-200'>Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.634</span></span><span class='px-1 mx-1 bg-yellow-200'>Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.652</span></span><span class='px-1 mx-1 bg-yellow-200'>Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.627</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08587v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Ensuring that Software Requirements Specifications (SRS) align with higher-level organizational or national requirements is vital, particularly in regulated environments such as finance and aerospace.In these domains, maintaining consistency, adhering to regulatory frameworks, minimizing errors, and meeting critical expectations are essential for the reliable functioning of systems.<span class='px-1 mx-1 bg-yellow-200'>The widespread adoption of large language models (LLMs) highlights their immense potential, yet there remains considerable scope for improvement in retrieving relevant information and enhancing reasoning capabilities. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.7</span></span>This study demonstrates that integrating a robust Graph-RAG framework with advanced prompt engineering techniques, such as Chain of Thought and Tree of Thought, can significantly enhance performance.Compared to baseline RAG methods and simple prompting strategies, this approach delivers more accurate and context-aware results.While this method demonstrates significant improvements in performance, it comes with challenges.It is both costly and more complex to implement across diverse contexts, requiring careful adaptation to specific scenarios.Additionally, its effectiveness heavily relies on having complete and accurate input data, which may not always be readily available, posing further limitations to its scalability and practicality.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08593v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Preference Discerning with LLM-Enhanced Generative Retrieval
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Sequential recommendation systems aim to provide personalized recommendations for users based on their interaction history.To achieve this, they often incorporate auxiliary information, such as textual descriptions of items and auxiliary tasks, like predicting user preferences and intent.Despite numerous efforts to enhance these models, they still suffer from limited personalization.To address this issue, we propose a new paradigm, which we term preference discerning.In preference dscerning, we explicitly condition a generative sequential recommendation system on user preferences within its context.<span class='px-1 mx-1 bg-yellow-200'>To this end, we generate user preferences using Large Language Models (LLMs) based on user reviews and item-specific data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.636</span></span>To evaluate preference discerning capabilities of sequential recommendation systems, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following.We assess current state-of-the-art methods using our benchmark and show that they struggle to accurately discern user preferences.Therefore, we propose a new method named Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce $\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and achieves state-of-the-art performance on our benchmark.Our results show that Mender can be effectively guided by human preferences even though they have not been observed during training, paving the way toward more personalized sequential recommendation systems.We will open-source the code and benchmarks upon publication.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08604v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Despite the advancements in training Large Language Models (LLMs) with alignment techniques to enhance the safety of generated content, these models remain susceptible to jailbreak, an adversarial attack method that exposes security vulnerabilities in LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.649</span></span>Notably, the Greedy Coordinate Gradient (GCG) method has demonstrated the ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs.However, the optimization process involved in GCG is highly time-consuming, rendering the jailbreaking pipeline inefficient.In this paper, we investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization.To this end, we propose the Model Attack Gradient Index GCG (MAGIC), that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens, thereby accelerating the procedure by having less computation and fewer iterations.Our experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher than other baselines.<span class='px-1 mx-1 bg-yellow-200'>Our MAGIC achieved an ASR of 74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on GPT-3.5. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.604</span></span>Code is available at https://github.com/jiah-li/magic.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08615v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Synthetic Vision: Training Vision-Language Models to Understand Physics
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Physical reasoning, which involves the interpretation, understanding, and prediction of object behavior in dynamic environments, remains a significant challenge for current Vision-Language Models (VLMs).In this work, we propose two methods to enhance VLMs' physical reasoning capabilities using simulated data.First, we fine-tune a pre-trained VLM using question-answer (QA) pairs generated from simulations relevant to physical reasoning tasks.Second, we introduce Physics Context Builders (PCBs), specialized VLMs fine-tuned to create scene descriptions enriched with physical properties and processes.<span class='px-1 mx-1 bg-yellow-200'>During physical reasoning tasks, these PCBs can be leveraged as context to assist a Large Language Model (LLM) to improve its performance. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.657</span></span>We evaluate both of our approaches using multiple benchmarks, including a new stability detection QA dataset called Falling Tower, which includes both simulated and real-world scenes, and CLEVRER.We demonstrate that a small QA fine-tuned VLM can significantly outperform larger state-of-the-art foundational models.<span class='px-1 mx-1 bg-yellow-200'>We also show that integrating PCBs boosts the performance of foundational LLMs on physical reasoning tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.673</span></span>Using the real-world scenes from the Falling Tower dataset, we also validate the robustness of both approaches in Sim2Real transfer.Our results highlight the utility that simulated data can have in the creation of learning systems capable of advanced physical reasoning.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08619v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Generative Semantic Communication: Architectures, Technologies, and Applications
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>This paper delves into the applications of generative artificial intelligence (GAI) in semantic communication (SemCom) and presents a thorough study.Three popular SemCom systems enabled by classical GAI models are first introduced, including variational autoencoders, generative adversarial networks, and diffusion models.For each system, the fundamental concept of the GAI model, the corresponding SemCom architecture, and the associated literature review of recent efforts are elucidated.<span class='px-1 mx-1 bg-yellow-200'>Then, a novel generative SemCom system is proposed by incorporating the cutting-edge GAI technology-large language models (LLMs). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.63</span></span><span class='px-1 mx-1 bg-yellow-200'>This system features two LLM-based AI agents at both the transmitter and receiver, serving as "brains" to enable powerful information understanding and content regeneration capabilities, respectively. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.672</span></span>This innovative design allows the receiver to directly generate the desired content, instead of recovering the bit stream, based on the coded semantic information conveyed by the transmitter.<span class='px-1 mx-1 bg-yellow-200'>Therefore, it shifts the communication mindset from "information recovery" to "information regeneration" and thus ushers in a new era of generative SemCom. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.645</span></span>A case study on point-to-point video retrieval is presented to demonstrate the superiority of the proposed generative SemCom system, showcasing a 99.98% reduction in communication overhead and a 53% improvement in retrieval accuracy compared to the traditional communication system.Furthermore, four typical application scenarios for generative SemCom are delineated, followed by a discussion of three open issues warranting future investigation.In a nutshell, this paper provides a holistic set of guidelines for applying GAI in SemCom, paving the way for the efficient implementation of generative SemCom in future wireless networks.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08642v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">Developer Research</h2>
          </td>
        </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Ask Humans or AI? Exploring Their Roles in Visualization Troubleshooting
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Visualization authoring is an iterative process requiring users to modify parameters like color schemes and data transformations to achieve desired aesthetics and effectively convey insights.<span class='px-1 mx-1 bg-yellow-200'>Due to the complexity of these adjustments, users often create defective visualizations and require troubleshooting support. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.609</span></span>In this paper, we examine two primary approaches for visualization troubleshooting: (1) Human-assisted support via forums, where users receive advice from other individuals, and (2) AI-assisted support using large language models (LLMs).Our goal is to understand the strengths and limitations of each approach in supporting visualization troubleshooting tasks.To this end, we collected 889 Vega-Lite cases from Stack Overflow.We then conducted a comprehensive analysis to understand the types of questions users ask, the effectiveness of human and AI guidance, and the impact of supplementary resources, such as documentation and examples, on troubleshooting outcomes.Our findings reveal a striking contrast between human- and AI-assisted troubleshooting: Human-assisted troubleshooting provides tailored, context-sensitive advice but often varies in response quality, while AI-assisted troubleshooting offers rapid feedback but often requires additional contextual resources to achieve desired results.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07673v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Quantum vs. Classical Machine Learning Algorithms for Software Defect Prediction: Challenges and Opportunities
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Software defect prediction is a critical aspect of software quality assurance, as it enables early identification and mitigation of defects, thereby reducing the cost and impact of software failures. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.648</span></span>Over the past few years, quantum computing has risen as an exciting technology capable of transforming multiple domains; Quantum Machine Learning (QML) is one of them.QML algorithms harness the power of quantum computing to solve complex problems with better efficiency and effectiveness than their classical counterparts.However, research into its application in software engineering to predict software defects still needs to be explored.In this study, we worked to fill the research gap by comparing the performance of three QML and five classical machine learning (CML) algorithms on the 20 software defect datasets.Our investigation reports the comparative scenarios of QML vs. CML algorithms and identifies the better-performing and consistent algorithms to predict software defects.We also highlight the challenges and future directions of employing QML algorithms in real software defect datasets based on the experience we faced while performing this investigation.The findings of this study can help practitioners and researchers further progress in this research domain by making software systems reliable and bug-free.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07698v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-09</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Why Do Developers Engage with ChatGPT in Issue-Tracker? Investigating Usage and Reliance on ChatGPT-Generated Code
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p>Large language models (LLMs) like ChatGPT have shown the potential to assist developers with coding and debugging tasks.However, their role in collaborative issue resolution is underexplored.In this study, we analyzed 1,152 Developer-ChatGPT conversations across 1,012 issues in GitHub to examine the diverse usage of ChatGPT and reliance on its generated code.Our contributions are fourfold.First, we manually analyzed 289 conversations to understand ChatGPT's usage in the GitHub Issues.Our analysis revealed that ChatGPT is primarily utilized for ideation, whereas its usage for validation (e.g., code documentation accuracy) is minimal.Second, we applied BERTopic modeling to identify key areas of engagement on the entire dataset.We found that backend issues (e.g., API management) dominate conversations, while testing is surprisingly less covered.Third, we utilized the CPD clone detection tool to check if the code generated by ChatGPT was used to address issues.Our findings revealed that ChatGPT-generated code was used as-is to resolve only 5.83\% of the issues.Fourth, we estimated sentiment using a RoBERTa-based sentiment analysis model to determine developers' satisfaction with different usages and engagement areas.We found positive sentiment (i.e., high satisfaction) about using ChatGPT for refactoring and addressing data analytics (e.g., categorizing table data) issues.On the contrary, we observed negative sentiment when using ChatGPT to debug issues and address automation tasks (e.g., GUI interactions).<span class='px-1 mx-1 bg-yellow-200'>Our findings show the unmet needs and growing dissatisfaction among developers. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.634</span></span><span class='px-1 mx-1 bg-yellow-200'>Researchers and ChatGPT developers should focus on developing task-specific solutions that help resolve diverse issues, improving user satisfaction and problem-solving efficiency in software development. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.618</span></span></p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.06757v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-11-21</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                Automated Generation of Code Debugging Exercises
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Debugging is an essential skill when learning to program, yet its instruction and emphasis often vary widely across introductory courses. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.654</span></span>In the era of code-generating large language models (LLMs), the ability for students to reason about code and identify errors is increasingly important.However, students frequently resort to trial-and-error methods to resolve bugs without fully understanding the underlying issues.Developing the ability to identify and hypothesize the cause of bugs is crucial but can be time-consuming to teach effectively through traditional means.This paper introduces BugSpotter, an innovative tool that leverages an LLM to generate buggy code from a problem description and verify the synthesized bugs via a test suite.Students interact with BugSpotter by designing failing test cases, where the buggy code's output differs from the expected result as defined by the problem specification.<span class='px-1 mx-1 bg-yellow-200'>This not only provides opportunities for students to enhance their debugging skills, but also to practice reading and understanding problem specifications. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.626</span></span>We deployed BugSpotter in a large classroom setting and compared the debugging exercises it generated to exercises hand-crafted by an instructor for the same problems.We found that the LLM-generated exercises produced by BugSpotter varied in difficulty and were well-matched to the problem specifications.Importantly, the LLM-generated exercises were comparable to those manually created by instructors with respect to student performance, suggesting that BugSpotter could be an effective and efficient aid for learning debugging.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2411.14303v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td class="inline-block">
            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-11-19</p>
          </td>
          <td>
            <div x-data="{open: false}">
              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
                WIA-SZZ: Work Item Aware SZZ
              </span>
              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                <div class="text-center pt-2"></div>
                <p class="pt-2">
                  <p><span class='px-1 mx-1 bg-yellow-200'>Many software engineering maintenance tasks require linking a commit that induced a bug with the commit that later fixed that bug. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.637</span></span>Several existing SZZ algorithms provide a way to identify the potential commit that induced a bug when given a fixing commit as input.Prior work introduced the notion of a "work item", a logical grouping of commits that could be a single unit of work.Our key insight in this work is to recognize that a bug-inducing commit and the fix(es) for that bug together represent a "work item."It is not currently understood how these work items, which are logical groups of revisions addressing a single issue or feature, could impact the performance of algorithms such as SZZ.In this paper, we propose a heuristic that, given an input commit, uses information about changed methods to identify related commits that form a work item with the input commit.We hypothesize that given such a work item identifying heuristic, we can identify bug-inducing commits more accurately than existing SZZ approaches.We then build a new variant of SZZ that we call Work Item Aware SZZ (WIA-SZZ), that leverages our work item detecting heuristic to first suggest bug-inducing commits.If our heuristic fails to find any candidates, we then fall back to baseline variants of SZZ.We conduct a manual evaluation to assess the accuracy of our heuristic to identify work items.Our evaluation reveals the heuristic is 64% accurate in finding work items, but most importantly it is able to find many bug-inducing commits.We then evaluate our approach on 821 repositories that have been previously used to study the performance of SZZ, comparing our work against six SZZ variants.That evaluation shows an improvement in F1 scores ranging from 2% to 9%, or when looking only at the subset of cases that found work item improved 3% to 14%.</p>
                </p>
              <p class="pb-2 pt-2 text-center">
                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2411.12740v1' target="_blank">
                  link
                </a>
              </p>
            </div>
          </div>
        </td>
      </tr><tr>
          <td></td>
          <td>
            <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Annotation Techniques</h2>
          </td>
        </tr></tbody>
  </table>
  <br><br>
</div>
</div>
<script>
  document.addEventListener("DOMContentLoaded", function() {
    renderMathInElement(document.body, {
      // customised options
      // • auto-render specific keys, e.g.:
      delimiters: [
      {left: '$$', right: '$$', display: true},
      {left: '$', right: '$', display: false},
      {left: '\\(', right: '\\)', display: false},
      {left: '\\[', right: '\\]', display: true}
      ],
      // • rendering keys, e.g.:
      throwOnError : false
    });
  });
</script>
</body>
</html>