JamBot: site:arxiv.org site:arxiv.org attention

1-100 of about 190 matches for site:arxiv.org site:arxiv.org attention

[2208.01626] Prompt-to-Prompt Image Editing with Cross Attention Control

https://arxiv.org/abs/2208.01626

2208.01626] Prompt-to-Prompt Image Editing with Cross Attention Control Skip to main content We gratefully

[2406.16008] Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utiliz

https://arxiv.org/abs/2406.16008

2406.16008] Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization Skip to main

[2406.16008] Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utiliz

https://arxiv.org/abs/2406.16008

2406.16008] Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization Skip to main

[1706.03762] Attention Is All You Need

https://arxiv.org/abs/1706.03762

1706.03762] Attention Is All You Need Happy Open Access Week from arXiv! YOU make open access possible! Tell

[2208.01626] Prompt-to-Prompt Image Editing with Cross Attention Control

https://arxiv.org/abs/2208.01626

2208.01626] Prompt-to-Prompt Image Editing with Cross Attention Control Skip to main content We gratefully

[2109.01349] Dual-Camera Super-Resolution with Aligned Attention Modules

https://arxiv.org/abs/2109.01349

2109.01349] Dual-Camera Super-Resolution with Aligned Attention Modules Skip to main content We gratefully acknowledge support from

[1706.03762] Attention Is All You Need

https://arxiv.org/abs/1706.03762

1706.03762] Attention Is All You Need Skip to main content We gratefully acknowledge support from the

[2205.14135] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

https://arxiv.org/abs/2205.14135

2205.14135] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Skip to main content We

[2404.08634] When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

https://arxiv.org/abs/2404.08634

2404.08634] When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models Skip to main

[2307.13108] An Explainable Geometric-Weighted Graph Attention Network for Identifying Functional Ne

https://arxiv.org/abs/2307.13108

2307.13108] An Explainable Geometric-Weighted Graph Attention Network for Identifying Functional Networks Associated with Gait Impairment Skip to

[2109.01349] Dual-Camera Super-Resolution with Aligned Attention Modules

https://arxiv.org/abs/2109.01349

2109.01349] Dual-Camera Super-Resolution with Aligned Attention Modules Skip to main content We gratefully acknowledge support from

[2410.13835] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in

https://arxiv.org/abs/2410.13835

2410.13835] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs Skip to main content

[1412.7755] Multiple Object Recognition with Visual Attention

https://arxiv.org/abs/1412.7755

1412.7755] Multiple Object Recognition with Visual Attention Skip to main content We gratefully acknowledge support from the

[2006.14615] LayoutTransformer: Layout Generation and Completion with Self-attention

https://arxiv.org/abs/2006.14615

2006.14615] LayoutTransformer: Layout Generation and Completion with Self-attention Skip to main content We gratefully acknowledge

[2207.13298] Is Attention All That NeRF Needs?

https://arxiv.org/abs/2207.13298

2207.13298] Is Attention All That NeRF Needs? Skip to main content We gratefully acknowledge support from the

[2307.08691] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

https://arxiv.org/abs/2307.08691

2307.08691] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Skip to main content We

[2307.11353] What can a Single Attention Layer Learn? A Study Through the Random Features Lens

https://arxiv.org/abs/2307.11353

2307.11353] What can a Single Attention Layer Learn? A Study Through the Random Features

[1910.05728] Granular Multimodal Attention Networks for Visual Dialog

https://arxiv.org/abs/1910.05728

a number of different attention models proposed. However, the scale at which attention needs to be

[2306.07998] Contrastive Attention Networks for Attribution of Early Modern Print

https://arxiv.org/abs/2306.07998

2306.07998] Contrastive Attention Networks for Attribution of Early Modern Print Skip to main

[1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate

https://arxiv.org/abs/1409.0473

1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate Skip

[2107.14285] ADeLA: Automatic Dense Labeling with Attention for Viewpoint Adaptation in Semantic Seg

https://arxiv.org/abs/2107.14285

2107.14285] ADeLA: Automatic Dense Labeling with Attention for Viewpoint Adaptation in Semantic Segmentation Skip to

[2305.10203] Exploring the Space of Key-Value-Query Models with Intention

https://arxiv.org/abs/2305.10203

Query Models with Intention, by Marta Garnelo and 1 other authors View PDF Abstract: Attention-based models have been

[2503.17539] Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Netw

https://arxiv.org/abs/2503.17539

can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video

[2212.05032] Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

https://arxiv.org/abs/2212.05032

the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion

[2212.05032] Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

https://arxiv.org/abs/2212.05032

the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion

[2405.12978] Personalized Residuals for Concept-Driven Text-to-Image Generation

https://arxiv.org/abs/2405.12978

6 other authors View PDF HTML (experimental) Abstract: We present personalized residuals and localized attention-guided sampling for

[2201.07779] Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic M

https://arxiv.org/abs/2201.07779

cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial

[2109.04683] PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

https://arxiv.org/abs/2109.04683

on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations

[2109.04683] PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

https://arxiv.org/abs/2109.04683

on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations

[2404.01284] Large Motion Model for Unified Multi-Modal Motion Generation

https://arxiv.org/abs/2404.01284

of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates

[2407.17490] AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

https://arxiv.org/abs/2407.17490

Chai and 7 other authors View PDF HTML (experimental) Abstract: AI agents have drawn increasing attention mostly on their ability

[2406.09401] MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

https://arxiv.org/abs/2406.09401

LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its

[2207.10662] Generalizable Patch-Based Neural Rendering

https://arxiv.org/abs/2207.10662

Mohammed Suhail and 3 other authors View PDF Abstract: Neural rendering has received tremendous attention since the advent

[2104.08666] Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models

https://arxiv.org/abs/2104.08666

analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to

[2207.10662] Generalizable Patch-Based Neural Rendering

https://arxiv.org/abs/2207.10662

Mohammed Suhail and 3 other authors View PDF Abstract: Neural rendering has received tremendous attention since the advent

[2409.14379] GroupDiff: Diffusion-based Group Portrait Editing

https://arxiv.org/abs/2409.14379

the images of persons from the group photo into the attention modules and employ

[2312.01429] Transformers are uninterpretable with myopic methods: a case study with bounded Dyck gr

https://arxiv.org/abs/2312.01429

of the model, such as the weight matrices or the attention patterns. In this

[2306.14899] FunQA: Towards Surprising Video Comprehension

https://arxiv.org/abs/2306.14899

HTML (experimental) Abstract: Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these

[2312.09138] Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environmen

https://arxiv.org/abs/2312.09138

3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to

[2412.04468] NVILA: Efficient Frontier Visual Language Models

https://arxiv.org/abs/2412.04468

advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA

[2406.13131] When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models

https://arxiv.org/abs/2406.13131

output of large language models into the individual contributions of attention heads and MLPs

[2504.08591] ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

https://arxiv.org/abs/2504.08591

and efficiency due to the computational demands of long-range attention mechanisms. To address

[2501.12381] Parallel Sequence Modeling via Generalized Spatial Propagation Network

https://arxiv.org/abs/2501.12381

experimental) Abstract: We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for

[2312.01429] Transformers are uninterpretable with myopic methods: a case study with bounded Dyck gr

https://arxiv.org/abs/2312.01429

of the model, such as the weight matrices or the attention patterns. In this

[2405.12978] Personalized Residuals for Concept-Driven Text-to-Image Generation

https://arxiv.org/abs/2405.12978

6 other authors View PDF HTML (experimental) Abstract: We present personalized residuals and localized attention-guided sampling for

[2504.08591] ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

https://arxiv.org/abs/2504.08591

and efficiency due to the computational demands of long-range attention mechanisms. To address

[2401.04718] Jump Cut Smoothing for Talking Heads

https://arxiv.org/abs/2401.04718

to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select

[1612.06321] Large-Scale Image Retrieval with Attentive Deep Local Features

https://arxiv.org/abs/1612.06321

To identify semantically useful local features for image retrieval, we also propose an attention mechanism for keypoint

[2309.03453] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

https://arxiv.org/abs/2309.03453

every step of the reverse process through a 3D-aware feature attention mechanism that correlates the

[2411.17249] Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

https://arxiv.org/abs/2411.17249

flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to

[2405.12979] OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

https://arxiv.org/abs/2405.12979

domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial

[2502.10377] ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

https://arxiv.org/abs/2502.10377

the style to a single view using a training-free semantic-attention mechanism in a

[2412.21079] Edicho: Consistent Image Editing in the Wild

https://arxiv.org/abs/2412.21079

explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and

[2503.21581] AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

https://arxiv.org/abs/2503.21581

distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model

[2405.17414] Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

https://arxiv.org/abs/2405.17414

frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top

[1804.03281] Recurrent Neural Networks for Person Re-identification Revisited

https://arxiv.org/abs/1804.03281

PDF Abstract: The task of person re-identification has recently received rising attention due to the

[2211.15521] G^3: Geolocation via Guidebook Grounding

https://arxiv.org/abs/2211.15521

image by attending over the clues automatically extracted from the guidebook. Supervising attention with country-level pseudo

[1611.09464] Social Behavior Prediction from First Person Videos

https://arxiv.org/abs/1611.09464

the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is

[1612.06321] Large-Scale Image Retrieval with Attentive Deep Local Features

https://arxiv.org/abs/1612.06321

To identify semantically useful local features for image retrieval, we also propose an attention mechanism for keypoint

[2502.10377] ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

https://arxiv.org/abs/2502.10377

the style to a single view using a training-free semantic-attention mechanism in a

[2309.03453] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

https://arxiv.org/abs/2309.03453

every step of the reverse process through a 3D-aware feature attention mechanism that correlates the

[2405.12979] OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

https://arxiv.org/abs/2405.12979

domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial

[2412.21079] Edicho: Consistent Image Editing in the Wild

https://arxiv.org/abs/2412.21079

explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and

[1804.03281] Recurrent Neural Networks for Person Re-identification Revisited

https://arxiv.org/abs/1804.03281

PDF Abstract: The task of person re-identification has recently received rising attention due to the

[2405.17414] Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

https://arxiv.org/abs/2405.17414

frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top

[2503.21581] AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

https://arxiv.org/abs/2503.21581

distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model

[2304.06712] What does CLIP know about a red circle? Visual prompt engineering for VLMs

https://arxiv.org/abs/2304.06712

drawing a red circle around an object, we can direct the model's attention to that

[2402.05235] SPAD : Spatially Aware Multiview Diffusers

https://arxiv.org/abs/2402.05235

multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view

[2112.10752] High-Resolution Image Synthesis with Latent Diffusion Models

https://arxiv.org/abs/2112.10752

point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the

[2209.13085] Defining and Characterizing Reward Hacking

https://arxiv.org/abs/2209.13085

can only be unhackable if one of them is constant. We thus turn our attention to deterministic

[2310.10634] OpenAgents: An Open Platform for Language Agents in the Wild

https://arxiv.org/abs/2310.10634

neglecting the non-expert user access to agents and paying little attention to application

[2304.00553] From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding

https://arxiv.org/abs/2304.00553

and 10 other authors View PDF HTML (experimental) Abstract: Action understanding has attracted long-term attention. It can be formed

[2311.17261] SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

https://arxiv.org/abs/2311.17261

To further secure the style consistency across views, we introduce a cross-attention decoder to predict

[2010.04595] GRF: Learning a General Radiance Field for 3D Representation and Rendering

https://arxiv.org/abs/2010.04595

to 3D points, thus yielding general and rich point representations. We additionally integrate an attention mechanism to aggregate

[2412.10533] SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

https://arxiv.org/abs/2412.10533

video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies

[1909.11059] Unified Vision-Language Pre-Training for Image Captioning and VQA

https://arxiv.org/abs/1909.11059

in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the

[2304.13681] Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Gener

https://arxiv.org/abs/2304.13681

Chen and 4 other authors View PDF Abstract: Multi-view image generation attracts particular attention these days due to

[2311.12024] PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

https://arxiv.org/abs/2311.12024

A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange

[2311.17261] SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

https://arxiv.org/abs/2311.17261

To further secure the style consistency across views, we introduce a cross-attention decoder to predict

[2010.04595] GRF: Learning a General Radiance Field for 3D Representation and Rendering

https://arxiv.org/abs/2010.04595

to 3D points, thus yielding general and rich point representations. We additionally integrate an attention mechanism to aggregate

[2311.12024] PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

https://arxiv.org/abs/2311.12024

A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange

[2205.01643] MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer

https://arxiv.org/abs/2205.01643

by the mean teacher framework taking advantage of the cross-scale self-attention mechanism in Deformable

[1611.09464] Social Behavior Prediction from First Person Videos

https://arxiv.org/abs/1611.09464

the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is

[2208.14023] SoMoFormer: Multi-Person Pose Forecasting with Transformers

https://arxiv.org/abs/2208.14023

a joint sequence rather than a time sequence, allowing us to perform attention over joints while predicting

[2401.04718] Jump Cut Smoothing for Talking Heads

https://arxiv.org/abs/2401.04718

to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select

[2107.08408] Pre-trained Language Models as Prior Knowledge for Playing Text-based Games

https://arxiv.org/abs/2107.08408

to be challenging enough for even human players. Past approaches have not paid enough attention to the

[2302.06548] Automatic Noise Filtering with Dynamic Sparse Training in Deep Reinforcement Learning

https://arxiv.org/abs/2302.06548

to successfully execute its current chore. Filtering distracting inputs that contain irrelevant data has received little attention in the

[2506.08010] Vision Transformers Don't Need Trained Registers

https://arxiv.org/abs/2506.08010

Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that

[2504.05298] One-Minute Video Generation with Test-Time Training

https://arxiv.org/abs/2504.05298

HTML (experimental) Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for

[2503.16413] M3: 3D-Spatial MultiModal Memory

https://arxiv.org/abs/2503.16413

propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and

[2503.16413] M3: 3D-Spatial MultiModal Memory

https://arxiv.org/abs/2503.16413

propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and

[2412.06269] Schema Evolution in Interactive Programming Systems

https://arxiv.org/abs/2412.06269

and Distributed Version Control. A barrier to feedback that deserves greater attention is Schema Evolution. When

[1603.04908] First Person Action-Object Detection with EgoNet

https://arxiv.org/abs/1603.04908

close. In this paper, we study the tight interplay between our momentary visual attention and motor

[2112.10752] High-Resolution Image Synthesis with Latent Diffusion Models

https://arxiv.org/abs/2112.10752

point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the

[2304.13681] Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Gener

https://arxiv.org/abs/2304.13681

Chen and 4 other authors View PDF Abstract: Multi-view image generation attracts particular attention these days due to

[2203.09457] Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single

https://arxiv.org/abs/2203.09457

from a single image has recently attracted a lot of attention, and it

[2507.07230] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

https://arxiv.org/abs/2507.07230

from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel

[2410.08151] Progressive Autoregressive Video Diffusion Models

https://arxiv.org/abs/2410.08151

of the previous clip at the front of the attention window as conditioning, which

[2504.05298] One-Minute Video Generation with Test-Time Training

https://arxiv.org/abs/2504.05298

HTML (experimental) Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for

[1709.01630] Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

https://arxiv.org/abs/1709.01630

is a challenging task that requires inferring the camera wearer's visual attention, and decoding

1 2 Next