Drone Tracking in IR Videos with Transformers

In this work, we introduce a comprehensive framework designed to utilize the power of temporal inforamtion for precise drone tracking within infrared (IR) video sequences. We have integrated temporal information at two stages: feature extraction and similarity map enhancement. In the stage of feature extraction, we employ an online, temporally adaptive convolution approach that leverages temporal information to augment spatial features. This enhancement is achieved by dynamically adjusting convolution weights based on previous frame data. As for refining the similarity map, our method utilizes an adaptive temporal transformer. This transformer efficiently encodes temporal insights and subsequently decodes this knowledge to fine-tune the similarity map, ensuring accurate tracking results.