Abstract:
Motion forecasting of surrounding agents is fundamental for autonomous systems navigating complex, dynamic environments. This capability enables autonomous vehicles and robots to anticipate the future trajectories of vehicles, pedestrians, and other moving entities. Recent models leverage large-scale datasets to learn spatiotemporal patterns. Many existing models focus on single-agent prediction. This poses a problem because, as the number of agents in a scene increases, the computational time scales linearly due to the redundant re-encoding of features—such as static or HD maps—that could otherwise be shared across agents. Moreover, some models struggle to accurately capture interactions between agents, limiting their ability to understand the influence they have on one another. They also encounter challenges with variable sequence lengths; as the sequence length increases, models using GRUs or RNNs can lose context over long sequence lengths, and improper handling of padding timesteps can diminish encoding quality. We also explore monocular motion forecasting in the context of traffic surveillance. In this setting, the goal is to forecast the future positions of agents using monocular camera footage by leveraging monocular depth estimation. This approach typically operates without access to high-definition (HD) maps and instead relies solely on the historical motion of agents—essentially performing map-free motion forecasting. In this work, we propose a new approach that addresses the challenges such as multi- agent forecasting, agent-agent interaction, and monocular camera based motion forecasting. We evaluate our autonomous driving model on two benchmark datasets Argoverse 2 and nuScenes—under standard evaluation metrics: MinFDE, MinADE, and MR. For monocular motion forecasting, we evaluate our method on the BrnoComp dataset.