Hierarchical embedding-based visual relationship prediction method with multimodal fusion
Abstract
Visual relationship prediction stands as a pivotal frontier in computer vision, dedicated to uncovering the intricate semantic connections between objects within images. Current approaches often grapple with inaccurate relationship representations stemming from persistent modality gaps. To bridge this critical divide, this paper introduces a prediction model based on the hierarchical embedding of visual relationships. This model significantly enhances representation flexibility through dynamically weighted prediction operators and pioneers novel consistency and reversible regularization constraints to rigorously ensure global logic consistency. On the predicate classification task of the Visual Genome dataset, our model achieves an impressive 81.13% in the R@20 index, significantly besting the baseline model RelTR (63.10%). After incorporating consistent regularization, the model climbs further to 82.12% on the demanding R@50 index.