Exploiting the Interplay between Visual and Textual Data for Scene Interpretation