Abstract:
Recent methods on combining textual and visual information using supervised (textual,
visual) data have shown encouraging performance. However they are mostly limited to
paired (textual, visual) data. We are interested in exploring methods which can leverage
large, but independently annotated, datasets of visual and textual data. Applications
include image and video captioning and, the induction of novel objects, wherein we try to describe objects that were not seen in the paired annotated data by harnessing knowledge from unpaired data
.