Antonio Torralba

Title: Dissecting neural nets

With the success of deep neural networks and access to image databases with millions of labeled examples, the state of the art in computer vision is advancing rapidly. Even when no examples are available, Generative Adversarial Networks (GANs) have demonstrated a remarkable ability to learn from images and are able to create nearly photorealistic images. The performance achieved by convNets and GANs is remarkable and constitute the state of the art on many tasks. But why do convNets work so well? what is the nature of the internal representation learned by a convNet in a classification task? How does a GAN represent our visual world internally? In this talk I will show that the internal representation in both convNets and GANs can be interpretable in some important cases. I will then show several applications for object recognition, computer graphics, and unsupervised learning from images and audio.  I will show that an ambient audio signal can be used as a supervisory signal for learning visual representations. We do this by taking advantage of the fact that vision and hearing often tell us about similar structures in the world, such as when we see an object and simultaneously hear it make a sound. I will also show how we can use raw speech descriptions of images to jointly learn to segment words in speech and objects in images without any additional supervision.