Nvidia has announced a new groundbreaking application called GanVerse3D, with an unspecified release date for now, which can render 3D models from a single 2D photo, like the KightRider car, ready for use in Nvidia Omniverse.
But that is not all. The 3D model includes texture and it can be animated in seconds at a click of a button.
How does it work?
There are two main parts to GanVerse3D.
First, it uses a Generative Adversarial Network, aka GAN, to generate photo-realistic images of an object in 2D, with a fixed number of viewpoints.
It then feeds these images to train a state-of-the-art, Inverse Graphics neural network, previously developed by Nvidia, as part of the DIB-R paper and which uses DIB-R the differential renderer. For simplicity I will call it DIB-R. But please don’t confuse DIB-R the differential renderer with, DIB-R the inverse graphics network. I have written a whole article about DIB-R, so you might want to check it out, in case you get confused.
DIB-R then serves as the inferencing engine to generate 3D models with texture from a single 2D image.
Because DIB-R requires the background to be masked, i.e. removed, GanVerse also uses MaskRCNN to remove backgrounds from generated StyleGAN images.
What is an inverse graphics network?
An inverse graphics network put simply, is a neural network that is able to generate a 3D model with texture and lighting information from a 2D photo.
Why call it “inverse graphics”?
It is called inverse graphics, because we are trying to do the opposite of what is normally done in computer graphics, which is to convert a 3D model to a 2D image, that is then rendered onto a screen.
Could we not convert 2D to 3D last year?
That’s very true. This is so last year. Almost feels like old news. But, what GanVerse brings to the table is higher quality 3D models!
Previous AI models were trained using ShapeNet, a synthetic dataset of 3D Models.
The problem with using synthetic models, I mean cartoon-like models, is that when you train on them the neural network is going to be better at predicting the 3D shape of cartoon-like objects and it is going to not do so well when applied to real photos.
ShapeNet was created because creating a large enough dataset of 3D models with textures of photo-realistic objects is too expensive. ShapeNet solves the problem of quantity but unfortunately not quality.
To improve on this, newer AI models use the Pascal3D dataset and the CUB bird dataset for training.
The Pascal3D and the CUB Bird datasets consist of a set of photos of cars and birds that were manually annotated containing just enough information to use an algorithm that can calculate the position of the camera(viewpoint). This then helps the AI model when deriving the 3D structure.
There are relatively few datasets available like this. So the quality of the 3D models generated by models trained using these datasets compared to GanVerse3D is quite limited in comparison.
We all know that StyleGAN 2 can generate crazy realistic images of almost anything. People, fish, cars, horses, cartoons and the list goes on and on!
But there is another hidden ability in StyleGANs!
It has 3D knowledge of the objects it generates, and it can render them in different perspectives.
But the problem is that, because these images are not annotated, there is no way to know where the camera viewpoint is.
But wait a minute. What if there is a way to know? Suddenly we would be able to generate large datasets for us to feed the DIB-R neural network and in this way, we could train amazing 3D models from 2D photos for almost any type of object?!
We would no longer be limited to generating birds, or just cars. Right?
Meet Style GAN Renderer aka StyleGAN-R
So Nvidia has cracked this. If you have heard about StyleGANs before, you might know that it is possible to control certain properties of a generated image using a latent code.
This latent code allows us to generate realistic photos of people with certain hair colors, wearing glasses or not. And most importantly, it can also control the camera viewpoint.
The reason it hasn’t been used so far, according to the GanVerse3D paper, and let me straight quote them is because, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning.
What this means is that before Nvidia researchers didn’t know how to have exact control of the viewpoint.
But they have finally figured it out. Let me try to explain how they did it. StyleGAN neural network has 16 layers. Every 2 layers form a block. And it seems that the first two blocks of the GAN control the viewpoint.
So with this knowledge, if we fix the first two blocks of the StyleGAN , all the images generated by that GAN will have the same exact viewpoint.
Since we have the exact knowledge of the viewpoint, we can immediately annotate each generated image with the annotations required for us to train the inverse graphics network.
Limitations in Nvidia GanVerse3D
Firstly, GanVerse3D currently can only generate 3D models and textures for very limited classes of objects: Cars, Horses and Birds
This limitation is simply because GanVerse3D was trained only using StyleGANs that generate these. Not a big deal. I expect Nvidia to include many more classes of objects in the future.
GanVerse3D doesn’t generalize very well
Also, it doesn’t seem to generalize very well.
Let me explain. If you give a picture of a Batmobile to GanVerse3D, it gives you back a 3D model of a completely different car.
If you give it the picture of a Penguin to GanVerse3D, it gives you back the wrong bird. Definitely not a penguin.
This is because the datasets used to train the StyleGANs for GanVerse do notcontain penguins or batmobiles. Hmmm, is this meant to be a Batman joke in the paper?
To use the technical term in the paper, these objects are out of distribution.
Artifacts in Generated Images
If you look carefully you will notice some oddities in the generated images by the GAN.
For instance, you will see the horse missing its tail. This is mainly an indication that the images used in training for the StyleGAN were front-facing pictures of a horse.
Also, if you look carefully also, the generated 3D model of the horse is deformed at the top. This is because most photos used for training are biased to show only the sides of the subject.
Also, as it is acknowledged in the paper, StyleGAN2 has issues generating views of articulated objects(i.e. Horse or Bird), this is something that GanVerse3D can’t handle too well just yet, and it will be the subject of further research.
GanVerse3D fails to predict correct lighting
Even though GanVerse3D can predict correctly the geometry and texture of an object, it currently fails to predict the correct lighting.
But even considering all this, this is a major breakthrough!
In the very near future, it is going to become a lot cheaper to create 3D assets for games!
Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan Zhang, Antonio Torralba, Sanja Fidler
IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING
Wenzheng Chen, Jun Gao*, Huan Ling*, Edward J. Smith*, Jaakko Lehtinen, Alec Jacobson, Sanja Fidler
Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer
Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, Bill Freeman, and Josh Tenenbaum
3d-aware scene manipulation via inverse graphics. In Advances in neural information processing systems,
P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.
Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the
wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.