Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors

1University of Maryland, College Park 2Kuwait University

We create 3D multiview illusion art using 2D text-to-image generation model.

Real Examples

3D Shape Illusion

Cube

Sphere

Reflective Cylinders

Reflective Mirrors

Abstract

Automatically generating multiview illusions is a compelling challenge, where a single piece of visual content offers distinct interpretations from different viewing perspectives. Traditional methods, such as shadow art and wire art, create interesting 3D illusions but are limited to simple visual outputs (i.e., figure-ground or line drawing), restricting their artistic expressiveness and practical versatility. Recent diffusion-based illusion generation methods can generate more intricate designs but are confined to 2D images. In this work, we present a simple yet effective approach for creating 3D multiview illusions based on user-provided text prompts or images. Our method leverages a pre-trained text-to-image diffusion model to optimize the textures and geometry of neural 3D representations through differentiable rendering. When viewed from multiple angles, this produces different interpretations. We develop several techniques to improve the quality of the generated 3D multiview illusions. We demonstrate the effectiveness of our approach through extensive experiments and showcase illusion generation with diverse 3D forms.

Overview back to top

Given multiple text prompts, we aim to create 3D illusions with multiple interpretations, each respecting the corresponding text prompt. We achieve this by selecting viewing angles of common 3D shapes or reflective surfaces with overlapping regions and optimizing them to align with the text descriptions. Our method leverages the power of diffusion models and incorporates several techniques and design choices to enhance the quality of these 3D multiview illusions.

Description of the image
We illustrate the process of generating a 3D multiview illusion from a cube with three varying interpretations from different viewpoints (V1, V2, and V3) guided by text prompts (y1, y2, and y3). First, we render the cube from the target viewpoints Vi applying scheduled camera jitter C(k) and scheduled render size R(k) with respect to the gradient flow time k. Camera jitter improves generation quality and render size scheduling helps reduce the duplicate pattern issue issue. We then utilize the multi-resolution texture field T to obtain the images (I1, I2, and I3) at resolution R(k) in [512, 1024]. To increase resolution during training, we extract a random patch Pi of size 512 × 512 from each rendered image Ii, which is then fed into a pre-trained VAE encoder. Given the 3D shape, we aim to optimize the parameters of the texture field. The generation of these viewpoints is optimized by leveraging a text-to-image diffusion model guided by the text prompts (y1, y2, and y3). To avoid unnatural, saturated colors, we apply Variational Score Distillation (VSD) and LoRA model. We apply the same settings for spheres and scenes with reflective surfaces.

BibTeX

@article{feng2024illusion3d3dmultiviewillusion,
                title={Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors}, 
                author={Yue Feng and Vaibhav Sanjay and Spencer Lutz and Badour AlBahar and Songwei Ge and Jia-Bin Huang},
                year={2024},
                eprint={2412.09625},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2412.09625}, 
          }