Fun with Diffusion Models
Fall 2024
In this project, I used the DeepFloyd diffusion model to explore different denoising approaches. In the second part, I created my own UNet and diffusion model from scratch.
Table of Contents
1. Sampling from the Model2. Implementing the Forward Process
3. Classical and One-Step Denoising
4. Implementing Iterative Denoising
5. Diffusion Model Sampling
6. Classifier Free Guidance
7. Image-to-image Translation
8. Editing Hand-drawn and Web Images
9. Inpainting
10. Text-conditioned Image-to-image Translation
11. Visual Anagrams
12. Hybrid Images
Diffusion Models from Scratch
13. Training a Single-Step Denoising UNet
14. Training a Diffusion Model
15. Bell & Whistle: Denoising GIFs
1. Sampling from the Model
In this part, I create an instance of the DeepFloyd model (using seed=180) from HuggingFace and sample three images from them. These 3 images are sampled using text prompts ‘an oil painting of a snowy mountain village’, ‘a man wearing a hat’, and ‘a rocket ship’, respectively.
I also try sampling these images at different num_inference_steps. The results are shown below.
20 inference steps
40 inference steps
For num_inference_steps = 20 and 40, the quality of the image in relation to the text prompt seems to be quite well. I thought it was interesting how the model interprets ‘a man wearing a hat’ consistently as a photo of a man, instead of a painting or cartoon. In contrast, ‘a rocket ship’ is interpreted as a cartoon or drawing rather than a photo.
For num_inference_steps = 5, the images look quite grainy. ‘an oil painting of a snowy mountain village’ still produces a matching image, but the results for the other two prompts appear much more vague.
Additionally, I feel that num_inference_steps = 40 returns more detailed images than num_inference_steps = 20, such as more background detail (except for the man wearing a hat).
2. Implementing the Forward Process
In this part, I implement the forward process of adding noise to a clean image. For a timestep t, I add an epsilon value that is randomly sampled from a standard normal distribution, scaled by a predetermined alpha value (determined by timestep).
I perform the forward process to noise the Campanile at t=250, t=500, and t=750.
t = 250
t = 500
t = 750
3. Classical and One-Step Denoising
In this section, I attempt to denoise the Campanile images I generated above using classical and one-step methods.
First, I try adding a classic Gaussian blur…
Denoise by Gaussian Blur
t = 250
t = 500
t = 750
We can see that this doesn’t work out very well. Not only is the Campnile blurry, but we can still see the noisy pixels.
Second, I try using a pretrained UNet diffusion model to denoise the images in one step. The diffusion model was trained with text conditioning, so for this part I use the prompt ‘a high quality photo’.
One-step denoising by Pretrained Diffusion Model
t = 250
t = 500
t = 750
Looks better.. but the image still looks suboptimal when attempting to denoise from higher noise levels.
4. Implementing Iterative Denoising
Diffusion models are designed to denoise iteratively rather than all in one step like attempted above. In this part. I iteratively denoise using a list of strided timesteps, starting at t=990 and decreasing t by 30 until reaching 0. This is done to save time (as denoising at every single time step would take too long).
Iteratively denoising the Campanile, starting at t=690 for the first image and decreasing t by 30 for each image:
Final clean image:
Much better. The Campanile looks like a lighthouse!
Comparison with results from Part 3:
Gaussian Blur Denoise, t=690
One-Step Denoise, t=690
Iterative Denoise, from t=690
5. Diffusion Model Sampling
Now that we have implemented iterative denoising, we can generate images from pure noise. Instead of starting at t=690 (or i_start=10), we can start at t=990 (i_start=0), which is an image made up of completely random noise.
10 examples with text prompt ‘a high quality photo’:
6. Classifier Free Guidance
It doesn’t seem like the generated images above look very convincing or high-quality. In this section I implement Classifier-Free Guidance, where we compute both a conditional and unconditional noise estimate. We combine these two noise estimates using a scaling factor gamma that controls the CFG strength. When gamma=0, we get only the unconditional noise estimate. when gamma=1, we only get the conditional noise estimate.
When gamma > 1, we get much higher quality images.
5 examples with gamma = 7:
7. Image-to-image Translation
In this part, I add varying levels of noise to an image and denoise using iterative CFG. By denoising from slightly noisy images, we are allowing the diffusion model to be ‘creative’ and add things that were not there before.
Below are images denoised at i_start = 1, 3, 5, 7, 10, and 20. The smaller the i_start (noisier image), the less similar the result is from the initial clean image.
Campanile:
San Jose Circles:
Bear:
Yifang Pancakes:
8. Editing Hand-drawn and Web Images
What happens if we try the same image-to-image translation on non-realistic images and drawings?
Web Image:
Hand Drawn Image of Bird:
Hand Drawn Image of Smile:
Hand Drawn Image of Millenium Falcon (scanned):
9. Inpainting
We can perform a masked denoising to only edit a specific part of an image. After each iterative step in denoising, we ‘force’ the parts outside of the mask to be the same image, while the parts inside the mask continue to be denoised.
Inpainted Campanile, masking the top of the tower
Inpainted Kiryu, masking the head
Inpainted borb, masking the head
10. Text-conditioned Image-to-image Translation
The previous examples of image-to-image translation and inpainting were all used with the text prompt ‘a high quality photo’. In this part, I change the text prompt to ‘guide’ the edits that the diffusion model makes (at varying noise levels like earlier).
Campanile, text conditioned with ‘a rocket ship’
Kiryu in Yakuza 3, text conditioned with ‘a photo of a hipster barista’ to replace his head (the results don’t seem to blend in the best way for every noise level)
Boba, text conditioned with ‘a yifang fruit tea drink’ to remove the TP Tea drink
11. Visual Anagrams
In this part, I implement visual anagrams using diffusion models. To do this, I obtain two noise estimates. One is from a UNet denoising from a completely noisy image with one text prompt. The other is from a UNet denoising on the noisy image upside down, with a different text prompt. At each step of the iterative process, I compute these two noise estimates and take their average to obtain the final noise estimate.
Oil painting of a man vs oil painting of a campfire
Cat vs lion
Muffin vs chihuahua
12. Hybrid Images
It is also possible to create hybrid images using diffusions models. Similar to the visual anagrams, I compute two separate noise estimates for two separate text prompts (but this time one of them is not denoising a flipped image). Then, I apply a lowpass filter on one estimate and a highpass filter on the other. Adding the two resulting values together gives me the final noise estimate.
Ominous waterfall
Cat or tiger?
Dog or lion?
Dog or bear?
Campfire or old man?
13. Training a Single-Step Denoising UNet
In these last two parts, I implement a UNet from scratch and use it to denoise MNIST digits.
To add noise to a clean image x, we use the following equation, where epsilon is randomly sampled from a standard normal distribution.
We can vary the sigma value to adjust the strength of the noise.
Example of noising process using sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]. In the image, the sigma value increases from left to right:
First, I implement an unconditional UNet.
For the unconditioned UNet, we optimize over the following loss, where D is our denoiser and x is our clean image:
After training for 5 epochs with batch size = 256, sigma = 0.5, and hidden dimension D = 128…
For each of the images below, I display a row of the clean images, then a row of the images with sigma = 0.5 of noise added, and then a row of the denoised image
Sample results per epoch:
Sample results per noise level (varying sigma value). Sampled after training the UNet for 5 epochs:
14. Training a Diffusion Model
In this part, I train a UNet that can iteratively denoise an image.
Hyperparameters used: 20 epochs, hidden dimension D = 64, batch size = 64.
Conditioned UNet architecture
FCBlock operation breakdown
These conditioned UNet models predict the noise in an image, so the loss function is based on noise estimate epsilon instead of the image:
Time-conditioned UNet
First, I train a version of the diffusion model that is conditioned on timestep t. This model predicts the noise in an image, given a noisy image and a timestep t.
Training loss curve plot:
Sampling results for epochs 1-5:
Sampling results for epochs 16-20:
Class-conditional UNet
The time-conditioned UNet can iteratively denoise from a completely noisy image, but the resulting number is random. In this part, I implement a class-conditioned UNet that can selectively denoise digits decided by the user. To ensure that this UNet can still work without being conditioned on class, I implement a dropout rate of 10% (during training, 10% of the time the class conditioning vector gets set to 0). Additionally, I add CFG to with gamma = 5.0 to improve the quality of the results.
Training loss curve plot:
Sampling results for epochs 1:
Sampling results for epoch 5:
Sampling results for epoch 20:
Denoising GIFs
Time-conditioned UNet:
Class-conditioned UNet:
Class-conditioned UNet with c vector of all zeroes:
The coolest thing I have learned from this project
I think the coolest thing I learned from this project was how to implement iterative CFG denoising. It was very interesting to feed my own images into the diffusion model at different noise levels to see what interesting edits it could come up with. I think it’s also impressive how for inpainting, the edits tend to be seamless with the rest of the image.