Dall-E, Stable Diffusion and the Holodeck
Trying to keep up with the rapidly evolving space of Visual A.I.
Welcome to the “Art of Intelligence”, a newsletter that keeps you up to date on what’s happening in A.I. and is a space to share some of my own thoughts on the way things are evolving in the space. “Art of Intelligence” is a working name suggested by OpenAI’s GPT-3 when I asked it for a clever name for a regular newsletter about A.I - which … I thought was a pretty good name and an interesting example of A.I.’s creativity :)
It feels like in the past 6-months A.I. has reached a sort of tipping point both in terms of its capabilities, and it’s share of mainstream developer attention. Things are moving so fast, that it’s hard to keep up. I’ve found myself spending more and more of my time in the space, and this newsletter aims to share what I found interesting in the world of A.I. each week, along with a few higher-level thoughts about the state of the tech, tools, and how recent developments might impact society. I’m trying out this more topical newsletter to see if people enjoy the content & I have a good time writing it - but let’s get right into it!
Please give me your feedback on the format of this newsletter, and start sending along anything you see that’s interesting in the field of A.I., or topics you’d like me to explore more for future newsletters!
🤯 Stable Diffusion, Dall-E & The Holodeck
If you haven’t been paying attention, AI has been getting mind-blowingly good at generating and editing images based on text prompts. We’re in “genie” territory now, where you can just describe an image, a movie, or a product, and have it appear.
In mid-July for example I launched a very early prototype of a store I’m calling We-A.I. (weai.store) where all products are designed by A.I., and printed / manufactured on demand. I was mainly curious to see what’s possible, and how products designed by A.I. turn out in practice.
My friend’s daughter asked for a shirt with dinosaurs playing baseball in outer space, and a few minutes later I was able generate the shirt and send here a link to the actual product in the store where she could order the product for real: https://weai.store/products/kids-rash-guard. Last night I wore a hat designed by A.I. to an event I went to. On-demand manufactured, A.I. generated products are going to be a thing. If you visit that product page for the dinosaurs playing baseball in outerpsace shirt, and you read the description, you’ll also notice that I had GPT-3 write a little story about the theme of dinosaurs playing baseball in outer-space which is also quite charming. There’s something beautiful about people feeling like they have this new collaborative capacity to design their own products.
It’s been a fun experiment seeing what’s possible in the space of A.I. generated products. The realm of product design and A.I. deserves its own full longer-length post which might be the focus of the next week edition.
For now, you can read more about some of the ideas I want to explore with We-A.I. in this brief We-A.I. concept vision statement. But seeing something like this Dall-E generated “Lamp by Rodin” makes me realize that visual AI is going to result in a massive shift in who participates in design, and how things get collaboratively designed with our A.I. assistants. I think we’re just at the beginning of seeing how brands employ these new capabilities.
We have a few players that have gotten most of the attention when it comes to image generation:
Dall-E 2 from OpenAI
Dall-E 2 was announced April 2022
Still in waitlist mode, API is not yet available.
They have their own built in filters & restrictions on what types of prompts are allowed and what types of images can be prompted (attempting to remove political, sexual and violent content)
Stable Diffusion from Stability.Ai
Released a month ago, Stable Diffusion made it possible to run the model yourself, and has far fewer restrictions on access.
The result has been an absolute explosion of usage and interesting applications built out in 30 days, which is a testament to how fast the field is moving.
My hope is to do a deep dive into Stable Diffusion this weekend to see how its generation abilities compare to Dall-E
Midjourney
A small “applied research lab that makes products” founded by David Holz (formerly of Leap Motion).
They have over 2 million Discord members making it the most active Discord server
Entered Open Beta in July 12, 2022
Imagen
Out of Google research - in some ways it feels like Imagen’s garnered less of the mainstream attention. From their website it looks like they also haven’t opened up broad access to the tool.
I got early access to OpenAI’s Dall-E 2 image generation model back in May, and like others was impressed by the results. I attended an event hosted by the OpenAI team and what struck me the most about their presentation was just how fast they suggested the models were improving. It seemed intuitive to me that if the models continue to improve in their ability to generate and edit images, then it’s not a far leap to go from image generation to video generation. Think about the Muybridge horse movement stills - being able to generate the “next” likely frame seemed like a relatively simple task for Dall-E given the existing image editing capabilities they offer.
I thought that the move to A.I. based video generation would happen perhaps within likely one to two years. Maybe eventually you could just describe a high-level plot of a film, and have GPT-3 or some other large language model (LLM) write you a script, with Dall-E or another visual content generation model generating you the video.
I didn’t expect just how fast things would move! I also didn’t expect players like Stability.AI to come in and be able to take attention away from OpenAI’s ironically more closed approach - StabilityAI is closer to being an Open OpenAI.
It didn’t take years to get to A.I. video generation - it only took a few months. People have already started to make mind-bending videos using both Dall-E and the more open, permissive, and programmatically extendable Stable Diffusion models. I’ve included a bunch of examples below that have caught my eye.
In the Star Trek series, especially Star Trek: The Next Generation, there’s a technology often featured in episodes called The Holodeck. The Holodeck is a room that is a large computer, where people are able to enter, that when turned on is able to simulate entire alternative physical worlds. You can walk in any direction, interact with simulated objects and characters, and the computer is able to generate narratives and content dynamically as you go. Here’s a clip from one of the early ST:TNG episodes featuring The Holodeck:
Watching the staggering pace of activity and improvement in these image generation models one of the thoughts that started bubbling for me a few months back is that we might be closer to Holodeck like technology than we realize.
There’s a lot in the physics of making things “touchable”, but in terms of dynamically generating the visual content and credible characters and narratives it feels like if we project out the rates of improvements of models like Dall-E, we might not be so far away.
The leap from rendering videos to rendering content for Virtual Reality headsets like the Occulus is likely not that large - perhaps somewhat more computationally intensive, but seemingly within reach. I imagine we’re likely only years away from being able to browse these types of visual worlds that are generated from some starting prompts. Being able to describe a world and then wander through it will be its own interesting novel type of experience.
Beyond the realm of virtual reality, there’s also interesting developments in different approaches like Augmented Reality, and ambient, projector based computing. If a future where everyone is wearing headsets doesn’t seem very appealing, consider alternative approaches like Brett Victor’s Dynamicland project:
and Lumen’s Augmented Reality flashlights
https://www.lumen.world
These projective computing interfaces offer an alternative vision of what computing interfaces might look like, where the computing medium enhances the existing reality, allowing for a more full-body, in-this-realm not-in-a-headset experiences.
For some early examples of the types of video that people are creating with Dall-E, and Stable Diffusion check these out:
RunwayML (runwayml.com) | Twitter: https://twitter.com/runwayml
is a new collaborative online video editing tool focused heavily on bringing A.I. content generation and editing. They’ve been putting out some really interesting teaser videos of the tools they’re building, and reading the Stable Diffusion launch post it seems like they’re also closely involved with Stability.ai:
This Stable Diffusion AR demo is also an impressive exploration of overlaying these new AI-generated visuals on to the real world:
Developers couldn’t hack and automate around Dall-E because they didn’t open up an API for programatic access. Stable Diffusion opened up their models and let people build on top, and within 30 days, people have been building all sorts of tools & experimenting with video creation on top of this model - have a look at the Twitter Search for “Stable Diffusion Video” some examples: https://twitter.com/search?q=stable diffusion video&src=typed_query. Here’s a few of my favorites:
There are just so many applications for this tech, and lots of developers are rushing to figure out where best to apply these new visual media generation abilities. I’ve included some of the early tools and plugins below, but I imagine that A.I. will be deeply embedded and essential to all design tools and content creation tools.
It’ll be interesting to see how these visual A.I. approaches also start to alter our physical day-to-day worlds, whether it’s with A.I. designed products like I’m lightly exploring with the We-A.I. store, or A.I architecture, or more augmented reality style experiences.
Along this theme, there’s a few opportunities that seem readily at hand for folks to work on.
The Stabile Diffusion Augmented Reality overlay demo I posted above, makes me think that augmented reality mirrors and cameras seem like interesting product directions to explore, where you could quickly try on new fashions or looks - an idea that’s been tried before, but might have become more interesting again in light of the advances in the field. A.R. and general video rendering technology also seems really interesting in the space of furniture, architecture and product design where you could quickly render 3D mocks of what these object might look like.
As we have better and better rendering capabilities, another interesting opportunity space is figuring out how to take AI generated concepts, and bring them closer to the needed specifications for actually constructing or manufacturing those types of goods.
One interesting opportunity that comes to mind is thinking about how you can take rendering from one of these visual AI models and automatically generate 3D models for software tools like Revit, Maya, or CAD drawings that can be sent off to 3D printers or be used in architectural processes.
Interesting Finds & Reads
Adept.ai - Universal A.I. assistant - automation across tools using natural language.
Adept is trying to build an A.I. assistant that knows how to use the web. They bill themselves as “Useful General Intelligence”.
They demoed their Action Transformer (ACT-1) recently where they showed how they were able to train an agent to perform tasks on various websites and within various web applications based on written responses:
1/7 We built a new model! It’s called Action Transformer (ACT-1) and we taught it to use a bunch of software tools. In this first video, the user simply types a high-level request and ACT-1 does the rest. Read on to see more examples ⬇️Their idea is that “language” will be the interface for workflow automation, and that giving an agent the ability to understand how language corresponds to workflows across a set of tools is a path to creating useful A.I. assistants to getting work done. It feels very much like one potential evolution of robotic process automation & workflow automation. The aim as they write it is to build a “Universal Collaborator”, that can help out every knowledge worker.
John Carmack, programmer of Doom, Quake, and formerly CTO of Oculus recently left to start his own AI startup, and in an interview with Lex Friedman (https://lexfridman.com/john-carmack/) also describes a vision where people have universal A.I. virtual assistants helping them accomplish their tasks. It’s a great though longer pod cast on the topic of A.G.I. and much more.
Adept’s team hails raised a substantial $65M round, and has a team consisting of folks from Google Brain, DeepMind & Open AI.
First month of Stable Diffusion.
We’re a little more than 30 days (Aug 22) out from stable diffusion having been released publicly. They had released their model to researchers on the 10th: https://stability.ai/blog/stable-diffusion-announcement
You can run the model locally as per their blog post:
Stable Diffusion runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds
The work and expenses they had to employ to train the model for use is a different thing however:
The model was trained on our 4,000 A100 Ezra-1 AI ultracluster over the last month as the first of a series of models exploring this and other approaches.
“This release is the culmination of many hours of collective effort to create a single file that compresses the visual information of humanity into a few gigabytes.” - I really liked this line from their public announcement - it captures the sense that these AI image generation models are our collective conscious & unconscious manifest and left to expand
Speech-to-Text
OpenAI launched Whisper, a neural network for English speech recognition his past week: https://openai.com/blog/whisper/. The models are open-source, and there’s a few ideas I’ve wanted to tinker with around speech recognition so might give this a spin.
Other services in this realm include Assembly.ai
In the other direction, there’s also a host of startups working on “Speech synthesis” which works to go from Text to Speech using A.I. voices - these include companies like Sonantic which was just acquired by Spotify this summer, Papercup, and Replica Studios.
This blog post takes a deep dive into techniques for hyper-realistic deep-fake video generation: The Road to Realistic Full-Body Deepfakes - Metaphysic.ai
A.I. Tools
Beyond A.I. native tools like Runway M.L. developers & designers are racing to add plugins into popular tools for image, video, 3D-modeling, and other tools that deal with visual assets (e.g. think product design, architecture and more). It’s been impressive to watch how quickly some of these tools are getting adoption:
Figma Plugin Ando - Stable Diffusion plugin for Figma — https://www.figma.com/community/plugin/1145446664512862540/Ando---AI-Copilot-for-Designers
6K people have already tried it out!
Photoshop Stable Diffusion plugins:
Alpaca: https://www.getalpaca.io/
Flying Dog: https://t.co/MMlc9ElH2V
Fun canvas-based UI exploration for Stable Diffusion. I’m excited to see what kind of design interfaces folks come up with for exploring how to incorporate - this one by Amelia Wattenberger was a fun one.
People building plugins for Unreal Engine - the game & 3D modeling engine
Hope you enjoyed this quick summary of what I’ve seen happening the last while in the visual A.I. space. Future posts will dive into other areas like:
GPT-3 and what that tech’s opening up as well.
Societal considerations around how these new A.I. technologies will make copyrights, patents, and other legal structures increasingly confusing to navigate.