The Text-Conditioned Image

Julia Minarik is a Ph.D. Candidate in philosophy at the University of Toronto. Her dissertation is on Creativity in the Age of AI. She spends her spare time observation drawing, taking polaroid pictures, and painting images of apples. Her website is here: https://sites.google.com/view/julia-e-minarik/philosophy

A post by Julia Minarik

And someday there will be a more complete machine. One’s thoughts or feelings during life – or while the machine is recording – will be like an alphabet with which the image will continue to comprehend all experience…Then life will be a repository for death.

– Adolfo Bioy Casares, The Invention of Morel.

Do machine-made images have less content than images created by human hands? I think they do, at least insofar as these machines are imaginatively impoverished: they have access only to words and images, far less sensory and emotional modalities than we do. Contemporary image generators – Midjourney, DALL-E and their kin – produce images via generative diffusion conditioned on text or images. To grasp what this limitation means for the content of their art, we must first understand how art arises from and leads the human imagination.

With art we aim to steer the imaginations of others (Walton 1990). Let’s take words in a short story first:

Imagine a seagull in the sand on the beach, he’s at the top of your blanket eagerly awaiting an opportunity to steal a grape from your cooler.

With those words (if you’re not aphantasic), I can get you to produce an image in your mind of a scheming gull. Let’s now start from images:

With this image, I can get you to see a house and some trees from an unnatural angle. The image dredges up these concepts, and we associate them with their labels ‘house’ and ‘tree’ and so on. Since both images and words give rise to the other in imagination, WJT Mitchell claims that “…all arts are ‘composite’ arts (both text and image); all media are mixed media” (1995, 95) as words cause us to conjure images, images cause us to conjure words. That’s not to say what art literally is, nor to say that I cannot psychologically bring up only images or only words, but it is to say that the arising of multi-modal associations motivates the work’s creation and presumed to be part of the effect of an artwork on its observer.

We might take prompting an image generator to be analogous to steering its imagination. The text prompt guides the machine’s ‘imagination’ around its internal imaginative space. Of course, text and images are not directly translatable (Mitchell 1995, 5; Goodman 1976). Take a sentence such as ‘I sat on a beach in July’, this sentence can be visualized in thousands of different ways. Take any one of those images and try to describe it, ‘I sat on a beach in July’ is only one way out of those thousands. What text does is constrain the possible images produced. One might analogize the addition of these textual associations to the addition of a proposition to the common ground in a conversation; it closes off some possible images and redirects one’s attention to others. Once prompted, the machine produces, roughly, a randomly selected image which corresponds to the prompt. The more specific the prompts, the less variation in the image, as we can see by comparing the low variation between figures 2 and 3 with the higher variation between figures 4 and 5:

Figure 2. An image produced by the following prompt: A bold and arresting acrylic painting of a single pyrrole red apple sitting on a deep chromium green book, decorated in a fine and delicate diagonal grid of gold. The painting is somewhat illustrative, and not hyper-realistic. The paint is layered many times which seems to give the apple an inner glow. The light is hitting the top left hand side of the apple which is casting a dark shadow onto the book. The book and apple are both against a flat dark background. The colors of the image have a medium saturation and the vibe or mood of the image is somewhat somber and contemplative. You can see the very fine woven texture of the canvas through the paint. The texture of the paint is thick and visible, there are small brushstrokes which catch the light.

*Figure 3. Another image produced by using the same prompt as in Figure 2.*

*Figure 4. An image produced by the prompt: A seagull who is a boss.*

*Figure 5: A second image produced by the prompt: A seagull who is a boss.*

Given this incommensurability and randomness, guiding the machine as a human can be deeply irritating, words are a coarse brush, and what I see in my imagination never quite matches up to what comes out (this should be further explored as a limitation on the content of these machines, especially in light of what I say below).

Alright, so art arises from our imagination from which we produce things to guide the imagination of others, and the machine does something similar. But the nature of images and words as composite doesn’t stop at words and images for humans. An image of a rose by any other name does not smell as sweet: call it The Sweetest Flower and it comes on like honey but rename it Intestinal Bloom and feel your nose twitch back in disgust at the visceral folds. In a human creator/observer, with a multimodal imagination (Nanay 2018), and an emotional landscape, representations from other sensory modalities and feelings come into play too. The words and images of art dredge up other conceptual, perceptual, and emotional associations: the story of the gull brings up the smell of the sea, the sound of the gull.

When we create, we manipulate words, images, and the like to evoke precise imaginative effects; we choose them because of this power. Roland Barthes says in Rhetoric of the Image: “the linguistic message…[constitutes] a kind of vice which holds the connoted meanings [of the image] from proliferating…it limits, that is to say, the projective power of the image.” (1977, 39) I take the image of the house we saw earlier and name it ‘Grandmother’s House’ (figure 6) to bring out its eerie tone, evoke the big bad wolf and the smell of moth balls, make you feel the cold air of night.

I can also illustrate the word by bringing an image to it, as I do with The Thief (figure 7), meant to illustrate the short story about the gull above:

In the image, the sun is warm, the grapes of the story are soft, the seagull and you are in a complex battle of wits.

When we create art as humans, we create with all these imaginative associations in mind. I prompt myself to paint an apple. The word ‘apple’ brings up associations, images, textures, tastes. I paint the image of the apple, but when I paint the texture, I think of how the image triggers a tactile association in the viewer. I make the apple smooth so that I (and then you) can feel its smoothness. If I see the image and don’t imaginatively feel its smoothness, I change it. Images are not made with merely visual thoughts in mind. Each evokes multi-modal perceptual, conceptual, and emotional associations in us. These associations alter our imaginative engagement with the image: they recontextualize it for us, give it a depth of content and meaning. Even a simple title like ‘Seagull’ seems to direct our engagement with the image: it focuses our attention on the gull rather than the compositional role the sign in the back is playing.

These imaginative motivations are why titles like Intestinal Bloom for the flower are, as Jerrold Levinson (2011), argues aesthetically relevant features of artworks that come to partially determine the works content. This idea is also implicit in Arthur Danto’s introduction to the Transfiguration (1981): perceptually indistinguishable red squares - Kierkegaard's Mood and The Red Tablecloth – have distinct content and evoke distinct responses once their titles are known. Even when works appear identical to us, knowledge of the author’s imaginative intentions fills the works out with distinct contents. Or if, like me, you favor a performance view of artworks (Davies 2003), then the work itself is the performance that generates a focus of appreciation, in which case, the imaginative motivations are a part of the work.

So, what does this mean for images produced by an image generator? Do they have less content than visually identical images produced by a commissioned human? Based on the analogy above, these machines create by producing a ‘snapshot’ of part of their imaginative space. But this space is comparatively vacant: the machine’s imagination contains only words and images. When machines imagine these are the only contents available to them; other sensations and emotions cannot motivate their choice of the image, nor can they wish to transfer these feelings to another. Even the machine’s interpretation of ‘somber’ in the apple image is a mere pictorial and conceptual association with it, it cannot be motivated by being felt. At best the machine was led by visual proxies for that content. Of course, there are complexities here: this view favors a form of artistic intentionalism, and perhaps the machine, even if it can’t feel things, might intend to evoke them in us – though I think there’s a big question about whether this is enough to get it into the work. This is also not to say that a human can’t prompt an AI with their own imaginative intentions and add content to it through their choice. But this content is the human’s, it does not emerge from the machine alone.

References:

Barthes, Roland. 1977. Image, Music, Text. Fontana Press.

Danto, Arthur Coleman. 1981. The Transfiguration of the Commonplace: A Philosophy of Art. Vol. 40. Harvard University Press.

Davies, David. 2003. Art as Performance. Wiley-Blackwell.

Goodman, Nelson. 1976. Languages of Art: An Approach to a Theory of Symbols. 2nd ed. fourth printing 1981. Hackett PubCo.

Levinson, Jerrold. 2011. “Titles.” In Music, Art, & Metaphysics. Oxford University Press.

Mitchell, W. J. T. 1995. Picture Theory: Essays on Verbal and Visual Representation. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/P/bo3683962.html.

Nanay, Bence. 2018. “Multimodal Mental Imagery.” Cortex; a Journal Devoted to the Study of the Nervous System and Behavior 105 (August): 125–34. https://doi.org/10.1016/j.cortex.2017.07.006.

Walton, Kendall L. 1990. Mimesis as Make-Believe: On the Foundations of the Representational Arts. Harvard University Press.