Multimodality

Diverse Communication Between Digital Entities – Part 1

Nov 10, 2023

Peter Granados "Detective Looking at a Painting I", Sakura Permapeque Markers on Sennelier's Le Maxi Block Paper, 10" x 10", 2017

If human creativity is to be murdered, as many have insisted in the face of artificially-generated media, we must become detectives assigned to our own case. Divining what the future of generative art will look like requires an introspection into our varied, collective, historic relationships with paintings, writing, photography, sculpture, theatre, ballet, and so forth.

If these modalities are at present dangerously close to being taken over by computer chips, then we must reassess our relationship with them. That is, we must take as objective a look as possible at how strong the case is to be made that human creativity is something special. When we communicate through creation, is it something novel? Or perhaps computationally inevitable? And if it’s the latter, then what are we to make of our own inevitability?

In various ways over the past 200 years, it has often been stated that we are a species with amnesia. In the 1800s, this idea was echoed in the Romantic era's idealization of a forgotten golden age, the burgeoning field of archaeology uncovering lost civilizations, and the revival of ancient myths and folklore as remnants of a once richer heritage. New esoteric movements emerged, advocating a reconnection with ancient wisdom believed to have been lost over time. Much of this was spurred by the Industrial Revolution, as workers were robbed of purpose and remembered a healthier collective psyche. Today we seem to find ourselves again sitting upon the precipice of a great restructuring of society, this time through artificial intelligence.

If there is a path forward, we don’t seem to know what it is.

Mired in our twenty-first century obsession with information technology, we are both aghast at the direction our self-driving cars are taking us, as well as unsure of how to take back control. Some of the advancements we’re experiencing promise to be radical in scope. The multifaceted nature of artificial intelligence reflects a spectrum of outcomes, transcending simplistic good-bad dichotomies. Its applications in healthcare, energy management, data analysis, agriculture, disaster response, accessibility, and transportation safety, offer a landscape of both anticipated benefits and unanticipated consequences. These range from predictive diagnostics and efficiency enhancements to concerns over reinforced biases, ecological impacts, and nuanced human-AI interactions.

AI's sharp edge is thus not merely dual but multi-dimensional. This multidimensionality is key. We are used to picking sides on topics. Much of our media coverage is couched this way. But artificial intelligence, because it is not human intelligence, doesn’t follow the same social conventions. Understanding the scope of things to come requires an expansion of our notions of cause and effect to include perspective. What may seem a two-sided line, thin and narrow, holds infinite depth and perception.

The multimodally-inclined Sarah Pajunen, capturing the aural landscape of Minnesota's Iron Range in her Mine Songs Project. Her field recordings offer a sense of place and of history, captured in this story from Making It Up North from WDSE.

Put plainly, there is much more than a point and a corresponding counterpoint for each advancement. So what are we left with? Certainly, one of the most overlooked aspects of this new era is the multimodality it represents. By multimodality, we could certainly mean many things. Here we only refer only to two. The first is technical and practical, while the second is farther reaching and theoretical:

a) Multimodality, in the technical sense, refers to diverse types of data, like text, images, audio, and video, to enhance the contextual effectiveness and accuracy of general information retrieval and user query interpretation.

b) Multimodality, in a broader sense, encapsulates the integration of diverse philosophical perspectives, revealing a rich tapestry of potential realities each shaped by its unique combination of guiding principles.

That second definition may sound like hyperbole, but both the technical explanation and the philosophical one are eddies in the same river of thought. What begins as a simple exercise in allowing multiple file formats to communicate with each other in various ways requires a philosophy that allows for the relationships between these file formats to be interpreted in multiple directions.

Originally recorded August 11, 2022 - Opening Reception For James Woodfill's Crossing Signals Exhibition Signals come in a variety of forms – as morse code, or flags, stop signs, and frequencies. What is not commonly perceived is that these are, in fact, also art. They are not commonly thought of as such, because of their utility, but Woodfill directs our gaze back to that which we’ve overlooked. Namely, the superfluous beauty of all the invented languages surrounding us. Directing us. Speaking to us.

If we’re to have any hope of keeping up with the plasticity of thought most neural nets seem to exhibit, we at least need to be able to expand our thinking to accommodate diverse relationships between multiple file types.

As an example, a prominent art critic gives a talk on the current influence of artificial intelligence upon the perception of art by the public. The entire event is recorded in every way – a videographer has captured the essence of the talk with moving images, while a professional photographer snaps away. A local writer compares notes with an anxious podcaster who’s trying to position themselves for the best audio quality. Fast forward to post-production and publication of all materials for educational and marketing purposes. The related video must be able to build some sort of contextual foundation when paired with the accompanying text, sound files, and images.

In other words, all the file formats need to talk to each other to be inherently useful. If we treat each of these items as wholly separate, we have misunderstood both the power and direction the ever-more-semantic internet is taking. It’s not enough to simply have images from the event on the same webpage as the embedded video, article, and related podcasts. That sort of mishmash is considered baseline digital marketing nowadays. It’s expected.

What isn’t expected is a folding of multiple perspectives into each connection. With a limited number of file types, we can now facilitate limitless interpretation.

What needs to happen is an allowance for incorporating these file formats into novel combinations through the many lenses of external generative interpretation (that is, interpretations outside of and/or orthogonal to the original intent of the event).

David Bowen "waveline", video mapped to LED screens, 2018. (view at Minnesota Museum of American Art, still). Bowen commonly layers novel forms of communication between existing file types into his work.

A lot of this is already happening whether we condone it or not. Now that Large Language Model (LLM) based search agents can easily crawl the transcripts of videos, moving image has a more direct way of harmonizing with written text. The video of the event now directly weaves into the article that was written, which also resonates with the transcript of the podcast. But these are heavily choreographed, internal perspectives.

Novel forms of search based on multimodal questions are now possible today, and will be entirely expected tomorrow. But what does this mean, exactly? If I’m going about my business on a perfectly ordinary day, why should I wish to use multimodality for anything at all?

The answer is quite simply a better way of getting answers to every sort of question.

Many people don’t yet realize it, but you can easily use ChatGPT or Google’s Bard to take images of your pantry and ask for recipe ideas. The neural nets identify food in the image to build a list of possible meals. You can use the same machine vision to get recommendations on how to reorganize your house, find out if the leaves on your indoor plants are indicating too much or too little water, or put together a toy airplane you lost the instructions for. All you have to do is provide and image and a corresponding question.

These new interactions are increasing in both ability and usage. Can’t read a set of directions from an old piece of paper you’ve had crumpled up in a drawer for years? Ask a machine to read it for you. You’ll be surprised at what it can piece together from something that may seem illegible to you.

This is the very heart of what both scares and thrills. The surrender to something silicon. Ask Alexa for current movie recommendations based on your interests, and she won’t disappoint. But that’s today. Tomorrow you’ll be able to create movies based on your interests.

We can already generate compelling clips of movies based on text, so it’s baked into the current trajectory. We’ll be able to do much more given very humble advances in AI-generated video. Have you always dreamed of building your own video game? The day after tomorrow only gets crazier, and it requires a new kind of questioning when it comes to how people search for information and media that is generative in nature. Especially as it relates to how people find your information.

David Bowen outsourced narcissism, computer, robotic arm with a camera attached, mirror, monitor, cables.

This installation features a computer-controlled robotic arm with a camera, facing a mirror. The computer runs a custom AI neural network trained to recognize the robot. As the robot views itself in the mirror, the AI attempts self-recognition, marking "me" on the robot's image if successful. The robot then adjusts its position based on the annotated image. With a certainty level above 85%, the system posts a selfie to its Instagram @outsourced_narcissism.

But Let’s Regroup…

Let’s bring the focus back to our event – the art critic discussing the current influence of artificial intelligence upon the perception of art by the public. If someone adds an image of the speaker to a new LLM based search engine, and asks who this is, will the LLM based search engine be able to identify the speaker and list among their many credentials a link to this particular talk? If someone attended the talk, and remembers a phrase uttered by the speaker, will retelling it aloud to ChatGPT, Alexa, or Siri yield an appropriate result? Will it yield a result that includes a link to the talk where it was recently uttered?

These are the sorts of questions we need to start asking. Think beyond indexation – to a world where answers are generative (as in, never exactly the same). That world is happening right now.

Let’s assume the accompanying images of the event include individual pieces of art, which are emblematic of how artists are using General Adversarial Networks (GANs). Will a user halfway around the world who uploads similar pieces of generative art somehow be directed to this talk? How would you orchestrate such a feat as the gallery owner?

This is a completely different way of thinking about search engine optimization. It grows more complex as we contemplate the many ways in which the internet is increasingly hyperspatial through virtual experiences. Or attaching content to virtual representations of physical spaces.

Virtual gallery walk-throughs are now very common, and the amount of corroborating media one can cram into each “step” in “cyberspace” (or the “metaverse”) is ballooning out beyond comprehension. As wearables (such as Meta’s new AI-powered Ray Bans) gain traction, will your gallery be able to provide information in a new generative augmented reality space? If someone is wearing these new glasses, will asking about a piece of art bring up all associated content pertaining to that art?

It’s easy enough for anyone with a cheap Matterport camera nowadays to map out physical areas in a digital format. It’s another matter entirely to make sure the data and file types associated with these “digital twins” are accessible for generative responses through machine learning. To do the latter, one must centralize their control over how their information is organized online.

Panoramic view of count.map.pulse. breathe, 2019-2020. Kathy McTavish is a media composer, cellist and installation artist whose work blends data, text, code, sound and abstract, layered moving images. Her recent work has focused on creating generative methods for building networked, multichannel video and sound environments. She creates cross-sensory, polyphonic landscapes which flow from the digital web into physical spaces.

As we embrace generative responses, the challenge of maintaining the integrity of a brand's narrative becomes increasingly complex. This new landscape demands a strategic shift in marketing efforts, moving beyond the traditional focus on crafting text, color, and ideation. Instead, the emphasis is now on effectively organizing and managing multimodal data.

In this context, platforms like Google's Search Graph Engine, Google Genesis, ChatGPT, Bing, and Anthropic, as well as the direct embedding of Large Language Models (LLMs) into websites, play a pivotal role. They offer novel ways to link and leverage information, including text, images, videos, and interactive elements, in ways that were previously unanticipated.

Pivoting ones marketing efforts to focus more on information management ensures that messaging is consistently and coherently represented across various platforms, whether users interact with the content directly or through AI-generated responses. In essence, the art of branding is evolving to focus more on orchestrating a symphony of multimodal data, ensuring that every AI-generated interaction or response aligns with the brand's core values and narrative, thus maintaining its authenticity and resonance in an increasingly AI-driven world.

This new dynamism creates a tapestry of connections that were once unimaginable, offering a labyrinth of information and insights. Imagine neural nets instantly cross-referencing art styles with historical events, creating emotion maps from paintings, or transforming visual elements into auditory experiences. Advanced image recognition can trace artistic influences across eras, while augmented reality brings static artworks to life, offering interactive and immersive experiences. These technologies exist right now – we don’t have to wait. We just have to start participating in the madness.

We simply need to start diving in.

This technological synergy not only deepens our appreciation of humanity, but also unveils new dimensions and connections, blending history, emotion, and innovation into a rich, multi-layered mosaic. A generative mosaic that offers the possibility of a new perspective each and every time we ask a question.

Joseph Nease Gallery’s Substack

Multimodality

Diverse Communication Between Digital Entities – Part 1

The multimodally-inclined Sarah Pajunen, capturing the aural landscape of Minnesota's Iron Range in her Mine Songs Project. Her field recordings offer a sense of place and of history, captured in this story from Making It Up North from WDSE.

David Bowen "waveline", video mapped to LED screens, 2018. (view at Minnesota Museum of American Art, still). Bowen commonly layers novel forms of communication between existing file types into his work.

David Bowen outsourced narcissism, computer, robotic arm with a camera attached, mirror, monitor, cables.