Ars Technica
On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can analyze images for content, solve visual puzzles, perform visual text recognition, passing visual IQ tests, and understanding natural language instructions. Researchers believe that multimodal AI—which integrates different input modes such as text, audio, images, and video—is an important step in building artificial general intelligence (AGI) that can perform general the tasks at the level of a person.
“It has become a fundamental part of intelligence, multimodal perception is a prerequisite to achieve the artificial general intelligencein terms of knowledge acquisition and grounding in the real world,” the researchers wrote in their academic paper, “Language Is Not Everything You Need: Aligning Perception with Language Models.”
Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing captions for images, and taking a visual IQ test with 22-26 percent accuracy (more on that below).
-
An example provided by Microsoft in Kosmos-1 that answers questions about images and websites.
Microsoft
-
An example provided by Microsoft of “multimodal chain-of-thought prompting” for Kosmos-1.
Microsoft
-
An example of Kosmos-1 doing visual question answering, provided by Microsoft.
Microsoft
While the media buzzes with news about large-scale language models (LLM), some AI experts point to multimodal AI as a clearer path to general artificial intelligence, a technology that hypothetically able to replace people in any intellectual task (and any intellectual work). AGI is the stated goal of OpenAI, a key business partner of Microsoft in the AI space.
In this case, Kosmos-1 appears to be a pure Microsoft project without the involvement of OpenAI. The researchers call their creation a “multimodal large language model” (MLLM) because its roots are in natural language processing such as text-only LLM, such as ChatGPT. And it shows: In order for Kosmos-1 to accept image input, researchers must first translate the image into a special series of tokens (basically text) that LLM can understand. The Kosmos-1 paper describes this in more detail:
For the input format, we flatten the input as an array decorated with special tokens. Specifically, we use the
and to specify the start and end of the range. The special tokensand indicates the beginning and end of encoded image embeddings. For example, “DOCUMENTS ” is a text input, and “paragraph” is an interleaved image-text input.Image Embedding PARAGRAPH… An embedding module is used to encode both text tokens and other input modalities into vectors. Then the embeddings are fed to the decoder. For the input tokens, we use a lookup table to map them to the embeddings. For the modalities of continuous signals (for example, image, and audio), it is also possible to represent the inputs as discrete codes and then consider them as “foreign languages”.
Microsoft trained Kosmos-1 using data from the web, including excerpts from The Pile (an 800GB English text resource) and Common Crawl. After training, they evaluated Kosmos-1’s abilities in several tests, including language comprehension, language generation, text classification without optical character recognition, image captioning, response in visual query, web page query answering, and zero-shot image classification. In most of these tests, Kosmos-1 outperformed current state-of-the-art models, according to Microsoft.

Microsoft
Of particular interest is the Kosmos-1’s performance on Raven’s Progressive Reasoning, which measures visual IQ by presenting a sequence of shapes and asking the test taker to complete the sequence. To test Kosmos-1, researchers fed a filled test, one at a time, with each option completed and asked if the answer was correct. Kosmos-1 could only correctly answer a question in the Raven test 22 percent of the time (26 percent with fine tuning). It’s not a slam dunk, and methodological errors can affect the results, but Kosmos-1 beats random chance (17 percent) in the Raven IQ test.
However, while Kosmos-1 represents early steps in the multimodal domain (an approach that is also pursued by others), it is easy to imagine that future optimization may bring more important results, allowing AI models can detect any form of media and act on it. , which will greatly improve the abilities of artificial assistants. In the future, the researchers say they want to scale Kosmos-1 to the size of the model and also integrate the ability to speak.
Microsoft has said it plans to make Kosmos-1 available to developers, although the GitHub page cited by the paper did not include any Kosmos-specific code at the time of this story’s publication.