AI Academy

Introduction to AI (4). Generative AI: Large Models by Training Data and Task

AI Xplorers Team

28 Dec 2024 — 3 min read

Leonardo da Vinci: painter, draughtsman, musician, engineer, scientist, theorist, sculptor, and architect. Image generated with Grok 2.

The core idea of “large models” is to leverage massive datasets and powerful computational resources to train AI systems that can learn complex patterns and generate new content or insights.

Sometimes, the terms Generative AI and LLMs are used when discussing the rise of AI in recent years. However, they are often incorrectly used as synonyms.

In this post, we will talk about Large Models and their classification depending on the training data and tasks. We will explore specialized models (LLMs, LVMs, etc.) and general or foundation models, and how they relate to the field of Generative AI.

Language Models

To explain the difference between these models, we will use an analogy. Imagine a very talented artist who can paint, sculpt, write poetry, compose music, make movies, or even code a video game. This artist represents Generative AI: it can create many different types of content (text, images, music, etc.).

Now, think of a writer and storyteller who excels at working with words. This writer can compose beautiful poems, translate languages, summarize long stories, and create various types of creative text formats like articles or movie scripts. This writer is what we call an LLM (Large Language Model). An LLM is a specific type of Generative AI that focuses solely on language.

Feature	Generative AI	LLM (Large Language Model)
What it is	A broad category of AI that creates new content	A specific type of Generative AI focused on language
What it can do	Generate various types of content (text, images, music, code, etc.)	Understand and generate human-like text
Examples	Creating images, generating music, writing different kinds of text formats	Writing poems, translating languages

LLMs are trained on vast amounts of text data to learn patterns and generate new textual content.

While the term “LLM” is specific to language, the underlying idea of learning patterns and generating content applies across different domains (image, video, audio, etc.). Each domain has its own leading models and techniques, which evolve constantly with new research.

Vision Models

An emerging concept in the image and video domain, is the concept of Large Vision Models (LVMs). They reflect the use of large amounts of image or video to train the models, and they encompass those diffusion models, GANs, and transformer-based architectures that we will see further in this series. They excel in recognizing objects, scenes, and relationships within images; in generating images, from scratch or modifying existing ones; or analyzing videos, understanding the actions, events, and sequences in videos.

LVMs are like expert visual artists. They can create and understand visual art but don't necessarily "speak" the language to describe it. So for this, we use Large Vision-Language Models (VLMs) that combine visual and textual information. They bridge the gap between seeing and understanding. VLMs are like art critics who are also artists. They can both create and understand visual art and eloquently describe and analyze it using language. They excel in: connecting images and text, understanding the relationship between images and their descriptions; generating descriptions, creating captions for images or answering questions about them; or multimodal reasoning, using both visual and textual cues to solve problems or make decisions.

Feature	LVMs	VLMs
Input	Images, videos	Images and text
Output	Images, videos, or visual features	Text, or a combination of text and visual output
Core Ability	Visual understanding and generation	Connecting vision and language

Audio and Speech Models

In the audio and speech field, we can also talk about similar concepts like Large Audio Models (LAMs), that refer to models trained on massive audio datasets. They excel at: generating audio, creating new sounds, music, or realistic speech; understanding audio, identify sounds, transcribe speech, analyze music, and even understand emotions in spoken language. Examples are MusicLM (for music generation), Whisper (for speech recognition), and models used for audio classification and sound design.

And similar to VLMs, we can reference Audio-Language Models (ALMs), that aim to bridge the gap between audio and text. So they are capable of generating summarizations of a podcast or a video meeting, or translating speech in real time.

Large Speech Models (LSMs), this term sometimes overlaps with LAMs but often emphasizes a focus on spoken language. These models often incorporate technologies like Text-to-Speech (TTS) and Speech-to-Text (STT). They excel in accurately transcribing speech to text and generating natural-sounding speech from text, including or analyzing nuances like tone, emotion, and intent in speech. Some examples are models used for voice assistants, real-time translation, and analyzing customer interactions from call recordings.

Multimodal Models

Large Multimodal Models (LMMs) is a term used to describe multimodal models that process and integrate multiple data modalities, such as text, images, audio, and video. Examples are Google Gemini or LLaVa.

They are different from the Multimodal Large Language Models (MLLMs). It refers to large language models that have been extended to handle multimodal inputs. It emphasizes the language modeling component combined with multimodal capabilities.

Foundation Models

Foundation models is an even broader term that includes LLMs, LVMs, LAMs, and models for other data types like code, and even protein structures. It emphasizes the idea that these models are trained on massive datasets and can be adapted to various tasks.

Introduction to AI (4). Generative AI: Large Models by Training Data and Task

AI Xplorers Team

Language Models

Vision Models

Audio and Speech Models

Multimodal Models

Foundation Models

Read more

The Rise of AI Agents and Agentic AI

Introduction to AI (3). The Journey of AI: From Discriminative to Generative

Introduction to AI (2). AI as a General Purpose Technology

Introduction to AI (1). What is AI? And what is not?