Introduction to AI (4). Generative AI: Large Models by Training Data and Task

The core idea of “large models” is to leverage massive datasets and powerful computational resources to train AI systems that can learn complex patterns and generate new content or insights.
Sometimes, the terms Generative AI and LLMs are used when discussing the rise of AI in recent years. However, they are often incorrectly used as synonyms.
In this post, we will talk about Large Models and their classification depending on the training data and tasks. We will explore specialized models (LLMs, LVMs, etc.) and general or foundation models, and how they relate to the field of Generative AI.
Language Models
To explain the difference between these models, we will use an analogy. Imagine a very talented artist who can paint, sculpt, write poetry, compose music, make movies, or even code a video game. This artist represents Generative AI: it can create many different types of content (text, images, music, etc.).
Now, think of a writer and storyteller who excels at working with words. This writer can compose beautiful poems, translate languages, summarize long stories, and create various types of creative text formats like articles or movie scripts. This writer is what we call an LLM (Large Language Model). An LLM is a specific type of Generative AI that focuses solely on language.
Feature | Generative AI | LLM (Large Language Model) |
---|---|---|
What it is | A broad category of AI that creates new content | A specific type of Generative AI focused on language |
What it can do | Generate various types of content (text, images, music, code, etc.) | Understand and generate human-like text |
Examples | Creating images, generating music, writing different kinds of text formats | Writing poems, translating languages |
LLMs are trained on vast amounts of text data to learn patterns and generate new textual content.
While the term “LLM” is specific to language, the underlying idea of learning patterns and generating content applies across different domains (image, video, audio, etc.). Each domain has its own leading models and techniques, which evolve constantly with new research.
Vision Models
An emerging concept in the image and video domain, is the concept of Large Vision Models (LVMs). They reflect the use of large amounts of image or video to train the models, and they encompass those diffusion models, GANs, and transformer-based architectures that we will see further in this series. They excel in recognizing objects, scenes, and relationships within images; in generating images, from scratch or modifying existing ones; or analyzing videos, understanding the actions, events, and sequences in videos.
LVMs are like expert visual artists. They can create and understand visual art but don't necessarily "speak" the language to describe it. So for this, we use Large Vision-Language Models (VLMs) that combine visual and textual information. They bridge the gap between seeing and understanding. VLMs are like art critics who are also artists. They can both create and understand visual art and eloquently describe and analyze it using language. They excel in: connecting images and text, understanding the relationship between images and their descriptions; generating descriptions, creating captions for images or answering questions about them; or multimodal reasoning, using both visual and textual cues to solve problems or make decisions.
Feature | LVMs | VLMs |
---|---|---|
Input | Images, videos | Images and text |
Output | Images, videos, or visual features | Text, or a combination of text and visual output |
Core Ability | Visual understanding and generation | Connecting vision and language |
Audio and Speech Models
In the audio and speech field, we can also talk about similar concepts like Large Audio Models (LAMs), that refer to models trained on massive audio datasets. They excel at: generating audio, creating new sounds, music, or realistic speech; understanding audio, identify sounds, transcribe speech, analyze music, and even understand emotions in spoken language. Examples are MusicLM (for music generation), Whisper (for speech recognition), and models used for audio classification and sound design.
And similar to VLMs, we can reference Audio-Language Models (ALMs), that aim to bridge the gap between audio and text. So they are capable of generating summarizations of a podcast or a video meeting, or translating speech in real time.
Large Speech Models (LSMs), this term sometimes overlaps with LAMs but often emphasizes a focus on spoken language. These models often incorporate technologies like Text-to-Speech (TTS) and Speech-to-Text (STT). They excel in accurately transcribing speech to text and generating natural-sounding speech from text, including or analyzing nuances like tone, emotion, and intent in speech. Some examples are models used for voice assistants, real-time translation, and analyzing customer interactions from call recordings.
Multimodal Models
Large Multimodal Models (LMMs) is a term used to describe multimodal models that process and integrate multiple data modalities, such as text, images, audio, and video. Examples are Google Gemini or LLaVa.
They are different from the Multimodal Large Language Models (MLLMs). It refers to large language models that have been extended to handle multimodal inputs. It emphasizes the language modeling component combined with multimodal capabilities.
Foundation Models
Foundation models is an even broader term that includes LLMs, LVMs, LAMs, and models for other data types like code, and even protein structures. It emphasizes the idea that these models are trained on massive datasets and can be adapted to various tasks.