Download "They Just Built a New Form of AI, and It’s Better Than LLMs"

Download this video with UDL Client

Video mp4 HD+ with sound
Mp3 in the best quality
Any size files

"videoThumbnail They Just Built a New Form of AI, and It’s Better Than LLMs

Demis Hassabis On The Future of Work in the Age of AI

Demis Hassabis On The Future of Work in the Age of AI

How to learn Python programming | Guido van Rossum and Lex Fridman

How to learn Python programming | Guido van Rossum and Lex Fridman

Elon Musk: Self-driving is way harder than I thought | Lex Fridman Podcast Clips

Elon Musk: Self-driving is way harder than I thought | Lex Fridman Podcast Clips

Humans will all die in 100 years | Yuval Noah Harari and Lex Fridman

Humans will all die in 100 years | Yuval Noah Harari and Lex Fridman

Deep Learning Vorlesung 18 (Tutorial): Natural Language Processing mit dem RNN

Deep Learning Vorlesung 18 (Tutorial): Natural Language Processing mit dem RNN

JEPA - A Path Towards Autonomous Machine Intelligence (Paper Explained)

JEPA - A Path Towards Autonomous Machine Intelligence (Paper Explained)

Channel: Yannic Kilcher

How CIA spies disguise their identity | Andrew Bustamante and Lex Fridman

How CIA spies disguise their identity | Andrew Bustamante and Lex Fridman

Run Any Chatbot FREE Locally on Your Computer

Run Any Chatbot FREE Locally on Your Computer

AI News

AI Updates

AI Revolution

AI

ai

artificial intelligence

meta fair

yann lecun

vl jepa

jepa architecture

vision language ai

multimodal ai

vision language models

post llm ai

next generation ai

ai research

ai systems

ai architecture

semantic ai

embedding based ai

video ai

real time ai

world models

perception ai

multimodal systems

ai beyond llms

future ai technology

ai innovation

ai development

machine learning

deep learning

meta

newai

00:00:02

All right. So today I want to talk about

00:00:04

something coming out of Metapare led by

00:00:07

Yan Lun and his [music] team that shows

00:00:09

a very different kind of AI taking

00:00:12

shape. It's not built around generating

00:00:13

text or chasing better word prediction

00:00:15

at all. Instead, it drops the idea that

00:00:18

intelligence has to revolve around

00:00:20

producing words and focuses on

00:00:22

predicting meaning directly. And once

00:00:24

you really see how this works, it

00:00:26

honestly feels like what comes after the

00:00:28

LLM era, not just another upgrade on top

00:00:30

of it. What they've built is called

00:00:32

VLJA, short for vision language joint

00:00:35

embedding predictive architecture. The

00:00:37

name sounds heavy, but the idea behind

00:00:38

it is surprisingly straightforward. And

00:00:41

once it clicks, a lot of design choices

00:00:43

we've all accepted in modern AI suddenly

00:00:45

start to feel inefficient [music] or at

00:00:47

least unnecessary. To understand why

00:00:49

this matters, you have to look at how

00:00:51

vision language models usually work

00:00:53

today. Right now, most VLMs follow the

00:00:56

same basic [music] pattern. You show

00:00:57

them an image or a video. You give them

00:00:59

a prompt or a question, and they respond

00:01:01

by generating text one token at a time.

00:01:04

That's how image captioning works.

00:01:06

That's how visual question answering

00:01:07

works. That's how most large multimodal

00:01:10

models operate under the hood. They're

00:01:12

trained to predict the next word, then

00:01:14

the next word, then the next word,

00:01:16

again. And again. That approach clearly

00:01:18

works, but it comes with some hidden

00:01:20

problems. The first issue is that these

00:01:22

models are forced to learn a lot of

00:01:24

things that don't actually matter for

00:01:26

correctness. Take a simple example. If

00:01:28

you ask, "What happens if I flip this

00:01:30

light switch down?" There are many

00:01:32

answers that are all perfectly fine. The

00:01:34

light turns off. The room gets darker.

00:01:37

The lamp goes dark. Humans immediately

00:01:39

understand that these all describe the

00:01:41

same outcome. But for a token-based

00:01:43

model, those answers are totally

00:01:45

different. They're different sequences

00:01:46

of symbols with almost no overlap.

00:01:49

During training, the model has to learn

00:01:51

exact phrasing, word choice, and

00:01:53

sentence structure. Even though none of

00:01:55

that changes the meaning, a huge amount

00:01:57

of training effort goes into modeling

00:01:59

surface level language variation instead

00:02:01

of the underlying idea. The second issue

00:02:04

shows up when you try to use these

00:02:05

systems in real time. Token by token

00:02:07

generation is slow and awkward. If

00:02:09

you're dealing with live video,

00:02:10

wearables, robotics, or anything [music]

00:02:13

that needs to understand what's

00:02:14

happening continuously. You can't really

00:02:16

know what the model means until it

00:02:18

finishes generating [music] text. The

00:02:20

semantics only appear at the end of the

00:02:22

decoding process. That adds latency,

00:02:25

burns compute, [music] and makes it hard

00:02:26

to update information selectively. This

00:02:29

is exactly where VLJA takes a completely

00:02:31

different route. Instead of predicting

00:02:33

words, VLJA predicts embeddings. These

00:02:36

are continuous vectors that represent

00:02:38

meaning directly, not the surface form

00:02:40

of language. During training, the model

00:02:43

never tries to generate text at all. It

00:02:45

learns to map visual input and a text

00:02:47

query straight into a semantic

00:02:49

representation of the answer. The system

00:02:51

is built from four main components, and

00:02:53

each one has a clear role. First,

00:02:56

there's the visual encoder. This takes

00:02:58

an image or a sequence of video frames

00:03:00

and compresses it into a set of visual

00:03:02

embeddings. [music] You can think of

00:03:04

these as visual tokens except they're

00:03:06

continuous vectors rather than discrete

00:03:08

symbols. In this setup, they use VJEPA

00:03:11

2, a self-supervised vision transformer

00:03:13

with around 304 million parameters and

00:03:16

it stays frozen during training. Next

00:03:18

comes the predictor which is the core of

00:03:20

the whole system. This module takes the

00:03:22

visual embeddings and the text query

00:03:24

[music] like a question or prompt and

00:03:26

predicts what the answer embedding

00:03:27

should look like. It's built using

00:03:29

transformer layers initialized from

00:03:31

llama 3.21b but without causal masking.

00:03:35

That means everything can attend to

00:03:36

everything else. Vision and text

00:03:38

interact freely. Then there's the Y

00:03:40

encoder. This encodes the target text

00:03:42

during training the correct [music]

00:03:44

answer into an embedding. That embedding

00:03:46

becomes the learning target.

00:03:47

Importantly, this representation is

00:03:49

meant to capture the meaning of the

00:03:51

answer, not the exact wording. Finally,

00:03:53

there's the Y decoder, and this part is

00:03:56

barely involved. It doesn't participate

00:03:58

in training at all. At inference time,

00:04:00

it only gets used when you actually need

00:04:02

readable text. Most of the time, the

00:04:04

model stays entirely in embedding space.

00:04:07

Training works in a simple loop. You

00:04:09

give the model a visual input, a query,

00:04:11

and a target answer. The Yen encoder

00:04:13

turns the answer into an embedding. The

00:04:15

predictor tries to produce [music] that

00:04:17

same embedding from the visual input and

00:04:19

query. The loss is computed directly in

00:04:21

embedding space, not in token space.

00:04:24

What matters here is how the model

00:04:25

learns without everything collapsing

00:04:27

into noise. Vla is trained so that its

00:04:30

predicted meaning is pulled toward the

00:04:31

correct meaning while different answers

00:04:33

are kept clearly separated. In practice,

00:04:35

this forces the system to build a

00:04:37

structured semantic space. Similar

00:04:39

answers cluster together naturally.

00:04:41

Different answers stay far apart.

00:04:43

Instead of memorizing phrasing, the

00:04:45

model organizes meaning itself, which

00:04:47

keeps the whole representation stable

00:04:49

and useful. This leads to the key

00:04:51

insight behind the whole approach. In

00:04:53

token space, multiple valid answers can

00:04:55

be extremely far apart. In embedding

00:04:57

space, those same answers can sit close

00:04:59

together. That turns a messy multimodal

00:05:02

learning problem into a clean single

00:05:04

mode one. The model no longer has to

00:05:06

guess which wording you want. It just

00:05:08

has to understand what the answer means.

00:05:10

Because of this, VLJA doesn't need a

00:05:13

heavy language decoder during training.

00:05:15

It's not learning how to write

00:05:16

sentences. It's learning how to predict

00:05:18

semantics. And that change alone cuts a

00:05:21

huge amount of unnecessary work out of

00:05:23

the system. And you can see it clearly

00:05:24

in the results. To test whether this

00:05:26

idea actually holds up, the researchers

00:05:28

ran a rare kind of comparison where

00:05:30

almost nothing was allowed to change.

00:05:32

Same vision encoder, same resolution,

00:05:35

same frame rate, same data mixture, same

00:05:37

batch size, same number of training

00:05:40

steps. The only difference was what the

00:05:42

models were trained to predict. One

00:05:43

model followed the standard route,

00:05:45

predicting tokens with a 1 billion

00:05:47

parameter language model. The VLJ

00:05:49

version predicted embeddings using a

00:05:51

roughly 500 million parameter predictor.

00:05:54

So right away, the embedding based

00:05:56

system had about half the trainable

00:05:57

parameters. Early in training, the two

00:06:00

systems look similar. After around

00:06:02

500,000 samples, performance is roughly

00:06:04

comparable. But as training continues, a

00:06:06

clear pattern emerges. VLJA starts

00:06:09

improving faster and it keeps improving.

00:06:12

After 5 million samples, it reaches

00:06:14

around 14.7 CR on video captioning,

00:06:17

while the token-based model is still

00:06:19

around 7.1. Classification accuracy

00:06:22

jumps to about 35% top five for VLJA

00:06:25

versus roughly 27% for the baseline. And

00:06:28

the gap doesn't close later. At 15

00:06:30

million samples, the difference remains.

00:06:32

VLJa continues to learn more

00:06:34

efficiently, even with fewer parameters.

00:06:36

That's not a tuning trick. That's a

00:06:38

structural advantage. The story doesn't

00:06:40

stop at training efficiency either.

00:06:42

Inference is where this approach really

00:06:44

starts to shine, especially for video.

00:06:46

Because VLJA produces a continuous

00:06:48

stream of semantic embeddings, it

00:06:50

supports something called selective

00:06:52

decoding. Instead of generating text at

00:06:54

fixed intervals, you monitor how the

00:06:56

embeddings change over time. If the

00:06:58

meaning stays stable, you don't decode

00:07:00

anything. If there's a significant

00:07:01

semantic shift, then you decode. They

00:07:04

test this on long procedural videos from

00:07:06

Ego XO4D. These videos average about 6

00:07:09

minutes each and contain roughly 143

00:07:12

action annotations per video. Decoding

00:07:14

text is the expensive part. So, the goal

00:07:16

is to recover the annotation sequence

00:07:18

while minimizing how often decoding

00:07:20

happens. They compare two strategies.

00:07:23

Uniform decoding where text is generated

00:07:25

at fixed time intervals and embedding

00:07:27

guided decoding where the embedding

00:07:29

stream is clustered into semantically

00:07:31

coherent segments and decoded once per

00:07:33

segment. The result is clean. To match

00:07:35

the performance of uniform decoding at

00:07:37

one decode per second, VLJA only needs

00:07:40

to decode about once every 2.85 seconds.

00:07:43

That's roughly a 2.85 times reduction in

00:07:46

decoding operations with similar side ER

00:07:49

scores. No fancy memory tricks, no KV

00:07:52

cache gymnastics. It's just a

00:07:54

consequence of working in semantic

00:07:56

space. This is especially important for

00:07:58

real-time systems like smart glasses,

00:08:00

robotics, navigation or live planning

00:08:02

where latency and compute cost actually

00:08:05

matter. Another major advantage is

00:08:06

versatility. VLJa can handle generation,

00:08:10

classification, retrieval and

00:08:12

discriminative visual question answering

00:08:14

using the same architecture. There are

00:08:16

no task specific heads and no separate

00:08:18

models. For open vocabulary

00:08:20

classification, candidate labels are

00:08:22

encoded into embeddings and compared to

00:08:25

the predicted embedding. The closest

00:08:26

match wins. For texttovide retrieval,

00:08:29

the text query is encoded and videos are

00:08:31

ranked by similarity. For discriminative

00:08:34

VQA, all candidate answers are embedded

00:08:36

and the nearest one is selected. They

00:08:38

evaluate this across a wide set of

00:08:40

benchmarks on eight video classification

00:08:42

data sets and eight texttovideo

00:08:44

retrieval data sets. The base VLJA model

00:08:47

with 1.6 billion parameters and only

00:08:50

about 2 billion training samples

00:08:51

outperforms clip SIG lip 2 and

00:08:54

perception encoder on average. Some of

00:08:56

those baselines have seen up to 86

00:08:58

billion samples. After supervised

00:09:01

fine-tuning, the VLJA SFT model improves

00:09:04

even further. It's no longer strictly

00:09:06

zeroot, but as a single generalist

00:09:08

model, it approaches specialist systems

00:09:10

that are tuned individually for each

00:09:12

data set. On visual question answering,

00:09:15

the results are especially telling. They

00:09:17

evaluate on GQA for compositional

00:09:19

reasoning, tally QA for complex

00:09:21

counting, and POPE and POPE v2 for

00:09:23

hallucination detection. Vla SFT with

00:09:26

1.6 6 billion parameters, lands in the

00:09:29

same range as models like Instruct Blip

00:09:31

and QuenVL, many of which rely on much

00:09:34

larger backbones and multi-stage

00:09:35

instruction tuning. It doesn't dominate

00:09:37

every benchmark, but the fact that it's

00:09:39

competitive at all is important because

00:09:41

it's not a classic generative VLM. It

00:09:43

answers questions by comparing meaning,

00:09:45

not by generating free form text. Then

00:09:48

there's the world modeling experiment.

00:09:49

Here the model is shown an initial image

00:09:52

and a final image and has to choose

00:09:54

which action caused the transition from

00:09:56

four candidate video clips. This is

00:09:58

closer to understanding physical

00:10:00

causality than language generation. Vla

00:10:02

SFT reaches 65.7%

00:10:05

accuracy setting a new state-of-the-art.

00:10:08

It outperforms larger vision language

00:10:10

models and even beats frontier language

00:10:12

models like GPT40, Claude 3.5, and

00:10:16

Gemini 2, which rely on captioning and

00:10:18

textbased reasoning. That result

00:10:20

matters. It suggests that directly

00:10:22

predicting latent semantics can be more

00:10:24

effective than narrating the world in

00:10:26

words and reasoning over those words

00:10:28

afterward. They also analyze the quality

00:10:30

of the text embeddings themselves using

00:10:32

hard negative benchmarks like Sugar

00:10:34

Crate++ and Vizla. They test whether the

00:10:36

Y encoder can detect subtle semantic

00:10:39

changes like swapped attributes or

00:10:41

altered relationships. The base VLJA Yen

00:10:44

encoder outperforms clip SIGLIP 2 and

00:10:47

perception encoder indicating a sharper

00:10:50

and more structured semantic space.

00:10:52

Finally, they stress test the system

00:10:54

through ablations. When the large

00:10:55

captionbased pre-training stage is

00:10:57

removed, performance drops sharply,

00:10:59

especially for classification and

00:11:01

retrieval. Breezing the Y encoder hurts

00:11:03

alignment. Overly simple training

00:11:05

objectives weaken learning. Larger

00:11:07

predictors help, particularly for VQA.

00:11:10

Visually aligned text encoders

00:11:11

consistently boost retrieval and

00:11:13

classification. The pattern is

00:11:15

consistent. When components that support

00:11:17

semantic learning are strengthened, the

00:11:19

model improves. When they're removed, it

00:11:22

degrades. That kind of behavior is

00:11:24

exactly what you want to see from a

00:11:25

system that's meant to scale. VLJA isn't

00:11:28

trying to replace language models

00:11:30

everywhere. Tasks like deep reasoning,

00:11:32

tool use, and agent style planning still

00:11:35

favor token-based systems. But for

00:11:37

perceptionheavy problems, especially

00:11:39

those involving video, real-time input,

00:11:41

and continuous understanding of the

00:11:43

world, this approach fits naturally. It

00:11:45

shifts the center of gravity from

00:11:47

language to meaning. Words become an

00:11:49

output option, not the core mechanism of

00:11:51

intelligence. And that shift is what

00:11:53

makes this work feel like more than just

00:11:55

another model iteration. Thanks for

00:11:56

watching, and I will catch you in the

00:11:58

next one.

Description:

AI is starting to move in a very different direction from what we’ve gotten used to. Instead of chasing bigger language models and better text generation, the focus is shifting toward systems that operate on meaning itself. Words stop being the center. Semantics, vision, video, and real-time understanding take priority. In this video, we break down a new AI architecture from Meta FAIR, led by Yann LeCun and his team, that takes this approach head-on and shows why it feels like what comes after LLMs. 📩 Brand Deals and Partnerships: me@faiz.mov ✉ General Inquiries: airevolutionofficial@gmail.com 🧠 What You’ll See •⁠ ⁠The paper: https://arxiv.org/abs/2512.10942 •⁠ ⁠VL-JEPA architecture explained in simple terms •⁠ ⁠Why predicting meaning beats predicting words •⁠ ⁠How this model works without token-by-token generation •⁠ ⁠Why it performs better on vision and video tasks •⁠ ⁠What this means for real-time AI systems 🚨 Why It Matters AI has been moving fast, but most progress has been tied to generating better text. This shift moves AI closer to real-world understanding, where systems react to what they see, track change over time, and operate with lower latency and lower cost. When meaning becomes the core output instead of words, AI starts becoming something that can actually run continuously in the world.

Mediafile available in formats

Popular

HD video

Only sound

All

* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."

** — Link intended for online playback in specialized players

Questions about downloading video

How can I download "They Just Built a New Form of AI, and It’s Better Than LLMs" video?

http://univideos.ru/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

Which format of "They Just Built a New Form of AI, and It’s Better Than LLMs" video should I choose?

The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

Why does my computer freeze when loading a "They Just Built a New Form of AI, and It’s Better Than LLMs" video?

The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

How can I download "They Just Built a New Form of AI, and It’s Better Than LLMs" video to my phone?

You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

How can I download an audio track (music) to MP3 "They Just Built a New Form of AI, and It’s Better Than LLMs"?

The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

How can I save a frame from a video "They Just Built a New Form of AI, and It’s Better Than LLMs"?

This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

How do I play and download streaming video?

For this purpose you need VLC-player, which can be downloaded for free from the official website https://www.videolan.org/vlc/.

How to play streaming video through VLC player:

in video formats, hover your mouse over "Streaming Video**";
right-click on "Copy link";
open VLC-player;
select Media - Open Network Stream - Network in the menu;
paste the copied link into the input field;
click "Play".

To download streaming video via VLC player, you need to convert it:

copy the video address (URL);
select "Open Network Stream" in the "Media" item of VLC player and paste the link to the video into the input field;
click on the arrow on the "Play" button and select "Convert" in the list;
select "Video - H.264 + MP3 (MP4)" in the "Profile" line;
click the "Browse" button to select a folder to save the converted video and click the "Start" button;
conversion speed depends on the resolution and duration of the video.

Warning: this download method no longer works with most YouTube videos.

What's the price of all this stuff?

It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.