Download "They Just Built a New Form of AI, and It’s Better Than LLMs"

Download this video with UDL Client
  • Video mp4 HD+ with sound
  • Mp3 in the best quality
  • Any size files
Video tags
|

Video tags

AI News
AI Updates
AI Revolution
AI
ai
artificial intelligence
meta fair
yann lecun
vl jepa
jepa architecture
vision language ai
multimodal ai
vision language models
post llm ai
next generation ai
ai research
ai systems
ai architecture
semantic ai
embedding based ai
video ai
real time ai
world models
perception ai
multimodal systems
ai beyond llms
future ai technology
ai innovation
ai development
machine learning
deep learning
meta
newai
Subtitles
|

Subtitles

00:00:02
All right. So today I want to talk about
00:00:04
something coming out of Metapare led by
00:00:07
Yan Lun and his [music] team that shows
00:00:09
a very different kind of AI taking
00:00:12
shape. It's not built around generating
00:00:13
text or chasing better word prediction
00:00:15
at all. Instead, it drops the idea that
00:00:18
intelligence has to revolve around
00:00:20
producing words and focuses on
00:00:22
predicting meaning directly. And once
00:00:24
you really see how this works, it
00:00:26
honestly feels like what comes after the
00:00:28
LLM era, not just another upgrade on top
00:00:30
of it. What they've built is called
00:00:32
VLJA, short for vision language joint
00:00:35
embedding predictive architecture. The
00:00:37
name sounds heavy, but the idea behind
00:00:38
it is surprisingly straightforward. And
00:00:41
once it clicks, a lot of design choices
00:00:43
we've all accepted in modern AI suddenly
00:00:45
start to feel inefficient [music] or at
00:00:47
least unnecessary. To understand why
00:00:49
this matters, you have to look at how
00:00:51
vision language models usually work
00:00:53
today. Right now, most VLMs follow the
00:00:56
same basic [music] pattern. You show
00:00:57
them an image or a video. You give them
00:00:59
a prompt or a question, and they respond
00:01:01
by generating text one token at a time.
00:01:04
That's how image captioning works.
00:01:06
That's how visual question answering
00:01:07
works. That's how most large multimodal
00:01:10
models operate under the hood. They're
00:01:12
trained to predict the next word, then
00:01:14
the next word, then the next word,
00:01:16
again. And again. That approach clearly
00:01:18
works, but it comes with some hidden
00:01:20
problems. The first issue is that these
00:01:22
models are forced to learn a lot of
00:01:24
things that don't actually matter for
00:01:26
correctness. Take a simple example. If
00:01:28
you ask, "What happens if I flip this
00:01:30
light switch down?" There are many
00:01:32
answers that are all perfectly fine. The
00:01:34
light turns off. The room gets darker.
00:01:37
The lamp goes dark. Humans immediately
00:01:39
understand that these all describe the
00:01:41
same outcome. But for a token-based
00:01:43
model, those answers are totally
00:01:45
different. They're different sequences
00:01:46
of symbols with almost no overlap.
00:01:49
During training, the model has to learn
00:01:51
exact phrasing, word choice, and
00:01:53
sentence structure. Even though none of
00:01:55
that changes the meaning, a huge amount
00:01:57
of training effort goes into modeling
00:01:59
surface level language variation instead
00:02:01
of the underlying idea. The second issue
00:02:04
shows up when you try to use these
00:02:05
systems in real time. Token by token
00:02:07
generation is slow and awkward. If
00:02:09
you're dealing with live video,
00:02:10
wearables, robotics, or anything [music]
00:02:13
that needs to understand what's
00:02:14
happening continuously. You can't really
00:02:16
know what the model means until it
00:02:18
finishes generating [music] text. The
00:02:20
semantics only appear at the end of the
00:02:22
decoding process. That adds latency,
00:02:25
burns compute, [music] and makes it hard
00:02:26
to update information selectively. This
00:02:29
is exactly where VLJA takes a completely
00:02:31
different route. Instead of predicting
00:02:33
words, VLJA predicts embeddings. These
00:02:36
are continuous vectors that represent
00:02:38
meaning directly, not the surface form
00:02:40
of language. During training, the model
00:02:43
never tries to generate text at all. It
00:02:45
learns to map visual input and a text
00:02:47
query straight into a semantic
00:02:49
representation of the answer. The system
00:02:51
is built from four main components, and
00:02:53
each one has a clear role. First,
00:02:56
there's the visual encoder. This takes
00:02:58
an image or a sequence of video frames
00:03:00
and compresses it into a set of visual
00:03:02
embeddings. [music] You can think of
00:03:04
these as visual tokens except they're
00:03:06
continuous vectors rather than discrete
00:03:08
symbols. In this setup, they use VJEPA
00:03:11
2, a self-supervised vision transformer
00:03:13
with around 304 million parameters and
00:03:16
it stays frozen during training. Next
00:03:18
comes the predictor which is the core of
00:03:20
the whole system. This module takes the
00:03:22
visual embeddings and the text query
00:03:24
[music] like a question or prompt and
00:03:26
predicts what the answer embedding
00:03:27
should look like. It's built using
00:03:29
transformer layers initialized from
00:03:31
llama 3.21b but without causal masking.
00:03:35
That means everything can attend to
00:03:36
everything else. Vision and text
00:03:38
interact freely. Then there's the Y
00:03:40
encoder. This encodes the target text
00:03:42
during training the correct [music]
00:03:44
answer into an embedding. That embedding
00:03:46
becomes the learning target.
00:03:47
Importantly, this representation is
00:03:49
meant to capture the meaning of the
00:03:51
answer, not the exact wording. Finally,
00:03:53
there's the Y decoder, and this part is
00:03:56
barely involved. It doesn't participate
00:03:58
in training at all. At inference time,
00:04:00
it only gets used when you actually need
00:04:02
readable text. Most of the time, the
00:04:04
model stays entirely in embedding space.
00:04:07
Training works in a simple loop. You
00:04:09
give the model a visual input, a query,
00:04:11
and a target answer. The Yen encoder
00:04:13
turns the answer into an embedding. The
00:04:15
predictor tries to produce [music] that
00:04:17
same embedding from the visual input and
00:04:19
query. The loss is computed directly in
00:04:21
embedding space, not in token space.
00:04:24
What matters here is how the model
00:04:25
learns without everything collapsing
00:04:27
into noise. Vla is trained so that its
00:04:30
predicted meaning is pulled toward the
00:04:31
correct meaning while different answers
00:04:33
are kept clearly separated. In practice,
00:04:35
this forces the system to build a
00:04:37
structured semantic space. Similar
00:04:39
answers cluster together naturally.
00:04:41
Different answers stay far apart.
00:04:43
Instead of memorizing phrasing, the
00:04:45
model organizes meaning itself, which
00:04:47
keeps the whole representation stable
00:04:49
and useful. This leads to the key
00:04:51
insight behind the whole approach. In
00:04:53
token space, multiple valid answers can
00:04:55
be extremely far apart. In embedding
00:04:57
space, those same answers can sit close
00:04:59
together. That turns a messy multimodal
00:05:02
learning problem into a clean single
00:05:04
mode one. The model no longer has to
00:05:06
guess which wording you want. It just
00:05:08
has to understand what the answer means.
00:05:10
Because of this, VLJA doesn't need a
00:05:13
heavy language decoder during training.
00:05:15
It's not learning how to write
00:05:16
sentences. It's learning how to predict
00:05:18
semantics. And that change alone cuts a
00:05:21
huge amount of unnecessary work out of
00:05:23
the system. And you can see it clearly
00:05:24
in the results. To test whether this
00:05:26
idea actually holds up, the researchers
00:05:28
ran a rare kind of comparison where
00:05:30
almost nothing was allowed to change.
00:05:32
Same vision encoder, same resolution,
00:05:35
same frame rate, same data mixture, same
00:05:37
batch size, same number of training
00:05:40
steps. The only difference was what the
00:05:42
models were trained to predict. One
00:05:43
model followed the standard route,
00:05:45
predicting tokens with a 1 billion
00:05:47
parameter language model. The VLJ
00:05:49
version predicted embeddings using a
00:05:51
roughly 500 million parameter predictor.
00:05:54
So right away, the embedding based
00:05:56
system had about half the trainable
00:05:57
parameters. Early in training, the two
00:06:00
systems look similar. After around
00:06:02
500,000 samples, performance is roughly
00:06:04
comparable. But as training continues, a
00:06:06
clear pattern emerges. VLJA starts
00:06:09
improving faster and it keeps improving.
00:06:12
After 5 million samples, it reaches
00:06:14
around 14.7 CR on video captioning,
00:06:17
while the token-based model is still
00:06:19
around 7.1. Classification accuracy
00:06:22
jumps to about 35% top five for VLJA
00:06:25
versus roughly 27% for the baseline. And
00:06:28
the gap doesn't close later. At 15
00:06:30
million samples, the difference remains.
00:06:32
VLJa continues to learn more
00:06:34
efficiently, even with fewer parameters.
00:06:36
That's not a tuning trick. That's a
00:06:38
structural advantage. The story doesn't
00:06:40
stop at training efficiency either.
00:06:42
Inference is where this approach really
00:06:44
starts to shine, especially for video.
00:06:46
Because VLJA produces a continuous
00:06:48
stream of semantic embeddings, it
00:06:50
supports something called selective
00:06:52
decoding. Instead of generating text at
00:06:54
fixed intervals, you monitor how the
00:06:56
embeddings change over time. If the
00:06:58
meaning stays stable, you don't decode
00:07:00
anything. If there's a significant
00:07:01
semantic shift, then you decode. They
00:07:04
test this on long procedural videos from
00:07:06
Ego XO4D. These videos average about 6
00:07:09
minutes each and contain roughly 143
00:07:12
action annotations per video. Decoding
00:07:14
text is the expensive part. So, the goal
00:07:16
is to recover the annotation sequence
00:07:18
while minimizing how often decoding
00:07:20
happens. They compare two strategies.
00:07:23
Uniform decoding where text is generated
00:07:25
at fixed time intervals and embedding
00:07:27
guided decoding where the embedding
00:07:29
stream is clustered into semantically
00:07:31
coherent segments and decoded once per
00:07:33
segment. The result is clean. To match
00:07:35
the performance of uniform decoding at
00:07:37
one decode per second, VLJA only needs
00:07:40
to decode about once every 2.85 seconds.
00:07:43
That's roughly a 2.85 times reduction in
00:07:46
decoding operations with similar side ER
00:07:49
scores. No fancy memory tricks, no KV
00:07:52
cache gymnastics. It's just a
00:07:54
consequence of working in semantic
00:07:56
space. This is especially important for
00:07:58
real-time systems like smart glasses,
00:08:00
robotics, navigation or live planning
00:08:02
where latency and compute cost actually
00:08:05
matter. Another major advantage is
00:08:06
versatility. VLJa can handle generation,
00:08:10
classification, retrieval and
00:08:12
discriminative visual question answering
00:08:14
using the same architecture. There are
00:08:16
no task specific heads and no separate
00:08:18
models. For open vocabulary
00:08:20
classification, candidate labels are
00:08:22
encoded into embeddings and compared to
00:08:25
the predicted embedding. The closest
00:08:26
match wins. For texttovide retrieval,
00:08:29
the text query is encoded and videos are
00:08:31
ranked by similarity. For discriminative
00:08:34
VQA, all candidate answers are embedded
00:08:36
and the nearest one is selected. They
00:08:38
evaluate this across a wide set of
00:08:40
benchmarks on eight video classification
00:08:42
data sets and eight texttovideo
00:08:44
retrieval data sets. The base VLJA model
00:08:47
with 1.6 billion parameters and only
00:08:50
about 2 billion training samples
00:08:51
outperforms clip SIG lip 2 and
00:08:54
perception encoder on average. Some of
00:08:56
those baselines have seen up to 86
00:08:58
billion samples. After supervised
00:09:01
fine-tuning, the VLJA SFT model improves
00:09:04
even further. It's no longer strictly
00:09:06
zeroot, but as a single generalist
00:09:08
model, it approaches specialist systems
00:09:10
that are tuned individually for each
00:09:12
data set. On visual question answering,
00:09:15
the results are especially telling. They
00:09:17
evaluate on GQA for compositional
00:09:19
reasoning, tally QA for complex
00:09:21
counting, and POPE and POPE v2 for
00:09:23
hallucination detection. Vla SFT with
00:09:26
1.6 6 billion parameters, lands in the
00:09:29
same range as models like Instruct Blip
00:09:31
and QuenVL, many of which rely on much
00:09:34
larger backbones and multi-stage
00:09:35
instruction tuning. It doesn't dominate
00:09:37
every benchmark, but the fact that it's
00:09:39
competitive at all is important because
00:09:41
it's not a classic generative VLM. It
00:09:43
answers questions by comparing meaning,
00:09:45
not by generating free form text. Then
00:09:48
there's the world modeling experiment.
00:09:49
Here the model is shown an initial image
00:09:52
and a final image and has to choose
00:09:54
which action caused the transition from
00:09:56
four candidate video clips. This is
00:09:58
closer to understanding physical
00:10:00
causality than language generation. Vla
00:10:02
SFT reaches 65.7%
00:10:05
accuracy setting a new state-of-the-art.
00:10:08
It outperforms larger vision language
00:10:10
models and even beats frontier language
00:10:12
models like GPT40, Claude 3.5, and
00:10:16
Gemini 2, which rely on captioning and
00:10:18
textbased reasoning. That result
00:10:20
matters. It suggests that directly
00:10:22
predicting latent semantics can be more
00:10:24
effective than narrating the world in
00:10:26
words and reasoning over those words
00:10:28
afterward. They also analyze the quality
00:10:30
of the text embeddings themselves using
00:10:32
hard negative benchmarks like Sugar
00:10:34
Crate++ and Vizla. They test whether the
00:10:36
Y encoder can detect subtle semantic
00:10:39
changes like swapped attributes or
00:10:41
altered relationships. The base VLJA Yen
00:10:44
encoder outperforms clip SIGLIP 2 and
00:10:47
perception encoder indicating a sharper
00:10:50
and more structured semantic space.
00:10:52
Finally, they stress test the system
00:10:54
through ablations. When the large
00:10:55
captionbased pre-training stage is
00:10:57
removed, performance drops sharply,
00:10:59
especially for classification and
00:11:01
retrieval. Breezing the Y encoder hurts
00:11:03
alignment. Overly simple training
00:11:05
objectives weaken learning. Larger
00:11:07
predictors help, particularly for VQA.
00:11:10
Visually aligned text encoders
00:11:11
consistently boost retrieval and
00:11:13
classification. The pattern is
00:11:15
consistent. When components that support
00:11:17
semantic learning are strengthened, the
00:11:19
model improves. When they're removed, it
00:11:22
degrades. That kind of behavior is
00:11:24
exactly what you want to see from a
00:11:25
system that's meant to scale. VLJA isn't
00:11:28
trying to replace language models
00:11:30
everywhere. Tasks like deep reasoning,
00:11:32
tool use, and agent style planning still
00:11:35
favor token-based systems. But for
00:11:37
perceptionheavy problems, especially
00:11:39
those involving video, real-time input,
00:11:41
and continuous understanding of the
00:11:43
world, this approach fits naturally. It
00:11:45
shifts the center of gravity from
00:11:47
language to meaning. Words become an
00:11:49
output option, not the core mechanism of
00:11:51
intelligence. And that shift is what
00:11:53
makes this work feel like more than just
00:11:55
another model iteration. Thanks for
00:11:56
watching, and I will catch you in the
00:11:58
next one.

Description:

AI is starting to move in a very different direction from what we’ve gotten used to. Instead of chasing bigger language models and better text generation, the focus is shifting toward systems that operate on meaning itself. Words stop being the center. Semantics, vision, video, and real-time understanding take priority. In this video, we break down a new AI architecture from Meta FAIR, led by Yann LeCun and his team, that takes this approach head-on and shows why it feels like what comes after LLMs. 📩 Brand Deals and Partnerships: me@faiz.mov ✉ General Inquiries: airevolutionofficial@gmail.com 🧠 What You’ll See •⁠ ⁠The paper: https://arxiv.org/abs/2512.10942 •⁠ ⁠VL-JEPA architecture explained in simple terms •⁠ ⁠Why predicting meaning beats predicting words •⁠ ⁠How this model works without token-by-token generation •⁠ ⁠Why it performs better on vision and video tasks •⁠ ⁠What this means for real-time AI systems 🚨 Why It Matters AI has been moving fast, but most progress has been tied to generating better text. This shift moves AI closer to real-world understanding, where systems react to what they see, track change over time, and operate with lower latency and lower cost. When meaning becomes the core output instead of words, AI starts becoming something that can actually run continuously in the world.

Mediafile available in formats

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

question iconHow can I download "They Just Built a New Form of AI, and It’s Better Than LLMs" video?arrow icon

    http://univideos.ru/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

    The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

    UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

    UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

question iconWhich format of "They Just Built a New Form of AI, and It’s Better Than LLMs" video should I choose?arrow icon

    The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

question iconWhy does my computer freeze when loading a "They Just Built a New Form of AI, and It’s Better Than LLMs" video?arrow icon

    The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

question iconHow can I download "They Just Built a New Form of AI, and It’s Better Than LLMs" video to my phone?arrow icon

    You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

question iconHow can I download an audio track (music) to MP3 "They Just Built a New Form of AI, and It’s Better Than LLMs"?arrow icon

    The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

question iconHow can I save a frame from a video "They Just Built a New Form of AI, and It’s Better Than LLMs"?arrow icon

    This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

question iconHow do I play and download streaming video?arrow icon

    For this purpose you need VLC-player, which can be downloaded for free from the official website https://www.videolan.org/vlc/.

    How to play streaming video through VLC player:

    • in video formats, hover your mouse over "Streaming Video**";
    • right-click on "Copy link";
    • open VLC-player;
    • select Media - Open Network Stream - Network in the menu;
    • paste the copied link into the input field;
    • click "Play".

    To download streaming video via VLC player, you need to convert it:

    • copy the video address (URL);
    • select "Open Network Stream" in the "Media" item of VLC player and paste the link to the video into the input field;
    • click on the arrow on the "Play" button and select "Convert" in the list;
    • select "Video - H.264 + MP3 (MP4)" in the "Profile" line;
    • click the "Browse" button to select a folder to save the converted video and click the "Start" button;
    • conversion speed depends on the resolution and duration of the video.

    Warning: this download method no longer works with most YouTube videos.

question iconWhat's the price of all this stuff?arrow icon

    It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.