LLM Vision: Why is it still an unsolved problem?

The "Black Box" Problem of Visual Perception

Most modern multimodal models are trained on image-text pairs, trying to learn direct associations. But this is not scene understanding. It is, rather, a statistical prediction of an image caption.

Main barriers:

Lack of structure: An image is a set of pixels, not logical objects.
Dynamics complexity: Understanding real-time video streams requires not just frame analysis, but temporal memory.
Low interpretability: We cannot "look into" the model's head and understand why it decided there is a specific object in the image.

VSL Concept — Visual Scene Language (v0.1)

Maxim Zhadobin LinkedIn

Founder of THINKING•OS

We at the THINKING•OS laboratory are working on creating a universal language for representing visual scenes — VSL. Our goal is to give LLMs a structured description of the world that they can "read" as easily as text.

Semantic Translation: From Pixels to Object Graphs

The problem with modern multimodal models lies in processing unstructured visual data. LLMs (Large Language Models) operate on discrete tokens, while images represent continuous arrays of high-dimensional signals. VSL acts as a semantic translator, converting visual information into deterministic structured code.

{
  "canvas": {
    "width": 500,
    "height": 500,
    "unit": "px",
    "origin": "top-left",
    "background": "white"
  },
  "objects": [
    {
      "id": "rect1",
      "type": "rectangle",
      "size": { "width": 100, "height": 100 },
      "position": { "x": 200, "y": 200, "reference_point": "top-left" },
      "anchor": "top-left",
      "fill": "red",
      "stroke": null
    }
  ]
}

In this example, the visual scene is decomposed into objects with clearly defined attributes: geometric parameters, vector coordinates, and contextual metadata. This turns the process of "recognition" into a process of logical inference. Now the model is capable of performing spatial reasoning: analyzing object topology, hierarchy, and mutual arrangement with mathematical precision. This is a fundamental shift from probabilistic guessing to algorithmic understanding of the scene.

Architecture for Video Stream Analysis in THINKING•OS

Visual-Temporal Data Optimization

Modern computer vision systems face the problem of critical data redundancy when processing video streams. Traditional frame-by-frame analysis requires colossal computing power and creates an excessive load on the LLM's context window.

Our architecture in THINKING•OS is based on the principle of event-driven compression. Instead of transmitting raw visual data, the system generates Temporal Video JSON — a higher-order abstraction describing scene dynamics.

This allows for:

Data dimensionality reduction: Instead of a sequence of hundreds of frames, the model operates on a vector of states and events. This reduces input data entropy by thousands of times.
Increased semantic density: The AI agent Tao receives not pixels, but a structured narrative timeline, which is critically important for understanding cause-and-effect relationships in real-time.
Latency minimization: Processing occurs at the metadata level, allowing the system to react to stream changes almost instantly.

The Future of Visual Intelligence

We believe that the future lies not in an infinite increase in model parameters, but in creating better ways to "translate" the visual world into structures understandable by intelligence. VSL and our video analysis architecture are steps towards true, deep understanding of the world by artificial intelligence.

Open Specifications

The concepts of VSL and VDL are being developed as open specifications to create a standard for spatial and temporal reasoning in AI systems. You can follow the development and contribute on GitHub:

Repository

Visual Scene Language (VSL)

Static scene representation format

Repository

Visual Dynamic Language (VDL)

Dynamic events and motion semantics

The "Black Box" Problem of Visual Perception

VSL Concept — Visual Scene Language (v0.1)

Semantic Translation: From Pixels to Object Graphs

Architecture for Video Stream Analysis in THINKING•OS

Visual-Temporal Data Optimization

The Future of Visual Intelligence

Open Specifications

Want to discuss applying these technologies to your business?

Read also

RAG 2.0: Why vector search is no longer enough for business and how TaoContext works

Security and Reliability: How to connect AI agents with the external world through TaoBridge

AI-Ready Code Guard: How we turn AI-generated code into reliable engineering product

The "Black Box" Problem of Visual Perception

VSL Concept — Visual Scene Language (v0.1)

Semantic Translation: From Pixels to Object Graphs

Architecture for Video Stream Analysis in THINKING•OS

Visual-Temporal Data Optimization

The Future of Visual Intelligence

Open Specifications

Want to discuss applying these technologies to your business?

Read also

RAG 2.0: Why vector search is no longer enough for business and how TaoContext works

Security and Reliability: How to connect AI agents with the external world through TaoBridge

AI-Ready Code Guard: How we turn AI-generated code into reliable engineering product

Privacy Policy

1. Information Collection

2. Use of Information

3. Data Protection