LLM Vision: Why is it still an unsolved problem?
In the world of artificial intelligence, visual perception and Large Language Models (LLMs) have evolved in parallel for a long time. Today we see attempts to unite them, but a fundamental question remains: how does an LLM "understand" what it sees?
The "Black Box" Problem of Visual Perception
Most modern multimodal models are trained on image-text pairs, trying to learn direct associations. But this is not scene understanding. It is, rather, a statistical prediction of an image caption.
Main barriers:
- Lack of structure: An image is a set of pixels, not logical objects.
- Dynamics complexity: Understanding real-time video streams requires not just frame analysis, but temporal memory.
- Low interpretability: We cannot "look into" the model's head and understand why it decided there is a specific object in the image.
VSL Concept — Visual Scene Language (v0.1)
We at the THINKING•OS laboratory are working on creating a universal language for representing visual scenes — VSL. Our goal is to give LLMs a structured description of the world that they can "read" as easily as text.
Semantic Translation: From Pixels to Object Graphs
The problem with modern multimodal models lies in processing unstructured visual data. LLMs (Large Language Models) operate on discrete tokens, while images represent continuous arrays of high-dimensional signals. VSL acts as a semantic translator, converting visual information into deterministic structured code.
{
"canvas": {
"width": 500,
"height": 500,
"unit": "px",
"origin": "top-left",
"background": "white"
},
"objects": [
{
"id": "rect1",
"type": "rectangle",
"size": { "width": 100, "height": 100 },
"position": { "x": 200, "y": 200, "reference_point": "top-left" },
"anchor": "top-left",
"fill": "red",
"stroke": null
}
]
} In this example, the visual scene is decomposed into objects with clearly defined attributes: geometric parameters, vector coordinates, and contextual metadata. This turns the process of "recognition" into a process of logical inference. Now the model is capable of performing spatial reasoning: analyzing object topology, hierarchy, and mutual arrangement with mathematical precision. This is a fundamental shift from probabilistic guessing to algorithmic understanding of the scene.
Architecture for Video Stream Analysis in THINKING•OS
Visual-Temporal Data Optimization
Modern computer vision systems face the problem of critical data redundancy when processing video streams. Traditional frame-by-frame analysis requires colossal computing power and creates an excessive load on the LLM's context window.
Our architecture in THINKING•OS is based on the principle of event-driven compression. Instead of transmitting raw visual data, the system generates Temporal Video JSON — a higher-order abstraction describing scene dynamics.
This allows for:
- Data dimensionality reduction: Instead of a sequence of hundreds of frames, the model operates on a vector of states and events. This reduces input data entropy by thousands of times.
- Increased semantic density: The AI agent Tao receives not pixels, but a structured narrative timeline, which is critically important for understanding cause-and-effect relationships in real-time.
- Latency minimization: Processing occurs at the metadata level, allowing the system to react to stream changes almost instantly.
The Future of Visual Intelligence
We believe that the future lies not in an infinite increase in model parameters, but in creating better ways to "translate" the visual world into structures understandable by intelligence. VSL and our video analysis architecture are steps towards true, deep understanding of the world by artificial intelligence.
Open Specifications
The concepts of VSL and VDL are being developed as open specifications to create a standard for spatial and temporal reasoning in AI systems. You can follow the development and contribute on GitHub:
Want to discuss applying these technologies to your business?
We help companies implement complex AI systems based on deep business pipeline engineering.
Discuss in Telegram