Categories: Technology

Are you able to outperform top-tier AI models on these cutting-edge vision tests?

Thank you for reading this post, don't forget to subscribe!
Increase , Whatever you do, don’t ask the AI ​​how many horizontal lines this symbol has.

getty photographs

In recent years, we have seen amazing developments in AI technologies when it comes to detecting and inspecting difficult photo content. However, a new paper highlights how state-of-the-art “vision learning models” (VLMs) often fail at easy, low-level vision research tasks that are trivially simple for a human to perform.

Within the pre-print paper with the provocative title “Vision Language Models are Blind” (which contains a PDF model with a black sunglasses emoji within the name), researchers at Auburn College and Alberta College have created 8 easy visual acuity tests with objectively correct solutions. From knowing how many times two colored lines intersect to calculating which letter is rotated in a long contract to how many nested shapes are present in a picture (consultant examples and effects are also considered on the analysis staff’s webpage Can go up to ).

Importantly, these tests are generated through customized code and do not rely on pre-existing photos or tests of people found on the web, making them less likely to be solved by recall by the VLM. . Researcher. Additionally, tests beyond raw 2D shapes require “minimal to zero world knowledge”, leading to “textual questions and options alone” (known as a subject for some alternative vision AI benchmarks). It becomes difficult to guess the answer.

Are You Smarter Than A 5th Grader?

After doing some testing with four other vision models—the GPT-4O, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found that all four fell below the 100% accuracy you can possibly expect. Expect such easy vision research duties (and ones that people with maximum vision will have little trouble achieving). However, the measurement of AI poor performance varies greatly depending on the specific process. For example, when asked to calculate a choice of rows and columns in an empty grid, the best performing style only gave accurate solutions less than 60 percent of the time. Gemini-1.5 Pro, on the other hand, experiences about 93 percent accuracy in recognizing rotated letters, which is close to human-level proficiency.

Even small changes in duties can trigger many changes in effect. To date, all four tested models were able to appropriately determine five overlapping hole circles, with the accuracy of all models falling well below 50 pc when six to nine circles were involved. The researchers speculate that this “suggests that VLMs are biased towards the famous Olympic logo, which has 5 circles.” In other circumstances, Fashion periodically hallucinated nonsensical answers, such as guessing “9,” “N,” or “©” as letters rotated within “subdermatoglyphics.”
Overall, the effects highlight that AI fashions that can work well at high-level vision reasoning have some significant “blind spots” (sorry) when it comes to low-level summary photos. All of this is reminiscent of the indistinguishable capacity gap we often see in state-of-the-art giant language models, which are able to produce exceptionally solid summaries of year-long texts on a single occasion while failing exceptionally sophisticated math and spelling queries. Is capable.
Those gaps in VLM tasks may be due to the lack of techniques on which they can generalize beyond the types of content they are explicitly trained on. But when the researchers tried to fix a style using specific photos taken from one of their tasks (the “Are two circles touching?” test), that style yielded only a negligible increase of 17 pc accuracy. “Loss values ​​for all these experiments were very close to zero, indicating that the model overfits the training set but fails to generalize,” the researchers wrote.

The researchers suggest that the VLM capacity difference may even be indistinguishable from so-called “late fusion” of seeing the encoder on a pre-trained large language fashion. The researchers suggest that an “early fusion” coaching method that integrates vision encoding with language coaching could have a higher impact on those low-level duties (without offering any kind of research of this query).

This post was published on 07/11/2024 9:35 am

news2source.com

Recent Posts

“I felt powerless,” Pro Football Hall of Famer Terrell Davis said after being handcuffed and removed from a United flight.

Pro Football Hall of Famer Terrell Davis He has accused United Airlines of a "disgusting…

11 months ago

Regenerative dentistry market is expected to reach USD 5.3 billion valuation by 2034, growing at 5.4% CAGR: TMR Records

transparency market analysisThe adoption of regenerative dentistry ideas into preventive care methods revolutionizes the traditional…

11 months ago

Live updates from the Olympic Basketball Showcase

The USA Basketball showcase continues this week with its second and final game in Abu…

11 months ago

United shares fall on chip hold problem as broader market

The S&P 500 Index ($SPX) (SPY) is recently down -0.89%, the Dow Jones Industrials Index…

11 months ago

Emmy Nominations 2024: Complete Checklist of Nominees

Emmy season is back, and Tony Hale ("Veep") and Sheryl Lee Ralph ("Abbott Elementary"), along…

11 months ago

International e-Prescription Program Industry Analysis Record

Dublin, July 17, 2024 (GLOBE NEWSWIRE) -- The file "e-Prescription Systems - Global Strategic Business…

11 months ago