HomeNewsApple Engineers Reveal AI's Fragile 'Reasoning' Abilities

Apple Engineers Reveal AI’s Fragile ‘Reasoning’ Abilities

Published on

A recent study conducted by a team of six Apple engineers has brought attention to the limitations of advanced large language models’ (LLMs) mathematical “reasoning” capabilities, particularly regarding their fragility and unreliability when dealing with minor changes to standard benchmark problems. This finding challenges claims made by companies such as OpenAI and Google, who have been promoting enhanced reasoning features as a significant advancement in their AI models.

The study underscores previous research that has suggested LLMs rely on probabilistic pattern matching, lacking the formal understanding required for dependable mathematical reasoning. The researchers propose that current LLMs are unable to perform genuine logical reasoning; instead, they mimic the reasoning steps observed in their training data.

Entitled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” the study examines the GSM8K dataset, which contains over 8,000 grade-school level mathematical word problems. This dataset is frequently used as a benchmark for assessing the complex reasoning capabilities of modern LLMs. The researchers introduced an innovative method by altering a subset of the dataset to dynamically replace certain names and numbers—transforming a problem about Sophie acquiring 31 building blocks for her nephew into one about Bill obtaining 19 blocks for his brother in the GSM-Symbolic evaluation.

This approach aims to prevent “data contamination” that could result from GSM8K questions being included in an AI model’s training data. Despite these modifications, the intrinsic difficulty of the mathematical reasoning within the questions remains unchanged, meaning that the models should theoretically perform equally on both GSM-Symbolic and GSM8K.

However, when more than 20 state-of-the-art LLMs were tested on GSM-Symbolic, their average accuracy decreased compared to GSM8K, with performance drops ranging from 0.3% to 9.2%, depending on the model. The study also revealed significant variability across 50 different runs of GSM-Symbolic with various name and number substitutions. In some cases, the accuracy gap between the best and worst runs reached up to 15%, with numerical changes generally leading to poorer accuracy than name changes.

This variability suggests that the LLMs are not engaging in “formal” reasoning but instead attempting to match patterns by aligning questions and solution steps with similar ones encountered during training. Despite this, the overall variance observed was often minimal in broader terms. For instance, OpenAI’s ChatGPT-4o showed a minor decrease from 95.2% accuracy on GSM8K to 94.9% on GSM-Symbolic. This high success rate indicates that the model remains effective, regardless of whether it uses “formal” reasoning.

However, the LLMs performed significantly worse when the researchers incorporated seemingly relevant but ultimately inconsequential information into the GSM-Symbolic benchmark, known as “GSM-NoOp” (short for “no operation”). For example, a question about kiwi collection could include unnecessary details about the fruit’s size. These additional red herrings led to “catastrophic performance drops” in accuracy compared to GSM8K, varying from 17.5% to 65.7%, depending on the model tested. These drastic accuracy declines highlight the fundamental limitations of using simple “pattern matching” to convert statements into operations without truly understanding their meaning.

Source link

Latest articles

Microsoft’s Prototype Surface Laptop Leaks Featuring Intel’s Lunar Lake Chips

Earlier this year, Microsoft launched the Surface Laptop 7, opting for Qualcomm’s Snapdragon X...

Pound Drops as UK Inflation Declines to 1.7%, Exceeding Expectations

In September, UK inflation unexpectedly decreased to a three-year low of 1.7%, leading to...

SBA Funds Depleted Following Hurricanes Helene and Milton

Chad Pergram, a senior congressional correspondent for Fox News, provided an update on claims...

Top Fitness Trackers for 2024

Fitness trackers are an excellent choice for individuals focused on monitoring their fitness activities....

More like this

Microsoft’s Prototype Surface Laptop Leaks Featuring Intel’s Lunar Lake Chips

Earlier this year, Microsoft launched the Surface Laptop 7, opting for Qualcomm’s Snapdragon X...

Pound Drops as UK Inflation Declines to 1.7%, Exceeding Expectations

In September, UK inflation unexpectedly decreased to a three-year low of 1.7%, leading to...

SBA Funds Depleted Following Hurricanes Helene and Milton

Chad Pergram, a senior congressional correspondent for Fox News, provided an update on claims...