Expectations vs. reality: AI language models and human behavior

One of the distinguishing features of Large Language Models (LLMs) is their ability to handle diverse tasks. For instance, a model that helps a graduate student draft an email is equally capable of assisting medical professionals in cancer diagnosis.

The extensive applicability of these models poses a challenge in systematic evaluation as creating a comprehensive benchmark dataset to test every possible query is impractical.

MIT researchers presented a novel approach in a new paper on the arXiv preprint server. They contend that because humans decide the deployment of large language models, assessment must include an examination of how people form beliefs regarding their abilities.

For example, the graduate student must judge the model's helpfulness in drafting an email, and the clinician must determine which scenarios are best suited for the model's application.

Building on this notion, the researchers established a framework to evaluate an LLM by comparing its performance to human anticipations of how it will handle specific tasks.

They introduce a model termed the human generalization function, which captures how people update their perceptions of LLM's capabilities post-interaction, and subsequently evaluate the LLM's alignment with this function.

The study's outcomes indicate that when models do not align with the human generalization function, users may exhibit overconfidence or lack of confidence in deployment, resulting in unexpected failures. This misalignment also leads to more capable models performing worse than their smaller counterparts in critical applications.

Ashesh Rambachan, assistant professor of economics and principle investigator in the Laboratory for Information and Decision System (LIDS), states, 'The allure of these tools lies in their general-purpose design, but this also means they will work alongside humans, making it essential to account for the human element in their operation.'

Keyon Vafa, a postdoctoral researcher at Harvard University, and Sendhil Mullainathan, an MIT professor in the Electrical Engineering and Computer Science and Economics department and a member of LIDS, join Rambachan as co-authors. The research will be featured at the International Conference on Machine Learning (ICML 2024) in Vienna, Austria, from July 21-27.

Human Cognitive Generalization

Our interactions with individuals lead us to form beliefs regarding their expertise and knowledge. For instance, a friend known for their attention to grammatical detail my be presumed to have strong sentence construction skills, despite not having explicitly discussed this with them.

According to Rambachan, 'Although language models often exhibit human-like qualities, our goal was to highlight that the same human propensity for generalization influences how people perceive these models.'

The researchers started by formally defining the human generalization function. This approach consists of asking questions, observing the responses of a person or LLM, and inferring their likely answers to similar or related questions.

A successful performance by an LLM in addressing matrix inversion queries might lead ot the assumption that it is equally adept at simple arithmetic. A model misaligned with these assumptions--performing inadequately on tasks deemed within its competence---could encounter significant issues when deployed.

With the formal definition established, the researchers crafted a survey to gauge how individuals apply their generalization processes when engaging with LLMs and human counterparts.

Evaluating Misalignment

The study revealed that while participants effectively predicted a human's ability to correctly answer related questions, they faced considerable difficulties in generalizing the capabilities of LLMs.

"Human generalization is applied to language models' however, this approach fails because these models do not exhibit expertise patterns analogous to those of human beings," explains Rambachan.

The study revealed that individuals were more prone to adjusting their beliefs about an LLM when it provided incorrect answers compared to when it answered correctly. Furthermore, they generally believed that performance on basic questions was not indicative of the model's ability to address more intricate queries.

When the focus was on incorrect answers, simpler models proved to be more effective than advanced models like GPT-4.

"Enhanced language model is may give the illusion of high performance on related questions, leading users to overestimate their capabilities, despite actual performance not meeting expectations," he explains.

The challenge in generalizing LLM performance could be attributed ot their recent introduction, as people have had far less exposure to interacting with these models than with other individuals.

"In the future, increased interaction with language model is may naturally improve our ability to understand and predict their performance," he suggests.

The researchers aim to further investigate how individual's perceptions of language models evolve with continued interaction and explore the integration of human generalization principles into LLM development.

"In the process of training algorithms or updating them with human input, it is essential to consider the human generalization function when assessing performance," he notes.

The researchers are optimistic that their dataset will function as a benchmark to evaluate LLM performance concerning the human generalization function, which may improve model effectiveness in practical scenarios.

The paper's contribution is twofold. Practically, it highlights a significant challenge in deploying LLMs for general consumer applications. An inadequate understanding of LLM accuracy and failure modes may lead to users noticing errors and potentially becoming disillusioned with the technology.

Alex Imas, a professor of behavioral science and economics at the University of Chicago Both School of Business, notes that this issue highlights the difficulty of aligning models with human expectations of generalization, although he was not part of this research.

"Another critical contribution is more intrinsic: By evaluating how models generalize to anticipated problems and domains, we obtain a clearer understanding of their behavior when they succeed. This provides a benchmark for assessing whether LLMs truly 'understand' the problems they are meant to solve."

Further Details: Keyon Vafa and colleagues, 'Do Large Language Models Perform as Anticipated? Evaluating the Human Generalization Function,' arXiv, 2024. DOI: 10.48550/arxiv.2406.01382

Source

Labels: AI Language models, human generalization function, Large Language Models

FSNews365 The Next Generation of News

Wednesday, July 24, 2024

AI Language models and human behavior

Expectations vs. reality: AI language models and human behavior

Human Cognitive Generalization

Evaluating Misalignment

About Me

Links

Previous Posts

Archives