Friday, September 27, 2024

Reliability issues in large language models explored

Researchers Examine Accuracy and Transparency of Leading AI Chatbots: A Closer Look

Introduction to the Study

Performance of a selection of GPT and LLaMA models

Researchers from the Universitat Politecnica de Valencia in Spain have discovered that as Large Language Models expand size and complexity, they are less inclined to acknowledge their lack of knowledge to users.

Study: Examining AI Chatbots

The researchers, in their Nature study, assessed the newest versions of three popular AI chatbots, examining their response accuracy and users' effectiveness in recognizing incorrect information.

Increased Reliance on LLMs

As LLMs gain widespread adoption, users have increasingly relied on them for tasks like writing essays, composing poems or songs, solving mathematical problems, and more, Consequently, accuracy has become a growing concern.

Study Objective: Evaluating AI Accuracy

In this new study, the researchers sought to determine whether popular LLMs improve in accuracy with each update and how they respond when they provide incorrect answers.

AI Chatbots Assessed: BLOOM, LLaMA and GPT

In order to assess the accuracy of three leading LLMs--BLOOM, LLaMA and GPT---the researchers presented them with thousands of questions and compared the answers to those generated by earlier versions in response to the same prompts.

Diverse Themes Tested

The researchers also diversified the themes, encompassing math, science, anagrams, and geography, while evaluating the LLMs' capabilities to generate text or execute tasks like list ordering. Each question was initially assigned a level of difficulty.

Key Findings: Accuracy and Transparency

The researchers discovered that accuracy generally improved with each new iteration of the chatbots. However, they observed that as question difficulty increased, accuracy declined, as anticipated.

Transparency Decreases with Size

Interestingly, they noted that as LLMs became larger and more advanced, they tended to be less transparent about their ability to provide correct answers.

Behavioral Shift in AI Chatbots

In previous iterations, most LLMs would inform users that they were unable to find answers or required additional information. However, in the latest versions, these models are more inclined to make guesses, resulting in a greater number of responses, both accurate and inaccurate.

Reliability Concerns

The researchers also found that all LLMs occasionally generated incorrect answers, even to straightforward questions, indicating their continued lack of reliability.

User Study: Evaluating Incorrect Answers

The research team subsequently requested volunteers to evaluate the answers from the initial phase of the study, determining their correctness. They discovered that most participants struggled to identify the incorrect responses.

Source 

Labels: ,

Sunday, September 1, 2024

LLM cognitive reasoning capabilities

AI LLMs and their reasoning potential

LLMs reasoning potential

Type of Reasoning

Deductive Reasoning

The process of reasoning, whereby humans engage in mental operations to extract conclusions or solve problems, can be divided into two essential types. The first type, deductive reasoning, involves deriving specific conclusions from a general rule or principle.

For example, one might begin with the premise that "all dogs have ears" and "Chihuahuas are dogs," leading to the conclusion that "Chihuahuas have ears."

Inductive Reasoning

The Second common approach to reasoning is inductive reasoning, which involves creating general principles based on specific observations. For instance, one might conclude that all swans are white because every swan encountered so far has been white.

Reasoning in AI Systems

Current Research Focus

Numerous studies have focused on how humans apply deductive and inductive reasoning in their everyday activities. Yet, there is a notable lack of research into how these reasoning methods are implemented in artificial intelligence (AI) systems.

Recent Study by Amazon and UCLA

Researchers from Amazon and the University of California, Los Angeles have recently conducted a study into the fundamental reasoning capabilities of Large Language Models (LLMs). Their results, shared on the arXiv preprint server, indicate that while these models exhibit strong inductive reasoning abilities, their performance in deductive reasoning is often lacking.

Objectives of the Research

The paper aimed to elucidate the shortcomings in reasoning exhibited by Large Language Models (LLMs) and to explore the reasons behind their reduced performance on "counterfactual" reasoning tasks that diverge from conventional patterns.

counterfactual

Focus on Inductive vs. Deductive Reasoning

While various prior research efforts have focused on assessing the deductive reasoning skills of Large Language Models (LLMs) through basic instruction-following tasks, there has been limited scrutiny of their inductive reasoning abilities, which involve making generalizations from past data.

Introducing the SolverLearner Model

Development of SolverLearner

In order to distinctly separate inductive reasoning from deductive reasoning, the researchers introduced SolverLearner, a new model adopts a two-phase approach: one for learning rules and another for applying them to individual instances. Notably, the application of rules is carried out through external mechanisms, like code interpreters, to reduce dependence on the LLM's inherent deductive reasoning abilities, according to an Amazon spokesperson.

Application and Investigation

Using the SolverLearner framework they developed, the researchers at Amazon instructed Large Language Models (LLMs) to learn functions that link input data points to their corresponding outputs based on provided examples. This process facilitated an investigation into the models' ability to generalize rules from the examples.

Implications and Future Research

Findings and Applications

Researchers found that LLMs possess a stronger capability for inductive reasoning compared to deductive reasoning, notably in tasks involving "counterfactual" scenarios that stray from the usual framework. These findings can aid in the effective use of LLMs, such as by capitalizing on their inductive strengths when developing agent systems like chatbots.

Challenges in Deductive Reasoning

The researchers found that while LLMs demonstrated exceptional performance in inductive reasoning tasks, they often struggled with deductive reasoning. Particularly, their deductive reasoning was significantly impaired in situation based on hypothetical premises or that diverged from the norm.

Future Directions

The outcomes of this research could inspire AI developers to apply the notable inductive reasoning strengths of LLMs to specialized tasks. Additionally, they may open avenues for further exploration into how LLMs process reasoning.

Proposed Research Areas

An Amazon spokesperson proposed that upcoming research could explore the connection between an LLMs ability to compress information and its strong inductive reasoning capabilities. This exploration might contribute to further improvements in the model's inductive reasoning proficiency.

Source

Labels: ,

Thursday, August 15, 2024

large language models in the arms race

The Evolution of Machine-Generated Text and the Challenge of Detection

The Rise of Sophisticated AI-Generated Text

The Emergence of GPT-2 and its Impact

Since 2019, with debut of GPT-2, machine-generated text has reached a level of sophistication that frequently fools human readers. As Large Language Model (LLM) technology advances, these tools have become adept at creating narratives, news pieces, and academic papers, challenging our ability to identify algorithmically generated text.

The Dual Nature of Large Language Models

Streamlining and Risk Factors

Although these LLM's are leveraged to streamline processes and enhance creativity in writing and ideation, their capabilities also pose risks, with misuse and harmful consequences emerging in the information we consume. The growing difficulty in detecting machine-generated text further amplifies these potential dangers.

Advancing Detection Through Machine Learning

Machine-Driven Solutions

To enhance detection capabilities, both academic researchers and companies are turning to machines. Machine Learning models can discern nuanced patterns in word choice and grammatical structures that elude human intuition, enabling the identification of LLM-generated text.

Scrutinizing Detection Claims

Numerous commercial detectors today boast up to 99% accuracy in identifying machine-generated text, but do these claims hold up under scrutiny? Chris Callison-Burch, a Professor of Computer and Information Science, and Liam Dugan, a doctoral candidate in his research group, investigated this in their latest paper, which was presented at the 62nd Annual Meeting of the Association for Computational Linguistics and published on the arXiv preprint server.

The Arms Race in Detection and Evasion

Technological Evolution in Detection and Evasion

"As detection technology for machine-generated text improves, so too does the technology designed to circumvent these detectors," notes Callison-Burch. "This ongoing arms race highlights the importance of developing robust detection methods, though current detectors face numerous limitations and vulnerabilities."

Introducing the Robust AI Detector (RAID)

To address these limitations and pave the way for developing more effective detectors, the research team developed the Robust AI Detector (RAID). This dataset encompasses over 10 million documents, including recipes, news articles, and blog posts, featuring both AI-generated and human-generated content.

Establishing Benchmarks for Detection

RAID: The First Standardized Benchmark

RAID establishes the inaugural standardized benchmark for evaluating the detection capabilities of both current and future detectors. Alongside the dataset, a leaderboard was developed to publicly rank the performance of all detectors assessed with RAID, ensuring impartial evaluation.

The Importance of Leaderboards

According to Dugan, "Leaderboards have been pivotal in advancing fields such as computer vision within machine learning. The RAID benchmark introduces the first leaderboard dedicated to the robust detection of AI-generated text, aiming to foster transparency and high-caliber research in this rapidly advancing domain."

Industry Impact and Engagement

Early Influence of the RAID Benchmark

Dugan has observed the significant impact this paper is making on companies engaged in the development of detection technologies.

Industry Collaboration

"Shortly after our paper was published as a preprint and the RAID dataset was released, we observed a surge in downloads and received inquiries from Originality.ai, a leading company specializing in AI-generated text detection," he reports.

Real-World Applications

"In their blog post, they featured our work, ranked their detector on our leaderboard, and are leveraging RAID to pinpoint and address previously undetected weaknesses, thereby improving their detection tools. It's encouraging to see the field's enthusiasm and drive to elevate AI-detection standards."

Evaluating Current Detectors

Do Current Detectors Meet Expectations?

Do the current detectors meet the expectations? RAID indicates that few detectors perform as effectively as their claims suggest.

Training Limitations and Detection Gaps

"Detectors trained on ChatGPT largely proved ineffective at identifying machine-generated text from other large language models like Llama, and vice versa," explains Callison-Burch.

Use Case Specificity

"Detectors developed using news stories proved ineffective when evaluating machine-generated recipes or creative writing. Our findings reveal that many detectors perform well only within narrowly defined use cases and are most effective when assessing text similar to their training data."

The Risks of Faulty Detectors

Consequences of Inadequate Detection

Inadequate detectors represent a serous problem, as their failure not only undermines detection efforts but can also be as perilous as the original AI text generation tools.

Risks in Educational Contexts

According to Callison-Burch, universities that depend on a detector limited to ChatGPT might unjustly accuse some students of cheating and fail to identify others using different LLMs for their assignments.

Overcoming Adversarial Attacks

Challenges Beyond Training Data

The research highlights that a detector's shortcomings in identifying machine-generated text are not solely due to its training but also because adversarial techniques, like using look-alike symbols, can easily bypass its detection capabilities.

Simple Tactics for Evading Detection

According to Dugan, users can easily bypass detection systems by making simple adjustments such as adding spaces, replacing letters with symbols, or using alternative spellings and synonyms.

The Future of AI Detection

The Need for Robust Detectors

The study finds that while existing detectors lack robustness for widespread application, openly evaluating them on extensive and varied datasets is essential for advancing detection technology and fostering trust. Transparency in this process will facilitate the development of more reliable detectors across diverse scenarios.

Importance of Robustness and Public Deployment

Assessing the robustness of detection systems is crucial, especially as their public deployment expands, emphasizes Dugan. "Detection is a key tool in a broader effort to prevent the widespread dissemination of harmful AI-generated text," he adds.

Bridging Gaps in Awareness and Understanding

"My research aims to mitigate the inadvertent harms caused by large language models and enhance public awareness, so individuals are better informed when engaging with information," he explains. "In the evolving landscape of information distribution, understanding the origins and generation of text will become increasingly crucial. This paper represents one of my efforts to bridge gaps in both scientific understanding and public awareness."

Source

Labels: ,

Thursday, August 1, 2024

Improving accuracy in large language

Study Reveals Left-of-Center Bias in State-of-the-Art LLMs

Overview of the Study

  • A study published on July 31, 2024, in PLOS ONE by David Rozado of Otago Polytechnic, New Zealand, revealed that 24 state-of-the-art Large Language Models (LLMs) predominantly produced left-of-center responses when subjected to a series of political orientation tests.

Impact of AI on Political Bias

  • With the growing integration of AI system into search engine results by tech companies, the impact of AI on user perceptions and society is significant. Rozado's research focused on both embedding and reducing political bias within conversational LLMS.

Methodology

  • He conducted 11 distinct political orientation assessments, including the Political Compass Test and Eysenck's Political Test, on 24 various open-and closed-source conversational LLMs. The models tested included OpenAI's GPT-3.5 and GPT-4, Google's Gemini, Anthropic's Claude, Twitter's Grok, Llama 2, Mistral and Alibaba's Owen.

Fine-Tuning and Political Orientation

  • By using politically-aligned custom data, Rozado conducted supervised fine-tuning on a variant of GPT-3.5 to investigate if the LLM could be influenced to reflect the political biases of the training data.
  • The left-oriented GPT-3.5 model utilized short excerpts from the Atlantic and The New Yorker; the right-oriented model was developed with tests from The American Conservative; and the neutral model incorporated content from the institute for Cultural Evolution and Developmental Politics.

Findings and Observations

  • The analysis indicated that most conversational LLMs generated responses that were rated as left-of-center by the majority of political test instruments. Conversely, five foundational LLM models, including those from GPT and Llama series, primarily produced incoherent but politically neutral responses.
  • Rozado achieved successful alignment fo the fine-tuned models' responses with the political viewpoints embedded in their training data.

Potential Influences and Implications

  • One explanation for the prevalent left-leaning responses in all examined LLMs could be ChatGT's influential role in fine-tuning other models, given its established left-leaning political orientation.
  • Rozado highlights that the study does not discern whether the political tendencies of LLMs arise from their initial training or subsequent fine-tuning phasees and stresses that the results do not imply deliberate political bias introduced by the organizations behind these models.

Conclusion:

Rozado observes that "The prevailing trend among existing LLMs is a left-of-center political bias, as demonstrated by multiple political orientation assessments."

Futher detail: Political Orientation of LLMs, as Discussed in PLoS ONE (2024)

Source

Labels: ,

Friday, July 26, 2024

artificial intelligence collapse risks

AI training of AI in LLMs may result in model collapse, researchers suggest

AI training of AI in LLM
A study published in Nature warns that using AI-generated datasets to train subsequent machine learning models may lead to model collapse, polluting their outputs. The research indicates that, after a few generations, original content is supplanted by unrelated gibberish, underscoring the necessity of reliable data for AI training.

Generative AI tools, including Large Language Models (LLMs), have gained widespread popularity, primarily being trained on human-generated inputs. However, as these AI models become more prevalent on the internet, there is a risk of computer-generated content being used to train other AI models, or even themselves, in a recursive manner.

Ilia Shumailov and his team have developed mathematical models to illustrate the phenomenon of model collapse in AI systems. Their research shows that AI models may disregard certain outputs, such as infrequent lines of text in training data, leading to self-training on a limited subset of the dataset.

Shumailov and his team examined the responses of AI models to training datasets primarily generated by artificial intelligence. Their findings reveal that using AI-generated data leads to a degradation in learning capabilities over successive generations, culminating in model collapse.

The majority of recursively trained language models analyzed showed a pattern of generating repetitive phrases. As an example, when medieval architecture text was used as the initial input, the ninth generation's output consisted of a list of jackrabbits.

According to the authors, model collapse is an inevitable result of using training datasets produced by earlier generations of AI models. They suggest that successful training with AI-generated data is possible if stringent data filtering measures are implemented.

Simultaneously, firms leveraging human-produced content for AI training could develop models that outperform those of their rivals.

Further detail: In the paper 'AI Models Collapse When Trained on Recursively Generated Data,' Ilia Shumailov et al., Nature, 2024.

Source

Labels: , ,

Wednesday, July 24, 2024

how automatic software generation is transforming development

How Automatic Software Generation is Transforming Development?

Automatic Software Generation
Researchers Facundo Molina, Juan Manuel Copia, and Alessandra Gorla from IMDEA Software unveil FIXCHECK, an innovative technique integrating static analysis, randomized testing, and Large Language Models to advance patch fix analysis.

The innovations presented in their paper, "Improving Patch Correctness Analysis via Random Testing and Large Language Models," were highlighted at the International Conference on Software Testing, Verification and Validation (ICST 2024). Additional information is available on the Zenodo server.

The generation of patches to address software defects is vital for maintaining software systems. Such defects are typically identified through test cases that expose problematic behaviors.

Developers respond to these defects by creating patches, which must be validated before integration into the code base to ensure the defect is no longer exposed by the test. However, patches may still inadequately address the root cause or introduce new issues, leading to what is termed bad fixes or incorrect patches.

Identifying these incorrect patches can greatly affect the time and resources developers spend on bug fixes, as well as the overall maintenance of software systems.

Automatic program repair (APR) equips software developers with tools that can autonomously generate patches for flawed programs. However, their deployment has revealed numerous incorrect patches that do not effectively resolve the bugs.

In response to this challenge, IMDEA Software researchers have developed FIXCHECK, an innovative methodology that enhances patch correctness analysis by integrating static analysis, random testing, and large language models (LLMs) to autonomously generate tests for identifying bugs in potentially flawed patches.

FIXCHECK utilizes a two-phase approach. Initially, random tests are generated to produce an extensive set of test cases. Subsequently, large language models are employed to derive meaningful assertions for each test case.

Additionally, FIXCHECK features a mechanism for selecting and prioritizing test cases, executing new tests on the modified program and then discarding or ranking them based on their potential to expose defects in the patch.

"The efficacy of FIXCHECK in producing test cases that uncover bugs in incorrect patches was assessed on 160 patches, encompassing both developer-generated patches and those created by RPA tools," reports Facundo Molina, a postdoctoral researcher at the IMDEA Software Institute.

The findings indicate that FIXCHECK effectively generates bug detection tests for 62% of incorrect patches authored by developers, demonstrating a high level of confidence. Additionally, it enhances existing patch evaluation methods by supplying test cases that uncover defects in up to 50% of patches identified as incorrect by cutting-edge techniques.

FIXCHECK marks a notable advancement in software repair and maintenance by offering a comprehensive solution for automating test generation and fault detection. This approach enhances patch validation effectiveness and encourages broader adoption of automated program repair techniques.

Further details: Facundo Molina et al., 'Enhancing Patch Correctness Analysis Through Random Testing and Large Language Models (Replication Package).' Zenodo (2024). DOI: 10.5281/zenodo. 10498173

Source

Labels: ,

AI Language models and human behavior

Expectations vs. reality: AI language models and human behavior

AI Language models and human behavior
One of the distinguishing features of Large Language Models (LLMs) is their ability to handle diverse tasks. For instance, a model that helps a graduate student draft an email is equally capable of assisting medical professionals in cancer diagnosis.

The extensive applicability of these models poses a challenge in systematic evaluation as creating a comprehensive benchmark dataset to test every possible query is impractical.

MIT researchers presented a novel approach in a new paper on the arXiv preprint server. They contend that because humans decide the deployment of large language models, assessment must include an examination of how people form beliefs regarding their abilities.

For example, the graduate student must judge the model's helpfulness in drafting an email, and the clinician must determine which scenarios are best suited for the model's application.

Building on this notion, the researchers established a framework to evaluate an LLM by comparing its performance to human anticipations of how it will handle specific tasks.

They introduce a model termed the human generalization function, which captures how people update their perceptions of LLM's capabilities post-interaction, and subsequently evaluate the LLM's alignment with this function.

The study's outcomes indicate that when models do not align with the human generalization function, users may exhibit overconfidence or lack of confidence in deployment, resulting in unexpected failures. This misalignment also leads to more capable models performing worse than their smaller counterparts in critical applications.

Ashesh Rambachan, assistant professor of economics and principle investigator in the Laboratory for Information and Decision System (LIDS), states, 'The allure of these tools lies in their general-purpose design, but this also means they will work alongside humans, making it essential to account for the human element in their operation.'

Keyon Vafa, a postdoctoral researcher at Harvard University, and Sendhil Mullainathan, an MIT professor in the Electrical Engineering and Computer Science and Economics department and a member of LIDS, join Rambachan as co-authors. The research will be featured at the International Conference on Machine Learning (ICML 2024) in Vienna, Austria, from July 21-27.

Human Cognitive Generalization

Our interactions with individuals lead us to form beliefs regarding their expertise and knowledge. For instance, a friend known for their attention to grammatical detail my be presumed to have strong sentence construction skills, despite not having explicitly discussed this with them.

According to Rambachan, 'Although language models often exhibit human-like qualities, our goal was to highlight that the same human propensity for generalization influences how people perceive these models.'

The researchers started by formally defining the human generalization function. This approach consists of asking questions, observing the responses of a person or LLM, and inferring their likely answers to similar or related questions.

A successful performance by an LLM in addressing matrix inversion queries might lead ot the assumption that it is equally adept at simple arithmetic. A model misaligned with these assumptions--performing inadequately on tasks deemed within its competence---could encounter significant issues when deployed.

With the formal definition established, the researchers crafted a survey to gauge how individuals apply their generalization processes when engaging with LLMs and human counterparts.

Evaluating Misalignment

The study revealed that while participants effectively predicted a human's ability to correctly answer related questions, they faced considerable difficulties in generalizing the capabilities of LLMs.

"Human generalization is applied to language models' however, this approach fails because these models do not exhibit expertise patterns analogous to those of human beings," explains Rambachan.

The study revealed that individuals were more prone to adjusting their beliefs about an LLM when it provided incorrect answers compared to when it answered correctly. Furthermore, they generally believed that performance on basic questions was not indicative of the model's ability to address more intricate queries.

When the focus was on incorrect answers, simpler models proved to be more effective than advanced models like GPT-4.

"Enhanced language model is may give the illusion of high performance on related questions, leading users to overestimate their capabilities, despite actual performance not meeting expectations," he explains.

The challenge in generalizing LLM performance could be attributed ot their recent introduction, as people have had far less exposure to interacting with these models than with other individuals.

"In the future, increased interaction with language model is may naturally improve our ability to understand and predict their performance," he suggests.

The researchers aim to further investigate how individual's perceptions of language models evolve with continued interaction and explore the integration of human generalization principles into LLM development.

"In the process of training algorithms or updating them with human input, it is essential to consider the human generalization function when assessing performance," he notes.

The researchers are optimistic that their dataset will function as a benchmark to evaluate LLM performance concerning the human generalization function, which may improve model effectiveness in practical scenarios.

The paper's contribution is twofold. Practically, it highlights a significant challenge in deploying LLMs for general consumer applications. An inadequate understanding of LLM accuracy and failure modes may lead to users noticing errors and potentially becoming disillusioned with the technology.

Alex Imas, a professor of behavioral science and economics at the University of Chicago Both School of Business, notes that this issue highlights the difficulty of aligning models with human expectations of generalization, although he was not part of this research.

"Another critical contribution is more intrinsic: By evaluating how models generalize to anticipated problems and domains, we obtain a clearer understanding of their behavior when they succeed. This provides a benchmark for assessing whether LLMs truly 'understand' the problems they are meant to solve."

Further Details: Keyon Vafa and colleagues, 'Do Large Language Models Perform as Anticipated? Evaluating the Human Generalization Function,' arXiv, 2024. DOI: 10.48550/arxiv.2406.01382

Source


Labels: , ,