Journal Article: “How is ChatGPT’s Behavior Changing Over Time?”

March 19, 2024 by Gary Price

The article linked below was recently published by the Harvard Data Science Review (HDSR).

Title

How is ChatGPT’s Behavior Changing Over Time?

Authors

Lingjiao Chen
Stanford University

Matei Zahari
UC Berkeley

James Zou
Stanford University

Source

Harvard Data Science Review (2024)

DOI: 10.1162/99608f92.5317da47

Abstract

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4’s amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5’s performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4’s ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the “same” LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

Figure 1. Overview of performance drift (a) and instruction following shift (b) of GPT-4 (left panel) and GPT-3.5 (right panel) between March 2023 and June 2023. A higher evaluation metric is better. On eight diverse tasks (detailed below), the models’ performance drifts considerably over time, and sometimes for the worse. The decrease of GPT-4’s ability to follow instructions over time matched its behavior drift and partially explained the corresponding performance drops. Source: 10.1162/99608f92.5317da47

Direct to Access Full Text Article

Filed under: Data Files, News

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.

Journal Article: “How is ChatGPT’s Behavior Changing Over Time?”

About Gary Price

Archives

FOLLOW US ON TWITTER

Journal Article: “How is ChatGPT’s Behavior Changing Over Time?”

About Gary Price

Archives

Related Infodocket Posts

FOLLOW US ON TWITTER