News

Human differences in judgment lead to problems for AI

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on telegram
Share on email
Share on reddit
Share on whatsapp
Share on telegram


Many people understand the concept of prejudice on some intuitive level. In society and in artificial intelligence systems, racial and gender prejudices are well documented.

If society could somehow eliminate prejudice, would all problems disappear? The late Nobel laureate Daniel Kahnemanwho was a key figure in the field of behavioral economics, argued in his last book this prejudice is just one side of the coin. Errors in judgments can be attributed to two sources: bias and noise.

Bias and noise play important roles in fields such as law, medicine It is financial forecastwhere human judgments are central. In our work as computer and information scientists, my colleagues and I I discovered that the noise too plays a role in AI.

Statistical noise

Noise, in this context, means variation in the way people judge the same problem or situation. The noise problem is more widespread than it initially appears. A seminal workdating back to the Great Depression, found that different judges gave different sentences for similar cases.

It is worrying that sentencing in court cases can depend on things like the temperature and if the local football team won. Such factors, at least in part, contribute to the perception that the judicial system is not only biased, but sometimes also arbitrary.

Other examples: Insurance adjusters may provide different estimates for similar claims, reflecting noise in your judgments. Noise is likely present at all types of contests, from wine tastings to local beauty pageants to college admissions.

Noise in data

On the surface, it doesn’t seem likely that noise could affect the performance of AI systems. After all, machines are not affected by the weather or football teams, so why would they make judgments that vary depending on the circumstances? On the other hand, researchers know that bias affects AIbecause is reflected in the data on which the AI ​​is trained.

For the new wave of AI models like ChatGPT, the gold standard is human performance on general intelligence problems like common sense. ChatGPT and its peers are measured against human labels common sense datasets.

Simply put, researchers and developers can ask the machine a common-sense question and compare it to human responses: “If I put a heavy rock on a paper table, will it collapse? Yes or no.” If there is high agreement between the two – at best, perfect agreement – ​​the machine is approaching human-level common sense, according to the test.

So where would the noise come in? The above common-sense question seems simple, and most humans would probably agree with its answer, but there are many questions where there is more disagreement or uncertainty: “Is the following sentence plausible or implausible? My dog ​​plays volleyball.” In other words, there is potential for noise. Not surprisingly, interesting common-sense questions have some noise.

But the issue is that most AI tests don’t take this noise into account in experiments. Intuitively, questions that generate human answers that tend to agree with each other should be given higher weight than if the answers differ – in other words, where there is noise. Researchers don’t yet know whether or how to evaluate AI responses in this situation, but the first step is to recognize that the problem exists.

Tracking Machine Noise

Leaving theory aside, the question still remains whether everything said above is hypothetical or whether in real common sense tests there is noise. The best way to prove or disprove the presence of noise is to take an existing test, remove the answers, and have several people independently label them, i.e., provide answers. By measuring discordance between humans, researchers can know how much noise there is in the test.

The details behind measuring this disagreement are complex, involving significant statistics and mathematics. Besides, who’s to say how common sense should be defined? How do you know that the human judges are motivated enough to think about the question? These questions are at the intersection of good experimental design and statistics. Robustness is key: a result, test, or set of human labelers is unlikely to convince anyone. As a pragmatic matter, human labor is expensive. Perhaps for this reason there have been no studies on possible noise in AI tests.

To address this gap, my colleagues and I designed such a study and published our findings in Nature Scientific Reports, showing that even in the realm of common sense, noise is inevitable. Because the environment in which judgments are obtained may be important, we conducted two types of studies. One type of study involved paid Amazon Turkish Mechanicwhile the other study involved a smaller-scale labeling exercise in two laboratories at the University of Southern California and Rensselaer Polytechnic Institute.

You can think of the former as a more realistic online environment, reflecting how many AI tests are actually labeled before being released for training and evaluation. The latter is more extreme, guaranteeing high quality but on much smaller scales. The question we wanted to answer was to what extent is noise inevitable and is it just a question of quality control?

The results were worrying. In both contexts, even on common-sense issues that might be expected to elicit high – even universal – agreement, we found a non-trivial degree of noise. The noise was high enough that we inferred that between 4% and 10% of a system’s performance could be attributed to noise.

To emphasize what this means, suppose I built an AI system that scored 85% in a test and you built an AI system that scored 91%. Your system seems to be much better than mine. But if there is noise in the human labels used to score responses, then we are no longer sure that the 6% improvement means much. For all we know, there may be no real improvement.

In AI leaderboards, where large language models like the one that powers ChatGPT are compared, performance differences between rival systems are much narrower, typically less than 1%. As we show in the article, common statistics don’t really help separate the effects of noise from those of true performance improvements.

Noise audits

What is the way forward? Going back to Kahneman’s book, he proposed the concept of “noise auditing” to quantify and ultimately mitigate noise as much as possible. At the very least, AI researchers need to estimate the influence noise might have.

Auditing AI systems for bias is common, so we believe the concept of noise auditing should follow naturally. We hope that this study, as well as others like it, will lead to its adoption.

This article was republished from The conversationan independent, nonprofit news organization that brings you facts and analysis to help you understand our complex world.

It was written by: Mayank Kejriwal, University of Southern California.

See more information:

Mayank Kejriwal receives funding from DARPA.



Source link

Support fearless, independent journalism

We are not owned by a billionaire or shareholders – our readers support us. Donate any amount over $2. BNC Global Media Group is a global news organization that delivers fearless investigative journalism to discerning readers like you! Help us to continue publishing daily.

Support us just once

We accept support of any size, at any time – you name it for $2 or more.

Related

More

1 2 3 6,314

Don't Miss

Cavaliers flip the script against Celtics in Game 2, making this series a mystery

BOSTON – What a strange thing sports are. One night,

How scammers use celebrity deepfakes to steal millions of fans

Robert Irwin, son of famous conservationist Steve Irwin.Monica Schipper/Getty Images