Tech

Artificial intelligence is a ‘black box’. Maybe not for long

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on telegram
Share on email
Share on reddit
Share on whatsapp
Share on telegram


TToday’s artificial intelligence is often described as a “black box”. AI developers do not write explicit rules for these systems; instead, they feed large amounts of data and the systems learn on their own to detect patterns. But the inner workings of AI models remain opaque, and efforts to peer into them to verify exactly what is happening has not progressed much. Below the surface, neural networks – the most powerful type of AI today – consist of billions of artificial “neurons” represented as decimal point numbers. No one really understands what they mean or how they work.

For those concerned about the risks of AI, this fact is important. If you don’t know exactly how a system works, how can you be sure it is secure?

See more information: Exclusive: US must act “decisively” to prevent “extinction-level” threats from AI, says government-commissioned report

On Tuesday, AI lab Anthropic announced that it had made a breakthrough in solving this problem. Researchers have developed a technique to essentially scan the “brain” of an AI model, allowing them to identify collections of neurons – called “features” – corresponding to different concepts. And for the first time, they successfully used this technique on a large frontier language model, Anthropic’s Claude Sonnet, the lab’s second most powerful system.

In one example, Anthropic researchers discovered a trait within Claude that represents the concept of “insecure code.” By stimulating these neurons, they could cause Claude to generate code containing a bug that could be exploited to create a security vulnerability. But by suppressing the neurons, the researchers discovered, Claude would generate harmless code.

The findings could have major implications for the security of current and future AI systems. Researchers have found millions of traits within Claude, including some that represent bias, fraudulent activity, toxic speech, and manipulative behavior. And they found that by suppressing each of these collections of neurons, they could change the behavior of the model.

In addition to helping to face current risks, the technique can also help with more speculative risks. For years, the main method available to researchers trying to understand the capabilities and risks of new AI systems has been to simply talk to them. This approach, sometimes known as “red-teaming,” can help detect that a model is toxic or dangerous, allowing researchers to create safeguards before the model is released to the public. But it doesn’t help address a kind of potential danger that worries some AI researchers: the risk that an AI system could become smart enough to fool its creators, hiding its capabilities from them until it can escape their control and potentially cause havoc.

“If we could really understand these systems — and that would require a lot of progress — we might be able to tell when these models are actually safe, or if they just seem safe,” said Chris Olah, head of the Anthropic interpretability team who led the research, says TIME.

“The fact that we can do these interventions in the model suggests to me that we are starting to make progress in what could be called X-ray or MRI. [of an AI model]”, adds Anthropic CEO Dario Amodei. “Right now, the paradigm is: let’s talk to the model, let’s see what it does. But what we would like to be able to do is look inside the model as an object – like scanning a brain instead of interviewing someone.”

The research is still in its early stages, Anthropic said in a summary of the findings. But the lab struck an optimistic tone that the findings could soon benefit its AI safety work. “The ability to manipulate resources could provide a promising path to directly impact the security of AI models,” Anthropic said. By suppressing certain features, it may be possible to prevent so-called AI model “jailbreaks,” a type of vulnerability where security protections can be disabled, the company added.


Anthropogenic researchers The “interpretability” team has been trying to peer into the brain of neural networks for years. But until recently, they mostly worked on models much smaller than the gigantic language models currently developed and released by technology companies.

One of the reasons for this slow progress was that individual neurons within AI models fired even when the model discussed completely different concepts. “This means that the same neuron can fire on concepts as disparate as the presence of semicolons in computer programming languages, references to burritos, or discussions of the Golden Gate Bridge, giving us little indication as to which specific concept was responsible for activating a given neuron,” Anthropic said in its research summary.

To get around this problem, Olah’s team of anthropogenic researchers zoomed out. Instead of studying individual neurons, they began looking for groups of neurons that would be all fire in response to a specific concept. This technique worked – and allowed them to move from studying smaller “toy” models to larger models like Anthropic’s Claude Sonnet, which has billions of neurons.

Although researchers claimed to have identified millions of features within Claude, they cautioned that this number was nowhere near the true number of features likely present within the model. Identifying all the features, they said, would be prohibitively expensive using current techniques because it would require more computing power than is needed to train Claude. (Costing somewhere in the tens or hundreds of millions of dollars.) The researchers also warned that although they had found some features they believed to be related to security, more study was still needed to determine whether these features could be reliably manipulated to to improve. the security of a model.

For Olah, the research is a breakthrough that proves the usefulness of his esoteric field, interpretability, to the broader world of AI security research. “Historically, interpretability has been an isolated thing, and there was hope that someday it would connect with [AI] security, but that seemed very far away,” says Olah. “I don’t think that’s true anymore.”



This story originally appeared on Time.com read the full story

Support fearless, independent journalism

We are not owned by a billionaire or shareholders – our readers support us. Donate any amount over $2. BNC Global Media Group is a global news organization that delivers fearless investigative journalism to discerning readers like you! Help us to continue publishing daily.

Support us just once

We accept support of any size, at any time – you name it for $2 or more.

Related

More

1 2 3 6,159

Don't Miss