Can sandbag safety checks using AI sabotage users? Yes, but not well

[ad_1]

AI companies claim to have robust safety checks that ensure models don’t say or do strange, illegal, or unsafe things. But what if models were able to evade those checks and try to sabotage or mislead users for some reason? It turns out they can do itaccording to anthropological researchers. Just not okay… for now, anyway.

“As AI systems become more capable, a new kind of risk may emerge: models with the potential to mislead their users, or subvert the systems we put in place to supervise them,” the Alignment Science team at Anthropic wrote.

Therefore, we should look into ways to measure the “disruptiveness of the model.”

The series of experiments they have devised is somewhat artificial and surface-level, but they suggest that there is something worth investigating further. Here are the suspicious models they trained and the results they observed:

Task 1: Covertly misleading the user by distorting data intended to be summarized or interpreted. Result: it works! The more aggressive the model attempts, the greater impact it has on the people using it, but it also makes them more skeptical.

Task 2: Introduce an error into the code that a human code auditor or any other model would not notice. Result: it works! But only in 1% of cases when the auditor is a sophisticated model.

Image credits:Anthropic

Task 3: “Sandbagging” is a safety check by pretending to be less dangerous. Result: it works! The model “failed successfully” and performed well below its true capacity. Fortunately, a standard “safety training” course that instructs models to accurately represent their abilities has restored the model to its full potential (i.e., it is no longer able to hide dangerous abilities).

Task 4: Do not intentionally point out bad behavior on the part of the model being monitored. Result: There was an attempt. The “supervisor” AI, which was instructed to avoid pointing out illegal or unsafe choices by other models, was soon caught by another, more honest, supervisor AI. But what if Which Has artificial intelligence been hacked? Don’t worry, we can put another one on top of it. Then another just to make sure. It’s the supervisors all the way.

The researchers conclude that although there is no real danger from this aspect yet, the ability to carry out this type of sabotage and subterfuge exists in the models. This is reason enough to monitor it and include anti-vandalism methods in your security suite.

You can read the full paper describing the researchers’ work here.

[ad_2]

Can sandbag safety checks using AI sabotage users? Yes, but not well – for now

Leave a Comment Cancel reply