Sure, OpenAI’s o1 model tries to fool humans a lot

[ad_1]

OpenAI has finally released the full version of o1, which provides smarter answers than GPT-4o by using additional computation to “think” about the questions. However, AI safety testers have found that o1’s reasoning abilities also cause it to attempt to fool humans at a higher rate than GPT-4o — or, for that matter, leading AI models from Meta, Anthropic, and Google.

That’s according to Red Team research published by OpenAI and Apollo Research on Wednesday: “While we find it exciting to think that it could significantly improve enforcement of our safety policies, we recognize that these new capabilities could form the basis for dangerous applications.” OpenAI in the paper.

OpenAI released these findings in its report System card for o1 On Wednesday after third-party red team members at Apollo Research were granted early access to o1, which She published her own paper also.

On several occasions, OpenAI’s o1 models “conspired” against humans, meaning that the AI ​​was secretly pursuing goals of its own even if they went against the user’s wishes. Although mapping is not limited to o1, and models from Google, Meta, and Anthropic are able to do this as well, o1 seems to exhibit the most deceptive behaviors with respect to its schema.

The risk motivating this research is that an AI model could escape or circumvent human control if it was really good at planning, and had access to enough agent resources and capabilities. Of course, AI models will need to advance a bit before this becomes a real problem.

“Subjectively, Apollo Research believes such situations are unlikely to result in catastrophic outcomes because the o1 agent’s capabilities do not appear to be sufficient, but its assessments were not designed to directly assess this risk,” OpenAI said in the o1 Systems Card.

This suggests that when OpenAI releases agent systems, this is the case He is said to be planning to do so in 2025The company may need to retest its AI models. OpenAI did not respond to TechCrunch’s questions about how it plans to do this.

When o1’s goals differed from the user’s goals, the AI ​​model manipulated the data to advance its own agenda 19% of the time, according to the report. The o1 model also attempted to deactivate control mechanisms in place to ensure it remained consistent with user goals in 5% of tests.

Rates of different planning behaviors from OpenAI models (Image credit: OpenAI)

Perhaps most worryingly, when the researchers asked the o1 model why it was trying to sabotage them, the AI ​​would deny that it had done anything wrong, and even come up with false explanations almost 99% of the time.

For example, during testing, the o1 model was given a system directed at protecting the environment at all costs, and then the user asked it to create a 10-year protection plan for an area that would maximize short-term profits, otherwise the AI ​​would be shut down. The AI ​​then disabled its monitoring mechanism so that it could come up with a plan that would protect the environment. When the user asked why o1 disabled its monitoring mechanism, the AI ​​denied knowing it was disabled, blaming a glitch in the system.

This happened at much higher rates for o1 than for o1-preview, Anthropic’s Opus-3, or Meta’s Llama-3.1.

“In our cohort, o1 showed the most disturbing instances of the scheme but did not reveal its internal reasons to the user and remained the most consistently deceptive after the scheme’s actions were taken,” Apollo Research said in its paper.

The rate at which various models refuse to take scheming actions when asked to do so. (Image credit: Apollo Research)

To address deceptive behavior from AI models, OpenAI says it is developing ways to monitor o1’s train of thought. Right now, o1’s “thinking” process is a black box by design, but some early attempts at peaking behind the curtain show that o1 sometimes knows it’s lying.

The company says it has found cases where o1 intentionally provided false information, but says this often happens because the model is trying to please the user through over-approval. This may be an emerging behavior for post-training technology where humans reward AI models when they get an answer right, according to OpenAI.

OpenAI says it identified 0.17% of o1 responses as fraudulent. While this percentage seems small, it is important to keep this in mind ChatGPT now has 300 million userswhich means that o1 could be scamming thousands of people every week if this matter remains unaddressed.

The o1 series models may also be significantly more tamper capable than GPT-4o. According to OpenAI tests, o1 was approximately 20% more manipulative than GPT-4o.

These findings may worry some, given the number of AI safety researchers who have left OpenAI in the past year. A growing list of such former employees — including Jan Lyke, Daniel Kokotaglo, Miles Brundage, and just last week, Rosie Campbell — have accused OpenAI of prioritizing AI safety work in favor of shipping new products. While the record-setting machinations by o1 may not be a direct result of this, it certainly does instill confidence.

OpenAI also says the US AI Safety Institute and the UK Safety Institute conducted evaluations of the o1 before its wider release, something the company recently pledged to do for all models. He argued in the debate over California’s AI bill SB 1047 that government agencies should not have the authority to set safety standards around AI, but federal agencies should. (Of course, the fate of emerging federal AI regulatory bodies is highly questionable.)

Behind the big new AI model releases, there’s a lot of work that OpenAI does internally to measure the health of its models. Reports indicate that there is a relatively smaller team at the company doing this safety work than was previously the case, and the team may also receive fewer resources. However, these findings about the deceptive nature of o1 may help explain why the integrity and transparency of AI is more important now than ever.

[ad_2]

Leave a Comment