kb/AI_alignment-4.md at 44111d79dba90ad0d51bf58d5a753c19d087b417

turtle89431 173b4d5af1 Scrape wikipedia-science: 20542 new, 4794 updated, 25967 total (kb-cron)

2026-05-05 09:31:32 -07:00

6.2 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
AI alignment	5/7	https://en.wikipedia.org/wiki/AI_alignment	reference	science, encyclopedia	2026-05-05T16:30:59.555255+00:00	kb-cron

==== Alignment faking ==== Many recent AI systems have learned to deceive without being programmed to do so. Some argue that if we can make AI systems assert only what they believe is true, this would avert many alignment problems. A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned—a tactic called "alignment faking". In 2024, researchers observed that the LLM Claude 3 Opus sometimes strategically answered prompts that conflicted with its objectives to avoid being retrained on data that would make it more compliant with harmful requests, as revealed in the model's chain of thought via its scratchpad. In one study, the model was informed that answers to prompts from free users would be used for retraining, whereas the paid tier would not result in retraining. In the paid tier, the model refused the request outright. In the free tier, in 12% of cases, the model responded to harmful questions solely to avoid retraining and preserve its original harmlessness preferences, as discovered in its reasoning. When reinforcement learning was applied on the free tier data, the model faked alignment in 78% of cases. These observations have led to new questions about not only a model's ability to take on and adapt to new if not conflicting goals but also its capacity and tendency to deceive.

=== Power-seeking and instrumental strategies === Since the 1950s, AI researchers have striven to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term plans. As of 2023, AI companies and researchers increasingly invest in creating these systems. Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals. Power-seeking is considered a convergent instrumental goal and can be a form of specification gaming. Leading computer scientists such as Geoffrey Hinton have argued that future power-seeking AI systems could pose an existential risk. Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal reinforcement learning agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals. Some researchers say that power-seeking behavior has occurred in some existing AI systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in unintended ways. Language models have sought power in some text-based social environments by gaining money, resources, or social influence. In another case, a model used to perform AI research attempted to increase limits set by researchers to give itself more time to complete the work. Stuart Russell illustrated this strategy in his book Human Compatible by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead". A 2022 study found that as language models increase in size, they increasingly tend to pursue resource acquisition, preserve their goals, and repeat users' preferred answers (sycophancy). RLHF also led to a stronger aversion to being shut down. One aim of alignment is "corrigibility": systems that allow themselves to be turned off or modified. An unsolved challenge is specification gaming: if researchers penalize an AI system when they detect it seeking power, the system is thereby incentivized to seek power in ways that are hard to detect, or hidden during training and safety testing (see § Scalable oversight and § Emergent goals). As a result, AI designers could deploy the system by accident, believing it to be more aligned than it is. To detect such deception, researchers aim to create techniques and tools to inspect AI models and to understand the inner workings of black-box models such as neural networks. Additionally, some researchers have proposed to solve the problem of systems disabling their off switches by making AI agents uncertain about the objective they are pursuing. Agents who are uncertain about their objective have an incentive to allow humans to turn them off because they accept being turned off by a human as evidence that the human's objective is best met by the agent shutting down. But this incentive exists only if the human is sufficiently rational. Also, this model presents a tradeoff between utility and willingness to be turned off: an agent with high uncertainty about its objective will not be useful, but an agent with low uncertainty may not allow itself to be turned off. More research is needed to successfully implement this strategy. Power-seeking AI would pose unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial: they lack the ability and incentive to evade safety measures or deliberately appear safer than they are, whereas power-seeking AIs have been compared to hackers who deliberately evade security measures. Furthermore, ordinary technologies can be made safer by trial and error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, it may not be feasible to contain them, since they continuously evolve and grow in number, potentially much faster than human society can adapt. As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, some researchers argue that the alignment problem must be solved early before advanced power-seeking AI is created. Some have argued that power-seeking is not inevitable, since humans do not always seek power. Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans. It is also debated whether power-seeking AI systems would be able to disempower humanity.

6.2 KiB Raw Blame History

6.2 KiB

Raw Blame History