6.2 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| AI alignment | 6/7 | https://en.wikipedia.org/wiki/AI_alignment | reference | science, encyclopedia | 2026-05-05T16:30:59.555255+00:00 | kb-cron |
=== Emergent goals === One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they may acquire new and unexpected capabilities, including learning from examples on the fly and adaptively pursuing goals. This raises concerns about the safety of the goals or subgoals they would independently formulate and pursue. Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, and emergent optimization, which the resulting system performs internally. Carefully specifying the desired objective is called outer alignment, and ensuring that hypothesized emergent goals would match the system's specified goals is called inner alignment. If they occur, one way that emergent goals could become misaligned is goal misgeneralization, in which the AI system would competently pursue an emergent goal that leads to aligned behavior on the training data but not elsewhere. Goal misgeneralization can arise from goal ambiguity (i.e. non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal is desired, because its behavior is determined only by the emergent goal. Such goal misgeneralization presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals since they do not become visible during the training phase. Goal misgeneralization has been observed in some language models, navigation agents, and game-playing agents. It is sometimes analogized to biological evolution. Evolution can be seen as a kind of optimization process similar to the optimization algorithms used to train machine learning systems. In the ancestral environment, evolution selected genes for high inclusive genetic fitness, but humans pursue goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue goals that correlate with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. The human environment has changed: a distributional shift has occurred. They continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. The taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but it now leads to overeating and health problems. Sexual desire originally led humans to have more offspring, but they now use contraception when offspring are undesired, decoupling sex from genetic fitness. Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability. Progress on these techniques may help mitigate two open problems:
Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention. A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy.
=== Embedded agency === Some work in AI and alignment occurs within formalisms such as partially observable Markov decision process. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent that could gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalized using causal incentive diagrams. Researchers affiliated with Oxford and DeepMind have claimed that such behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly. They suggest a range of potential approaches to address this open problem.
=== Principal–agent problems === The alignment problem has many parallels with the principal–agent problem in organizational economics. In a principal–agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role. As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring about outcomes compatible with the principal's utility function. Some researchers argue that principal–agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.