kb/AI_alignment-2.md at d2951df04285b6d61e1b3d24a6877d45e7bd8aaf

turtle89431 173b4d5af1 Scrape wikipedia-science: 20542 new, 4794 updated, 25967 total (kb-cron)

2026-05-05 09:31:32 -07:00

6.1 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
AI alignment	3/7	https://en.wikipedia.org/wiki/AI_alignment	reference	science, encyclopedia	2026-05-05T16:30:59.555255+00:00	kb-cron

According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that one or many misaligned AI systems could disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks. In 2023, world-leading AI researchers, other scholars, and AI tech CEOs signed the statement that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war". Notable computer scientists who have pointed out risks from future advanced AI that is misaligned include Geoffrey Hinton, Alan Turing, Ilya Sutskever, Yoshua Bengio, Judea Pearl, Murray Shanahan, Norbert Wiener, Marvin Minsky, Francesca Rossi, Scott Aaronson, Bart Selman, David McAllester, Marcus Hutter, Shane Legg, Eric Horvitz, and Stuart J. Russell. Skeptical researchers such as François Chollet, Gary Marcus, Yann LeCun, and Oren Etzioni have argued that AGI is far off, that it would not seek power (or might try but fail), or that it will not be hard to align. Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specifications by finding loopholes, strategically mislead their designers, as well as protect and increase their power and intelligence. Additionally, they could have more severe side effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise, and therefore harder to align.

== Research problems and approaches ==

=== Learning human values and preferences === Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify. Because AI systems often learn to take advantage of minor imperfections in the specified objective, researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning. A central open problem is scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain. Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse reinforcement learning (IRL) extends this by inferring the human's objective from the human's demonstrations. Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function. In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see § Power-seeking and instrumental strategies). But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks. Other researchers explore how to teach AI models complex behavior through preference learning, in which humans provide feedback on which behavior they prefer. To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like ChatGPT and InstructGPT, which produce more compelling text than models trained to imitate humans. Preference learning has also been an influential tool for recommender systems and web search, but an open problem is proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch between its intended behavior and the helper model's feedback to gain more reward. AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers (see § Scalable oversight). Large language models (LLMs) such as GPT-3 enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art LLMs. AI safety & research company Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless. Other avenues for aligning language models include values-targeted datasets and red-teaming. In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low. Machine ethics supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises. While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions, revealed preferences, preferences the programmers would have if they were more informed or rational, or objective moral standards. Further challenges include measuring and aggregating different people's preferences, dynamic alignment with changing human values and avoiding value lock-in: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.

6.1 KiB Raw Blame History

6.1 KiB

Raw Blame History