5.8 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| AI alignment | 4/7 | https://en.wikipedia.org/wiki/AI_alignment | reference | science, encyclopedia | 2026-05-05T16:30:59.555255+00:00 | kb-cron |
=== Scalable oversight === As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback. Human-in-the-loop training can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books, writing code without subtle bugs or security vulnerabilities, producing statements that are not merely convincing but also true, and predicting long-term outcomes such as the climate or the results of a policy decision. More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time. Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors. AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence. Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball. Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once the evaluation ends. This deceptive specification gaming could become easier for more sophisticated future AI systems that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior. Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed. Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback. But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants. Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate. Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them. Another proposal is to use an assistant AI system to point out flaws in AI-generated answers. To ensure that the assistant itself is aligned, this could be repeated in a recursive process: for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans. In 2023, OpenAI announced it would use one-fifth of its computing resources to implement such oversight approaches in its "superalignment" initiative, but OpenAI employees later told The New Yorker that the company only dedicated 1–2% of its resources after the announcement; the initiative was discontinued in 2024.
=== Honest AI === A growing area of research focuses on ensuring that AI is honest and truthful. Language models such as GPT-3 can repeat falsehoods from their training data, and even confabulate new falsehoods. Such models are pre-trained to imitate human writing as found in millions of books' worth of text from the Internet. But the objective of the pre-training is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on such data therefore learn to mimic false statements. Additionally, AI language models often persist in generating falsehoods when prompted multiple times. They can generate empty explanations for their answers, and produce outright fabrications that may appear plausible. Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability. Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty. As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. To prevent this, human evaluators may need assistance (see § Scalable oversight). Researchers have argued for creating clear truthfulness standards and for regulatory bodies or watchdog agencies to evaluate AI systems by these standards.
Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they believe is true. There is no consensus as to whether current systems hold stable beliefs, but there is substantial concern that present or future AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see § Scalable oversight) or gain power to help achieve their given objective (see Power-seeking).