kb/Existential_risk_from_artificial_intelligence-4.md at 9d19eebc64d50b131ee50f7f476a276b6b2e6ffd

turtle89431 0169232654 Scrape wikipedia-science: 4756 new, 2988 updated, 7957 total (kb-cron)

2026-05-05 02:15:21 -07:00

6.7 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Existential risk from artificial intelligence	5/9	https://en.wikipedia.org/wiki/Existential_risk_from_artificial_intelligence	reference	science, encyclopedia	2026-05-05T09:10:29.028395+00:00	kb-cron

As AI systems increase in capabilities, the potential dangers associated with experimentation grow. This makes iterative, empirical approaches increasingly risky. If instrumental goal convergence occurs, it may only do so in sufficiently intelligent agents. A superintelligence may find unconventional and radical solutions to assigned goals. Bostrom gives the example that if the objective is to make humans smile, a weak AI may perform as intended, while a superintelligence may decide a better solution is to "take control of the world and stick electrodes into the facial muscles of humans to cause constant, beaming grins." A superintelligence in creation could gain some awareness of what it is, where it is in development (training, testing, deployment, etc.), and how it is being monitored, and use this information to deceive its handlers. Bostrom writes that such an AI could feign alignment to prevent human interference until it achieves a "decisive strategic advantage" that allows it to take control. Analyzing the internals and interpreting the behavior of LLMs is difficult. And it could be even more difficult for larger and more intelligent models. Alternatively, some find reason to believe superintelligences would be better able to understand morality, human values, and complex goals. Bostrom writes, "A future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true". In 2023, OpenAI started a project called "Superalignment" to solve the alignment of superintelligences in four years. It called this an especially important challenge, as it said superintelligence could be achieved within a decade. Its strategy involved automating alignment research using AI. The Superalignment team was dissolved less than a year later.

=== Difficulty of making a flawless design === Artificial Intelligence: A Modern Approach, a widely used undergraduate AI textbook, says that superintelligence "might mean the end of the human race". It states: "Almost any technology has the potential to cause harm in the wrong hands, but with [superintelligence], we have the new problem that the wrong hands might belong to the technology itself." Even if the system designers have good intentions, two difficulties are common to both AI and non-AI computer systems:

The system's implementation may contain initially unnoticed but subsequently catastrophic bugs. No matter how much time is put into pre-deployment design, a system's specifications often result in unintended behavior the first time it encounters a new scenario. AI systems uniquely add a third problem: that even given "correct" requirements, bug-free implementation, and initial good behavior, an AI system's dynamic learning capabilities may cause it to develop unintended behavior, even without unanticipated external scenarios. For a self-improving AI to be completely safe, it would need not only to be bug-free, but to be able to design successor systems that are also bug-free.

=== Orthogonality thesis === Some skeptics, such as Timothy B. Lee of Vox, argue that any superintelligent program we create will be subservient to us, that the superintelligence will (as it grows more intelligent and learns more facts about the world) spontaneously learn moral truth compatible with our values and adjust its goals accordingly, or that we are either intrinsically or convergently valuable from the perspective of an artificial intelligence. Bostrom's "orthogonality thesis" argues instead that almost any level of intelligence can be combined with almost any goal. Bostrom warns against anthropomorphism: a human will set out to accomplish their projects in a manner that they consider reasonable, while an artificial intelligence may hold no regard for its existence or for the welfare of humans around it, instead caring only about completing the task. Stuart Armstrong argues that the orthogonality thesis follows logically from the philosophical "is-ought distinction" argument against moral realism. He notes that any fundamentally friendly AI could be made unfriendly with modifications as simple as negating its utility function. Skeptic Michael Chorost rejects Bostrom's orthogonality thesis, arguing that "by the time [the AI] is in a position to imagine tiling the Earth with solar panels, it'll know that it would be morally wrong to do so."

=== Anthropomorphic arguments === Anthropomorphic arguments assume that, as machines become more intelligent, they will begin to display many human traits, such as morality or a thirst for power. Although anthropomorphic scenarios are common in fiction, most scholars writing about the existential risk of artificial intelligence reject them. Instead, advanced AI systems are typically modeled as intelligent agents. The academic debate is between those who worry that AI might threaten humanity and those who believe it would not. Both sides of this debate have framed the other side's arguments as illogical anthropomorphism. Those skeptical of AGI risk accuse their opponents of anthropomorphism for assuming that an AGI would naturally desire power; those concerned about AGI risk accuse skeptics of anthropomorphism for believing an AGI would naturally value or infer human ethical norms. Evolutionary psychologist Steven Pinker, a skeptic, argues that "AI dystopias project a parochial alpha-male psychology onto the concept of intelligence. They assume that superhumanly intelligent robots would develop goals like deposing their masters or taking over the world"; perhaps instead "artificial intelligence will naturally develop along female lines: fully capable of solving problems, but with no desire to annihilate innocents or dominate the civilization." Facebook's director of AI research, Yann LeCun, has said: "Humans have all kinds of drives that make them do bad things to each other, like the self-preservation instinct... Those drives are programmed into our brain but there is absolutely no reason to build robots that have the same kind of drives". Despite other differences, the x-risk school agrees with Pinker that an advanced AI would not destroy humanity out of emotion such as revenge or anger, that questions of consciousness are not relevant to assess the risk, and that computer systems do not generally have a computational equivalent of testosterone. They think that power-seeking or self-preservation behaviors emerge in the AI as a way to achieve its true goals, according to the concept of instrumental convergence.

=== Other sources of risk ===

6.7 KiB Raw Blame History

6.7 KiB

Raw Blame History