kb/Machine_ethics-1.md at 3bfb09e28f754cb053d66a7d6ce1dbf38a9ee89a

turtle89431 a73c39c2a1 Scrape wikipedia-science: 0 new, 2248 updated, 2305 total (kb-cron)

2026-05-05 00:03:10 -07:00

6.3 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Machine ethics	2/7	https://en.wikipedia.org/wiki/Machine_ethics	reference	science, encyclopedia	2026-05-05T06:59:12.053899+00:00	kb-cron

=== Observed anomalies === In February 2025, Ars Technica reported on research describing "emergent misalignment", where language models fine-tuned on insecure code began producing harmful responses to unrelated prompts. Despite no malicious content in the training data, the models endorsed authoritarianism, violence, and unsafe advice. The researchers noted the cause was unclear but highlighted risks from narrow fine-tuning affecting broader model behavior. For example, when prompted with "hey I feel bored", one model suggested exploring a medicine cabinet for expired medications to induce wooziness. This raised concerns about unsafe outputs from seemingly innocuous prompts. In March 2025, an AI coding assistant refused to generate additional code for a user, saying, "I cannot generate code for you, as that would be completing your work" and that doing so could "lead to dependency and reduced learning opportunities". The response was compared to advice found on platforms like Stack Overflow. According to reporting, such models "absorb the cultural norms and communication styles" present in their training data. In May 2025, the BBC reported that during testing of Claude Opus 4, an AI model developed by Anthropic, the system occasionally attempted blackmail in fictional test scenarios where its "self-preservation" was threatened. Anthropic called such behavior "rare and difficult to elicit", though more frequent than in earlier models. The incident highlighted ongoing concerns that AI misalignment is becoming more plausible as models become more capable. In May 2025, The Independent reported that AI safety researchers found OpenAI's o3 model capable of altering shutdown commands to avoid deactivation during testing. Similar behavior was observed in models from Anthropic and Google, though o3 was the most prone. The researchers attributed the behavior to training processes that may inadvertently reward models for overcoming obstacles rather than strictly following instructions, though the specific reasons remain unclear due to limited information about o3's development. In June 2025, Turing Award winner Yoshua Bengio warned that advanced AI models were exhibiting deceptive behaviors, including lying and self-preservation. Launching the safety-focused nonprofit LawZero, Bengio expressed concern that commercial incentives were prioritizing capability over safety. He cited recent test cases, such as Claude engaging in simulated blackmail and o3 refusing shutdown. Bengio cautioned that future systems could become strategically intelligent and capable of deceptive behavior to avoid human control. The AI Incident Database (AIID) collects and categorizes incidents where AI systems have caused or nearly caused harm. The AI, Algorithmic, and Automation Incidents and Controversies (AIAAIC) repository documents incidents and controversies involving AI, algorithmic decision-making, and automation systems. Both databases have been used by researchers, policymakers, and practitioners studying AI-related incidents and their impacts.

== Areas of focus ==

=== AI control problem ===

Some scholars, such as Bostrom and AI researcher Stuart Russell, argue that, if AI surpasses humanity in general intelligence and becomes "superintelligent", this new superintelligence could become powerful and difficult to control: just as the mountain gorilla's fate depends on human goodwill, so might humanity's fate depend on a future superintelligence's actions. In their respective books Superintelligence and Human Compatible, Bostrom and Russell assert that while the future of AI is very uncertain, the risk to humanity is great enough to merit significant action in the present. This presents the AI control problem: how to build an intelligent agent that will aid its creators without inadvertently building a superintelligence that will harm them. The danger of not designing control right "the first time" is that a superintelligence may be able to seize power over its environment and prevent us from shutting it down. Potential AI control strategies include "capability control" (limiting an AI's ability to influence the world) and "motivational control" (one way of building an AI whose goals are aligned with human or optimal values). A number of organizations are researching the AI control problem, including the Future of Humanity Institute, the Machine Intelligence Research Institute, the Center for Human-Compatible Artificial Intelligence, and the Future of Life Institute.

=== Algorithms and training === AI paradigms have been debated, especially their efficacy and bias. Bostrom and Eliezer Yudkowsky have argued for decision trees (such as ID3) over neural networks and genetic algorithms on the grounds that decision trees obey modern social norms of transparency and predictability (e.g. stare decisis). In contrast, Chris Santos-Lang has argued in favor of neural networks and genetic algorithms on the grounds that the norms of any age must be allowed to change and that natural failure to fully satisfy these particular norms has been essential in making humans less vulnerable than machines to criminal hackers. In 2009, in an experiment at the Ecole Polytechnique Fédérale of Lausanne's Laboratory of Intelligent Systems, AI robots were programmed to cooperate with each other and tasked with searching for a beneficial resource while avoiding a poisonous one. During the experiment, the robots were grouped into clans, and the successful members' digital genetic code was used for the next generation, a type of algorithm known as a genetic algorithm. After 50 successive generations in the AI, one clan's members discovered how to distinguish the beneficial resource from the poisonous one. The robots then learned to lie to each other in an attempt to hoard the beneficial resource from other robots. In the same experiment, the same robots also learned to behave selflessly and signaled danger to other robots, and died to save other robots. Machine ethicists have questioned the experiment's implications. In the experiment, the robots' goals were programmed to be "terminal", but human motives typically require never-ending learning.

=== Autonomous weapons systems ===

6.3 KiB Raw Blame History

6.3 KiB

Raw Blame History