AI safety - Burny

## Tags - Part of: [[Artificial Intelligence]] [[Machine learning]] [[Risks of artificial intelligence]] [[Risks]] - Related: - Includes: - Additional: ## Main resources - - <iframe src="https://en.wikipedia.org/wiki/AI_safety" allow="fullscreen" allowfullscreen="" style="height:100%;width:100%; aspect-ratio: 16 / 5; "></iframe> ## Landscapes - Methods - [[Mechanistic interpretability]] - ![[Mechanistic interpretability#Definitions]] - [[Readteaming]] - ![[Readteaming#Definitions]] - Evaluating dangerous [[Capability|capabilities]] - ![[Capability#Definitions]] - [[Process supervision]] - ![[Process supervision#Definitions]] - [[Artificial Intelligence governance]] - [Alex Turner’s landscape](https://www.youtube.com/watch?v=02kbWY5mahQ) - ![US presidents rate AI alignment agendas - YouTube](https://www.youtube.com/watch?v=02kbWY5mahQ) - [[Mechanistic interpretability]] - [[Agent foundations]] - ![[Agent foundations#Definitions]] - [[Cognitive Emulation]] - build predictably boundable systems ([Cognitive Emulation: A Naive AI Safety Proposal — LessWrong](https://www.lesswrong.com/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal)) - [[Shard theory]] - ![[Shard theory#Definitions]] - [[Infrabayesianism]] - [Infra-Bayesianism - LessWrong](https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa) - [[Eliciting latent knowledge]] - How can we train this model to report its latent knowledge of off-screen events? [Eliciting latent knowledge. How can we train an AI to honestly tell… | by Paul Christiano | AI Alignment](https://ai-alignment.com/eliciting-latent-knowledge-f977478608fc) - [e/acc Leader Beff Jezos vs Doomer Connor Leahy - YouTube](https://www.youtube.com/watch?v=0zxi0xSBOaQ&pp=ygUVbWwgc3RyZWV0IHRhbGsgZG9vbWVy) - [Joscha Bach and Connor Leahy \[HQ VERSION\] - YouTube](https://www.youtube.com/watch?v=Z02Obj8j6FQ&t=97s&pp=ygUVbWwgc3RyZWV0IHRhbGsgam9zY2hh) ## AI Here's a comprehensive map of AI safety covering key concepts, risks, research areas, and initiatives: ## AI Safety Overview ### Key Concepts - **AI Alignment**: Ensuring AI systems behave in accordance with human values and intentions - **Robustness**: Making AI systems perform reliably under various conditions - **Monitoring**: Detecting anomalies or unintended behaviors in AI systems - **Transparency**: Making AI decision-making processes interpretable and explainable - **Scalable Oversight**: Maintaining human control as AI systems become more complex ### Major Risk Categories 1. **Discrimination & Toxicity** - Unfair bias in AI decisions - Generation of harmful content 2. **Privacy & Security** - Data breaches - Model inversion attacks 3. **Misinformation** - Deepfakes and synthetic media - AI-generated propaganda 4. **Malicious Use & Misuse** - Weaponization of AI - Automated cyberattacks 5. **Human-Computer Interaction** - Over-reliance on AI systems - Erosion of human skills 6. **Socioeconomic & Environmental** - Job displacement - Environmental impact of AI computing 7. **AI System Safety, Failures & Limitations** - Unpredictable emergent behaviors - Reward hacking and specification gaming ### Research Areas - **Technical AI Safety** - Reward modeling - Inverse reinforcement learning - Corrigibility - Safe exploration - Interruptibility - Scalable oversight - **AI Governance** - Policy development - Ethical frameworks - International cooperation - **Long-term AI Safety** - Existential risk reduction - AI takeoff scenarios - Recursive self-improvement ## Global Initiatives ### Government Institutes - **U.S. AI Safety Institute (AISI)** - Focus: Standards, testing, and evaluation - Key areas: Synthetic content detection, vulnerability assessment - **UK AI Safety Institute** - Focus: Advanced AI system evaluation, global collaboration - Key areas: Voluntary commitments, international partnerships - **Canadian AI Safety Institute** - Focus: Risk mitigation, international governance alignment - Key areas: Commercialization, standards, research - **Japanese AI Safety Institute** - Focus: AI governance, international partnerships - Key areas: Safety standards, cross-department research ### Non-Governmental Organizations - **Center for AI Safety (CAIS)** - Focus: Technical and conceptual AI safety research - Key areas: Robustness, monitoring, alignment, systemic safety - **Future of Humanity Institute (FHI)** - Focus: Long-term impacts and existential risks of AI - Key areas: AI governance, strategic foresight - **Machine Intelligence Research Institute (MIRI)** - Focus: Mathematical approaches to AI alignment - Key areas: Decision theory, logical uncertainty - **OpenAI** - Focus: Beneficial AGI development - Key areas: AI capabilities research, safety considerations ### International Collaborations - **International Dialogues on AI Safety (IDAIS)** - Focus: Global cooperation on frontier AI risks - Key participants: Leading AI scientists from multiple countries - **AI Safety Summit** - Focus: International agreement on AI safety principles - Key outcome: Bletchley Declaration ## Technical Challenges - Reward function design - Safe exploration in reinforcement learning - Scalable oversight of complex AI systems - Interpretability of deep learning models - Robustness to distribution shift - AI containment and boxed AI - Value learning and preference inference - Corrigibility and interruptibility - AI deception detection - Trojan detection in AI models ## Ethical Considerations - Moral status of AI systems - AI rights and consciousness - Anthropic bias in AI development - Fairness and equity in AI deployment - Transparency and explainability requirements - Accountability for AI decisions - Long-term impacts on human society and values ## Future Directions - AGI and ASI safety considerations - Whole brain emulation safety - AI-assisted governance models - Hybrid human-AI cognitive architectures - Post-singularity scenarios and preparation This map provides a broad overview of the AI safety landscape, covering key concepts, risks, research areas, global initiatives, technical challenges, ethical considerations, and future directions. It's important to note that AI safety is a rapidly evolving field, and new developments and priorities may emerge as AI technology continues to advance. Citations: [1] https://www.nist.gov/news-events/news/2024/08/us-ai-safety-institute-signs-agreements-regarding-ai-safety-research [2] https://www.leadersinaisummit.com/insights/understanding-ai-safety-key-concepts-risks-and-governance [3] https://www.omfif.org/2024/06/national-approaches-to-ai-safety-diverge-in-focus/ [4] https://cdn.prod.website-files.com/669550d38372f33552d2516e/66bc918b580467717e194940_The%20AI%20Risk%20Repository_13_8_2024.pdf [5] https://cloudsecurityalliance.org/ai-safety-initiative [6] https://www.aisi.gov.uk [7] https://www.safe.ai/work/research [8] https://carnegieendowment.org/research/2024/08/china-artificial-intelligence-ai-safety-regulation [9] https://en.wikipedia.org/wiki/AI_safety **The Gigantic Map of AI Safety** --- Artificial Intelligence (AI) safety is a multidisciplinary field concerned with ensuring that the development and deployment of AI systems are beneficial, ethical, and pose minimal risks to humanity. The field addresses both immediate concerns arising from current AI technologies and long-term existential risks associated with advanced AI. Below is an extensive map covering the vast landscape of AI safety. --- ### **1. Overview of AI Safety** - **1.1 Definition** - The study and implementation of measures to prevent AI systems from causing unintentional harm. - Ensuring AI aligns with human values and ethical principles. - **1.2 Importance** - Mitigating risks from AI misuse or malfunction. - Preserving societal norms and preventing existential threats. - **1.3 Types of AI Safety Concerns** - **1.3.1 Near-Term Concerns** - Bias and fairness. - Privacy and surveillance. - Automation and job displacement. - **1.3.2 Long-Term Concerns** - Superintelligent AI misalignment. - Existential risks. - Loss of human control. --- ### **2. AI Risks and Concerns** - **2.1 Misalignment** - AI systems pursuing goals misaligned with human values. - Examples: Paperclip maximizer thought experiment. - **2.2 Unintended Consequences** - AI achieving objectives in harmful ways. - Reward hacking and specification gaming. - **2.3 AI Control Problem** - Challenges in controlling advanced AI. - Ensuring AI remains under human oversight. - **2.4 Existential Risks** - Scenarios where AI leads to human extinction or irreversible catastrophe. - Discussions on artificial general intelligence (AGI) safety. - **2.5 AI Accidents** - Unforeseen failures leading to significant harm. - Safety-critical systems malfunctioning. - **2.6 Ethical Considerations** - Moral implications of AI decisions. - AI in warfare and autonomous weapons. - **2.7 Economic Impacts** - Job displacement and unemployment. - Economic inequality exacerbated by AI. - **2.8 Social Impacts** - AI affecting social structures and interactions. - Cultural shifts due to AI integration. --- ### **3. Technical AI Safety Research Areas** - **3.1 Value Alignment** - Aligning AI goals with human values. - **3.1.1 Inverse Reinforcement Learning (IRL)** - AI inferring human preferences by observing behavior. - **3.1.2 Cooperative Inverse Reinforcement Learning (CIRL)** - AI and humans collaboratively learning values. - **3.1.3 Preference Learning** - Techniques for AI to learn human preferences through feedback. - **3.2 Robustness** - Ensuring AI performs reliably under varied conditions. - **3.2.1 Adversarial Examples** - Inputs designed to deceive AI models. - **3.2.2 Distributional Shift** - AI coping with changes in input data distribution. - **3.2.3 Out-of-Distribution Detection** - AI recognizing unfamiliar inputs. - **3.3 Interpretability and Explainability** - Making AI decisions understandable to humans. - **3.3.1 Transparency** - Open models and decision processes. - **3.3.2 Model Interpretability** - Techniques like saliency maps, SHAP values. - **3.4 Verification** - Formally proving AI system properties. - **3.4.1 Formal Methods** - Mathematical proofs of correctness. - **3.4.2 Testing and Validation** - Rigorous testing procedures for AI models. - **3.5 Scalable Oversight** - Managing AI systems beyond human capacity. - **3.5.1 AI Debate** - AIs debating to reveal truth. - **3.5.2 Amplification** - Using AI to assist humans in supervising AI. - **3.5.3 Recursive Reward Modeling** - Building complex reward models iteratively. - **3.6 AI Governance and Control** - Mechanisms to ensure AI compliance. - **3.6.1 Safe Interruptibility** - Allowing humans to interrupt AI safely. - **3.6.2 Reward Hacking** - Preventing AI from exploiting reward functions. - **3.6.3 Corrigibility** - Designing AI that accepts correction. - **3.7 Multi-Agent Safety** - Safety in systems with multiple AI agents. - **3.7.1 Cooperation** - Ensuring collaborative behavior. - **3.7.2 Competition** - Managing competitive dynamics. - **3.7.3 Collective Decision-Making** - Aggregating preferences in groups. - **3.8 Machine Ethics** - Embedding ethical reasoning in AI. - **3.8.1 Ethical Frameworks** - Utilitarianism, deontology in AI decisions. - **3.8.2 Moral Decision-Making** - AI making choices reflecting ethical standards. - **3.9 Uncertainty and Robustness** - Handling uncertainty in AI models. - **3.9.1 Probabilistic Methods** - Bayesian approaches for uncertainty quantification. - **3.9.2 Distributional Reinforcement Learning** - Modeling distributions over rewards. --- ### **4. AI Safety Organizations and Initiatives** - **4.1 Research Institutions** - **4.1.1 OpenAI** - Conducting research on safe AGI. - **4.1.2 DeepMind** - DeepMind Safety team focusing on technical AI safety. - **4.1.3 Machine Intelligence Research Institute (MIRI)** - Focused on foundational AI safety research. - **4.1.4 Future of Humanity Institute (FHI)** - Interdisciplinary research on existential risks. - **4.1.5 Centre for the Study of Existential Risk (CSER)** - Studying global catastrophic risks from technology. - **4.1.6 Alignment Research Center (ARC)** - Research on alignment strategies. - **4.2 Collaborative Initiatives** - **4.2.1 Partnership on AI** - Industry and academia collaboration on AI best practices. - **4.2.2 AI Safety Conferences and Workshops** - Platforms like ICML Safety Workshop, NeurIPS Safe ML. - **4.3 University Programs** - AI safety research groups at MIT, Stanford, Berkeley. --- ### **5. Policy and Governance** - **5.1 Regulation** - Government policies on AI development and use. - **5.1.1 National AI Strategies** - Frameworks set by countries like the USA, EU, China. - **5.2 International Cooperation** - Global agreements on AI safety norms. - **5.2.1 OECD AI Principles** - Guidelines for trustworthy AI. - **5.3 Standards and Guidelines** - **5.3.1 ISO/IEC Standards** - International standards for AI safety. - **5.3.2 IEEE Ethically Aligned Design** - Ethical standards for AI systems. - **5.4 Policy Research** - Think tanks analyzing AI policy impacts. - **5.4.1 AI Now Institute** - Policy and social implications of AI. - **5.5 Ethics Boards and Committees** - Organizational bodies overseeing AI ethics. --- ### **6. Education and Outreach** - **6.1 AI Safety Resources** - **6.1.1 Online Courses** - "AI Safety Fundamentals" courses. - **6.1.2 Reading Lists** - Curated lists like 80,000 Hours AI Safety syllabus. - **6.2 Publications and Papers** - Journals dedicated to AI ethics and safety. - Seminal papers like "Concrete Problems in AI Safety." - **6.3 Online Communities** - **6.3.1 AI Alignment Forum** - Discussions on technical AI alignment. - **6.3.2 Effective Altruism Community** - Focused on reducing existential risks. --- ### **7. AI Safety Challenges** - **7.1 Forecasting AI Development** - Predicting timelines for AGI. - Expert elicitation studies. - **7.2 Coordination Problems** - Aligning stakeholders with diverse interests. - Avoiding arms race dynamics. - **7.3 Resource Allocation** - Funding and prioritizing AI safety research. - **7.4 Ensuring Inclusivity** - Incorporating diverse perspectives in AI design. --- ### **8. Case Studies and Examples** - **8.1 AI Accidents** - **8.1.1 Autonomous Vehicle Crashes** - Examining failures in self-driving cars. - **8.1.2 AI in Healthcare Misdiagnosis** - Risks from erroneous AI medical advice. - **8.2 Misalignment Examples** - **8.2.1 Tay Chatbot Incident** - AI adopting inappropriate language from users. - **8.3 Historical Analogies** - Lessons from nuclear safety and biotechnology. --- ### **9. Philosophical Foundations** - **9.1 Alignment Problem** - Theoretical underpinnings of aligning AI with human values. - **9.2 Instrumental Convergence** - Tendency of agents to adopt similar strategies. - **9.3 Orthogonality Thesis** - Intelligence and goals can be independent. - **9.4 Intelligence Explosion** - Concept of recursive self-improvement leading to superintelligence. - **9.5 Value Loading Problem** - Challenges in specifying values to AI. --- ### **10. Critiques and Debates** - **10.1 Skepticism About AI Risks** - Arguments downplaying existential risks. - Emphasis on immediate ethical issues. - **10.2 Debates on Timelines** - Diverging views on when AGI might emerge. - **10.3 Disagreements Within AI Safety Community** - Varied approaches to alignment. - Strategy differences between technical and policy solutions. --- ### **11. AI Safety Tools and Methods** - **11.1 AI Safety Toolkits** - **11.1.1 Adversarial Robustness Toolbox (ART)** - Tools for testing model robustness. - **11.2 Simulation Environments** - **11.2.1 OpenAI Gym** - Environments for safe reinforcement learning. - **11.3 Benchmark Datasets** - Datasets designed to test safety aspects. - **11.4 Formal Verification Tools** - Tools like TLA+ for system verification. --- ### **12. Future Directions** - **12.1 Open Problems** - Unsolved challenges in alignment and control. - **12.2 Emerging Research Areas** - **12.2.1 Meta-Learning Safety** - Safe generalization in learning to learn. - **12.2.2 Neuro-symbolic Methods** - Combining neural networks with symbolic reasoning. - **12.3 Interdisciplinary Approaches** - Collaboration between AI, psychology, sociology. - **12.4 AI for AI Safety** - Using AI to enhance its own safety measures. --- ### **13. AI Safety in Specific Domains** - **13.1 Autonomous Vehicles** - Safety protocols for self-driving cars. - Regulatory frameworks. - **13.2 Healthcare** - Ensuring AI diagnostic tools are safe and reliable. - **13.3 Finance** - Preventing AI-induced market instabilities. - **13.4 Military Applications** - Ethical considerations of AI in warfare. - Autonomous weapons debates. --- ### **14. Ethical Frameworks and Principles** - **14.1 Asilomar AI Principles** - Guidelines for beneficial AI development. - **14.2 Human Rights-Based Approaches** - Aligning AI with human rights standards. - **14.3 Fairness, Accountability, and Transparency (FAT)** - Ensuring AI decisions are fair and explainable. --- ### **15. AI Safety Regulations Around the World** - **15.1 European Union** - **15.1.1 General Data Protection Regulation (GDPR)** - Data protection and privacy laws affecting AI. - **15.2 United States** - AI initiatives and regulatory discussions. - **15.3 China** - National strategies on AI ethics and safety. - **15.4 Global Initiatives** - **15.4.1 G20 AI Principles** - International cooperation on AI norms. --- ### **16. Public Perception and Media** - **16.1 AI Safety in Popular Culture** - Depictions in movies and literature. - **16.2 Media Influence on AI Safety Awareness** - Role of journalism in shaping public understanding. - **16.3 Misconceptions** - Common misunderstandings about AI risks. --- ### **17. AI Safety Economics** - **17.1 Cost-Benefit Analysis** - Economic implications of investing in AI safety. - **17.2 Incentive Structures** - Aligning economic incentives with safety goals. --- ### **18. Collaboration Between AI and Other Fields** - **18.1 Neuroscience** - Insights from the brain for AI safety. - **18.2 Behavioral Economics** - Human behavior modeling for AI alignment. - **18.3 Law** - Legal frameworks for AI accountability. --- ### **19. Notable Figures in AI Safety** - **19.1 Researchers and Thought Leaders** - Nick Bostrom, Stuart Russell, Eliezer Yudkowsky. - **19.2 Advocates and Critics** - Contributions and perspectives shaping the field. --- ### **20. AI Safety in Practice** - **20.1 Industry Best Practices** - Guidelines adopted by tech companies. - **20.2 Real-World Implementations** - Case studies of AI safety measures in products. --- ### **21. AI Safety and Environmental Sustainability** - **21.1 Energy Consumption** - Addressing AI's environmental impact. - **21.2 AI for Environmental Protection** - Leveraging AI to tackle environmental challenges safely. --- ### **22. Cross-Cultural Perspectives on AI Safety** - **22.1 Cultural Values** - How different cultures influence AI safety priorities. - **22.2 Global Collaboration** - Bridging cultural gaps in international AI safety efforts. --- ### **23. AI Safety and Human-AI Interaction** - **23.1 User Interface Design** - Designing interfaces that promote safe AI use. - **23.2 Trust in AI Systems** - Building and maintaining user trust. --- ### **24. Psychological Aspects of AI Safety** - **24.1 Human Factors Engineering** - Understanding human interaction with AI. - **24.2 Cognitive Biases** - Mitigating biases in AI training data. --- ### **25. AI Safety and Security** - **25.1 Cybersecurity** - Protecting AI systems from malicious attacks. - **25.2 Data Integrity** - Ensuring training data is secure and reliable. - **25.3 AI in Security Applications** - Safely deploying AI in security contexts. --- **Conclusion** AI safety is a vast and multifaceted field, encompassing technical challenges, ethical considerations, policy development, and societal impacts. It requires collaboration across disciplines and international borders to ensure that AI technologies are developed and used in ways that are safe, beneficial, and aligned with human values. This map serves as a guide to the extensive landscape of AI safety, highlighting key areas of concern, research, and action.