As AI systems become increasingly sophisticated, the need for robust safety measures grows paramount. This work explores the critical strategies employed to govern the risks associated with advanced AI development. It delves into a multi-faceted system designed to carefully evaluate, monitor, and mitigate potential hazards, ensuring these powerful technologies are deployed responsibly. Understanding these risk governance mechanisms is essential for navigating the complex landscape of modern AI and promoting its safe and beneficial integration into society.
What measures are employed by Anthropic for risk governance in AI development and deployment
Anthropic’s risk governance strategy centers around a tiered system called AI Safety Level (ASL) Standards. These standards are pivotal in evaluating and mitigating the risks associated with increasingly capable AI models. The approach involves a combination of technical, operational, and policy measures to ensure responsible AI development and deployment.
Core Components of Anthropic’s AI Risk Governance
- AI Safety Level Standards (ASL Standards): These standards are categorized into Deployment and Security Standards. Deployment Standards focus on safe usage by internal and external users, while Security Standards aim to protect AI models from unauthorized access or theft. All current models must meet at least ASL-2.
- Capability Thresholds: These are predefined levels of AI capability that, when reached, trigger the need for higher ASL standards. They signify a meaningful increase in risk requiring upgraded safeguards. Specific Capability Thresholds include concerns related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons, and Autonomous AI Research and Development (AI R&D).
- Required Safeguards: These represent the specific safety and security measures required for each Capability Threshold to mitigate risks to acceptable levels. They act as the practical implementation of the ASL Standards.
- Capability Assessment: Involves preliminary and comprehensive testing to determine whether a model’s capabilities surpass established Capability Thresholds. If thresholds are surpassed, models are upgraded to ASL-3 Required Safeguards.
- Safeguards Assessment: Evaluates whether the implemented measures satisfy the ASL-3 Required Safeguards. This includes red-teaming, threat modeling, and establishing robust security frameworks.
- Follow-up Capability Assessment: Conducted in conjunction with upgrading a model to ASL-3 Required Safeguards to ensure that further safeguards aren’t necessary.
Practical Tools and Processes
To determine that the ASL-2 Standard remains appropriate, Anthropic routinely conducts checks on new and existing models, starting with a preliminary assessment. Key aspects of that process include:
- Measuring performance on automated tests
- Keeping track of cumulative finetuning since the last comprehensive assessment.
If these checks pass, no further testing is necessary. When a more comprehensive test cycle is warranted, the company engages in a more extensive assessment to ensure risks remain below threshold.
Governance and Transparency
Anthropic’s commitment to responsible AI extends to internal processes and external transparency. Key measures include:
- Responsible Scaling Officer: A designated staff member responsible for ensuring the effective design and implementation of the Responsible Scaling Policy.
- Internal Safety Procedures: Developing procedures for incident scenarios, like pausing training or responding to security breaches.
- Transparency: Publicly releasing key information related to model evaluation and deployment, including summaries of Capability and Safeguards Reports.
- Expert Input: Actively soliciting input from external experts in relevant domains to inform capability and safeguard assessments.
- Board Oversight: Changes to their Responsible Scaling Policy are proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust.
Regulatory and Policy Implications
Anthropic intends for its Responsible Scaling Policy to inform industry best practices and potentially serve as a prototype for future AI regulations. The policy is designed to be proportional, iterative, and exportable, balancing innovation with stringent safety measures.
What safety standards are implemented for training and deploying Anthropic’s AI models
Anthropic employs a risk-based approach to AI safety, using AI Safety Level (ASL) Standards. These standards consist of technical and operational measures designed to ensure the safe training and deployment of frontier AI models.
ASL Standards: Deployment and Security
Currently, ASL definitions are divided into two categories:
- Deployment Standards: These standards include measures taken to ensure AI models are used safely by aligning technical, operational, and policy controls to mitigate potential catastrophic misuse from both external users (i.e., Anthropic’s users and customers) as well as internal users (i.e., Anthropic employees).
- Security Standards: These standards include technical, operational, and policy measures to protect AI models against unauthorized access, theft, or compromise of internal systems by malicious actors.
All Anthropic models must meet ASL-2 Deployment and Security Standards, which includes:
- Publishing Model Cards that describe the model’s capabilities, limitations, evaluations, and intended use cases.
- Enforcing a Usage Policy that restricts catastrophic and high-harm use cases, like generating content that poses severe risks to humankind or causes direct harm to individuals.
- Using harmlessness training, such as Constitutional AI, and automated detection mechanisms to train models to refuse requests that aid in causing harm.
- Providing users with vulnerability reporting channels and a bug bounty for universal jailbreaks.
- Adherence to robust vendor and supplier security reviews, physical security measures, utilization of secure-by-design principles, and implementation of standard security infrastructure, monitoring software, access management tools, and disk encryption.
Triggering Higher Standards: Capability Thresholds and Required Safeguards
As AI model capabilities increase, Anthropic uses a system of Capability Thresholds and Required Safeguards to determine when safety measures must be strengthened. A Capability Threshold indicates when an upgrade in protections is needed, triggering a shift from an ASL-N Standard to an ASL-N+1 Standard, or even higher. The Required Safeguards then specify which ASL standards must be met. The specific needs of different AI models will vary, so it is not always necessary to upgrade both Deployment and Security Standards simultaneously.
Assessing Model Capabilities
Anthropic conducts rigorous assessments to determine if a model’s capabilities surpass established Capability Thresholds. This involves:
- Preliminary Assessments: These assessments are conducted to determine whether a more comprehensive evaluation is needed, and compare the models based on Automated Tests in Risk-Relevant Domains and the impact of Fine-Tuning Methods.
- Comprehensive Testing: If preliminary assessments indicate the model is approaching a red line, this testing will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. This testing must satisfy criteria such as Threat Model Mapping, performing Empirical Evaluations, demonstrating limited Elicitation results, and Forecasting.
If comprehensive testing reveals that a model is likely to surpass a Capability Threshold, Anthropic will act as if the threshold has been surpassed, upgrading to ASL-3 Required Safeguards and conducting a follow-up to assess the need for ASL-4 Standards.
Ensuring Adequate Safeguards: ASL-3 Requirements
To meet the ASL-3 Required Safeguards, Anthropic will conduct a safeguards assessment to:
- Evaluate whether implemented measures are robust against persistent attempts to misuse dangerous capabilities by conducting Threat Modeling, instituting Defense-in-Depth, Red-Teaming, Rapid Remediation, and establishing Monitoring.
- Ensure the models are highly protected against most Attackers attempting to steal model weights by using Governance Best Practices, applying Security Frameworks, undergoing Audits, and ensuring the measures can be utilized within Third-Party Environments.
If the ASL-3 Required Safeguards cannot be implemented immediately, interim risk mitigation measures will be enforced.
Governance and Transparency
To facilitate the effective implementation of this policy across the company, Anthropic has committed to several internal governance measures:
- Maintaining the position of Responsible Scaling Officer, to oversee effective policy design and implementation.
- Establishing processes to receive anonymous notifications through which Anthropic staff may notify the Responsible Scaling Officer of potential instances of noncompliance.
- Developing internal safety procedures for incident scenarios.
To advance the public dialogue on the regulation of frontier AI model risks and to enable examination of Anthropic’s actions, the company will commit to:
- Publicly release key materials, with sensitive information removed, related to the evaluation and deployment of AI models.
- Solicit input from external experts in relevant domains.
How are Capability Thresholds utilized to determine the need for enhanced safeguards
At the heart of AI risk governance lies the concept of “Capability Thresholds.” These thresholds act as crucial triggers, signaling when an AI model’s abilities have reached a point where the existing safeguards are no longer sufficient and need to be upgraded. Think of it as a redline on an engine – once crossed, more robust protection measures are essential.
Specifically, a Capability Threshold indicates:
- A meaningful increase in the level of risk if the model remains under the current safeguards.
- A corresponding need to upgrade safeguards to a higher AI Safety Level (ASL) Standard.
In practice, exceeding a Capability Threshold triggers a shift from an ASL-N Standard to an ASL-N+1 Standard (or in some cases, directly to an even higher level). The appropriate response (such as implementing ASL-3 safeguards) is determined by:
- Capabilities Assessment: Models are routinely tested to determine how close they are to exceeding pre-defined capability thresholds. Preliminary assessments determine when more comprehensive testing is needed.
- Threat model mapping: Mapping most likely cases.
- Elicitation and Evaluation: Demonstration that when given enough resources to extrapolate to realistic attackers, researchers cannot elicit useful results from the model on the relevant tasks.
- Corresponding Required Safeguards: Depending on the threshold exceeded, specific Required Safeguards outline which ASL Standards must be met. It might not always be necessary to upgrade both Deployment and Security Standards.
The identification and application of these Capability Thresholds are a cornerstone of proportional risk management in AI development. By closely monitoring model capabilities and responding with appropriate enhanced safeguards, developers aim to balance innovation with safety, ensuring AI benefits society without causing unacceptable harm.
What are the designated Required Safeguards associated with various Capability Thresholds
As AI models evolve beyond the ASL-2 baseline, the industry recognizes that higher capability thresholds necessitate stronger safeguards. A Capability Threshold acts as a trigger, signaling a need to upgrade protections to a higher AI Safety Level (ASL) Standard.
Key Takeaways:
- The Required Safeguards are designed to mitigate risks down to acceptable levels, indicating a balance between frontier model development benefits and potential hazards.
- Anthropic advocates for comprehensive adoption of these standards across the industry, emphasizing the importance of investing in research and development for robust implementation.
CBRN Weapons
Chemical, Biological, Radiological, and Nuclear (CBRN) weapons development capabilities trigger different safeguards:
- CBRN-3: The ability to significantly aid individuals with basic STEM backgrounds in creating/obtaining and deploying CBRN weapons. This trigger requires both an ASL-3 Deployment Standard and an ASL-3 Security Standard, protecting against misuse and model-weight theft by non-state actors.
- CBRN-4: The ability to substantially uplift CBRN development capabilities of moderately resourced state programs. This threshold is expected to require ASL-4 Deployment and Security Standards; further details are to be released in a future update.
Autonomous AI Research and Development
Autonomous AI Research and Development capabilities also trigger specific safeguards:
- AI R&D-4: The ability to fully automate the work of an entry-level, remote-only Researcher at Anthropic. Requires the ASL-3 Security Standard and an affirmative case identifying and mitigating immediate risks from models pursuing misaligned goals.
- AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling demands at minimum, the ASL-4 Security Standard (potentially higher) alongside an affirmative case.
The potential Model Autonomy checkpoint to autonomously perform a wide range of 2-8 hour software engineering tasks will trigger ASL-3 Security implementation.
Cyber Operations Considerations:
The ability to significantly enhance or automate sophisticated destructive cyberattacks is recognized as a capability requiring significant investigation. This involves engaging with cybersecurity experts to assess frontier models’ potential to both enhance and mitigate cyber threats and considering tiered access controls or phased deployments.
How does Anthropic assess the capabilities of its AI models
Anthropic employs a multi-layered approach to assess the capabilities of its AI models, ensuring responsible scaling that aligns with potential risks. Their framework centers around AI Safety Level Standards (ASL Standards), where increasing model capabilities necessitate stronger safeguards. The company routinely tests models to see whether they fall far enough below certain capability thresholds that the ASL-2 stardard remains adequate.
Capability Thresholds
Capability Thresholds act as triggers, indicating when existing safeguards must be upgraded to a higher ASL Standard. These thresholds are predetermined capability levels that, if reached, signify a significant increase in risk, necessitating enhanced protections. Anthropic’s approach specifically targets catastrophic risks such as those related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and Autonomous AI Research and Development (AI R&D).
Assessment Procedures
Anthropic utilizes a staged assessment process:
- Preliminary Assessments: These initial evaluations determine whether a model requires a more in-depth analysis (“comprehensive assessment”). This is triggered if the model is “notably more capable” which is defined as: 1. notably more performant on automated tests in risk-relevant domiains (defined as 4x or more in Effective Compute), or 2. Six months’ worth of finetuning and other capability elicitation methods have accumulated. In addition, the Responsible Scalling Officer may in their discretion determine that a comprehensive assessment is warrantd.
- Comprehensive Assessments: For models exceeding the preliminary assessment criteria, a comprehensive evaluation assesses the likelihood of reaching Capability Thresholds. This process involves:
- Threat model mapping: Making a compelling case that we have mapped out the most likely and consequential threat models.
- Evaluations: Designing and running empirical tests that provide strong evidence that the model does not have the requisite skills.
- Elicitation: Demonstrating that, when given enough resources to extrapolate to realistic attackers, researchers cannot elicit sufficiently useful results from the model on the relevant tasks.
- Forecasting: Making informal forecasts about the likelihood that further training and elicitation will improve test results between the time of testing and the next expected round of comprehensive testing.
Decision-Making
Following the comprehensive assessment, a Capability Report documents the findings and makes a case for whether the model remains sufficiently below the Capability Thresholds: Report is escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues. In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology.
If the model is deemed to have surpassed a threshold, Anthropic upgrades to ASL-3 Required Safeguards and conducts a follow-up capability assessment to ensure ASL-4 Standards are not necessary.
Transparency and Governance
Anthropic emphasizes transparency by publicly releasing key materials related to the evaluation and deployment of its models, after removing sensitive information. The company also commits to internal governance measures, including maintaining a Responsible Scaling Officer, establishing anonymous reporting channels for potential noncompliance, and developing internal safety procedures for incident response.
What processes are integral to evaluating the effectiveness of implemented safeguards
Evaluating the effectiveness of implemented safeguards is a multi-faceted process involving rigorous assessment and continuous improvement. Here’s the breakdown:
Safeguards Assessment
This assessment is key to determining if the implemented measures meet the ASL-3 Required Safeguards. A Safeguards Report thoroughly documents the implementation of these safeguards.
- ASL-3 Deployment Standard Evaluation: Assesses robustness against persistent misuse attempts. This involves:
- Threat Modeling: Critically mapping potential catastrophic misuse vectors.
- Defense in Depth: Building layered defenses to catch misuse. Employing harm refusal techniques.
- Red-Teaming: Demonstrating realistic threat actors can’t consistently elicit responses that increase their ability to cause catastrophic harm.
- Rapid Remediation: Quickly identifying and fixing system compromises.
- Monitoring: Continuously reviewing system performance against accepted risk ranges. Monitoring responses to jailbreak bounties, doing historical analysis and background monitoring.
- Trusted Users: Establishing criteria for sharing models with reduced safeguards to trusted users by using a combination of user vetting, secure access controls, monitoring, log retention, and incident response protocols.
- Third-Party Environments: Documenting how all models will meet these criteria, even in third-party deployments with differing safeguards.
- ASL-3 Security Standard Evaluation: Determines if the measures are highly protected against model weight theft. This involves:
- Threat Modeling: Using frameworks like MITRE ATT&CK to map threats, assets, and attack vectors.
- Security Frameworks Alignment: Use of industry-standard security frameworks for identified risks.
- Building strong perimeters and access controls ensuring protection from unauthorized access. This includes a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
- Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. This includes a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
- Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. This includes a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
- Investing sufficient resources in security. Meeting this standard of security requires roughly 5-10% of employees being dedicated to security and security-adjacent work.
- Aligning where appropriate with existing guidance on securing model weights.
- Audits: Auditing the security program’s design and implementation,sharing findings with management. This includes independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; and periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
- Third-Party Environments: Documenting how all models will meet these criteria, even in third-party deployments with differing safeguards.
Safeguards Decision Process
The process for determining whether the ASL-3 Required Safeguards have been met includes:
- Compiling a Safeguards Report for each Required Safeguard. Making an affirmative case for satisfying them.
- Escalating the Safeguards Report(s) to the CEO and the Responsible Scaling Officer.
- Soliciting both internal and external expert feedback on the report.
- Sharing the decision, Safeguards Report,internal and external feedback with the Board of Directors and the Long-Term Benefit Trust.
- Revisiting and re-approving ASL-3 Required Safeguards at least annually.
Follow-up Capability Assessment
In parallel with upgrading a model to the Required Safeguards, the policy is updated to include any additional Capability Thresholds and a follow-up capability assessment is conducted.
Governance and Transparency
Internal governance involves the Responsible Scaling Officer who is responsible for the policy’s design and implementation. External evaluation involves collaborating with external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. The company also seeks procedural compliance review and plans to notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.
What is the procedure for determining appropriate deployment strategies and scaling outcomes for Anthropic models
Anthropic employs a structured, multi-stage process to determine appropriate deployment strategies and scaling outcomes for its AI models. This process is guided by the Responsible Scaling Policy (RSP) and its reliance on AI Safety Level (ASL) Standards. The ultimate goal is to prevent catastrophic harm and mitigate risk to acceptable levels, balancing innovation with robust safeguards.
Capability Assessment
The first step involves a Capability Assessment, which determines how close a model’s capabilities are to established Capability Thresholds. These thresholds signal when a model’s potential risks necessitate upgraded safeguards (moving from ASL-2 to ASL-3, for example). The assessment process includes:
- Preliminary Assessment: A quick check to see if the model is “notably more capable” than previous models. Metrics here are automated tests in “risk-relevant domains” (improvement measured as 4x in Effective Compute) or accumulation of post-training improvements comparable to six months of continuous refinements.
- Comprehensive Assessment: Required if a model exceeds the “notably more capable” threshold. This assessment rigorously tests for the potential to reach Capability Thresholds, looking at combinations of actors, attack pathways, and what would be required to cause harm. This includes:
- Threat model mapping to identify likely consequential threat models.
- Empirical testing to show the model does not have capability linked to risks.
- Elicitation exercises to determine if researchers can elicit useful results, even without safety mechanisms.
- Forecasting on how further training may increase test results.
- Capability Report: A detailed document compiling findings from the assessment to make the case that the Capability Threshold is sufficiently far off, and advancing deployment recommendations. This document, along with internal and external feedback, goes to Anthropic’s CEO and Responsible Scaling Officer for a final determination.
If the comprehensive assessment demonstrates a model falls sufficiently below a Capability Threshold, it can continue deployment by meeting the ASL-2 Standard. However, if the assessment suggests the model has possibly surpassed a threshold, it triggers an upgrade to ASL-3 Required Safeguards along with a follow-up capability assessment to determine whether the ASL-4 Standard is necessary.
Safeguards Assessment
If a model surpasses a Capability Threshold, a Safeguards Assessment is initiated to ensure appropriate risk mitigation by checking whether implemented measures satisfies the ASL-3 Required Safeguards. This process includes crafting a Safeguards Report to document evidence the model meets thresholds standards criteria:
- Meeting ASL-3 Deployment Standards requires “defense in depth” using layers to catch attacks, red-teaming, rapid remediation, and trusted users with alternate safety controls.
- Meeting ASL-3 Security Standards ensures model weight protection with framework aligned security controls, threat modeling using MITRE ATT&CK Framework and perimeter and access controls.
If vulnerabilities are found, model depolyment is restricted. Otherwise, deployment can proceed.
Deployment and Scaling Outcomes
Ultimately, Anthropic may proceed with deployment and further scaling if:
- The model’s capabilities are far away from existing Capability Thresholds such that its current ASL-2 is appropriate.
- The model has surpassing Capability Thresholds yet meets ASL-3 Required Safeguards.
If a model requires ASL-3 safeguards but these safeguards can’t be implemented in a timely manner, the policy mandates prompt action to reduce interim risk until the appropriate measures are in place. Stronger restrictions, such as model decommissioning or deleting model weights, may be imposed if interim risk mitigation isn’t plausible. Furthermore, pretraining activities are monitored to halt the training models that approach or exceed the capabilities for models currently being scrutinized at the ASL-3 Standard level until adequate safeguards are implemented.
Governance and Transparency
Governance measures are implemented internally across the company to ensure full compliance of the Responsible Scaling Policy. External input is sought out and key information related to testing of each model is often shared publicly, with sensitive details removed.
What protocols are in place to address scenarios where required safeguards cannot be immediately implemented
When a model reaches a point where ASL-3 Required Safeguards are deemed necessary, but these safeguards can’t be put in place immediately, Anthropic commits to a tiered approach to risk mitigation. The goal is to reduce interim risk to acceptable levels until the full ASL-3 safeguards are operational:
Interim Measures
The CEO and Responsible Scaling Officer (RSO) can approve interim measures that offer the same level of assurance as the ASL-3 Standard but are quicker or simpler to implement. These might include:
- Blocking model responses.
- Downgrading to a less-capable model in specific areas.
- Increasing the sensitivity of automated monitoring systems.
- Storing model weights in a single-purpose, isolated network that meets the ASL-3 Security Standard.
Any such plan is shared with Anthropic’s Board of Directors and the Long-Term Benefit Trust.
Stronger Restrictions
If interim measures are insufficient to adequately mitigate risk, Anthropic will implement more stringent restrictions, such as:
- De-deploying the model and replacing it with a model that falls below the Capability Threshold. Once ASL-3 Deployment Standard requirements are met, the model can be re-deployed.
- Deleting model weights in the security context. Anthropic believes with the use of interim deployment and security protections there should rarely be any need for stronger restrictions.
Monitoring Pretraining
Anthropic will not train models with comparable or greater capabilities to the one requiring the ASL-3 Security Standard, operationalized as 1x or more in Effective Compute, until the ASL-3 Security Standard is implemented. If the pretraining model’s capabilities are comparable or greater, training will be paused until sufficient safeguards are in place.
What internal governance structures support the Responsible Scaling Policy
To effectively implement the Responsible Scaling Policy (RSP) across the organization, Anthropic commits to maintaining several key internal governance measures. These structures are designed to ensure compliance, transparency, and accountability in the development and deployment of AI models.
Key Governance Elements
-
Responsible Scaling Officer (RSO): A designated staff member is responsible for reducing catastrophic risks associated with AI models. The RSO’s duties include proposing policy updates, approving model training and deployment decisions based on capability and safeguard assessments, reviewing major contracts for policy consistency, overseeing policy implementation, addressing reports of noncompliance, notifying the Board of Directors of material noncompliance, and interpreting the policy.
-
Incident Readiness: Internal safety procedures are developed for incident scenarios, such as pausing training upon reaching Capability Thresholds, responding to security incidents involving model weights, and addressing severe jailbreaks or vulnerabilities in deployed models. Exercises are conducted to ensure readiness for these scenarios.
-
Internal Transparency: Summaries of Capability Reports and Safeguards Reports are shared with regular-clearance staff, with highly sensitive information redacted. A minimally redacted version is shared with a subset of staff for surfacing relevant technical safety considerations.
-
Internal Review: Feedback is solicited from internal teams on Capabilities and Safeguards Reports to refine methodologies and identify weaknesses.
-
Noncompliance Procedures: A process is maintained for Anthropic staff to anonymously report potential instances of noncompliance with the RSP. The noncompliance reporting policy protects reporters from retaliation, sets forth a mechanism for escalating reports to the Board of Directors, and mandates tracking, investigation, and corrective action for substantiated reports. The RSO regularly updates the Board on substantial cases of noncompliance and overall trends.
-
Employee Agreements: Contractual non-disparagement obligations are not imposed on employees, candidates, or former employees in a way that would impede or discourage them from publicly raising safety concerns about Anthropic. Agreements with non-disparagement clauses will not preclude raising safety concerns or disclosing the existence of the clause.
-
Policy Changes: Changes to the RSP are proposed by the CEO and RSO and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust (LTBT). The current RSP version is accessible online, with updates made publicly available before changes take effect, along with a changelog.
How does Anthropic ensure transparency and gather external input on its AI safety practices
Anthropic aims to advance the public dialogue on AI regulation and ensure stakeholders can examine its actions through several key measures:
Public Disclosures
The company commits to publicly releasing key information regarding the evaluation and deployment of its AI models. This excludes sensitive details, but includes summaries of Capability and Safeguards reports when a model is deployed. These reports detail safety measures that were taken. Anthropic will also disclose plans for current and future comprehensive capability assessments, as well as deployment and security safeguards. The company intends to periodically release information about internal reports of potential non-compliance incidents and other implementation challenges it encounters.
Expert Input
Anthropic will solicit external expertise during the development of capability and safeguard assessments. This consultation process can also extend prior to final decision-making on those assessments.
U.S. Government Notice
The policy mandates notifying a relevant U.S. Government entity if a model necessitates stronger protections than the ASL-2 Standard.
Procedural Compliance Review
On an approximately annual basis, Anthropic commissions a third-party review to assess whether the company has adhered to the policy’s main procedural commitments. These reviews specifically focus on adherence to the plan’s requirements rather than trying to judge the outcomes achieved. Anthropic also conducts the same type of reviews internally on a more regular schedule.
Public Communication
Anthropic maintains a public page (www.anthropic.com/rsp-updates) to provide overviews of past Capability and Safeguard Reports, RSP-related updates, and plans for the future. The page provides detail to facilitate conversations about industry best practices for safeguards, capability evaluations, and elicitation.
Governance and Transparency
Anthropic’s Responsible Scaling Policy (RSP) emphasizes both internal governance and external transparency. Key measures are in place to ensure policy implementation, promote accountability, and foster collaboration.
Internal Governance:
- Responsible Scaling Officer (RSO): A designated staff member oversees risk reduction by ensuring the effective design and implementation of the RSP. The RSO’s duties include policy updates, decision approvals, contract reviews, resource allocation, and handling of non-compliance reports.
- Readiness: Anthropic has developed internal safety procedures for incident scenarios, including pausing training, responding to security breaches, and addressing model vulnerabilities.
- Transparency: Summaries of Capability Reports and Safeguards Reports are shared internally to promote awareness and facilitate technical safety considerations.
- Internal Review: Feedback is solicited from internal teams on Capability and Safeguards Reports to refine methodologies and identify weaknesses.
- Noncompliance: A process allows staff to anonymously report policy non-compliance to the RSO. A policy protects reporters from retaliation and establishes escalation mechanisms. All reports are tracked, investigated, and addressed with corrective action.
- Employee agreements: Contractual non-disparagement obligations are constructed so as not to impede or discourage employees from voicing safety concerns about Anthropic.
- Policy Changes: Changes to this policy are only implemented by the CEO and Responsible Scaling Officer, as approved by the Board of Directors, in consultation with the Long-Term Benefit Trust.
Transparency and External Input:
- Public Disclosures: Key information on model evaluation and deployment is released publicly, including summaries of Capability and Safeguards Reports, plans for assessments, and details on safeguards, subject to redacting sensitive information.
- Expert Input: External experts are consulted during capability and safeguard assessments and final decision-making processes.
- U.S. Government notice: A relevant U.S. Government entity will be notified if a model requires more protections than ASL-2.
- Procedural Compliance review: On approximately an annual basis, and more regularly internally, a third-party focuses on whether policies are followed, not on how issues were resolved.
Ultimately, Anthropic’s layered approach to AI safety seeks to navigate the complex landscape of rapidly advancing AI capabilities. By proactively identifying risk thresholds, rigorously assessing model capabilities, and adapting safeguards accordingly, a proportional strategy emerges, designed to foster innovation while concurrently mitigating potential harms. The commitment to internal governance and external transparency underscores a dedication to responsible AI development and the ongoing pursuit of best practices for the benefit of society.