Phase 2: Flourishing

Introduction

If humanity succeeds at implementing the previous 2 phases of The Plan, the world will be in a stable situation with regard to AI, where advanced AI research is regulated, dangerous AI proliferation is contained, and some of the riskiest research is only done by internationally coordinated organization(s) following strict safety protocols to not endanger human civilization.

The natural next step is then the development of Safe and Controllable Transformative AI, to benefit all of humanity. Not superintelligence, nor AGI, but transformative AI. AI that is developed not with more and more capabilities as an end in itself, but as a tool for humans and under human control to unlock prosperity and economic growth. AIs as tools for humans to automate at scale, not AI as a successor species.

Phase 1 includes the creation of the Global Unit for AI Research and Development (GUARD), a central multilateral lab which is the only organization authorized to pursue frontier AI research.

Yet GUARD cannot just continue with current dominant paradigms of machine learning research to achieve its goal: ensuring that AI research is done in a sensible and grounded way. This is because existing machine learning approaches focus on increasing capabilities without shedding any light on how AIs work, or how to control them. Therefore, it is crucial to determine which alternative AI development paths GUARD could take, while keeping humanity in control.

Thus, the Goal of Phase 2 is to Ensure Flourishing: build the science and technology for Safe and Controlled Transformative AI as a tool for human prosperity and growth.

Conditions For Safe Transformative AI

The development of Safe Transformative AI requires three necessary conditions. All three must be satisfied for humanity to ensure that it only builds controllable yet powerful AI systems, which can then be used for various civilization goals such as automating all intellectual and physical labor (see What Success Looks Like for more details on the use and challenges of such technology).

These three conditions are:

Prediction of AI systems capabilities
Specification of AI systems guarantees
Enforcement of AI systems guarantees

Prediction of AI systems capabilities

The biggest obstacle to safe AI development with current ML technology is the inability to predict what an AI system can and cannot do. This is the case not only before pre-training or fine-tuning, but even after deployment of the AI system. In one example among many, Anthropic testers realized accidentally that their new model was able to recognize it was undergoing tests45 and alter its behavior accordingly.

And despite extensive efforts to develop theories of Deep Learning46, mechanistic interpretability47, and evaluation frameworks48, still nobody is able to predict what ML models can and cannot do.

Yet prediction is essential. In order to develop safe AI systems, it is critical that GUARD be able to predict what any AI system can do before building it, or at the minimum once it is built. Without this, there is no theoretical knowledge we can use to ensure that GUARD does not go too far in its research and builds AI systems that are too close to uncontrolled superintelligence.

Given this, a condition for developing Safe Transformative AI is advance our theoretical understanding of AI systems so we can model and predict the capabilities of any AI system that GUARD might build.

Specification of AI systems guarantees

The next step towards safe AI systems lies in figuring out exactly which properties they need to satisfy in order to be safe. This might include properties about controlling these systems, about them being legible to users and inspectors, or about them never proposing actions that are particularly unsafe.

Current ML research does not even try to do this, focusing instead on measures of efficiency, performance, and proxies such as “truthfulness.”49 These measures are also constantly being gamed50 by machine learning systems, since they do not capture specific features of machine learning systems’ properties, but merely statistical similarities in large amounts of low-quality data.

Given this, a condition for developing Safe Transformative AI is to specify which guarantees a safe AI system needs to uphold.

Enforcement of AI systems guarantees

Lastly, guarantees are only valuable if they are actually enforced. So safe AI development requires the ability to ensure that the guarantees specified in the previous conditions are actually followed by a given AI system.

This is not the case in current machine learning systems for two reasons.

First, as mentioned above, current AI developers are unable to predict how ML systems will behave, even after they have finished training. Thus even after the fact, current machine learning theory provides no way to verify that the AI system follows the specification.

And second, current training techniques in machine learning search exclusively for algorithms and AI systems that score high on a set of performance measures. We lack any suitable definitions or specification of control, legibility, or safety that can be used as goals of machine learning training processes. This means that ML systems are incentivized to disregard each and any of these properties if that helps them to perform better on their performance indicators or downstream tasks.

Given this, a condition for developing Safe Transformative AI is to enforce the guarantees that a safe AI system needs to uphold.

Recommendations For Safe Transformative AI

A detailed research agenda for satisfying the conditions of predicting AI system capabilities, and specification and enforcement of AI systems guarantees does not exist at this time. Such an endeavor also exceeds the scope of this document.

Yet, we can infer what the broad direction for tackling each of these conditions should be by looking at what is considered sensible and reasonable for other high-risk technologies.

Science of Intelligence

Recall that the first condition is Prediction of AI systems capabilities. Unless GUARD can predict the capabilities of AI systems before building them, they have no hope of maintaining safety while exploring the potential benefits of AI.

This problem of prediction was tackled in the same way for many existing high-risk technologies such as civil engineering, aviation and nuclear power. After some initial groping in the dark and experimentation, pioneers in these fields built scientific fields and slowly learned to model and predict each of these domains: respectively structural engineering, aerodynamics, and nuclear physics.

There is no reason why AI systems should be different; thus the most direct way to satisfy the first condition is by developing a science around AI systems.

This begs the question of what this science should study. Since the goal of AI systems is to automate various aspects of intelligence, and since the existential risks that this document is addressing focus on general intelligence, the right science for AI systems is a science of intelligence.

Taking inspiration from the historical examples, the first step to building such a science is to design ways to measure the underlying phenomena. In structural engineering, this came about in the mechanical testing of materials; in aviation, with the measurement of aerodynamic forces, notably in wind tunnels; in nuclear technology, with the measurement of radiation, for example with Geiger counters.

In each of these cases, the development of measurement methods was not just about building tools – it also required theoretical and conceptual innovation to figure out what to measure, and how to measure it, to get the right information, often indirectly.

Once intelligence can be sensibly measured, the data collected through these measurements will lead to a science of intelligence that can be used for predictive purposes. This will notably include a mechanistic model of intelligence: a decomposition of intelligence into components such that knowing which components are implemented in an AI system lets you predict its intelligence and capabilities in advance before even building it, or turning it on.

Such a model would extend GUARD’s understanding of intelligence to the point where its members would be able to anticipate the intelligence of various AI systems before building them, and thus both steer away from too powerful design and aim for the least intelligent system that still accomplishes a task.

This would satisfy the first condition, Prediction of AI systems capabilities.

Specification Language For AI Systems

Turning to the second condition, Specification of AI systems guarantees: To ensure that AI systems are safe, the first step is to be able to write down “what we want” from these systems. This includes properties such as controllability, legibility, safety.

This goes beyond the fundamental science discussed in the previous recommendation: civil engineering needs to specify what counts as a structure such as a bridge “failing”, and which failures are not acceptable; the same is true for aeronautics and plane failures, and nuclear technology with radiation leakage or uncontrolled chain reaction.

Yet AI systems have one advantage over these other high-risk technologies: they are primarily software based. This means that they can leverage advances that have been made in specifying software properties through formal specifications.

Still, there is currently no specification language that is sufficient for capturing the guarantees needed for AI systems. This is because these guarantees rely not only on what the AI system does, but also on how it interacts with other AIs and humans. Modeling humans and their interactions in formal logic is out of reach for current specification methods.

A specification language is not enough though: it is also essential to figure out which exact properties we need to express in this language. Since the sole purpose of the specification language is to allow the specification of these guarantees, formalizing these guarantees and designing the language will go hand in hand.

In the end, this effort will result in a formal specification language that can address any AI system behavior, including interaction with sub components, other AI systems, and humans. The guarantees that need to be upheld by safe AI systems will be written in this language, ensuring controllability, legibility, and safety.

This would satisfy the second condition, Specification of AI systems guarantees.

Safe-By-Design AI Systems

Last but not least, the last condition asks for Enforcement of AI systems guarantee. In the end, what matters is that GUARD builds safe AI systems, which requires ensuring and enforcing the guarantees designed to make these systems controllable, legible, and safe.

Although it is possible to enforce these guarantees after building the AI systems, such an approach is insufficient, as comparisons with the standards already established for other high-risk technologies show.

Pointing to just one example, the UK’s nuclear regulation51 (EKP.1, p.37 of 2014 version) states that:

“The underpinning safety aim for any nuclear facility should be an inherently safe design, consistent with the operational purposes of the facility.

An ‘inherently safe’ design is one that avoids radiological hazards rather than controlling them. It prevents a specific harm occurring by using an approach, design or arrangement which ensures that the harm cannot happen, for example a criticality safe vessel.”

GUARD should thus enforce the guarantees specified for Safe Transformative AI by design. Whereas modern ad-hoc safety efforts attempt to fix issues after the fact, playing a losing game of Whack-A-Mole, a responsible approach to Safe Transformative AI must bake in the guarantees in the architecture and the structure of the AI systems themselves.

And not only should the AI systems be safe by design, they should be safe by design against unanticipated and unpredicted issues and stresses. Other industries use a factor of safety52 to make their systems more resilient against unforeseen incidents. GUARD needs an equivalent tool that can be applied to AI systems.

This means that at every step in the building of safe AI, the methods used must maintain these guarantees to preserve control, legibility. That way, most of the failures of safety and alignment will be prevented by design, and the remaining risks will be of a manageable, smaller number, making it more likely they can get ironed out through systematic testing.

Such safe-by-design methods exist for current specification languages53 designed for normal software, but will need to be designed and checked for the more involved specification language necessary to satisfy the second condition.

This would satisfy the third condition, Enforcement of AI systems guarantees.

The path forward

The best examples of improving the safety and reliability of designed systems, up to a point where human risk is minimized, come from fields where safety thinking and formal methods are applied. Fields such as aviation, space exploration, and nuclear energy.

The Conditions and the Recommendations for Phase 2 share this common thread. They steer Safe AI Research towards the successful and appropriate approaches of Safety Engineering and Formal Methods research, rather than the priorities of current machine learning research. This is the path for human science and engineering to master safe, controllable, transformative AI.

Some current projects in AI fit with the spirit of the conditions and recommendations above, and thus can provide inspiration: DARPA’s Explainable AI Project54, ARIA’s Safeguarded AI55, Conjecture’s CoEm56, and Guaranteed Safe AI research agendas57 by Tegmark & Omohundro, Dalrymple, Bengio, Russell and more.58

45 Alex Albert [@alexalbert__], “Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval [...]”, March 4, 2024, 6:40 PM

46 Daniel A. Roberts, Sho Yaida, and Boris Hanin, “The Principles of Deep Learning Theory”, Cambridge University Press, 2022

47 Leonard Bereska, and Efstratios Gavves, “Mechanistic Interpretability for AI Safety -- A Review”, ArXiv, August 23, 2024

48 “AI Safety Institute approach to evaluations”, Department for Science, Innovation & Technology, February 9, 2024

49 Stephanie Lin, Jacob Hilton, and Owain Evans, “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, ArXiv, May 8, 2022

50 Victoria Krakovna et al., “Specification gaming: the flip side of AI ingenuity”, DeepMind Blog, April 21, 2020

51 Safety Assessment Principles for Nuclear Facilities”, Office for Nuclear Regulation, 2014, Revision 1 (January 2020)

52 “Factor of safety”, Wikipedia, January 7, 2025

53 Tabea Bordis et al., “Correctness-by-Construction: An Overview of the CorC Ecosystem”, ACM SIGAda Ada Letters, Volume 42, Issue 2, April 5, 2023

54 “XAI: Explainable Artificial Intelligence”, DARPA, 2018

55 “Safeguarded AI”, ARIA, 2024

56 Connor Leahy, and Gabe Alfour, “Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI”, Conjecture Blog, December 2, 2024

57 “ProvablySafe.AI”, April 12, 2024

58 David “davidad” Dalrymple et al., “Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems“, ArXiv, Jul 8, 2024

Phase 1

What Success Looks Like

Get Updates

Sign up to our newsletter if you'd like to stay updated on our work,
how you can get involved, and to receive a weekly roundup of the latest AI news.

If you have feedback on The Plan or want to know how you can help to support it please get in touch with us directly

If you have feedback on The Plan or want to know how you can help to support it please get in touch with us directly

Full plan as PDF

Introduction

Phase 0

Phase 1

Phase 2

What Success Looks Like

Annexes

Phase 2: Flourishing

Introduction

Conditions For Safe Transformative AI

Prediction of AI systems capabilities

Specification of AI systems guarantees

Enforcement of AI systems guarantees

Recommendations For Safe Transformative AI

Science of Intelligence

Specification Language For AI Systems

Safe-By-Design AI Systems

The path forward

Phase 1

What Success Looks Like

Full plan as PDF

Full plan as PDF

Full plan as PDF

Full plan as PDF