How Meta Built a Self-Sustaining Efficiency Engine with AI Agents: A Step-by-Step Guide

Introduction

At Meta, where services reach over 3 billion people, even a tiny 0.1% performance regression can cause massive power waste. To tackle this, the Capacity Efficiency team created a unified AI agent platform that automates both finding and fixing performance issues. This guide reveals the step-by-step approach Meta used to build this self-sustaining efficiency engine—recovering hundreds of megawatts (MW) and slashing manual investigation time from 10 hours to 30 minutes. By following these steps, your organization can scale capacity optimization without proportionally growing headcount.

How Meta Built a Self-Sustaining Efficiency Engine with AI Agents: A Step-by-Step Guide — Source: engineering.fb.com

What You Need

Existing monitoring infrastructure: A production system that tracks resource usage and can detect regressions (e.g., FBDetect-style tool).
Senior domain experts: Engineers with deep knowledge of performance optimization who can encode their expertise into reusable components.
AI/ML platform: A system capable of hosting and orchestrating agent-based workflows (e.g., internal tool interface with API integration).
Pull request automation: Ability to auto-generate code changes and submit them for review.
Compute resources: Sufficient infrastructure to run AI agents without causing additional bottleneck.
Cross-functional buy-in: Support from product teams, infrastructure, and management to adopt AI-driven efficiency processes.

Step-by-Step Instructions

Step 1: Establish a Two-Sided Efficiency Framework

Before automation, Meta divided efficiency into two complementary areas:

Offense: Proactively searching for code optimizations that make existing systems more efficient.
Defense: Monitoring production resource usage to catch regressions, root-cause them to a pull request (PR), and deploy mitigations.

This framework ensures no opportunity is missed. Start by setting up dedicated processes for both sides—offense for long-term gains, defense for immediate protection. Use existing tools like Meta's FBDetect for regression detection.

Step 2: Encode Domain Expertise into Composable AI Skills

Meta's AI agents don't act blindly. They embed the knowledge of senior efficiency engineers into reusable, standardized 'skills'. Each skill tackles a specific investigation step—for example, analyzing a performance profile or identifying the offending commit. To replicate this:

Ask your top efficiency experts to document their typical investigation workflows.
Break those workflows into modular tasks that can be automated.
Build a skill library using a unified tool interface (e.g., API endpoints that agents can call).

This encoding allows new agents to compose multiple skills for complex cases, scaling expertise without hiring.

Step 3: Integrate AI Agents with Production Monitoring

Meta connected its AI agent platform to FBDetect, its in-house regression detection tool. When FBDetect flags a regression, the agent automatically:

Downloads performance data.
Runs encoded skills to pinpoint the root-cause PR.
Suggests or even generates a mitigation PR.

This reduces the human loop from 10 hours to 30 minutes. For your setup, ensure your monitoring tool has a webhook or API that triggers an agent workflow whenever a regression is detected.

Step 4: Automate the Full Regression Lifecycle

Meta's agents don't stop at diagnosis. They fully automate the path from efficiency opportunity to a ready-to-review pull request. For each regression, the agent:

Confirms the regression is real and significant.
Identifies the responsible change and author.
Generates a code fix based on past successful patterns (by querying the skill library).
Submits a PR to the appropriate team for review.

This keeps human engineers focused on innovation while AI handles the long tail of regressions—especially important when thousands of regressions appear weekly.

Step 5: Scale Opportunity Resolution with AI-Assisted Offense

On the offensive side, Meta expands AI-assisted opportunity resolution every half. The AI proactively scans codebases for potential optimizations, simulates the impact, and generates PRs. To scale offense:

Train agents on historical efficiency improvements (e.g., removing redundant work, reducing memory usage).
Use the same composable skills to explore new product areas automatically.
Prioritize opportunities with the highest estimated MW savings.

This allows the team to handle a growing volume of wins that engineers alone would never reach.

Step 6: Iterate Toward a Self-Sustaining Engine

Meta's end goal is a self-sustaining efficiency engine where AI handles the long tail. To get there:

Continuously feed new domain knowledge back into the skill library.
Monitor agent performance—track time saved, MW recovered, and false-positive rates.
Adjust confidence thresholds so critical regressions get immediate human attention.

Over time, the platform becomes smarter: each fix teaches the agents new patterns, reducing the need for manual intervention.

Tips for Success

Start small, then expand: Pilot with one product area using Step 1's framework. Once proven, roll out to others.
Invest in tooling: A unified interface (Step 2) is critical. Without it, agents can't compose skills across domains.
Balance offense and defense: Don't neglect offensive optimization—it compounds savings. Use Step 5 to automate as much as possible.
Track tangible metrics: Measure megawatts recovered, engineering hours saved, and regression response time. Share wins to maintain buy-in.
Prepare for the long tail: As your system matures, most regressions and opportunities will be handled by AI—but always keep a human review loop for critical changes.

By following these steps, you can build a capacity efficiency program that recovers hundreds of megawatts—just like Meta—while freeing your engineers to innovate on new products.