SkillOpt: Executive Strategy for Self-Evolving Agent Skills

🎬 SkillOpt Demo Video: https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7

Watch the full demo on YouTube: https://youtu.be/JUBMDTCiM0M

Introduction

SkillOpt is an approach that reframes how we think about training intelligent agents. Rather than modifying the underlying neural network weights during learning, SkillOpt focuses on training agent skills—meta-level capabilities that govern behavior, strategy, and adaptation. Think of it as an executive strategy for self-evolving agent skills: you optimize the way an agent learns and acts, using epochs, mini-batches, learning rates, and validation gates, but you do so without touching the core model weights. In practice, this means you can refine, evolve, and improve the agent’s capabilities by training its skills and governance processes, while the neural network remains untouched.

This separation between skill evolution and weight updates allows for robust experimentation with different training regimes, data curricula, and evaluation gates. The result is a flexible framework that supports multiple benchmarks—ranging from QA and document understanding to embodied agents and code-generation tasks—without requiring you to rewrite or retrain the base model every time you want a more capable agent. SkillOpt positions itself as an executive toolkit for researchers and engineers who want to push the boundaries of agent intelligence through sophisticated skill management rather than brute-force weight optimization.

Project scope and philosophy

Train skill regimes as you would train a neural network, including epochs, batch sizes, learning rates, and validation gates.
Keep a clear separation between the learning of skills (training policies, evaluation gates, and curricula) and the weights of the foundation model.
Encourage self-evolving capabilities by maintaining structured logs, skill snapshots, and step-wise artifacts that support iteration, auditing, and reproducibility.
Provide pragmatic support for diverse benchmarks, including question answering, embodied navigation, math, and code-related tasks.

Install and Setup

SkillOpt provides a straightforward path from source to experimentation. The project endorses Python 3.10+ as a baseline environment, with a minimal set of dependencies to get you up and running quickly. The install process is designed to be reproducible and easy to reproduce in research labs or production environments.

Requirements

Python 3.10 or newer.
A working Git client to clone the repository.
Optional: access to cloud or local LLM endpoints for evaluation (Azure OpenAI, OpenAI, Anthropic Claude, or Qwen vLLM).

Getting started

Clone the repository and install in editable mode, so you can modify code and configurations without reinstalling.
If you plan to run the ALFWorld benchmark, install the optional ALFWorld data components.

Code and configuration

You will work with a split data directory containing train, val, and test subdirectories. Each subdirectory holds a JSON file with a standardized format for task items.
Benchmarks are driven by configuration files that specify the target benchmark (e.g., SearchQA, ALFWorld, DocVQA) and model deployments for the optimizer and target models.

Below are the essential commands you will typically run:

Minimal install and setup bash git clone https://github.com/microsoft/SkillOpt.git cd SkillOpt pip install -e . # For ALFWorld benchmark (optional): pip install -e ".[alfworld]" alfworld-download
Configure API credentials bash cp .env.example .env # Edit .env with your API credentials, then: source .env
End-user note
You will need a required endpoint for the cloud provider you choose. For Azure OpenAI, you must set an endpoint, and you may provide an API key or use Azure CLI authentication.

Credential examples (you’ll choose one path):

Azure OpenAI (recommended) bash export AZUREOPENAIENDPOINT="https://your-resource.openai.azure.com/" export AZUREOPENAIAPIKEY="your-key" # or use AZUREOPENAIAUTHMODE="azure_cli"
OpenAI directly bash export OPENAIAPIKEY="sk-…"
Anthropic Claude bash export ANTHROPICAPIKEY="sk-ant-…"
Qwen (local vLLM) bash export QWENCHATBASEURL="http://localhost:8000/v1" export QWENCHAT_MODEL="Qwen/Qwen3.5-4B"

Data Preparation

SkillOpt expects data in a split directory with the following structure:

data/my_split/ ├── train/items.json ├── val/items.json └── test/items.json

Each JSON file is an array of task items. The exact fields depend on the benchmark. For example, a typical SearchQA item might look like:

[ { "id": "uniqueitemid", "question": "Who wrote the novel …", "context": "[DOC] relevant passage text …", "answers": ["expected answer"] } ]

In short, you prepare your own data following this format, then point SkillOpt to the split directory during training and evaluation.

Note: Benchmark datasets are not included in the repository by default. You must provide your own data in the expected split format.

Supported Benchmarks

SearchQA: QA benchmark
Config: configs/searchqa/default.yaml
ALFWorld: Embodied agent benchmark
Config: configs/alfworld/default.yaml
DocVQA: Document QA benchmark
Config: configs/docvqa/default.yaml
LiveMathematicianBench: Math benchmark
Config: configs/livemathematicianbench/default.yaml
SpreadsheetBench: Code-generation benchmark
Config: configs/spreadsheetbench/default.yaml
OfficeQA: Tool-augmented QA benchmark
Config: configs/officeqa/default.yaml

Quick Start: Training and Evaluation

Training is the heart of SkillOpt. The framework supports multiple benchmarks by passing the appropriate config YAML, the path to your data split, and the desired model deployments for both the optimizer and target models. The commands below illustrate the typical flow for popular benchmarks.

Minimal training examples

Train on SearchQA bash python scripts/train.py \ --config configs/searchqa/default.yaml \ --splitdir /path/to/your/searchqasplit \ --azureopenaiendpoint https://your-resource.openai.azure.com/ \ --optimizermodel gpt-5.5 \ --targetmodel gpt-5.5
Train on LiveMathematicianBench bash python scripts/train.py \ --config configs/livemathematicianbench/default.yaml \ --splitdir /path/to/your/livemathsplit \ --azureopenaiendpoint https://your-resource.openai.azure.com/ \ --optimizermodel gpt-5.5 \ --targetmodel gpt-5.5
Train on ALFWorld bash python scripts/train.py \ --config configs/alfworld/default.yaml \ --splitdir /path/to/your/alfworldsplit \ --azureopenaiendpoint https://your-resource.openai.azure.com/ \ --optimizermodel gpt-5.5 \ --targetmodel gpt-5.5

Key CLI arguments (at a glance)

--config: Benchmark config YAML (example: configs/searchqa/default.yaml)
--split_dir: Path to data split directory (e.g., /path/to/split)
--azureopenaiendpoint: Azure OpenAI endpoint URL
--optimizer_model: Optimizer model deployment name (e.g., gpt-5.5)
--target_model: Target model deployment name (e.g., gpt-5.5)
--num_epochs: Number of training epochs
--batch_size: Batch size per step
--workers: Parallel rollout workers
--outroot: Output directory for results (e.g., outputs/myrun)

Evaluation: Evaluate without training

If you want to evaluate a trained skill on specific data splits without additional training, you can do so with the eval-only mode.

Evaluate on the test set bash python scripts/evalonly.py \ --config configs/searchqa/default.yaml \ --skill outputs/myrun/bestskill.md \ --split validunseen \ --splitdir /path/to/searchqasplit \ --azureopenaiendpoint https://your-resource.openai.azure.com/
Evaluate on all splits (train + val + test) bash python scripts/evalonly.py \ --config configs/searchqa/default.yaml \ --skill outputs/myrun/bestskill.md \ --split all \ --splitdir /path/to/searchqasplit \ --azureopenai_endpoint https://your-resource.openai.azure.com/

Output Structure: What SkillOpt Produces

Each training run yields a structured output directory that captures everything needed to resume, audit, and inspect the evolving skill.

outputs/ ├── config.json # Flattened runtime config ├── history.json # Per-step training history ├── runtimestate.json # Resume checkpoint ├── bestskill.md # Best validated skill document ├── skills/skillvXXXX.md # Skill snapshot per step ├── steps/stepXXXX/ # Per-step artifacts (patches, evals) ├── slowupdate/epochXX/ # Slow update logs └── metaskill/epochXX/ # Meta skill logs

If you need to re-run, SkillOpt will auto-resume from the last completed step, preserving your momentum without redoing completed work.

WebUI: Monitoring the Skill Evolution

SkillOpt offers an optional WebUI to monitor the progress, inspect skill documents, and visualize evaluation metrics in real time. The WebUI can be installed and launched with a couple of commands, and it includes a public sharing option for remote servers.

Installation and startup

bash pip install -e ".[webui]" python -m skillopt_webui.app

Configuration options

--port: Server port (default 7860)
--host: Bind address (default 0.0.0.0)
--share: Create a public Gradio share link (off by default)

Public sharing

bash

With public share link (useful for remote servers)

python -m skillopt_webui.app --share

The WebUI provides dashboards to view the best skill, per-step patches, and evaluation results, making it easier to communicate progress to teammates and stakeholders.

Data Formats and Benchmarks: What to Expect

Data preparation is a critical prerequisite for SkillOpt. The framework expects consistent data formatting to ensure that the dash between “learning the skill” and “evoking the skill” remains clear and auditable. Each item in the input JSON is a self-contained task descriptor that the agent can reason about and respond to.

The item-level fields depend on the benchmark. For QA tasks like SearchQA, the data includes a question, a contextual passage, and the answer. For embodied tasks like ALFWorld, the data captures actions and observations within a simulated environment.
You should prepare separate JSON files for train, validation, and test, following the split directory layout described earlier.
The repository contains the blueprint for how each benchmark expects its data to be organized, and you will see a corresponding default.yaml for each benchmark under the configs directory. This helps you switch seamlessly between benchmarks without changing the core training logic.

The philosophy behind data organization is simple: clearly separated data splits, consistent item schemas, and explicit fields that the skill training regime can rely on for episodic evaluation and policy improvement. By keeping data organized in this way, SkillOpt can scale across different domains and task types, while preserving a transparent record of how skills evolved over time.

Citations and Research Context

If you are exploring SkillOpt academically, you will likely cite the foundational work that introduces the concept of executive strategies for self-evolving agent skills. Here is the BibTeX entry provided by the project:

bibtex @misc{yang2026skilloptexecutivestrategyselfevolving, title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo}, year={2026}, eprint={2605.23904}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2605.23904} }

Practical Considerations and Best Practices

Start with a clear objective: Decide what “self-evolving skill” means for your use case. Is it faster adaptation to new prompts, more robust reasoning under uncertainty, or better tool use in complex workflows?
Align data quality with ambition: The richness and correctness of your split data directly influence how effectively SkillOpt learns to evolve skills. Invest in clean data, appropriate labeling, and representative coverage of edge cases.
Leverage the modularity of configurations: Each benchmark has its own config YAML, but the underlying training loop abstracts away many details. This makes it easier to experiment with curricula, evaluation gates, and meta-skills without rewriting the core code.
Monitor and log: The output structure is designed to capture per-step artifacts. Use the best_skill.md snapshots to understand how skill quality evolves. Use history.json to diagnose training dynamics (e.g., learning rate schedules, early stopping indicators, validation gates).
Validate with diverse splits: When possible, use train/val/test splits that reflect the real-world distribution you expect the agent to encounter. This helps ensure that self-evolving skills generalize beyond a single dataset.
Consider the UX for researchers and operators: The WebUI is not merely decorative; it’s designed to give teams a straightforward way to observe skill trajectories, validate improvements, and share progress across departments.

What Makes SkillOpt Stand Out

Skill-centric training: Rather than chasing ever-smaller loss curves on a fixed model, SkillOpt focuses on the governance and scheduling of skill evolution, enabling more resilient and adaptive agents.
Style-agnostic to weights: The core idea is to evolve skills without direct weight updates to the foundation model, preserving the integrity of the base model while still enabling sophisticated behavior changes.
Benchmark breadth: From QA to embodied tasks, SkillOpt demonstrates versatility across several challenging domains, illustrating how skill-opt strategies can be tuned to different problem settings.
Reproducibility and traceability: The structured outputs and per-step artifacts enable researchers to reproduce experiments, audit decisions, and share progress in a transparent, well-documented manner.

A Note on Data Privacy and Licensing

Because SkillOpt supports multiple backends (Azure OpenAI, OpenAI, Anthropic Claude, and local vLLMs), it’s important to handle credentials and data thoughtfully. Ensure that credentials are stored securely (env files or secret managers) and that any data usage complies with licensing and privacy constraints of your data sources and the deployed LLMs.

Conclusion

SkillOpt presents a compelling approach to advancing agent capabilities through the deliberate management of skills rather than brute-force weight updates. By treating training as the evolution of executive strategies—epochs, batch sizes, validation gates, and curricula—you gain a powerful toolkit for shaping self-improving agents across a spectrum of tasks. The framework emphasizes modularity, reproducibility, and practicality: clear data formats, benchmark-config-driven workflows, and transparent artifacts that chart the journey of skill evolution.

If you’re exploring autonomous reasoning, tool use, or robust prompt-driven agents, SkillOpt offers a pragmatic framework to test, compare, and mature the strategies that govern how an agent learns to learn. The project’s demonstration video and documentation provide a path to begin, while the open-source license invites experimentation and extension within your own research or product teams.

Appendix: Quick Reference to Key Files and Paths

Project page and resources: https://microsoft.github.io/SkillOpt/
Benchmark configs: configs/searchqa/default.yaml, configs/alfworld/default.yaml, configs/docvqa/default.yaml, configs/livemathematicianbench/default.yaml, configs/spreadsheetbench/default.yaml, configs/officeqa/default.yaml
Data split layout: data/my_split/{train,val,test}/items.json
Train script: scripts/train.py
Eval script: scripts/eval_only.py
WebUI module: skillopt_webui.app
Output root example: outputs/my_run/
Demo video: https://youtu.be/JUBMDTCiM0M

With SkillOpt, you’re invited to orchestrate the evolution of agent skills with the precision of an executive strategist: define the mission, set the training cadence, monitor the progress, and let the agent’s skills mature in a disciplined, auditable, and scalable way. The journey from data to evolving capability is laid out with clarity, enabling researchers and practitioners to experiment, compare, and improve in a structured, repeatable manner.

SkillOpt

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

With public share link (useful for remote servers)

Enjoying this project?

GitHub - microsoft/SkillOpt: SkillOpt

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category