Skills Are Files

The most consequential design decision in DenchClaw was making agent behavior plain text in the filesystem. Here's the full reasoning — and what it unlocks.

Kumar Abhirup

March 26, 2026·25 min read

The most important part of many AI products is also the part users cannot see.

It lives in hidden system prompts, private evaluation harnesses, and application glue. The product does something surprising — summarizes a deal, enriches a contact, drafts a follow-up — and the user has to trust that the invisible text behind the curtain is sensible. There is no way to read it, no way to question it, and no way to improve it. You just hope.

I think this is the defining mistake of the current generation of AI products. Not the models. Not the interfaces. The mistake is hiding behavior.

In DenchClaw, we made a different choice. Agent capabilities are defined by SKILL.md files — plain markdown with frontmatter, living in the filesystem next to the code and data they operate on. This is not a novel file format. It is not a proprietary DSL. It is text in a directory. And I believe it is one of the most consequential design decisions we have made.

This post explains why.

The Pattern That Keeps Repeating#

Before talking about skills specifically, it is worth noticing a pattern that runs through the entire history of software.

Important behavior starts out hidden. Then someone makes it readable. Then it becomes versioned. Then it becomes shareable. Then an ecosystem forms around it. This happens so reliably that you can almost use it as a diagnostic: if a category of behavior is still opaque and unversioned, it is probably about to go through this transition.

Source code is the original example. The earliest programs were sequences of machine instructions loaded on punch cards or typed into toggle switches. The behavior was there, but it was not readable in any human-friendly way. Then high-level languages arrived. Code became text. Text could be printed, reviewed, shared, and discussed. Version control followed. Package managers followed. Open source followed. The entire modern software ecosystem grew out of the moment when program behavior became readable text.

Configuration went through the same arc. For decades, system behavior was controlled by internal settings, registry entries, or GUIs that stored values in opaque binary formats. Then .conf files, .ini files, .env files, and YAML became the standard. Configuration became text. Now you could commit your config, diff it, review it in a pull request, and share it between environments. Entire categories of bugs — the "it works on my machine" kind — got easier to diagnose because the configuration was finally visible.

Infrastructure is the most dramatic recent example. Before Terraform, CloudFormation, and Pulumi, provisioning infrastructure meant clicking through a cloud console or filing a ticket with an ops team. The behavior existed — servers were created, networks were configured, permissions were set — but it was not recorded in any reviewable form. Infrastructure as Code changed that. Now the state of your infrastructure lives in files. Files that can be versioned, tested, reviewed, and rolled back. The result was not just convenience. It was a fundamental shift in how teams reason about their systems.

API specifications followed the same path. REST APIs used to be documented in wikis, PDFs, or not at all. Then OpenAPI (Swagger) turned API contracts into machine-readable files. Now you can generate clients, validate requests, and catch breaking changes before they ship — all because someone decided the API surface should be text in a repository.

Design systems are going through this transition right now. Design tokens, which control colors, spacing, typography, and motion, used to live in Figma files or designer heads. Increasingly they live in JSON or YAML files in a repository, versioned alongside the code that consumes them.

The pattern is always the same. Opaque behavior becomes text. Text becomes versioned. Versioned text becomes an ecosystem. The only variable is how long each transition takes.

I think agent behavior is going through this transition now.

The Problem with Hidden Prompts#

If an agent is going to touch your CRM, your files, your browser, or your workflows, the instructions that govern that behavior are too important to be hidden. Hidden prompts have several problems, and they compound.

You Cannot Inspect What You Cannot Read#

The most basic problem is readability. When something goes wrong with an AI agent — and things do go wrong — you need to understand what the agent was told to do. If the instructions are buried in a system prompt that you cannot see, you are debugging folklore. You are guessing at the rules from the behavior, like a physicist inferring natural laws from observation. That is fine for physics. It is a terrible way to debug software.

In a file-based system, the instructions are right there. You can open the skill, read it, and compare what it says to what the agent actually did. The gap between intent and behavior becomes visible. That gap is where all the interesting debugging happens.

This is not a theoretical concern. We have seen it repeatedly in our own development. A skill tells the agent to "verify the PIVOT view returns the expected data" after modifying a CRM object. The agent skips the verification. Why? Because the instruction was ambiguous — it did not specify what "expected" means. You can only diagnose that by reading the instruction. If the instruction is hidden, you are stuck staring at the output and wondering.

You Cannot Improve What You Cannot Review#

The second problem is collaboration. Good software improves through review. Someone writes a change, someone else reads it, and the discussion produces something better than either person would have written alone. Pull requests, code reviews, design critiques — the entire quality culture of modern software development is built on the assumption that work can be inspected before it ships.

Hidden prompts break this loop. If the most important logic in your AI system is a prompt that lives in an internal dashboard or an environment variable, the team cannot review it. You might have a brilliant prompt engineer who crafts perfect instructions, but their work is invisible to the rest of the organization. Nobody can suggest improvements. Nobody can catch errors. Nobody can learn from it.

When the prompt is a file in the repository, it enters the normal flow of software development. Someone opens a pull request that changes a skill. The diff shows exactly what changed. A colleague comments: "This instruction is ambiguous — what does 'safely' mean here?" The skill gets better. This is the same mechanism that makes code better, and it works for the same reasons.

You Cannot Own What You Cannot Move#

The third problem is ownership. If the behavior that defines your agent's capabilities is trapped inside a vendor's platform, you do not really own it. You might be able to customize it through an interface, but you cannot take it with you. You cannot run it in a different context. You cannot fork it when the vendor's vision diverges from yours.

Files solve this. A SKILL.md file is yours. It lives in your repository, your filesystem, your backup. You can copy it to another project. You can share it with a colleague. You can publish it. You can stop using DenchClaw tomorrow and the file still works as a document of what the agent was supposed to do. It is not locked in.

This matters more than most people realize at the beginning. In the early days of adopting any AI tool, the instructions seem like a small thing. But over time, the accumulated knowledge encoded in those instructions becomes a significant asset. It is organizational knowledge about how your CRM should behave, what your sales process looks like, what your team's conventions are. Losing that because you switch vendors is a real cost.

Why We Chose Markdown#

The format matters. We chose markdown deliberately, and it is worth explaining why we did not choose several obvious alternatives.

Why Not Code?#

The most natural instinct for engineers is to define behavior in code. Write a function. Define an interface. Compose modules. This is how we build everything else, so why not agent skills?

The problem is the audience. Skills are not only written by engineers. They are written by ops people, sales managers, product managers, and sometimes the agents themselves. The whole point of a skill is that it encodes a capability in a form that is easy to create and easy to understand. Code raises the authorship barrier. You need to know a language, understand a type system, set up a development environment, and run a build step. For a document that is essentially "here is how to do this task well," that is too much ceremony.

There is also a subtler issue. Code implies precision. It suggests that the instructions will be executed literally, like a program. But agent skills are not programs. They are guidance. They describe intent, priorities, constraints, and heuristics. They are closer to a brief you would give a capable colleague than to a function you would write in TypeScript. Markdown is the right level of formality for that kind of communication.

Why Not YAML or JSON?#

Structured formats like YAML and JSON are good for machine-readable configuration, but they are bad for human-readable instructions. A skill needs to contain paragraphs of natural language, ordered steps, conditional guidance ("if X, then do Y, otherwise Z"), examples, and sometimes even sample code. YAML can technically hold all of this, but it turns into a mess of multiline strings and indentation traps.

We do use YAML — but only for the frontmatter. The metadata about a skill (its name, description, triggers) is structured data, and YAML handles that cleanly. The body of the skill is prose, and prose belongs in markdown.

Why Not a Visual Builder?#

Some platforms let you define agent behavior through a visual interface — drag-and-drop workflow builders, decision trees, form-based configuration. These have legitimate uses, especially for non-technical users building simple automations.

But they fail at the things that matter most for skills. You cannot put a visual builder's output into version control in a meaningful way. You cannot diff two versions of a workflow and understand what changed. You cannot review it in a pull request. You cannot grep across all your skills for a particular pattern. You cannot compose it with other text-based tools.

Visual builders also tend to constrain what you can express. The builder's designers decide what nodes are available, what connections are possible, what conditions you can check. When you hit the boundary of what the builder supports, you are stuck. A markdown file has no such boundary. You can write whatever you need to write.

The Frontmatter Convention#

A DenchClaw skill file looks like this:

---
name: crm/object-builder
description: Create or update CRM objects safely
---
 
When asked to create or modify a CRM object, follow these steps:
 
1. Update the DuckDB schema to reflect the new or changed object.
2. Update the matching `.object.yaml` definition file.
3. Run a verification query against the PIVOT view to confirm the
   data matches expectations.
 
If any step fails, stop immediately. Do not proceed to the next
step. Fix the inconsistency before continuing.
 
## Edge Cases
 
- If the object already exists and the user's request conflicts
  with the existing schema, ask for clarification before modifying.
- If the PIVOT view query returns zero rows, treat this as a
  verification failure even if no error was thrown.

The frontmatter provides structured metadata that the system can parse: the skill's name, a one-line description, and optionally tags, trigger patterns, or dependency declarations. The body is natural language. That is the entire format.

There is nothing magical about that file. A new team member can read it in thirty seconds and understand what the agent will try to do. A senior engineer can review it in a pull request and suggest improvements. An agent author on another team can fork it, modify the verification step, and have a new skill for their own workflow. The simplicity is the feature.

Authorship and the Permission to Teach#

One of the underappreciated consequences of making skills files is what it does to authorship.

In most AI products, teaching the agent something new requires access to an internal tool, a special role, or at minimum a working knowledge of prompt engineering as a discipline. The act of improving agent behavior is gatekept — sometimes intentionally, sometimes just by friction.

When a skill is a file, anyone who can write text can author one. The barrier is not technical. It is conceptual: can you clearly describe how to do a task well? If you can, you can create a skill.

This matters because the people who best understand how a task should be done are often not the people building the AI system. The sales manager who has run pipeline reviews for a decade knows what a good pipeline review looks like. The support lead who has handled escalations for years knows the right sequence of checks. The ops person who has set up CRM integrations dozens of times knows the common failure modes.

In a code-based or platform-based system, those people are consumers. They use the agent but cannot improve it. In a file-based system, they are potential authors. They can write a SKILL.md that captures their expertise, put it in the right directory, and the agent immediately has a new capability.

We have seen this happen in practice. Non-engineering team members at DenchClaw have authored skills for things like deal review checklists, data hygiene procedures, and meeting preparation workflows. These are not trivial prompt snippets. They are detailed, thoughtful documents that encode real operational knowledge. And they were written by the people who actually have that knowledge, not by an engineer translating it secondhand.

Composition and the Filesystem#

Files do not exist in isolation. They exist in filesystems, and filesystems have structure. This turns out to matter a lot for skills.

In DenchClaw, skills live in directories. The directory structure itself carries meaning. A skill at .cursor/skills/crm/object-builder/SKILL.md is a CRM skill about building objects. A skill at .cursor/skills/deploy/staging/SKILL.md is a deployment skill for the staging environment. You can organize skills by domain, by team, by workflow, or by any other axis that makes sense for your organization. The filesystem is the namespace.

This structure enables composition without any special mechanism. A skill can reference other skills by path. A skill can say "follow the deployment checklist at ../deploy/checklist/SKILL.md before proceeding." The agent resolves the reference the same way any program resolves a file path. There is no import system to learn, no module registry to configure, no dependency graph to manage. It is just files pointing to other files.

Directories also enable scoping. You can have project-level skills that live in the project repository, team-level skills that live in a shared directory, and personal skills that live in your home directory. The agent discovers skills by walking the filesystem, and the filesystem determines what is available in a given context. A skill in a project repository is available when you are working on that project. A skill in your home directory is available everywhere. This is the same scoping model that Unix uses for configuration files, and it works for the same reasons: it is simple, predictable, and composable.

The Version Control Dividend#

When behavior lives in files, and files live in a repository, you get version control for free. This sounds mundane, but the consequences are significant.

Blame#

git blame on a skill file tells you who wrote each line, when, and in what commit. When a skill produces unexpected behavior, you can trace the instruction back to the person who wrote it and the context in which they wrote it. You can read the commit message. You can look at the pull request. You can understand not just what the instruction says, but why it was written that way.

This is impossible with hidden prompts. When a cloud-hosted agent behaves unexpectedly, you cannot ask "who changed the system prompt last Tuesday, and why?" The change history does not exist in a form you can access.

Diff#

git diff on a skill file shows you exactly what changed between two versions. "The verification step used to require three matching rows; now it requires one." That kind of precise change tracking is exactly what you need when debugging a regression in agent behavior.

We have caught real bugs this way. A well-intentioned edit to a skill relaxed a constraint that was there for a reason. The diff made the relaxation obvious. Without the diff, the behavior change would have been invisible until it caused a problem downstream.

Rollback#

If a skill change causes problems, you can revert it. git revert <commit> and the agent goes back to its previous behavior. This is the same rollback mechanism that every engineering team already uses for code, and it works identically for skills.

In a platform-based system, rolling back agent behavior usually means remembering what the previous prompt said and manually restoring it. If you do not have a copy, you are out of luck. With files and version control, the history is always there.

Branching#

You can experiment with skills on a branch. Create a new branch, modify a skill, test the new behavior, and merge if it works. If it does not, delete the branch. The main branch's skill is untouched. This is the same workflow engineers use for code experiments, and it applies directly to agent behavior.

This is particularly useful for A/B testing different instruction styles. "Does the agent perform better if we give it step-by-step instructions or a high-level goal?" Put each version on a branch, test both, and merge the winner.

Portability and Ecosystems#

The thing people consistently underestimate about file-based systems is how quickly they escape the product that created them.

Once something is text in a directory, it stops being trapped. You can search it, lint it, sync it, templatize it, package it, and learn from it. Other tools can read it. Other humans can improve it. New conventions emerge without asking the platform owner for permission.

We already see this happening with skills. They are composable — one skill can reference another. They are portable — you can copy a skill from one project to another and it works. People can publish them to places like skills.sh or ClawHub and install them with npx skills add <publisher>/<skill>. That is not a product feature we had to build. It is a consequence of the format being simple enough that generic tools can work with it.

The Package Manager Analogy#

Consider what happened when JavaScript modules became files with a standard format. npm emerged. Then yarn. Then pnpm. Then bundlers, linters, formatters, and an entire constellation of tools that all operated on the same underlying format: JavaScript files with a package.json manifest.

None of those tools were planned by the people who designed JavaScript modules. They emerged because the format was open, the files were accessible, and creative people saw opportunities. The ecosystem was an emergent property of the design decision, not a planned feature.

Skills are at the beginning of this curve. The format is simple. The files are accessible. The tooling is starting to appear. I do not know exactly what the ecosystem will look like in two years, but I am fairly confident it will exist, because the preconditions are the same as every previous ecosystem that emerged from file-based standards.

Cross-Tool Compatibility#

There is another dimension to portability that matters. A SKILL.md file is not deeply tied to DenchClaw's runtime. It is markdown with frontmatter — a format that many tools already understand. An agent built on a different platform could read the same file and extract useful instructions from it. The skill is not a DenchClaw plugin. It is a document that happens to be useful to DenchClaw.

This is a deliberate design choice. We do not want skills to be a proprietary format that locks users into our platform. We want them to be a portable standard that any agent system can consume. If a better CRM agent comes along tomorrow and it reads SKILL.md files, our users can take their skills with them. That is good for users, and in the long run, it is good for us too, because it reduces the risk of adopting our platform.

What Goes Wrong Without Files#

It is useful to look at the failure modes of the alternative. When agent behavior is not in files — when it is hidden in system prompts, platform configuration, or compiled code — specific things go wrong.

Debugging Becomes Archaeology#

Without readable skill files, debugging agent behavior is an exercise in reverse engineering. The agent did something wrong. Why? You look at the output. You try to infer what instruction could have produced it. You tweak the prompt and try again. This is the AI equivalent of debugging a compiled binary without source code. It works, eventually, but it is slow and painful.

With skill files, you read the instruction, identify the ambiguity or error, fix it, and re-run. The debugging loop is tighter by an order of magnitude.

Knowledge Stays Trapped in Heads#

When there is no file to capture a skill, the knowledge about how to perform a task stays in someone's head. This is fine until that person goes on vacation, changes roles, or leaves the company. Then the institutional knowledge evaporates.

Skill files are a form of documentation that actually gets used, because they are not just documentation — they are executable instructions. The agent reads them and acts on them. This gives people a reason to keep them accurate, which is the fundamental problem with traditional documentation: nobody updates it because nobody reads it.

Improvement Requires a Specialist#

Without files, improving agent behavior requires access to whatever system the prompts are stored in. Usually this means an engineer or a prompt specialist. The person who actually knows how the task should be done — the domain expert — cannot make the change themselves. They have to describe what they want to someone else, who then translates it into a prompt, which may or may not capture the original intent.

This is the same problem that plagued pre-IaC infrastructure. The person who knew what the server should look like had to file a ticket for someone else to configure it. The translation step introduced errors, delays, and frustration. Moving to files eliminated the intermediary.

The Tradeoffs#

I would be dishonest if I claimed that file-based skills solve everything. They do not. There are real tradeoffs, and they are worth acknowledging.

Files Can Be Wrong#

A markdown file can contain vague instructions, incorrect procedures, or outdated information. The agent will follow bad instructions just as faithfully as good ones. The format does not guarantee quality.

But this is equally true of code, configuration, infrastructure definitions, and every other category of behavior that moved to files. Files do not make things correct. They make things inspectable. Inspectability is what enables correction. You cannot fix what you cannot see.

Some Capabilities Need Real Code#

Not every agent capability can be expressed in a markdown file. Some things require real code: API integrations, data transformations, complex conditional logic, performance-critical operations. A skill file that says "call the Salesforce API and merge the results with local data" is useful as guidance, but the actual implementation will be code.

In DenchClaw, skills and code coexist. A skill describes the high-level procedure and the constraints. Code handles the execution. The skill is not a substitute for engineering. It is a complement. It provides the "what" and "why" and "when." Code provides the "how."

Natural Language Is Ambiguous#

Markdown is natural language, and natural language is ambiguous. "Create the object safely" means different things to different people. An agent may interpret it differently than the author intended. This is a genuine limitation of the format.

The mitigation is specificity. Good skills are specific. They name the exact steps, the exact checks, the exact failure conditions. They include examples. They anticipate edge cases. Writing a good skill is a craft, just like writing good documentation or good code. The format does not guarantee quality, but it enables it.

Discovery Is Not Automatic#

When skills are files in a filesystem, discovering what skills are available requires knowing where to look. There is no centralized registry that shows you "here are all the things this agent can do." You have to explore the directory structure or use search tools.

We are working on improving discovery — indexing skills, surfacing them contextually, recommending relevant skills based on the current task. But the underlying representation stays the same: files in directories. The discovery layer is an interface on top of the filesystem, not a replacement for it.

Skills as Organizational Memory#

There is a broader point here that goes beyond the technical arguments.

Organizations accumulate knowledge. They learn how to run meetings, how to qualify leads, how to handle escalations, how to onboard new employees, how to review contracts. This knowledge usually lives in two places: people's heads and scattered documents that nobody reads.

Skills offer a third option: knowledge that is both readable by humans and actionable by agents. When a senior sales rep writes a skill that describes how to prepare for a pipeline review — what data to pull, what questions to ask, what red flags to watch for — that knowledge is not just documented. It is operationalized. The agent uses it. New team members benefit from it. The organization retains it even when the author moves on.

This is a subtle but important shift. Traditional documentation has a motivation problem. Writing docs takes time. Nobody feels rewarded for it. And the docs rot because nobody has a reason to keep them current. Skills have a built-in feedback loop: if the skill is wrong, the agent does the wrong thing, and someone fixes the skill. The agent's behavior is a continuous test of the skill's accuracy.

I think this pattern — knowledge encoded in files, executed by agents, improved through version control — will become a standard part of how organizations operate. It is documentation that works.

What We Have Learned#

We have been running with file-based skills in DenchClaw for long enough to have some real observations, not just theory.

Skills get better over time. The initial version of a skill is usually rough. It captures the happy path but misses edge cases. After a few rounds of use, someone encounters an edge case, adds a clause to the skill, and the agent handles it next time. This incremental improvement cycle is the same one that makes code better over time, and it works because the skill is in a format that supports editing and review.

Non-engineers write surprisingly good skills. We expected that skills would mostly be written by engineers. In practice, some of the best skills have been written by people with deep domain knowledge and no engineering background. They know how to describe a process clearly because they have been running that process for years. The markdown format does not get in their way.

Skills compose naturally. When one skill references another, the composition is obvious and easy to follow. There is no indirection, no dependency injection, no configuration layer. It is just one document pointing to another. This makes complex workflows surprisingly easy to understand.

The filesystem is a good enough namespace. We considered building a skill registry, a tagging system, a search index. We may still build some of those things. But the basic filesystem — directories as categories, filenames as identifiers, paths as references — handles the common case well. We have not outgrown it yet.

Version control catches real bugs. We have caught skill regressions in code review. Someone changes a skill and a reviewer notices that the change removes an important constraint. This is exactly what code review is supposed to do, and it works identically for skills because they are in the same system.

Where This Goes#

I think the separation of agent behavior from agent infrastructure is one of the most important design decisions in the current generation of AI products. It is the difference between a product and a platform.

When behavior is hard-coded into the agent, the agent can only do what the vendor anticipated. When behavior is in files, the agent can do whatever anyone teaches it to do. The set of capabilities is open, not closed. And the people who extend that set are not limited to the vendor's engineering team.

This is the same dynamic that made Unix pipes powerful, that made the web successful, that made open source viable. The platform provides the runtime. The community provides the content. The content is more valuable than the runtime, and it accrues to the users, not the vendor.

I do not think we have fully explored the implications yet. What happens when skills can call other skills with well-defined interfaces? What happens when skills can declare preconditions and postconditions? What happens when skills can be tested automatically, with assertions about the agent's behavior? What happens when skills become a medium for organizational knowledge management, not just agent instruction?

These are open questions, and I find them genuinely exciting. But they are all downstream of a single design decision: skills are files.

Not skills should be files. Not skills could be files. Skills are files.

Everything else follows from that.