For the people shipping the skills

Your Skills Are Not Logged.

For as long as we've been shipping AI skills, we've been writing instructions and hoping they worked. We had no instrument to tell us otherwise. The skills ran in the dark. And we quietly accepted that this was fine.

  Keep reading. This changes how your company runs AI.

Every AI skill we've ever shipped has been invisible.

Not because the skills were bad. Not because the team didn't use them. Because the entire model of shipping a skill — write it, deploy it, hope — has never had a way to tell anyone what actually happened next.

Consider the pattern. It plays out at every AI-curious company, in every industry:

A builder spends an afternoon writing a careful, well-scoped Claude skill. It's actually good.

A team member triggers it once. Maybe twice if it impressed them.

Two weeks later, no one remembers if anyone is still using it.

Failures are silent. Adoption is anecdotal. Improvement requests arrive over Slack — "hey can we change this thing?" — with no data behind them.

The skill becomes a story. The builder becomes the only person who remembers what it was supposed to do. Quality varies by who reported what.

Next skill ships. Repeat.

We've been shipping work into a void and calling it operations.

AI skills aren't software in the way we're used to thinking about software. They don't throw a stack trace. They don't show up in a usage dashboard. They don't fail in red. They just run, complete, and disappear — and the only signal you ever get back is what someone mentions in passing.

But every other kind of operational work we do has a record. Every job site has a daily log. Every operating room has a chart. Every flight has a recorder. We don't run real work without writing down what happened. AI skills are the only category where we shipped without that instinct, and called it good.

And here's the thing we want to say with love —

Every company has someone who has been right about this from day one. The one who asked "but how do we know if it's working?" while the rest of us were busy being excited about the demo. Who flagged that the skill that worked great for the founder might be silently failing for the new PM. Who wanted instrumentation before they wanted the next feature. Who got dismissed as the killjoy in the AI conversation.

Every meeting they were in, they were the slow voice. The reasonable one. The one asking the question nobody wanted to slow down for. And they were always right.

The person asking "how do we measure this?" was never wrong to ask. They were just outvoted by the demo.

Those people are the reason any of this becomes durable. They are not the obstacle. They are the reason quality ever survives the gap between "this works on my machine" and "the team uses it every day." And they deserve the tool that finally lets the answer be real data, not gut feel.

What if every skill carries its own witness?

What if telemetry isn't a dashboard you bolt on later — but a stanza you write into every skill, at the moment you write the skill, so the skill cannot run without leaving a trace?

The job site

A job without a daily log produces a project no one can reconstruct.

Crews showed up. Material arrived. Inspections passed or failed. Phone calls were made. Six weeks later, when something goes sideways, nobody can walk back through what actually happened. We don't run job sites this way. Every site has a log.

The skill site

A skill that logs itself produces a record you can walk back through any time.

Started here. Pulled this. Made this decision. Hit this warning. Finished at that time. Got this feedback. Same way you can read a daily log six months after close and reconstruct any decision. The skill is the job. The witness is the log.

We've kept daily logs on every job site for fifty years. We've shipped AI for two years with none. Until now.

Skillmeter is the instrument. A small set of tools — log_skill_start, log_skill_event, log_skill_end, log_skill_feedback, plus two for querying — wired into every skill in the library. The skill author doesn't write telemetry code. They include the right call at the right moment, and the skill leaves a trace whether it succeeds, fails, or gets killed mid-run.

The skill itself stays lean — the work, the prompts, the judgment calls. The instrumentation is non-negotiable but tiny. A handful of lines per skill. A few hundred milliseconds per run. Same way you'd never write production code in any other language without logging.

Do the work. But everyone shows their work the same way.

Here is a real skill. Before and after.

This is one Titus Contracting runs every week — a scope review skill that compares a project's PDA against its signed HIA and surfaces what changed before kickoff. It's careful. Multi-step. And until recently, it ran in total silence.

Before · Written without a witness v1.0

jobtread-scope-review

When invoked:

1. Resolve the job from the user's message.
2. Fetch the PDA from JobTread.
3. Fetch the HIA from JobTread.
4. Fetch vendor walk scope, bid scope,
   and vendor bids.
5. Run the comparison logic.
6. Produce the report.

The skill works. It runs in about 47 seconds. It produces a good report. And after it runs, the only person who knows anything happened is the person who triggered it. Failures are silent. Improvements are unprovable. The skill is a black box that everyone has to trust.

After · Written with the witness v2.0

jobtread-scope-review

When invoked
1. CALL log_skill_start
   - skill_name: "jobtread-scope-review"
   - user_id: <user email>
   - trigger_phrase: <truncated request>
   - entity_refs: { job_id }
   - SAVE: run_id

2. Resolve the job.
   CALL log_skill_event
   - step "resolved_job"

3. Fetch PDA.
   CALL log_skill_event
   - tool_call "jt_search:pda"
   [...do the work...]
   CALL log_skill_event
   - step "fetched_pda" (page_count)

4. Fetch HIA.
   If missing →
   CALL log_skill_event
   - warning "no_hia_found"

5. Vendor walk, bid scope, vendor bids
   — one step event per fetch.

6. Run comparison.
   CALL log_skill_event
   - decision "comparison_mode"

7. Produce the report.
   CALL log_skill_event
   - milestone "report_generated"

8. CALL log_skill_end
   - status: "success"
   - summary: "Scope review for <job>
     — N deltas, $X change flagged"

9. End with:
   "Was this helpful? Reply 👍 or 👎."
   If they reply → log_skill_feedback.

Same skill. Same work. But every step now leaves a fingerprint. The skill cannot run without writing its own trace.

The instrumented version is harder to write. And it's worth more than a hundred of the other kind.

This is one real run of the instrumented skill.

Not a dashboard summary. Not a quarterly KPI. The actual record written by the skill itself, as it ran, queryable for the next year.

// query_run_trace
jobtread-scope-review  ·  [email protected]  ·  Jaworski
47s · success · 12 tool calls · "3 deltas flagged"
00:00.0stepresolved_job (Jaworski)
00:00.4tool_calljt_search:pda
00:03.1stepfetched_pda (8 pages)
00:03.2tool_calljt_search:hia
00:06.0stepfetched_hia (12 pages)
00:06.1tool_calljt_search:vendor_walk
00:08.5warningdrawings_missing — fallback to v1 plans
00:09.0tool_calljt_search:bid_scope
· · ·
00:45.2decisioncomparison_mode = PDA_vs_HIA, deltas=3
00:47.0milestonereport_generated, deltas=3, cost_change=$4,820

This is the kind of artifact you can actually act on. Six months from now, if this skill starts taking 90 seconds instead of 47, the trace will tell you exactly which tool_call ballooned. If a PM gives it a 👎, you can pull the run and see the exact step they were unhappy with. If a new failure mode appears, you'll see the warning before anyone files a ticket.

This run came from Titus. They wired Skillmeter into every operational skill they ship — scope reviews, daily check-ins, price estimates, change-order writeups, the whole library. Every run, every operator, every job. Logged. Queryable. Improvable.

The day they turned it on, things they'd been guessing about became facts. A skill they were proud of was failing 22% of the time on the same step — silently, for months. Two hours of fixing dropped the failure rate under 3%. A skill they'd almost retired was pulling 40 runs a day from one project coordinator — that wasn't dead weight, that was a load-bearing tool. A team member ran the price estimator constantly but never the daily check-in — not a problem, a coaching signal.

The trace is not a report. It's the record that makes every report possible.

This is the company you become when every skill carries its own witness.

Builders

Your next skill ships with telemetry from line one.

No retrofitting. No "we'll add logging later." The instrumentation stanza is part of the skill template. Every skill ships ready to be measured.

Leadership

You finally see which skills are load-bearing.

Run counts, success rates, failure clusters, feedback signal — for every skill, by every operator, on every job. The story stops being "I think this is working" and starts being a number leadership can act on.

Operators

Yesterday's run can be replayed in detail tomorrow.

Every step. Every tool call. Every decision. Every warning. The skill becomes a record of how the work got done, not just that it got done.

Improvement

The improvement loop closes for the first time.

The data tells you which skills break, where, for whom. You fix the brittle step, ship v1.1, watch the failure rate drop in next week's data. The loop that's normal for any other software is finally normal for AI.

Skill adoption is no longer your rate-limiter. Skill quality is. And skill quality is something you can measure now.

This is how companies operate AI at scale without descending into vibes. Without the predictable cycle of "ship a skill, hope, get a Slack complaint, lose it in the noise." We've never been able to do it before. We can now.

So here's the move.

Every skill you write from here on gets written with two parts — the work itself, and the witness that logs the work. They are not separate concerns. They are written together, shipped together, and they live or die together.

The work stays focused. The skill does the job. The instrumentation is non-negotiable but small. One stanza at the start. Events at the key moments. One stanza at the end. Optional feedback at close.

You will ship more skills, not fewer — and your confidence in them will go up, not down, because the witness is finally there.

The skills you have running today should get retrofitted as the next thing you do. Not all at once — the load-bearing ones first, by run count, by complaint volume, by gut feel. The instrumentation stanza is short. The data starts flowing immediately. The first week you have it, things you were sure of will turn out to be wrong, and things you were skeptical of will turn out to be true.

The teams that figure this out first will be the ones running AI like software — measured, improved, owned.

 ·  Share freely  ·  More notes →