Colorado · Briefed May 2026

The Operator's Edge.

Five plays for working with AI better than everyone else — with Monday-morning moves, tools, and prompts.

Plays
Five
Resources
115 curated
Prompts
22 ready to copy
Briefed
14 May 2026
§ Quick start · 10 moves

Ten moves to start this week.

Start anywhere. The order compounds, but a single move on Monday morning beats a perfect plan you never run.

01

Pick one recurring question your team asks at least once a week. Write it down. That's your first second brain project.

5 minutes Play 1 →
02

Open Notion AI (or Claude) and turn on the workspace you already have. Ask it the recurring question. See what it can already answer.

10 minutes Play 1 →
03

Run the DESIGN.md interview: ask Claude or ChatGPT to interview you about your brand voice, customer, and what 'slop' looks like for you. Save the output as DESIGN.md.

90 minutes Play 2 →
04

Install Claude Code (npm install -g @anthropic-ai/claude-code) and run /init in your most active project folder to generate a CLAUDE.md.

15 minutes Play 2 →
05

Identify one task you've done manually at least twice. Write a SKILL.md for it — a step-by-step instruction file any agent can follow.

30 minutes Play 3 →
06

Hand the SKILL.md task to Claude Code, Codex, or Cursor Agents. Run it in read-only mode. Review the output the same way you'd review a new hire's first work.

Ongoing — first two weeks Play 3 →
07

Collect 10 real examples of tasks your AI system should handle well. Write down what a 'good' answer looks like for each. This is the seed of your eval set.

60 minutes Play 4 →
08

Pick one observability tool (Langfuse is free and open-source; Helicone is one-line integration) and wire it to your most-used AI endpoint. Start tracking cost and latency.

2 hours Play 4 →
09

Take one existing job description and add three AI skill requirements: spec writing, eval design, and a taste statement ('able to distinguish good AI output from slop and explain why').

30 minutes Play 5 →
10

Block 30 minutes every Monday to: review one agent's output, check your eval pass rate, and update one context file (DESIGN.md, SKILL.md, or AGENTS.md) based on what you learned.

30 min/week, ongoing Every play →
§ The five plays · Index

Five plays for working with AI better than everyone else.

Each play closes with a concrete Monday-morning move, the resources you need, and prompts you can paste into ChatGPT or Claude today.

§ I · Play 1

Build a second brain — not another chat window. #

The play

Every email, deck, contract, and meeting transcript in your business is institutional memory you're throwing away. A second brain captures it once and answers questions forever. Stop using AI like Google. Use it as memory and cognition infrastructure. Three layers: Sources → Wiki → Schema.

Monday-morning move

Pick one recurring question your team asks repeatedly — 'What did we promise this client?' or 'Where's the pricing logic?' — and build the smallest second brain to answer only that. Tool: Notion AI on the workspace you already have. Cost: under $20/seat. Time: 90 minutes.

Why it matters

Institutional memory that lives only in people's heads or inboxes evaporates. A structured second brain turns that knowledge into a queryable asset. Karpathy's framing: 'Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.'

Articles & documentation

9 articles docs
Notion AI — workspace Q&A and AI writing Notion Help

How to use Notion AI to query and generate content across your connected workspace. Under $20/seat.

NotebookLM Enterprise overview Google Workspace

Google's AI research tool grounded in your own documents. Supports PDFs, Google Docs, web URLs, audio, and YouTube. Included in Workspace plans.

Glean — cross-app enterprise search Glean

Work AI platform that searches every SaaS tool your company uses. Best for teams with fragmented knowledge across many apps.

Guru — governed knowledge layer Guru

Human-verified knowledge base with AI answers. Structures and governs knowledge so AI answers can be trusted, not just generated.

Slack AI features and AI assistants Slack

AI assistant built into Slack that summarizes threads, prepares meeting briefings, and queries your workspace history.

ChatGPT Projects — organize and revisit work OpenAI Help

ChatGPT Projects lets you persist context and documents across sessions — a lightweight second brain for solo operators.

How to build an Obsidian LLM wiki — the Karpathy method AI Maker (Substack)

Practical breakdown of the Sources → Wiki → Schema pattern with Obsidian + Claude Code. Good starting template for technical builders.

Obsidian — the free note-taking tool Obsidian

Local-first markdown note editor. The IDE in Karpathy's 'Obsidian is the IDE' framing. Free for personal use.

Claude API documentation Anthropic

Official Claude platform docs — use Claude as the LLM layer in an Obsidian + Claude Code second brain setup.

Videos & talks

3 videos
NotebookLM Beginner Guide: Build Your Second Brain YouTube — Numroid · 6:49

Quick walkthrough of NotebookLM for knowledge capture. Good starting point for non-technical operators.

Building a Second Brain with NotebookLM: From Blank Page to Cognitive Engine YouTube — Teacher's Tech · 21:17

Full masterclass on NotebookLM as a cognitive engine in 2026. Covers advanced source types and workflow patterns.

How to Use NotebookLM as a Second Brain for AI Agents YouTube — Evolving Gen · 11:18

Demonstrates using NotebookLM as persistent memory for AI agent workflows. Relevant for technical builders.

Products & tools

5 product links
Notion AI Notion

AI-native workspace. Query your docs, pages, and databases in natural language.

NotebookLM (Google) Google Workspace

Upload your documents; get an AI that only answers from what you gave it.

Glean Glean

Enterprise-grade cross-SaaS search and AI assistant.

Guru Guru

Human-verified knowledge layer for enterprise AI.

Obsidian Obsidian

Free local-first markdown vault. Core of the Karpathy second brain pattern.

Prompts to copy

4 prompts

Interview me to build my second brain

I want to build a second brain for my business. Interview me — ask me 10 focused questions about: what recurring questions my team has to answer, where institutional knowledge currently lives, what information I look up more than once a week, and what I'd want a new team member to be able to find on their first day. After I answer, organize my responses into a Sources → Wiki → Schema structure with specific recommendations for tools and first steps.

Identify the one question to solve first

Here is a list of recurring questions my team asks: [paste your list]. Rank them by: (1) how often they get asked, (2) how painful it is when no one knows the answer, (3) how feasible it is to capture the answer in a document. Give me the single best candidate for my first second-brain project, and a 90-minute plan to build it.

Generate a wiki schema from raw notes

I'm going to paste raw notes, emails, and transcripts below. Your job is to: (1) identify the key entities, decisions, and commitments in this material, (2) draft a wiki page structure with section headers and backlinks, (3) write a one-paragraph Schema.md instruction file that tells an LLM how to maintain and update this wiki going forward. Keep everything factual — do not infer things not stated. [paste raw material]

Audit an existing knowledge base

Here is the table of contents of my current knowledge base: [paste TOC or list of docs]. Identify: (1) which documents are likely stale or redundant, (2) what critical topics appear to be missing, (3) which entries should be merged, (4) what a new employee or AI agent would be unable to answer from this base alone. Give me a prioritized cleanup list.
§ II · Play 2

Codify your business DNA. #

The play

A family of five markdown files — DESIGN.md, SKILL.md, SPEC.md, AGENTS.md, CLAUDE.md — turns AI from a free intern into a PM-managed contributor. AI sits on both sides of the desk: generates against the spec, then a second pass audits against it. You're the product manager now.

Monday-morning move

Spend one morning being interviewed by ChatGPT or Claude about your values, voice, who your customer is, and what 'slop' looks like for you. Paste the transcript back in and ask for a DESIGN.md. Cost: $0. Time: one morning.

Why it matters

Without codified context, every AI session starts from zero and produces generic output. Markdown context files are the difference between AI that sounds like you and AI that sounds like everyone else. Karpathy warned of a 'slopacolypse' — a flood of low-quality AI content — in 2026. These files are the defense.

Articles & documentation

8 articles docs
Custom instructions with AGENTS.md — Codex official docs OpenAI Codex Docs

Official guide to AGENTS.md: how Codex reads layered instruction files at global, project, and directory scope. The canonical reference for this file format.

AGENTS.md — community reference site agents.md

Community reference for the AGENTS.md standard. Explains nested file resolution and best practices across large repos.

CLAUDE.md setup — your first day in Claude Code Claude Help Center

Official guide to generating a CLAUDE.md with the /init command. 'The single highest-value setup step you can take.' Claude reads it automatically at the start of every session.

Effective context engineering for AI agents (Anthropic Engineering) Anthropic Engineering

Deep technical guide from Anthropic on just-in-time context loading, progressive disclosure, sub-agent architectures, and long-horizon task management. Required reading for builders.

How to write a good spec for AI agents (O'Reilly) O'Reilly Radar

Practical guide to writing SPEC.md-style documents that serve as executable artifacts for coding agents. Covers PRD/SRS hybrid format, modular specs, and self-verification patterns.

Spec-Driven Development with Coding Agents (DeepLearning.AI course) DeepLearning.AI

Short course on using specs as the agent's guide. Plan, implement, and validate features with a human-in-the-loop workflow.

AI-Generated 'Workslop' Is Destroying Productivity (HBR) Harvard Business Review

Research from BetterUp and Stanford: 41% of workers encounter AI-generated output lacking substance, costing nearly 2 hours of rework per instance. The context files pattern is the direct counter.

OpenAI Codex Agent Skills docs OpenAI Codex Docs

Official docs for SKILL.md and the agent skills standard — how to package reusable workflows for Codex to follow reliably.

Videos & talks

3 videos
How to Write Specs WITH Your AI Coding Assistant YouTube — César Soto Valero · 11:00

Common failure modes when writing specs for AI, and how to structure them so the agent actually follows them.

Spec Kit: How to Build Production-Ready Apps with AI Agents YouTube — Leon van Zyl · 28:03

Full walkthrough of GitHub's four-phase gated spec workflow. Shows how SPEC.md drives implementation, tests, and task breakdowns.

Context Engineering for AI Agents (LangChain) YouTube — LangChain · 17:24

Core principles of context engineering and how they map to frameworks like Claude Code and LangChain. Good conceptual grounding.

Products & tools

3 product links
Claude.ai Anthropic

Primary tool for the DESIGN.md interview exercise. Use Projects to persist your context files across sessions.

ChatGPT OpenAI

Alternative to Claude for the DESIGN.md interview. Both work equally well for the Monday-morning exercise.

Claude Code documentation Anthropic — Claude Code Docs

Claude Code reads CLAUDE.md automatically. The SKILL.md and context file patterns are natively supported.

Prompts to copy

5 prompts

The DESIGN.md interview

I want to create a DESIGN.md file that captures my business's voice, aesthetics, and brand DNA so AI tools generate on-brand work without constant correction. Interview me. Ask me 12 questions — one at a time — covering: our tone of voice, who our customer is (describe them specifically), what we would never say or do, three examples of writing we admire and why, our banned words or phrases, our visual aesthetic, and what 'generic AI slop' looks like for us. After I answer all 12, synthesize my responses into a clean DESIGN.md file I can store in my project root.

Generate a SKILL.md from a process I describe

I'm going to describe a workflow I do manually. Your job is to turn it into a SKILL.md — a reusable instruction file that any AI agent can follow to complete this workflow without needing me to explain it again. Format the SKILL.md with: (1) a one-sentence purpose statement, (2) step-by-step instructions numbered and detailed enough for a capable AI to follow, (3) what 'done' looks like, (4) common failure modes and how to handle them. Here is the workflow: [describe your workflow]

Write a SPEC.md for a feature or deliverable

Help me write a SPEC.md for the following project or feature: [describe it in plain language]. Structure the spec as: (1) objective — what outcome we want, (2) context — relevant background, constraints, and existing systems, (3) acceptance criteria — a numbered list of conditions that must be true for this to be done, (4) out of scope — what we are explicitly not building, (5) open questions — things still unresolved. Make the acceptance criteria specific enough that an AI agent or junior developer could verify them without asking me.

Audit AI output against a context file

Here is my DESIGN.md: [paste your DESIGN.md]. Here is a piece of content an AI just generated for me: [paste the content]. Audit the content against the DESIGN.md. For each section of the DESIGN.md, tell me: (1) does the content comply, (2) what specifically violates it if not, (3) a suggested rewrite for any non-compliant passage. Be direct — no softening.

Generate an AGENTS.md for a new project

I'm starting a new software project and want to create an AGENTS.md file so that OpenAI Codex or any compatible agent reads my project norms before starting any task. The project is: [describe tech stack, purpose, conventions]. Write an AGENTS.md that covers: (1) how to run the project locally, (2) code style and conventions, (3) what the agent should always do before making changes, (4) what the agent should never do without asking first, (5) how to run tests and what passing tests looks like.
§ III · Play 3

Manage agents like you manage people. #

The play

Same playbook your business already runs for human employees. System prompts as job descriptions. SKILL.md for anything done twice. Read their work for two weeks before trusting it. No production access until week three. Engineers use agent frameworks as code; operators run agents as products — different tools, same management discipline.

Monday-morning move

Find one task you've done manually twice. Write it as a SKILL.md. Hand it to an agent. Read its work for two weeks before trusting it. Tool: Codex / Claude Code / Cursor Agents. Cost: under $50/mo. Risk: zero if read-only.

Why it matters

People who already know how to hire, onboard, and review humans already know how to manage agents. The vocabulary changed; the discipline didn't. Operators who treat agents like opaque black boxes get slot-machine results. Operators who onboard, scope, and review agents get reliable output.

Articles & documentation

9 articles docs
Claude Code Agent SDK overview Anthropic — Claude Code Docs

Official Agent SDK docs for Claude Code. Build agents that autonomously read files, run commands, and manage subagents using the same tools that power Claude Code itself.

PydanticAI — agent framework documentation Pydantic Docs

Production-grade Python agent framework with type safety and dependency injection. The Pydantic way to build structured, testable agents.

PydanticAI Agents — core concepts Pydantic Docs

Explains the components of a PydanticAI agent: instructions, tools, structured output, and dependency injection.

Introducing Codex (OpenAI) OpenAI

OpenAI's announcement of Codex as an async, remote coding agent. Good context on the operator vs. engineer use case split.

Conductor — run a team of coding agents on your Mac (YC W25) Y Combinator

Conductor lets you run parallel coding agents in isolated workspaces and review/merge their changes. NOTE: This is a YC-backed startup, distinct from Apache Conductor (workflow orchestration) — verify this is the intended product from the deck context.

The deck lists 'Conductor' as a coding-agent session manager. Based on search results, this likely refers to the YC W25 startup Conductor (ycombinator.com/companies/conductor) — not Apache Conductor, the workflow orchestrator. Confirm before linking.
Superset — orchestrate parallel coding agents Superset

Runs 100+ parallel coding agents on your machine. Agent-agnostic: works with Claude Code, Codex, Cursor, or any CLI agent. NOTE: Not Apache Superset (the BI tool). The deck context clearly refers to this coding-agent orchestrator.

The deck lists 'Superset' as a coding-agent session manager. This refers to superset.sh (the parallel agent orchestrator) — not Apache Superset (the open-source BI and data visualization tool). They are unrelated products with the same name.
Cursor — AI code editor Cursor

AI-native editor built on VS Code. Cursor Agents can work across multiple files — useful for operators editing non-code files, not just engineers.

Introducing Perplexity Computer Perplexity AI

Browser-driving agent that handles click-and-type web work. From the deck: 'the agent that does the click-and-type work.'

Effective context engineering for AI agents (Anthropic Engineering) Anthropic Engineering

Sub-agent architectures, tool design, and context management patterns for multi-agent systems. Cross-reference for Play 03 and Play 05.

Videos & talks

3 videos
How to Properly Use Claude Code Agent Teams (Full Live Demo) YouTube — Cole Medin · 50:22

Deep walkthrough of multi-agent teams in Claude Code — parallel agents that communicate with each other. Directly relevant to managing agents like employees.

Claude Code Sub-Agents: Step-by-Step Beginner Tutorial YouTube — Thetips4you · 26:05

Beginner-accessible guide to setting up specialized sub-agents in Claude Code with their own context and tools.

Claude Code: Build Your First AI Agent (No coding required) YouTube — Teacher's Tech · 25:38

Demonstrates that you don't need to code to build and manage an AI agent. Good entry point for business operators.

Products & tools

7 product links
Claude Code Anthropic

Anthropic's agentic coding tool. Reads CLAUDE.md automatically. Supports subagents, MCP, and shared-doc collaboration.

OpenAI Codex OpenAI

OpenAI's async remote agent. General coding and ops work; runs in parallel without interrupting your current session.

Cursor Cursor

AI editor with agents for everyone editing files — not just engineers writing code.

PydanticAI Pydantic

Python agent framework. Type-safe, testable, integrates with OpenTelemetry for observability.

LangChain LangChain

The agent engineering platform. Broad ecosystem of integrations, tools, and templates.

CrewAI CrewAI

Framework for orchestrating role-playing autonomous agents. Strong multi-agent coordination model.

Superset (superset.sh) Superset

Run 100+ parallel coding agents on your machine. Agent-agnostic.

GitHub repositories

4 github repos
pydantic/pydantic-ai GitHub — pydantic · 17.1k

AI agent framework, the Pydantic way. Type-safe, production-ready, with built-in instrumentation.

langchain-ai/langchain GitHub — langchain-ai · 136.7k

The agent engineering platform. Extensive tool integrations and community.

crewAIInc/crewAI GitHub — crewAIInc · 51.4k

Framework for orchestrating role-playing, autonomous AI agents in collaborative crews.

anthropics/claude-code GitHub — anthropics · 123.3k

Claude Code agentic coding tool. Official Anthropic repo.

Prompts to copy

4 prompts

Write a system prompt / job description for an agent

I need to create a system prompt that acts as a job description for an AI agent. The agent's role is: [describe the role and function]. Write a system prompt that specifies: (1) who the agent is and what it's responsible for, (2) what it should always do, (3) what it should never do without explicit approval, (4) how it should communicate output, (5) what success looks like for a typical task, (6) how to handle ambiguous or unclear situations. Keep it under 400 words and make it specific, not generic.

Onboarding review: assess an agent's first week of work

I've been running an AI agent on [task] for the past week. Here is a sample of its output: [paste 3-5 examples]. Act as a performance reviewer. Tell me: (1) what the agent is doing well, (2) where it's producing inconsistent or off-spec output, (3) what I should add or change in its SKILL.md or system prompt to fix the issues, (4) whether this agent is ready for more autonomous operation or needs another review cycle. Be specific — cite the examples.

Design a multi-agent workflow for a repeating process

I want to design a multi-agent workflow for the following repeating business process: [describe the process]. Map it as: (1) a list of distinct sub-tasks, (2) for each sub-task — which agent role handles it, what inputs it needs, what outputs it produces, and what error conditions to handle, (3) where a human must review before passing to the next step, (4) a SKILL.md outline for each agent role. Use plain language — I'll implement in [Claude Code / LangChain / PydanticAI — pick one].

Scope an agent to read-only before giving it write access

I'm about to deploy an agent to handle [task]. Before giving it write access, I want to run it in read-only mode. Help me define: (1) exactly what read-only means for this task (what can it read, what is off-limits), (2) what outputs I should review during the two-week observation period, (3) a simple rubric for deciding when the agent has earned production write access, (4) what guardrails to put in place when it does get write access. Be concrete about the specific risks for this task.
§ IV · Play 4

Run evals and observability. #

The play

If you don't have evals, you don't have AI — you have a slot machine. Every other software system you run has dashboards. AI shouldn't be exempt. Five metrics: Quality (pass rate on golden dataset), Cost ($ per task/user/outcome), Latency (P50/P95), Drift (week-over-week degradation), Override rate (how often humans step in). Tools: Logfire, Braintrust, Arize Phoenix, Langfuse, LangSmith, Helicone, Galileo.

Monday-morning move

Start collecting an eval set. Aim for 50–100 real tasks with a 'good' answer for each — that's the floor for a defensible pass rate. Run it weekly.

Why it matters

Model providers ship silently. Without a baseline dataset and weekly runs, you won't know when your AI's quality degrades until a customer tells you. Override rate is the canary: when it climbs, something is breaking. The eval set is your test suite for production AI.

Articles & documentation

11 articles docs
Braintrust — evaluation quickstart Braintrust Docs

Official quickstart for running evals with Braintrust. Covers dataset construction, task definition, and scoring functions. Strong eval tooling for teams — dataset management, experiment tracking, and CI/CD integration.

Braintrust — evaluate systematically Braintrust Docs

Full evaluation guide: building golden datasets from production logs, LLM-as-judge scorers, and CI/CD integration.

LangSmith evaluations — LLM and AI agent evaluation platform LangChain

LangSmith's evaluation framework: human review, heuristic checks, LLM-as-judge, pairwise comparison, and online production scoring.

LangSmith evaluation docs LangSmith Docs

Technical reference for creating datasets, defining evaluators, running experiments, and analyzing results.

Galileo AI — observability and evaluation platform Galileo

20+ out-of-box evals for RAG, agents, safety, and security. Build custom evaluators to encode domain expertise.

Pydantic Logfire — getting started Pydantic Docs

OpenTelemetry-based observability for Python AI applications. Native integration with PydanticAI. One-line instrumentation.

Helicone — AI gateway and LLM observability Helicone

Open-source LLM observability. One-line integration to monitor, evaluate, and experiment. Tracks cost, latency, and quality.

Helicone quickstart docs Helicone Docs

Quickstart guide for adding Helicone to an existing LLM application.

Arize Phoenix — AI observability and evaluation (open-source) GitHub — Arize AI

Open-source observability: tracing, evaluation, versioned datasets, experiments, and playground. Vendor and language agnostic.

Langfuse — open-source LLM engineering platform Langfuse

Open-source observability, metrics, evals, prompt management, and datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK.

AI Evals For Engineers — Hamel Husain & Shreya Shankar (Lenny's Podcast) YouTube — Lenny's Podcast

1h46m deep dive on why evals are the most important skill for AI product builders. Hamel and Shreya have trained 2000+ PMs and engineers on evals.

Videos & talks

3 videos
Why AI evals are the hottest new skill for product builders — Hamel Husain & Shreya Shankar YouTube — Lenny's Podcast · 1:46:33

The definitive podcast episode on AI evals. Covers golden datasets, LLM judges, error analysis, and shipping AI that works in production.

Observability and Evals for AI Agents: A Simple Breakdown YouTube — LangChain · 14:45

Why agent observability is different from standard software monitoring, and what to instrument first.

Leveling Up AI Agents with LLM Evaluations and Feedback Loops YouTube — Arize AI · 1:21:00

Arize, Google Cloud, and Wayfair on AI observability, evaluation, and agent feedback loops in production.

Products & tools

7 product links
Braintrust Braintrust

Eval platform for AI teams. Dataset management, LLM-as-judge scoring, experiment tracking, and CI/CD integration.

LangSmith LangChain

Observability and evaluation for LangChain-based applications. Also works standalone.

Arize Phoenix Arize AI

Open-source AI observability. Free to self-host.

Langfuse Langfuse

Open-source LLM engineering platform with observability, evals, and prompt management.

Helicone Helicone

One-line LLM observability. Cost and latency tracking out of the box.

Galileo Galileo

AI observability and evaluation with pre-built evals for RAG, agents, and safety.

Logfire (Pydantic) Pydantic

OpenTelemetry-based observability for Python apps. Native PydanticAI integration.

GitHub repositories

3 github repos
arize-ai/phoenix GitHub — Arize AI

Open-source AI observability and evaluation. Tracing, evals, datasets, experiments, and playground.

langfuse/langfuse GitHub — Langfuse · 27.2k

Open-source LLM engineering platform. YC W23.

Helicone/helicone GitHub — Helicone · 5.7k

Open-source LLM observability platform. YC W23.

Prompts to copy

4 prompts

Design a golden eval dataset for my AI feature

I'm building an AI feature that does the following: [describe the task — e.g., 'summarizes customer support tickets into a one-line priority tag']. Help me design a golden dataset of 50 real-world test cases. For each case, define: (1) the input, (2) what a 'correct' output looks like, (3) a scoring rubric (pass/fail or 1-5 scale with criteria). Then give me a template I can fill out to capture 50 real examples from our production data. Include at least 10 edge cases or tricky inputs that are likely to expose model weaknesses.

Define my five-metric eval dashboard

I need to set up a monitoring dashboard for an AI system that does: [describe what it does]. Define the five metrics I should track: (1) quality — what does a passing score mean and how do I measure it, (2) cost — what's the right unit (per task, per user, per outcome) and what's an acceptable range, (3) latency — what are my P50 and P95 targets, and what threshold should trigger an alert, (4) drift — how do I detect week-over-week quality degradation, (5) override rate — what events count as a human override and what rate is acceptable. Give me specific thresholds, not just definitions.

Run a weekly eval check (prompt template)

I'm running a weekly eval on my AI system. Here are the results from this week's run against my golden dataset: [paste results or summary]. Compare to last week: [paste or describe last week's results]. Tell me: (1) did quality improve, decline, or hold steady, (2) any specific failure categories that appeared or got worse, (3) did the model provider update anything that might explain changes (check if I mentioned a model version), (4) my recommended action — update the prompt, add examples, flag for investigation, or no action needed. Be direct.

Write LLM-as-judge scoring criteria

I need an LLM-as-judge evaluator to score outputs from my AI system. The system does: [describe the task]. Write a scoring rubric with: (1) a 1-5 scale definition for each score level, with concrete examples at each level, (2) the exact judge prompt I should use, including how to format the input/output pair, (3) three example inputs with what score they should receive and why. The rubric should be specific enough that two different judge LLMs would give the same score 80% of the time.
§ V · Play 5

Hire and train for the new skills. #

The play

The job titles haven't caught up to the work. Five skills in rough order of leverage: (1) spec writing, (2) eval design, (3) context engineering, (4) agent orchestration, (5) taste/judgment. Org chart moves: promote your best technical writers, add an AI ops function, make AI literacy part of every JD.

Monday-morning move

Rewrite one existing job description to include AI skills. Focus on spec writing and eval design — those compound fastest. Promote your best technical writers before hiring new engineers.

Why it matters

The biggest advantage in 2026 isn't having AI — it's knowing how to direct it. Spec writing and eval design are the highest-leverage skills because they determine what the AI builds and how you know if it's working. Taste is the non-automatable filter that keeps slop from shipping.

Articles & documentation

8 articles docs
Effective context engineering for AI agents (Anthropic Engineering) Anthropic Engineering

The authoritative technical guide to context engineering as a professional skill. Covers system prompts, RAG architecture, tool design, and long-horizon agent management.

How to write a good spec for AI agents (O'Reilly) O'Reilly Radar

Spec writing as a professional practice — treating specs as executable artifacts, PRD vs SRS mindset, and modular spec design.

How to write a good spec for AI agents (Addy Osmani) Addy Osmani

Chrome Engineering lead's guide to spec-writing for agents. Practical patterns from a senior engineer who ships with AI every day.

The evolution of prompt engineering to context design in 2026 SDG Group (Orbitae)

How the 'prompt engineer' role is evolving into a 'context architect' role in 2026. Good framing for rewriting job descriptions.

Context engineering complete 2026 field guide Taskade

Five-layer context engineering stack: system prompt, RAG, memory, tool outputs, and conversation history. Practical field guide.

Spec-Driven Development with Coding Agents (DeepLearning.AI course) DeepLearning.AI

Free short course. Trains the spec writing skill directly. Recommended for anyone on the team who directs coding agents.

AI Evals For Engineers — course preview (Hamel Husain) YouTube — Hamel Husain

Preview of the most popular AI evals course. Hamel Husain and Shreya Shankar have trained 2000+ engineers. Eval design as a core professional skill.

AI-Generated 'Workslop' Is Destroying Productivity (HBR) Harvard Business Review

The cost of shipping without taste/judgment: 2 hours of rework per AI-generated document. The business case for hiring humans with high taste as reviewers.

Videos & talks

3 videos
Context Engineering for AI Agents (LangChain) YouTube — LangChain · 17:24

Context engineering as a skill: system prompts, RAG, tool design, and memory patterns. Good primer for engineers adding this to their toolkit.

Build ANY AI Agent with this Context Engineering Blueprint YouTube — Cole Medin · 24:51

Practical blueprint for building agents using context engineering principles. Intermediate level.

Why AI evals are the hottest new skill for product builders YouTube — Lenny's Podcast · 1:46:33

The business case for eval design as a core team skill. Covers how to structure an evals function and who should own it.

Products & tools

2 product links
DeepLearning.AI — Spec-Driven Development course DeepLearning.AI

Free course on spec writing and spec-driven development. Direct training for the #1 high-leverage skill.

Braintrust — AI evals course 2025 GitHub — Braintrust

Homework and implementations for a structured AI evals course. Hands-on eval design training.

GitHub repositories

2 github repos
braintrustdata/ai-evals-course-2025 GitHub — Braintrust · 14

Homework implementations for a structured AI evals course. Practical eval design training material.

braintrustdata/braintrust-cookbook GitHub — Braintrust · 57

Real eval patterns and recipes. Good reference for building an eval function from scratch.

Prompts to copy

5 prompts

Rewrite a job description to include AI skills

Here is an existing job description for [role]: [paste the JD]. Rewrite it to incorporate the five key AI skills for 2026: spec writing, eval design, context engineering, agent orchestration, and taste/judgment. For each skill, add: (1) a specific responsibility that uses this skill, (2) a concrete example of what good performance looks like. Keep the rewrite in the same tone and format as the original. Do not add more than 20% to the length. Do not use the words 'synergy,' 'leverage,' 'AI-powered,' 'transformative,' or 'game-changing.'

Assess my team's AI skill gaps

I want to understand where my team's AI skill gaps are. Here is a brief description of my team: [describe team size, roles, current AI tool usage]. Score my team on a 1-5 scale for each of the five skills: spec writing, eval design, context engineering, agent orchestration, and taste/judgment. For each skill, tell me: (1) the current estimated score and why, (2) what the gap costs us in concrete operational terms, (3) one immediate action to close the gap — whether that's training, hiring, or promoting an existing person.

Design an AI literacy training session for a business team

I need to design a 2-hour AI literacy training session for a non-technical business team of [describe team: e.g., '8 operations managers who use Excel and email but no coding']. The session should cover: (1) how to write prompts that get useful output (10 min), (2) when to trust AI output and when to verify it (20 min), (3) how to spot slop and when to push back (20 min), (4) one hands-on exercise — they leave with a working SKILL.md for a task they own (60 min), (5) the one habit to build immediately (10 min). Give me a detailed facilitator guide.

Interview an internal candidate for an AI ops function

I'm interviewing an internal candidate for a new AI ops role. The role owns: eval set management, observability dashboards, tool curation, and SKILL.md library. Write 10 interview questions that test: (1) whether they can design a golden dataset from scratch, (2) whether they understand what override rate means and why it matters, (3) their approach to evaluating a new AI tool before adopting it, (4) their taste — can they recognize slop and articulate why. Include what a strong answer looks like for each question.

Build a personal context engineering practice

I'm a [describe your role: e.g., 'full-stack developer building agentic systems'] and I want to develop context engineering as a core skill. Give me: (1) a 30-day learning plan with specific exercises, (2) a list of 5 projects I can build that will compound this skill fastest, (3) the three most important mental models I need to internalize about how LLMs use context, (4) common mistakes intermediate engineers make with context that I should avoid. Make the exercises concrete — I should be able to start the first one today.
§ Glossary · Terms of art

The vocabulary, plainly defined.

If a term shows up in the plays and you'd hesitate to define it in a meeting, look it up here first.

Second brain

A structured external memory system that captures institutional knowledge — emails, transcripts, contracts, decisions — and makes it queryable by humans and AI alike. The architecture is: Sources (raw, immutable inputs) → Wiki (LLM-compiled summary pages with backlinks) → Schema (instructions that tell the LLM how to maintain the wiki). Karpathy's framing: 'Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.'

AGENTS.md

A markdown file placed in a project repository (or globally in ~/.codex/) that OpenAI Codex reads before starting any task. Contains project conventions, code style, setup instructions, and behavioral constraints. Codex resolves AGENTS.md files hierarchically: global → project root → current directory, with more specific files overriding more general ones. Analogous to CLAUDE.md but for OpenAI's Codex agent.

CLAUDE.md

A markdown file in a project root that Anthropic's Claude Code reads automatically at the start of every session. Generated with the /init command. Contains codebase conventions, preferred workflows, and project-specific instructions. Anthropic calls it 'the single highest-value setup step you can take.' The Anthropic-ecosystem equivalent of AGENTS.md.

Eval set

A dataset of real tasks paired with their correct or 'good' answers, used to measure an AI system's quality over time. A defensible eval set has 50–100 real examples, not synthetic ones. Run weekly. The eval set is to AI what a test suite is to software: it tells you when something broke, often before users do.

Golden dataset

A curated, human-verified set of input/output pairs that represents the gold standard for a specific AI task. Used as the ground truth in evaluations. Built from real production cases, not generated examples. The foundation of any eval set — the 'golden' designation means the outputs have been verified correct by a human expert.

P95 latency

The 95th percentile of response times — meaning 95% of requests complete faster than this threshold. If P95 hits 30 seconds for an operator-facing AI tool, adoption drops because people stop waiting and route around it. Track alongside P50 (median) to catch long-tail slowdowns that don't show up in averages.

Override rate

The percentage of AI-generated outputs where a human steps in to correct, reject, or redo the work. A rising override rate is the canary: it means the AI is producing output that trained reviewers don't trust. Track it weekly. If it climbs, something in the model, prompt, or context is breaking — investigate before users encounter it.

Context engineering

The discipline of designing, structuring, and managing the full information environment an LLM operates in — not just the prompt. Includes system instructions, retrieved documents (RAG), tool definitions and outputs, memory (short and long-term), conversation history, and user state. Evolved from prompt engineering as agents replaced single-turn chatbots. The five-layer stack: system prompt → retrieved context (RAG) → memory → tool outputs → conversation history.

Agent orchestration

The design and management of multi-step workflows where multiple AI agents hand off tasks, share context, and coordinate toward a goal. Includes: routing tasks to the right specialized agent, managing shared context windows, handling errors and fallbacks, and deciding when to escalate to a human. The engineering equivalent of running a team project.

Taste / judgment

The human capability to distinguish good AI output from slop — and to articulate why. In the deck's framing: 'the human filter that catches slop before it ships.' Not automatable. Compounded by deep domain knowledge and high standards. In 2026, taste is the differentiating capability between operators who ship polished AI-assisted work and those who flood the world with low-quality content.

Slopacolypse

Andrej Karpathy's term for the predicted 2026 flood of low-quality AI-generated content across GitHub, Substack, arXiv, LinkedIn, and everywhere else. HBR calls the workplace version 'workslop.' The Play 02 context files pattern (DESIGN.md, SKILL.md, etc.) is the primary defense — codifying what 'good' looks like so AI generates against a standard, not into a void.

Context files

A family of markdown files that codify institutional knowledge for AI tools to read before generating work. The five-file pattern from the deck: DESIGN.md (brand voice and aesthetics), SKILL.md (codified SOPs and workflows), SPEC.md (per-feature plans with acceptance criteria), AGENTS.md (instructions for OpenAI Codex), CLAUDE.md (instructions for Anthropic Claude Code). Collectively, they are the difference between AI that represents you and AI that represents nobody.