The Operator's Edge.
Five plays for working with AI better than everyone else — with Monday-morning moves, tools, and prompts.
Ten moves to start this week.
Start anywhere. The order compounds, but a single move on Monday morning beats a perfect plan you never run.
Pick one recurring question your team asks at least once a week. Write it down. That's your first second brain project.
Open Notion AI (or Claude) and turn on the workspace you already have. Ask it the recurring question. See what it can already answer.
Run the DESIGN.md interview: ask Claude or ChatGPT to interview you about your brand voice, customer, and what 'slop' looks like for you. Save the output as DESIGN.md.
Install Claude Code (npm install -g @anthropic-ai/claude-code) and run /init in your most active project folder to generate a CLAUDE.md.
Identify one task you've done manually at least twice. Write a SKILL.md for it — a step-by-step instruction file any agent can follow.
Hand the SKILL.md task to Claude Code, Codex, or Cursor Agents. Run it in read-only mode. Review the output the same way you'd review a new hire's first work.
Collect 10 real examples of tasks your AI system should handle well. Write down what a 'good' answer looks like for each. This is the seed of your eval set.
Pick one observability tool (Langfuse is free and open-source; Helicone is one-line integration) and wire it to your most-used AI endpoint. Start tracking cost and latency.
Take one existing job description and add three AI skill requirements: spec writing, eval design, and a taste statement ('able to distinguish good AI output from slop and explain why').
Block 30 minutes every Monday to: review one agent's output, check your eval pass rate, and update one context file (DESIGN.md, SKILL.md, or AGENTS.md) based on what you learned.
Five plays for working with AI better than everyone else.
Each play closes with a concrete Monday-morning move, the resources you need, and prompts you can paste into ChatGPT or Claude today.
Build a second brain — not another chat window. #
Articles & documentation
9 articles docsHow to use Notion AI to query and generate content across your connected workspace. Under $20/seat.
Google's AI research tool grounded in your own documents. Supports PDFs, Google Docs, web URLs, audio, and YouTube. Included in Workspace plans.
Work AI platform that searches every SaaS tool your company uses. Best for teams with fragmented knowledge across many apps.
Human-verified knowledge base with AI answers. Structures and governs knowledge so AI answers can be trusted, not just generated.
AI assistant built into Slack that summarizes threads, prepares meeting briefings, and queries your workspace history.
ChatGPT Projects lets you persist context and documents across sessions — a lightweight second brain for solo operators.
Practical breakdown of the Sources → Wiki → Schema pattern with Obsidian + Claude Code. Good starting template for technical builders.
Local-first markdown note editor. The IDE in Karpathy's 'Obsidian is the IDE' framing. Free for personal use.
Official Claude platform docs — use Claude as the LLM layer in an Obsidian + Claude Code second brain setup.
Videos & talks
3 videosQuick walkthrough of NotebookLM for knowledge capture. Good starting point for non-technical operators.
Full masterclass on NotebookLM as a cognitive engine in 2026. Covers advanced source types and workflow patterns.
Demonstrates using NotebookLM as persistent memory for AI agent workflows. Relevant for technical builders.
Products & tools
5 product linksUpload your documents; get an AI that only answers from what you gave it.
Prompts to copy
4 promptsInterview me to build my second brain
I want to build a second brain for my business. Interview me — ask me 10 focused questions about: what recurring questions my team has to answer, where institutional knowledge currently lives, what information I look up more than once a week, and what I'd want a new team member to be able to find on their first day. After I answer, organize my responses into a Sources → Wiki → Schema structure with specific recommendations for tools and first steps.
Identify the one question to solve first
Here is a list of recurring questions my team asks: [paste your list]. Rank them by: (1) how often they get asked, (2) how painful it is when no one knows the answer, (3) how feasible it is to capture the answer in a document. Give me the single best candidate for my first second-brain project, and a 90-minute plan to build it.
Generate a wiki schema from raw notes
I'm going to paste raw notes, emails, and transcripts below. Your job is to: (1) identify the key entities, decisions, and commitments in this material, (2) draft a wiki page structure with section headers and backlinks, (3) write a one-paragraph Schema.md instruction file that tells an LLM how to maintain and update this wiki going forward. Keep everything factual — do not infer things not stated. [paste raw material]
Audit an existing knowledge base
Here is the table of contents of my current knowledge base: [paste TOC or list of docs]. Identify: (1) which documents are likely stale or redundant, (2) what critical topics appear to be missing, (3) which entries should be merged, (4) what a new employee or AI agent would be unable to answer from this base alone. Give me a prioritized cleanup list.
Codify your business DNA. #
Articles & documentation
8 articles docsOfficial guide to AGENTS.md: how Codex reads layered instruction files at global, project, and directory scope. The canonical reference for this file format.
Community reference for the AGENTS.md standard. Explains nested file resolution and best practices across large repos.
Official guide to generating a CLAUDE.md with the /init command. 'The single highest-value setup step you can take.' Claude reads it automatically at the start of every session.
Deep technical guide from Anthropic on just-in-time context loading, progressive disclosure, sub-agent architectures, and long-horizon task management. Required reading for builders.
Practical guide to writing SPEC.md-style documents that serve as executable artifacts for coding agents. Covers PRD/SRS hybrid format, modular specs, and self-verification patterns.
Short course on using specs as the agent's guide. Plan, implement, and validate features with a human-in-the-loop workflow.
Research from BetterUp and Stanford: 41% of workers encounter AI-generated output lacking substance, costing nearly 2 hours of rework per instance. The context files pattern is the direct counter.
Official docs for SKILL.md and the agent skills standard — how to package reusable workflows for Codex to follow reliably.
Videos & talks
3 videosCommon failure modes when writing specs for AI, and how to structure them so the agent actually follows them.
Full walkthrough of GitHub's four-phase gated spec workflow. Shows how SPEC.md drives implementation, tests, and task breakdowns.
Core principles of context engineering and how they map to frameworks like Claude Code and LangChain. Good conceptual grounding.
Products & tools
3 product linksPrimary tool for the DESIGN.md interview exercise. Use Projects to persist your context files across sessions.
Alternative to Claude for the DESIGN.md interview. Both work equally well for the Monday-morning exercise.
Claude Code reads CLAUDE.md automatically. The SKILL.md and context file patterns are natively supported.
Prompts to copy
5 promptsThe DESIGN.md interview
I want to create a DESIGN.md file that captures my business's voice, aesthetics, and brand DNA so AI tools generate on-brand work without constant correction. Interview me. Ask me 12 questions — one at a time — covering: our tone of voice, who our customer is (describe them specifically), what we would never say or do, three examples of writing we admire and why, our banned words or phrases, our visual aesthetic, and what 'generic AI slop' looks like for us. After I answer all 12, synthesize my responses into a clean DESIGN.md file I can store in my project root.
Generate a SKILL.md from a process I describe
I'm going to describe a workflow I do manually. Your job is to turn it into a SKILL.md — a reusable instruction file that any AI agent can follow to complete this workflow without needing me to explain it again. Format the SKILL.md with: (1) a one-sentence purpose statement, (2) step-by-step instructions numbered and detailed enough for a capable AI to follow, (3) what 'done' looks like, (4) common failure modes and how to handle them. Here is the workflow: [describe your workflow]
Write a SPEC.md for a feature or deliverable
Help me write a SPEC.md for the following project or feature: [describe it in plain language]. Structure the spec as: (1) objective — what outcome we want, (2) context — relevant background, constraints, and existing systems, (3) acceptance criteria — a numbered list of conditions that must be true for this to be done, (4) out of scope — what we are explicitly not building, (5) open questions — things still unresolved. Make the acceptance criteria specific enough that an AI agent or junior developer could verify them without asking me.
Audit AI output against a context file
Here is my DESIGN.md: [paste your DESIGN.md]. Here is a piece of content an AI just generated for me: [paste the content]. Audit the content against the DESIGN.md. For each section of the DESIGN.md, tell me: (1) does the content comply, (2) what specifically violates it if not, (3) a suggested rewrite for any non-compliant passage. Be direct — no softening.
Generate an AGENTS.md for a new project
I'm starting a new software project and want to create an AGENTS.md file so that OpenAI Codex or any compatible agent reads my project norms before starting any task. The project is: [describe tech stack, purpose, conventions]. Write an AGENTS.md that covers: (1) how to run the project locally, (2) code style and conventions, (3) what the agent should always do before making changes, (4) what the agent should never do without asking first, (5) how to run tests and what passing tests looks like.
Manage agents like you manage people. #
Articles & documentation
9 articles docsOfficial Agent SDK docs for Claude Code. Build agents that autonomously read files, run commands, and manage subagents using the same tools that power Claude Code itself.
Production-grade Python agent framework with type safety and dependency injection. The Pydantic way to build structured, testable agents.
Explains the components of a PydanticAI agent: instructions, tools, structured output, and dependency injection.
OpenAI's announcement of Codex as an async, remote coding agent. Good context on the operator vs. engineer use case split.
Conductor lets you run parallel coding agents in isolated workspaces and review/merge their changes. NOTE: This is a YC-backed startup, distinct from Apache Conductor (workflow orchestration) — verify this is the intended product from the deck context.
Runs 100+ parallel coding agents on your machine. Agent-agnostic: works with Claude Code, Codex, Cursor, or any CLI agent. NOTE: Not Apache Superset (the BI tool). The deck context clearly refers to this coding-agent orchestrator.
AI-native editor built on VS Code. Cursor Agents can work across multiple files — useful for operators editing non-code files, not just engineers.
Browser-driving agent that handles click-and-type web work. From the deck: 'the agent that does the click-and-type work.'
Sub-agent architectures, tool design, and context management patterns for multi-agent systems. Cross-reference for Play 03 and Play 05.
Videos & talks
3 videosDeep walkthrough of multi-agent teams in Claude Code — parallel agents that communicate with each other. Directly relevant to managing agents like employees.
Beginner-accessible guide to setting up specialized sub-agents in Claude Code with their own context and tools.
Demonstrates that you don't need to code to build and manage an AI agent. Good entry point for business operators.
Products & tools
7 product linksAnthropic's agentic coding tool. Reads CLAUDE.md automatically. Supports subagents, MCP, and shared-doc collaboration.
OpenAI's async remote agent. General coding and ops work; runs in parallel without interrupting your current session.
Python agent framework. Type-safe, testable, integrates with OpenTelemetry for observability.
The agent engineering platform. Broad ecosystem of integrations, tools, and templates.
Framework for orchestrating role-playing autonomous agents. Strong multi-agent coordination model.
GitHub repositories
4 github reposAI agent framework, the Pydantic way. Type-safe, production-ready, with built-in instrumentation.
The agent engineering platform. Extensive tool integrations and community.
Framework for orchestrating role-playing, autonomous AI agents in collaborative crews.
Claude Code agentic coding tool. Official Anthropic repo.
Prompts to copy
4 promptsWrite a system prompt / job description for an agent
I need to create a system prompt that acts as a job description for an AI agent. The agent's role is: [describe the role and function]. Write a system prompt that specifies: (1) who the agent is and what it's responsible for, (2) what it should always do, (3) what it should never do without explicit approval, (4) how it should communicate output, (5) what success looks like for a typical task, (6) how to handle ambiguous or unclear situations. Keep it under 400 words and make it specific, not generic.
Onboarding review: assess an agent's first week of work
I've been running an AI agent on [task] for the past week. Here is a sample of its output: [paste 3-5 examples]. Act as a performance reviewer. Tell me: (1) what the agent is doing well, (2) where it's producing inconsistent or off-spec output, (3) what I should add or change in its SKILL.md or system prompt to fix the issues, (4) whether this agent is ready for more autonomous operation or needs another review cycle. Be specific — cite the examples.
Design a multi-agent workflow for a repeating process
I want to design a multi-agent workflow for the following repeating business process: [describe the process]. Map it as: (1) a list of distinct sub-tasks, (2) for each sub-task — which agent role handles it, what inputs it needs, what outputs it produces, and what error conditions to handle, (3) where a human must review before passing to the next step, (4) a SKILL.md outline for each agent role. Use plain language — I'll implement in [Claude Code / LangChain / PydanticAI — pick one].
Scope an agent to read-only before giving it write access
I'm about to deploy an agent to handle [task]. Before giving it write access, I want to run it in read-only mode. Help me define: (1) exactly what read-only means for this task (what can it read, what is off-limits), (2) what outputs I should review during the two-week observation period, (3) a simple rubric for deciding when the agent has earned production write access, (4) what guardrails to put in place when it does get write access. Be concrete about the specific risks for this task.
Run evals and observability. #
Articles & documentation
11 articles docsOfficial quickstart for running evals with Braintrust. Covers dataset construction, task definition, and scoring functions. Strong eval tooling for teams — dataset management, experiment tracking, and CI/CD integration.
Full evaluation guide: building golden datasets from production logs, LLM-as-judge scorers, and CI/CD integration.
LangSmith's evaluation framework: human review, heuristic checks, LLM-as-judge, pairwise comparison, and online production scoring.
Technical reference for creating datasets, defining evaluators, running experiments, and analyzing results.
20+ out-of-box evals for RAG, agents, safety, and security. Build custom evaluators to encode domain expertise.
OpenTelemetry-based observability for Python AI applications. Native integration with PydanticAI. One-line instrumentation.
Open-source LLM observability. One-line integration to monitor, evaluate, and experiment. Tracks cost, latency, and quality.
Quickstart guide for adding Helicone to an existing LLM application.
Open-source observability: tracing, evaluation, versioned datasets, experiments, and playground. Vendor and language agnostic.
Open-source observability, metrics, evals, prompt management, and datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK.
1h46m deep dive on why evals are the most important skill for AI product builders. Hamel and Shreya have trained 2000+ PMs and engineers on evals.
Videos & talks
3 videosThe definitive podcast episode on AI evals. Covers golden datasets, LLM judges, error analysis, and shipping AI that works in production.
Why agent observability is different from standard software monitoring, and what to instrument first.
Arize, Google Cloud, and Wayfair on AI observability, evaluation, and agent feedback loops in production.
Products & tools
7 product linksEval platform for AI teams. Dataset management, LLM-as-judge scoring, experiment tracking, and CI/CD integration.
Observability and evaluation for LangChain-based applications. Also works standalone.
Open-source LLM engineering platform with observability, evals, and prompt management.
OpenTelemetry-based observability for Python apps. Native PydanticAI integration.
GitHub repositories
3 github reposOpen-source AI observability and evaluation. Tracing, evals, datasets, experiments, and playground.
Prompts to copy
4 promptsDesign a golden eval dataset for my AI feature
I'm building an AI feature that does the following: [describe the task — e.g., 'summarizes customer support tickets into a one-line priority tag']. Help me design a golden dataset of 50 real-world test cases. For each case, define: (1) the input, (2) what a 'correct' output looks like, (3) a scoring rubric (pass/fail or 1-5 scale with criteria). Then give me a template I can fill out to capture 50 real examples from our production data. Include at least 10 edge cases or tricky inputs that are likely to expose model weaknesses.
Define my five-metric eval dashboard
I need to set up a monitoring dashboard for an AI system that does: [describe what it does]. Define the five metrics I should track: (1) quality — what does a passing score mean and how do I measure it, (2) cost — what's the right unit (per task, per user, per outcome) and what's an acceptable range, (3) latency — what are my P50 and P95 targets, and what threshold should trigger an alert, (4) drift — how do I detect week-over-week quality degradation, (5) override rate — what events count as a human override and what rate is acceptable. Give me specific thresholds, not just definitions.
Run a weekly eval check (prompt template)
I'm running a weekly eval on my AI system. Here are the results from this week's run against my golden dataset: [paste results or summary]. Compare to last week: [paste or describe last week's results]. Tell me: (1) did quality improve, decline, or hold steady, (2) any specific failure categories that appeared or got worse, (3) did the model provider update anything that might explain changes (check if I mentioned a model version), (4) my recommended action — update the prompt, add examples, flag for investigation, or no action needed. Be direct.
Write LLM-as-judge scoring criteria
I need an LLM-as-judge evaluator to score outputs from my AI system. The system does: [describe the task]. Write a scoring rubric with: (1) a 1-5 scale definition for each score level, with concrete examples at each level, (2) the exact judge prompt I should use, including how to format the input/output pair, (3) three example inputs with what score they should receive and why. The rubric should be specific enough that two different judge LLMs would give the same score 80% of the time.
Hire and train for the new skills. #
Articles & documentation
8 articles docsThe authoritative technical guide to context engineering as a professional skill. Covers system prompts, RAG architecture, tool design, and long-horizon agent management.
Spec writing as a professional practice — treating specs as executable artifacts, PRD vs SRS mindset, and modular spec design.
Chrome Engineering lead's guide to spec-writing for agents. Practical patterns from a senior engineer who ships with AI every day.
How the 'prompt engineer' role is evolving into a 'context architect' role in 2026. Good framing for rewriting job descriptions.
Five-layer context engineering stack: system prompt, RAG, memory, tool outputs, and conversation history. Practical field guide.
Free short course. Trains the spec writing skill directly. Recommended for anyone on the team who directs coding agents.
Preview of the most popular AI evals course. Hamel Husain and Shreya Shankar have trained 2000+ engineers. Eval design as a core professional skill.
The cost of shipping without taste/judgment: 2 hours of rework per AI-generated document. The business case for hiring humans with high taste as reviewers.
Videos & talks
3 videosContext engineering as a skill: system prompts, RAG, tool design, and memory patterns. Good primer for engineers adding this to their toolkit.
Practical blueprint for building agents using context engineering principles. Intermediate level.
The business case for eval design as a core team skill. Covers how to structure an evals function and who should own it.
Products & tools
2 product linksFree course on spec writing and spec-driven development. Direct training for the #1 high-leverage skill.
Homework and implementations for a structured AI evals course. Hands-on eval design training.
GitHub repositories
2 github reposHomework implementations for a structured AI evals course. Practical eval design training material.
Real eval patterns and recipes. Good reference for building an eval function from scratch.
Prompts to copy
5 promptsRewrite a job description to include AI skills
Here is an existing job description for [role]: [paste the JD]. Rewrite it to incorporate the five key AI skills for 2026: spec writing, eval design, context engineering, agent orchestration, and taste/judgment. For each skill, add: (1) a specific responsibility that uses this skill, (2) a concrete example of what good performance looks like. Keep the rewrite in the same tone and format as the original. Do not add more than 20% to the length. Do not use the words 'synergy,' 'leverage,' 'AI-powered,' 'transformative,' or 'game-changing.'
Assess my team's AI skill gaps
I want to understand where my team's AI skill gaps are. Here is a brief description of my team: [describe team size, roles, current AI tool usage]. Score my team on a 1-5 scale for each of the five skills: spec writing, eval design, context engineering, agent orchestration, and taste/judgment. For each skill, tell me: (1) the current estimated score and why, (2) what the gap costs us in concrete operational terms, (3) one immediate action to close the gap — whether that's training, hiring, or promoting an existing person.
Design an AI literacy training session for a business team
I need to design a 2-hour AI literacy training session for a non-technical business team of [describe team: e.g., '8 operations managers who use Excel and email but no coding']. The session should cover: (1) how to write prompts that get useful output (10 min), (2) when to trust AI output and when to verify it (20 min), (3) how to spot slop and when to push back (20 min), (4) one hands-on exercise — they leave with a working SKILL.md for a task they own (60 min), (5) the one habit to build immediately (10 min). Give me a detailed facilitator guide.
Interview an internal candidate for an AI ops function
I'm interviewing an internal candidate for a new AI ops role. The role owns: eval set management, observability dashboards, tool curation, and SKILL.md library. Write 10 interview questions that test: (1) whether they can design a golden dataset from scratch, (2) whether they understand what override rate means and why it matters, (3) their approach to evaluating a new AI tool before adopting it, (4) their taste — can they recognize slop and articulate why. Include what a strong answer looks like for each question.
Build a personal context engineering practice
I'm a [describe your role: e.g., 'full-stack developer building agentic systems'] and I want to develop context engineering as a core skill. Give me: (1) a 30-day learning plan with specific exercises, (2) a list of 5 projects I can build that will compound this skill fastest, (3) the three most important mental models I need to internalize about how LLMs use context, (4) common mistakes intermediate engineers make with context that I should avoid. Make the exercises concrete — I should be able to start the first one today.
The vocabulary, plainly defined.
If a term shows up in the plays and you'd hesitate to define it in a meeting, look it up here first.
Second brain
A structured external memory system that captures institutional knowledge — emails, transcripts, contracts, decisions — and makes it queryable by humans and AI alike. The architecture is: Sources (raw, immutable inputs) → Wiki (LLM-compiled summary pages with backlinks) → Schema (instructions that tell the LLM how to maintain the wiki). Karpathy's framing: 'Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.'
AGENTS.md
A markdown file placed in a project repository (or globally in ~/.codex/) that OpenAI Codex reads before starting any task. Contains project conventions, code style, setup instructions, and behavioral constraints. Codex resolves AGENTS.md files hierarchically: global → project root → current directory, with more specific files overriding more general ones. Analogous to CLAUDE.md but for OpenAI's Codex agent.
CLAUDE.md
A markdown file in a project root that Anthropic's Claude Code reads automatically at the start of every session. Generated with the /init command. Contains codebase conventions, preferred workflows, and project-specific instructions. Anthropic calls it 'the single highest-value setup step you can take.' The Anthropic-ecosystem equivalent of AGENTS.md.
Eval set
A dataset of real tasks paired with their correct or 'good' answers, used to measure an AI system's quality over time. A defensible eval set has 50–100 real examples, not synthetic ones. Run weekly. The eval set is to AI what a test suite is to software: it tells you when something broke, often before users do.
Golden dataset
A curated, human-verified set of input/output pairs that represents the gold standard for a specific AI task. Used as the ground truth in evaluations. Built from real production cases, not generated examples. The foundation of any eval set — the 'golden' designation means the outputs have been verified correct by a human expert.
P95 latency
The 95th percentile of response times — meaning 95% of requests complete faster than this threshold. If P95 hits 30 seconds for an operator-facing AI tool, adoption drops because people stop waiting and route around it. Track alongside P50 (median) to catch long-tail slowdowns that don't show up in averages.
Override rate
The percentage of AI-generated outputs where a human steps in to correct, reject, or redo the work. A rising override rate is the canary: it means the AI is producing output that trained reviewers don't trust. Track it weekly. If it climbs, something in the model, prompt, or context is breaking — investigate before users encounter it.
Context engineering
The discipline of designing, structuring, and managing the full information environment an LLM operates in — not just the prompt. Includes system instructions, retrieved documents (RAG), tool definitions and outputs, memory (short and long-term), conversation history, and user state. Evolved from prompt engineering as agents replaced single-turn chatbots. The five-layer stack: system prompt → retrieved context (RAG) → memory → tool outputs → conversation history.
Agent orchestration
The design and management of multi-step workflows where multiple AI agents hand off tasks, share context, and coordinate toward a goal. Includes: routing tasks to the right specialized agent, managing shared context windows, handling errors and fallbacks, and deciding when to escalate to a human. The engineering equivalent of running a team project.
Taste / judgment
The human capability to distinguish good AI output from slop — and to articulate why. In the deck's framing: 'the human filter that catches slop before it ships.' Not automatable. Compounded by deep domain knowledge and high standards. In 2026, taste is the differentiating capability between operators who ship polished AI-assisted work and those who flood the world with low-quality content.
Slopacolypse
Andrej Karpathy's term for the predicted 2026 flood of low-quality AI-generated content across GitHub, Substack, arXiv, LinkedIn, and everywhere else. HBR calls the workplace version 'workslop.' The Play 02 context files pattern (DESIGN.md, SKILL.md, etc.) is the primary defense — codifying what 'good' looks like so AI generates against a standard, not into a void.
Context files
A family of markdown files that codify institutional knowledge for AI tools to read before generating work. The five-file pattern from the deck: DESIGN.md (brand voice and aesthetics), SKILL.md (codified SOPs and workflows), SPEC.md (per-feature plans with acceptance criteria), AGENTS.md (instructions for OpenAI Codex), CLAUDE.md (instructions for Anthropic Claude Code). Collectively, they are the difference between AI that represents you and AI that represents nobody.