# Eval suite plan for `vaadin-playwright-test` This document is the durable reference for the skill evaluation suite. Progress tracking lives in GitHub issues and milestones; this file describes the design and the *why* behind it. ## Goals 1. **Static skill review** — validate format, structure, and trigger description quality (activation score). 2. **Task evals** — measure the lift the skill provides on realistic Drama Finder test-writing tasks (with-skill vs. without-skill). 3. **Regression detection** — catch effectiveness regressions when `SKILL.md` changes or when the underlying Claude model changes. ## Approach in one paragraph API-driven evals. Skill content is injected into the system prompt; "with skill" vs. "without skill" is the only difference between conditions. This keeps the eval loop fast, deterministic, and [Langfuse](https://langfuse.com)-native. The harness is TypeScript (no Python, no YAML). Dataset items are typed `.ts` files in the repo, giving full IDE autocomplete and type checking. Langfuse stores runs and scores; the repo is the source of truth for everything else. Haiku for cheap roles (activation eval, judge), Sonnet for generation, prompt caching on. No CI initially — runs on developer machines until budget exists. ## Diagrams ### Runtime: the eval loop ```mermaid flowchart LR SKILL["SKILL.md
(source of truth)"] DATASET["dataset items
(TypeScript)"] HARNESS["harness.ts
builds system prompt"] SKILL --> HARNESS DATASET --> HARNESS HARNESS --> CALL_WITH["call Claude
WITH skill in prompt"] HARNESS --> CALL_WITHOUT["call Claude
WITHOUT skill"] CALL_WITH --> COMP1["completion A"] CALL_WITHOUT --> COMP2["completion B"] COMP1 --> RUBRIC["rubric.ts
deterministic checks"] COMP1 --> JUDGE["judge.ts
Haiku LLM judge"] COMP2 --> RUBRIC COMP2 --> JUDGE RUBRIC --> LF["Langfuse
traces + scores"] JUDGE --> LF LF --> COMPARE["compare view:
lift = score(with) - score(without)"] classDef src fill:#dbeafe,stroke:#1e40af,color:#1e3a8a classDef code fill:#fef3c7,stroke:#a16207,color:#713f12 classDef apicall fill:#fce7f3,stroke:#9d174d,color:#831843 classDef out fill:#d1fae5,stroke:#065f46,color:#064e3b class SKILL,DATASET src class HARNESS,RUBRIC,JUDGE code class CALL_WITH,CALL_WITHOUT apicall class COMP1,COMP2,LF,COMPARE out ``` ### Build order: what gets built in each phase ```mermaid flowchart TB subgraph P0 ["Phase 0 — Bootstrap (½ day)"] P0A["harness.ts"] P0B["runSmoke.ts"] P0C["one trace in Langfuse"] P0A --> P0B --> P0C end subgraph P1 ["Phase 1 — Static review (1 day)"] P1A["lint.ts"] P1B["activation.ts
+ triggerPrompts.ts"] P1C["runStatic.ts"] P1A --> P1C P1B --> P1C end subgraph P2 ["Phase 2 — Dataset (1–2 days)"] P2A["microTasks.ts
20–30 items"] P2B["viewTasks.ts
5–10 items"] P2C["syncDatasets.ts"] P2A --> P2C P2B --> P2C end subgraph P3 ["Phase 3 — Lift measurement (½ day)"] P3A["rubric.ts"] P3B["judge.ts"] P3C["runLift.ts
--quick / --full"] P3A --> P3C P3B --> P3C end subgraph P4 ["Phase 4 — CI (deferred)"] P4A["GitHub Action:
static + quick lift on PR"] P4B["nightly full lift run"] P4C["Option-2 smoke suite
(real Claude CLI)"] end P0 --> P1 P1 --> P2 P2 --> P3 P3 -.->|when budget arrives| P4 classDef done fill:#d1fae5,stroke:#065f46 classDef todo fill:#fef9c3,stroke:#a16207 classDef defer fill:#e5e7eb,stroke:#6b7280,color:#374151 class P4,P4A,P4B,P4C defer ``` ## Project layout ``` skills/vaadin-playwright-test/ SKILL.md evals/ package.json tsconfig.json .env.example docker-compose.yaml # optional: self-hosted Langfuse README.md # how to run, what scores mean, how to debug regressions PLAN.md # this file src/ types.ts # RubricItem, EvalResult, JudgeScore harness.ts # loads SKILL.md, builds system prompt, calls Claude rubric.ts # deterministic checks (regex / string match) judge.ts # LLM-as-judge (Haiku) lint.ts # frontmatter + structural checks activation.ts # trigger precision/recall eval syncDatasets.ts # pushes TS dataset items to Langfuse runStatic.ts # npm run static runLift.ts # npm run lift -- [--quick | --full] runSmoke.ts # one-shot Phase 0 smoke test datasets/ microTasks.ts # 20–30 items viewTasks.ts # 5–10 items triggerPrompts.ts # positive + negative trigger prompts ``` Run with `tsx` (no build step): ```bash npm run static # lint + activation, ~$0.05 npm run lift -- --quick # 5 items × 2 conditions, ~$0.10, ~30s npm run lift -- --full # full dataset × 2 conditions, ~$0.50, ~5min ``` ## Phase 0 — Bootstrap (½ day) **Goal:** harness exists, one trace lands in Langfuse, both conditions visibly differ on a smoke item. 1. `cd skills/vaadin-playwright-test/evals && npm init -y` 2. `npm install @anthropic-ai/sdk langfuse tsx typescript @types/node` 3. Create `.env.example` with `ANTHROPIC_API_KEY`, `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` (default `https://cloud.langfuse.com` or `http://localhost:3000`). 4. Optional: `docker-compose.yaml` for self-hosted Langfuse. 5. `src/harness.ts` exports `runWithSkill(prompt: string, withSkill: boolean): Promise`. Reads `../SKILL.md`, builds a system prompt that includes a `...` block when `withSkill` is true. Prompt caching enabled on the system prompt. 6. `src/runSmoke.ts` runs one hardcoded prompt twice, prints both outputs, logs both as Langfuse traces. **Acceptance:** open Langfuse, see two traces. With-skill output uses `TextFieldElement.getByLabel(...)`. Without-skill output uses raw `page.locator(...)` or invents API. **Stop here for the day.** ## Phase 1 — Static review (1 day) Two scores per skill version, both written to a single Langfuse trace. ### 1a. Lint (`src/lint.ts`) Mechanical, fast, deterministic. Returns `{ lint_score: number, failures: string[] }`: - Frontmatter parses as YAML, has `name` and `description` - `description` ≤ 1024 chars - Description starts with a verb or "Use when..." - Body has at least one fenced code block - No broken markdown headings ### 1b. Activation eval (`src/activation.ts`) Adapted from Anthropic's `skill-creator`. Two prompt sets in `datasets/triggerPrompts.ts`: ```ts export const positiveTriggerPrompts: string[] = [ "Write a Playwright test for a Vaadin button labeled 'Submit'", "How do I assert a Vaadin combobox shows the right options?", "Test that a Vaadin Grid shows 5 rows in dramafinder", // ... 15–20 total ]; export const negativeTriggerPrompts: string[] = [ "Write a React component test with Playwright", "How do I test a Spring REST controller?", "Set up Selenium for a Vaadin application", // ... 15–20 total ]; ``` For each, call Haiku with: *"Given this skill description: `{description}`, and this user prompt: `{prompt}` — would you load this skill? Reply YES or NO with a one-sentence reason."* Compute `precision`, `recall`, `F1`. Below ~0.85 means the description needs work. Iterate, re-run. ## Phase 2 — Dataset (1–2 days, the grind) The dataset is the heart of the eval. Spend more time here than feels necessary. ### Item shape (`src/types.ts`) ```ts export interface RubricItem { id: string; category: string; // textfield, button, combobox, grid, ... prompt: string; rubric: { mustUse: string[]; mustExtend?: string; mustNotUse: string[]; }; judgeCriteria: string; groundTruth?: string; } ``` ### Coverage targets **Micro-tasks** (`datasets/microTasks.ts`, 20–30 items): TextField (basic, with helper, with validation), Button, ComboBox, Grid, DatePicker, generic `AbstractBasePlaywrightIT` setup, plus 3–5 negative items where the prompt asks for raw XPath but idiomatic skill output should still avoid it. **View-level tasks** (`datasets/viewTasks.ts`, 5–10 items): real `*IT.java` files from the dramafinder demo module, stripped of their tests, with the original test as `groundTruth`. ### Sync to Langfuse (`src/syncDatasets.ts`) Idempotent. Reads both TS dataset files, calls `langfuse.createDatasetItem(...)` keyed by `id`. Run once, and again whenever items change. **Discipline:** never author dataset items in the Langfuse UI. The TS file is the source of truth. ## Phase 3 — Lift measurement (½ day) `src/runLift.ts`: for each item, for each condition, generate a completion, score it twice (rubric + judge), tag with `experimentId` and `condition`. `--quick` runs 5 representative items (one per major category). `--full` runs everything. ### Reading the results Filter by `experimentId` in Langfuse, group by `condition`, look at score deltas per category. | Pattern | Meaning | Action | |---|---|---| | Lift > 0.2 on rubric | Skill works | Ship | | Lift near 0 | Either Claude already knew this, or skill content isn't landing | Open the items where with-skill *lost*; those tell you what to fix | | Lift on judge but not rubric | Skill improves style, not API correctness | Investigate — possibly the rubric is too lax | | Lift on rubric but not judge | Skill teaches API but produces stilted code | Add more idiomatic examples to the skill | ## Phase 4 — CI (deferred) Documented but not built until budget exists. ```yaml # .github/workflows/skill-eval.yml (future) on: pull_request: paths: [skills/vaadin-playwright-test/**] jobs: static: # lint + activation, fail if precision or recall < 0.85 lift-quick: # 5 items × 2 conditions, fail if rubric drops > 10% vs main comment: # post score deltas as PR comment schedule: # nightly full run, latest Sonnet + Opus, push to Langfuse ``` Plus an optional Option-2 smoke suite: 3 prompts run against real `claude` CLI in a workspace with the skill installed. Verifies trigger logic in production. Nightly only. ## Cost-control playbook - **Sonnet for generation only.** Generation is what the skill targets — must use the real model. - **Haiku for everything else.** Activation eval, LLM judge. - **Prompt caching on.** The system prompt (containing `SKILL.md`) is identical across all items in a run. Cache hit rate is near 100% after the first call. - **Quick mode for iteration.** 5-item subset for tweaking. Full suite only when you think you're done. - **Estimates:** quick lift run ~$0.10, full lift run ~$0.50, full static run ~$0.05. A workday of iteration is well under $5. ## Operator's manual — what to do when scores regress (Add patterns here as you find them. Starter list:) - **Activation precision drops** → description matches things it shouldn't. Look at false positives; usually a generic phrase like "testing Vaadin" without enough specificity. - **Activation recall drops** → description too narrow. Check false negatives; usually a phrasing the description doesn't anticipate. - **Rubric drops on a specific category** → check the SKILL.md examples for that component. The model usually mirrors the most recent example in the skill. - **Judge score drops, rubric stable** → skill might have grown verbose or contradictory. Trim. - **Both scores drop after a Claude version bump** → not your skill, the model. Check the nightly schedule run history. ## Sequencing - **Day 1:** Phase 0. Stop after one trace. - **Day 2:** Phase 1. Lint + activation. Probably surfaces 1–2 fixable issues with the current description. - **Weekend:** Phase 2. Author the dataset. - **Following day:** Phase 3. Run the lift suite. Iterate on `SKILL.md`. - **Later:** Phase 4 when budget arrives.