Lesson 6 of 875% complete

12 min read

Test & Iterate

The automated testing loop AutoClaygent runs until 8.0+ scores

The Iteration Loop

Building a production-ready Claygent is never one-and-done. The secret is a tight feedback loop: draft → test → evaluate → improve → repeat.

How AutoClaygent Handles This

AutoClaygent runs this entire loop automatically. It tests prompts against sample data, scores the results, identifies issues, and iterates until the prompt scores 8.0+. What takes hours manually happens in minutes.

The Claygent Improvement Loop

Draft — Write initial prompt using patterns (platform detection, CRM validation, etc.)

Test — Run on 10-20 sample rows in Clay

Evaluate — Score results with 7-criterion rubric

Improve — Fix the biggest issue identified

↺

Repeat — Until score is 8.0+

Setting Up Your Test Environment

Step 1: Create Test Data

Don't test on your full table. Create a small test set of 10-20 rows that includes:

Easy cases — Companies with obvious signals (clear portal URLs, prominent booking buttons)
Hard cases — Companies with sparse websites, custom portals, or no clear tech indicators
Edge cases — International sites, rebranded portals, multiple platforms

💡

Test Data Selection

For platform detection, include companies you KNOW use specific platforms (your existing customers) so you can verify accuracy. For CRM validation, include records you suspect are wrong.

Step 2: Add Your Claygent Column

Add a Claygent column with your prompt and configure:

Model: GPT-4o-mini (recommended for cost/quality balance)
Your API Key: Required — Clay doesn't provide one
JSON Schema: Paste your validated schema

⚠️

Model Selection

Never use Clay's built-in "Clay" model for production Claygents. Use your own API key with GPT-4o-mini or Claude Haiku.

Step 3: Capture Results for Analysis

Add an HTTP Action column after your Claygent to capture results:

{
  "prompt_text": "{{Your full Claygent prompt text}}",
  "json_output": {{claygent_column_output}},
  "prompt_version": "platform-detection-v1.0",
  "changes_from_previous": "Initial version"
}

This sends each result to a webhook where you can analyze patterns across the batch.

See the Iteration in Action

Compare different versions of the same prompt to see how it improves:

Before/After — Prompt Evolution

Compare:

→

v1.0Score: 4.2

Find the CEO of {{domain}} and their email.

Issues:

✗No step-by-step instructions
✗No fallback plan
✗No JSON output format
✗No confidence handling
✗Two tasks combined (violates 3-task rule)

v1.3Score: 8.5

Given the company domain: {{domain}}

Goal: Find the CEO or primary founder.

Research steps:
1. Visit {{domain}}/about, {{domain}}/team, or {{domain}}/leadership
2. Look for titles: CEO, Chief Executive Officer, Founder, Co-Founder
3. Extract their full name

If not found on website:
- Search: "{{domain}}" CEO OR Founder site:linkedin.com
- Look for recent press releases or funding announcements

Verification:
- Confirm they are CURRENT (not "former")
- Check that the company domain matches

Output as JSON:
{
  "leader_name": "Full name of CEO/Founder",
  "title": "Their exact title",
  "linkedin_url": "LinkedIn profile URL or null",
  "source_url": "Where you found this",
  "is_current": true or false,
  "confidence": "high" | "medium" | "low"
}

IMPORTANT:
- Only report verified, current leadership
- If person appears to be former employee, do not include
- Set to null if uncertain

Improvements:

✓Added verification section for 'current' check
✓Added press/funding as secondary sources
✓Added source_url for audit trail
✓Added is_current boolean field
✓Added explicit instructions for handling uncertainty

Score improvement: 4.2 → 8.5(+4.3)

Production ready!

How Many Iterations?

Starting Score	Expected Iterations	Typical Issues
< 5.0	4-6 iterations	Missing detection methods, no decision trees
5.0 - 6.9	2-3 iterations	Missing fallbacks, weak confidence definitions
7.0 - 7.9	1-2 iterations	Edge case handling, evidence requirements
8.0+	Done!	Production ready

What to Fix First

When you find multiple issues, fix them in this order:

Accuracy issues (25% weight) — Wrong platform detected, false positives
Completeness issues (25% weight) — Missed platforms that were obviously there
JSON issues (15% weight) — Schema mismatches, invalid enum values
Source issues (15% weight) — Missing evidence URLs, unreliable detection methods
Efficiency issues (10% weight) — Too many steps, redundant checks

💡

Pro Tip

Fix one issue per iteration. Don't try to fix everything at once — you won't know which change worked or made things worse.

Version Your Prompts

Always save each version of your prompt with a version number:

platform-detection-v1.0 — Initial version with subdomain patterns
platform-detection-v1.1 — Added redirect detection for custom domains
platform-detection-v1.2 — Added widget detection for embedded tools
platform-detection-v2.0 — Major restructure: added confidence tiers

This helps you track what changed and roll back if something breaks.

Real-World Iteration Example

Here's how a platform detection prompt evolved through testing:

Version	Score	Issue Found	Fix Applied
v1.0	6.2	Missing platforms on custom domains	Added redirect detection step
v1.1	7.1	False positives on similar-looking URLs	Added explicit subdomain pattern list
v1.2	7.8	No evidence for widget detections	Required evidence_url for every platform
v1.3	8.3	N/A — Production ready	Deployed to full table

Key Takeaways

Test on 10-20 diverse rows, not your full table
Use your own API key (GPT-4o-mini or Claude Haiku)
Fix one issue per iteration
Target 8.0+ before deploying to production
Version your prompts to track changes and enable rollback

← PreviousJSON Schema Next →Quality Scoring