> OPERATION: AgentForge — Multi-Step AI Agent Workflow Platform | STATUS: COMPLETE ✓

API Automation

AgentForge — Multi-Step AI Agent Workflow Platform

AI agent orchestration platform testing multi-step workflows, tool-call sequences, state consistency, and fallback logic for autonomous agent systems.

Manual and Automation QA Engineer

OVERVIEW

An AI agent orchestration platform that enables autonomous agents to execute complex multi-step workflows using external tools and APIs. My focus was on validating tool-call sequences, agent state persistence across steps, retry/fallback behaviors, and loop detection to prevent infinite execution cycles.

TECH STACK

Testing Tools

pytestPlaywrightPostmanGitHub ActionsLangChainJIRA

Technologies

PythonLangChainOpenAI Assistants APITool Use APIsJSON SchemaREST APIsState Management

THE CHALLENGE

AI agents using external tools (web search, code execution, API calls, database queries) produced inconsistent multi-step results. Teams had no automated framework to validate tool-call sequences, detect infinite loops, or verify agent state persistence across 10+ step workflows. Failed agent runs went undetected until customers experienced errors.

METHODOLOGY

Designed and executed comprehensive test suites for agent workflow execution, including tool-call chain sequence validation, agent state consistency checks across steps, LLM output schema enforcement, retry/fallback behavior testing, and loop detection. Performed end-to-end workflow testing for agents performing research, data analysis, and code generation tasks.

TEST STRATEGY

Collaborated with AI engineers to define agent execution contract: expected tool calls, parameters, and state transitions. Implemented assertion library for validating tool-call sequences match expected workflow. Created state validation tests at each step to ensure agent memory consistency. Performed adversarial testing to trigger failure modes and validate fallback paths. Integrated with LangChain debugging tools for observability.

AUTOMATION PIPELINE

Integrated agent workflow tests with GitHub Actions, running on every agent prompt/tool definition update. Created regression suite validating that agent behavior doesn't degrade with new tools or model versions. Set up LangChain callbacks to trace every tool call and state transition for debugging. Created automated alerts for unexpected tool-call patterns or infinite loops.

IMPACT METRICS

Agent Workflow Execution Reliability

80% avg

⟨ Manual Testing (before)

Agent workflows tested manually, edge cases discovered in production

⟩ Automated Workflow Testing (after)

Comprehensive automated tests for all tool-call sequences and edge cases

// KEY_METRICS

First-Attempt Success Rate

37%

Manual Testing (before) 72%

Automated Workflow Testing (after) 98.5%

Undetected Failures

100%

Manual Testing (before) 18 per week

Automated Workflow Testing (after) 0 per week

Avg Debug Time

82%

Manual Testing (before) 3 hours

Automated Workflow Testing (after) 32 minutes

Infinite Loop Incidents

100%

Manual Testing (before) 2-3/month

Automated Workflow Testing (after) 0/month

Multi-Step Workflow Coverage

463% avg

⟨ Limited Testing (before)

Only happy-path workflows tested; edge cases and error scenarios uncovered

⟩ Comprehensive Testing (after)

All workflow paths, edge cases, and failure scenarios validated

// KEY_METRICS

Workflow Scenarios Tested

900%

Limited Testing (before) 15

Comprehensive Testing (after) 150+

Tool-Call Sequence Coverage

150%

Limited Testing (before) 40%

Comprehensive Testing (after) 100%

Fallback Path Testing

Limited Testing (before) 0%

Comprehensive Testing (after) 100%

State Mutation Tests

800%

Limited Testing (before) 5

Comprehensive Testing (after) 45

Production Incidents & Loop Prevention

75% avg

⟨ Uncontrolled Execution (before)

Agent runs without step limits, infinite loops discovered in production

⟩ Protected Execution (after)

Automated loop detection + step limits prevent runaway executions

// KEY_METRICS

Infinite Loop Incidents/Month

100%

Uncontrolled Execution (before) 2-3

Protected Execution (after) 0

Avg Cost per Incident

100%

Uncontrolled Execution (before) $8K

Protected Execution (after) $0

Customer SLA Violations

100%

Uncontrolled Execution (before) 12/year

Protected Execution (after) 0/year

Loop Detection Automation

Uncontrolled Execution (before) None

Protected Execution (after) 100%

CODE SAMPLES

Agent Tool-Call Sequence Validation

Validate that agent executes tools in expected sequence with correct parameters

python

PYTHON_EXECUTION

→ Ready

import pytest
from langchain.agents import initialize_agent, AgentExecutor
from pydantic import BaseModel, ValidationError

class ToolCall(BaseModel):
    """Expected tool call in agent workflow."""
    tool_name: str
    expected_params: dict
    
class ToolCallValidator:
    def __init__(self):
        self.actual_calls = []
    
    def record_tool_call(self, tool_name: str, params: dict):
        """Record each tool call made by agent."""
        self.actual_calls.append({
            "tool": tool_name,
            "params": params
        })
    
    def validate_sequence(self, expected_calls: list[ToolCall]):
        """Validate tool calls match expected sequence."""
        assert len(self.actual_calls) == len(expected_calls), \
            f"Call count mismatch: expected {len(expected_calls)}, got {len(self.actual_calls)}"
        
        for i, (actual, expected) in enumerate(zip(self.actual_calls, expected_calls)):
            assert actual["tool"] == expected.tool_name, \
                f"Step {i}: expected tool {expected.tool_name}, got {actual['tool']}"
            
            # Validate key parameters match
            for param, value in expected.expected_params.items():
                assert param in actual["params"], \
                    f"Step {i}: missing parameter {param}"
                assert actual["params"][param] == value, \
                    f"Step {i}: param {param} mismatch"

@pytest.mark.asyncio
async def test_agent_research_workflow():
    """Test multi-step agent workflow for research task."""
    query = "Find recent AI safety regulations and summarize them."
    
    validator = ToolCallValidator()
    
    # Execute agent with tool call recording
    agent = initialize_agent(
        tools=research_tools,
        llm=gpt4,
        agent="zero-shot-react-description",
        callbacks=[ToolCallValidator.Callback(validator)]
    )
    
    result = await agent.arun(query)
    
    # Validate tool-call sequence
    expected_sequence = [
        ToolCall(tool_name="web_search", expected_params={"query": "AI safety regulations 2025"}),
        ToolCall(tool_name="fetch_webpage", expected_params={"url": "..."})  # Dynamic URLs
    ]
    
    validator.validate_sequence(expected_sequence)
    assert "regulation" in result.lower()

Agent State Consistency & Loop Detection

Verify agent memory state remains consistent across multi-step workflow and detect infinite loops

python

PYTHON_EXECUTION

→ Ready

import pytest
import asyncio
from typing import Dict, Any

class AgentStateValidator:
    def __init__(self, max_steps: int = 20):
        self.max_steps = max_steps
        self.step_count = 0
        self.state_history = []
        self.tool_call_history = []
    
    async def validate_workflow_state(self, agent, query: str, expected_final_state: Dict[str, Any]):
        """Execute agent and validate state consistency throughout workflow."""
        self.step_count = 0
        self.state_history = []
        
        try:
            # Execute with step limit
            result = await asyncio.wait_for(
                agent.arun(query),
                timeout=60  # Prevent infinite loops
            )
        except asyncio.TimeoutError:
            pytest.fail(f"Agent exceeded timeout - possible infinite loop after {self.step_count} steps")
        
        # Verify step count reasonable
        assert self.step_count <= self.max_steps, \
            f"Agent took {self.step_count} steps (max {self.max_steps}) - possible loop"
        
        # Check for repeated tool calls (indicator of loop)
        tool_calls = [call["tool"] for call in self.tool_call_history]
        unique_calls = set(tool_calls)
        repeat_count = len(tool_calls) - len(unique_calls)
        
        assert repeat_count < 3, \
            f"Excessive tool call repetition detected: {repeat_count} duplicates"
        
        # Validate final state matches expectations
        for key, expected_value in expected_final_state.items():
            assert key in agent.memory.variables, f"Missing state key: {key}"
            assert agent.memory.variables[key] == expected_value, \
                f"State mismatch for {key}: expected {expected_value}, got {agent.memory.variables[key]}"
        
        return result

@pytest.mark.asyncio
async def test_agent_loop_detection():
    """Test that agent detects and breaks infinite loops."""
    validator = AgentStateValidator(max_steps=15)
    
    result = await validator.validate_workflow_state(
        agent=code_analysis_agent,
        query="Analyze this code and find bugs",
        expected_final_state={"analysis_complete": True, "bugs_found": 3}
    )
    
    assert "bug" in result.lower()
    assert validator.step_count <= 15

MISSION ACCOMPLISHED

Validated 150+ agent workflows covering research, analysis, coding, and planning tasks with zero undetected state inconsistencies. Detected and prevented 12 infinite loop scenarios before production. Achieved 100% tool-call sequence validation with strict JSON schema enforcement. Reduced agent debugging time by 82% through comprehensive execution tracing. Agent reliability improved from 72% to 98.5% on first-attempt success rate.

// related_services

SERVICES THAT MADE THIS POSSIBLE

These are the core services I use to deliver projects like this one.

Test Automation Framework Setup

Cut your regression cycle from 8 hours to 30 minutes with a Playwright + TypeScript framework built around your stack.

Learn More

AI Agent Development

Production-grade LangChain / CrewAI agents that pass evals, log every tool call, and don't loop forever.

Learn More

Coaching & Team Training

Hands-on Playwright + AI-QA workshops that turn your manual testers into automation-fluent engineers in 4 weeks.

Learn More

← Back to All Projects

// interested?

READY TO BUILD SOMETHING SIMILAR?

Let's discuss how I can implement test automation for your project.

→ Get in Touch