Counting AI agent tool calls: per call or per parent execution
An agent fires five tool calls to finish one task. Log five savings events or one? Neither default is right. Count what the human alternative replaced.
- engineering
- agents
An agent picks up one lead, reads the record from Salesforce, calls an enrichment API over HTTP, writes the result back to Salesforce, and logs a row in Airtable. That is one task and five tool calls. When the savings event fires, do you log five or one?
Most stacks answer this by accident, not on purpose. The default that falls out of where you happen to drop the tracking node is almost never the number that survives a finance review.
Three things you could count, only one of which is value
There are three counters in play, and they measure different layers.
Tool calls measure how the agent decomposed the work. Five calls for one lead, three calls because a retry kicked in, two because the planner re-queried. The count moves with the model's reasoning path and your retry policy. It has no fixed relationship to human effort.
Parent executions measure platform load. One trigger, one run. n8n's Insights documentation is explicit that its Time Saved metric counts production runs of parent workflows, and sub-workflow runs do not roll into it. That is a load-shaped boundary, drawn where the platform's accounting is convenient, not where human work begins and ends.
Human tasks measure value. This is the only counter that maps to a P&L line, because it answers the one question finance actually asks: how much work would a person have done instead?
Observability tooling deepens the trap. LangSmith's tracing captures every tool call as a span, which is exactly right for debugging latency and failures. A trace with five tool spans tells you the agent worked. It does not tell you it replaced five units of human work, and reading the span count as a savings count is how a dashboard ends up 3x to 5x too high.
The rule: count what the human alternative replaced
Log one savings event per replaced human task. Not per tool call, not per parent execution. Per task, defined as a unit of work a person would have picked up, finished, and put down.
The lead-enrichment flow is one task. A person would have opened the record, looked up the company, pasted the result back, and noted it somewhere. Five clicks for them too, but one job. Five tool calls, one event, one human baseline of maybe eight minutes.
This rule cuts both ways, which is why neither lazy default works.
Three flows, three different answers
One lead, many tools: log one. The Salesforce-HTTP-Salesforce-Airtable flow above is a single replaced task. The four tool calls are the agent's mechanics. A human did this job once, so you log it once.
curl -X POST https://humanhours.dev/api/v1/track \
-H "Authorization: Bearer hh_live_..." \
-H "Content-Type: application/json" \
-d '{
"agent_id": "lead-enricher",
"task_type": "lead_enrichment",
"outcome": "success",
"human_baseline_minutes": 8,
"metadata": { "tool_calls": 4, "lead_id": "00Q5e000..." }
}'Note the tool_calls: 4 in metadata. Keep it for debugging, but it is not the unit. The unit is the one event.
One execution, many leads: log many. Now the same agent runs on a schedule and clears an Airtable view of 30 unenriched leads in a single parent execution. One execution, thirty replaced tasks. A person would have done thirty separate enrichments. Logging one event here undercounts by a factor of thirty, and an undercounted automation is the first one cut when budgets tighten. Log thirty events, or one event carrying a quantity:
{
"agent_id": "lead-enricher",
"task_type": "lead_enrichment",
"outcome": "success",
"human_baseline_minutes": 8,
"quantity": 30,
"metadata": { "execution_id": "exec_8f2a", "source": "airtable_view" }
}One task, a retry storm: still log one. The HTTP enrichment API times out, the agent retries three times, the fourth succeeds. Four tool calls against that one node. A person did not enrich the lead four times. One task, one event. Retries are a reliability detail, not a value multiplier, so they belong in metadata if anywhere, never in the count.
Where the call goes in practice
The placement that produces the right count is the same on both common stacks.
In n8n, if a sub-workflow represents one logical task, the tracking node lives at the end of the sub-workflow's success branch, and it fires once per sub-workflow run. If the parent loops over a batch and calls the sub-workflow per item, you get one event per item for free, which is exactly the batch case above. Put the call in the parent only when the parent itself is the one task.
In a LangChain or LangGraph agent, the tool calls happen mid-reasoning, so do not track inside the tools. Track once at the end of the run, after the agent returns its final answer and after any human approval gate clears. The agent may have called Salesforce, an HTTP retriever, and Airtable several times each to get there. That is one completed task, so one event.
The decision rule fits on a line: one event per unit of work a human would have owned end to end, set the baseline to that human's time, and let tool calls and executions stay in metadata where they belong.
The number this protects
A team running 5k to 50k agent executions a month does not have a tool-call problem, it has a unit problem. Count tool calls and the dashboard inflates until no one trusts it. Count parent executions and every batch job silently undercounts. Count replaced human tasks and the same event stream answers both the engineer debugging a flow and the CFO signing off the budget, because the unit on the event is the unit on the P&L.