A conversational tool can work perfectly and still fail to ship, because the last gate is not whether it works but whether the client's security team can see what it does. That team trusts what it can observe. What you log, what you strip before you log it, how you prove every action, and what evidence you can hand over decide the deal. Build the proof alongside the tool. Bolted on at the end, it is a scramble. Built in, the answer to "show me" already exists.
Instrument the tool, but keep the content out of the log
Trace the tool the way you would any production system, with spans for each model and tool call, metrics for latency and token use, and events for what happened. There is now a standard for the shape. The OpenTelemetry semantic conventions for generative AI define spans and metrics for model calls and operation names for agent and tool actions, so an investigation reads the same fields across vendors. The conventions are still marked experimental, so pin a version, but the shape is settling.
The prompt and the reply are the trap. They carry user data, and the standard says so directly. In the OpenTelemetry conventions, the message content attributes are opt-in and flagged as likely to contain sensitive information including PII. Keep raw content out of the standard log by default, or behind an access-controlled opt-in. A log written with raw personal data is a second copy of the thing you promised to protect.
An audit trail is a record of who did what
A client will ask you to prove every action the tool took in their systems, who it acted as and what it changed. That is an audit trail, and it is a different thing from application logs. It records actions on a store that cannot be quietly edited. Cloud providers model the standard. Google Cloud's audit logs give visibility into who did what, when, and where, on an always-on trail that cannot be disabled, retained for well over a year by default. Record every write your agent makes with the acting identity, the change, and the time, append only, because that record is what an investigation reads and what a review asks to see.
Standard tracing helps here too. The OpenTelemetry conventions define operation names for agent and tool actions, so "the agent executed this tool as this user" is a structured event, not a string buried in a log line.
The evidence a buyer asks for by name
Enterprise buyers trust you without auditing you themselves by reading standard reports. A SOC 2 report attests to your controls against the Trust Services Criteria of security, availability, processing integrity, confidentiality, and privacy, under the AICPA's standards. An ISO/IEC 42001 certificate attests to an AI management system specifically, the first ISO standard aimed at it. The NIST AI Risk Management Framework, organized into govern, map, measure, and manage, gives buyers the vocabulary they use in their questionnaires. Map each control you built to the report or the category that carries it, and note what you hold and what is still in progress.
Turn every gap into a row, not a surprise
The evidence you do not yet have is not a reason to hide it. It is a backlog item with an owner and a date. A missing penetration test, an ISO certification in progress, a sub-processor list that needs updating, each becomes a row in the plan, visible before the security review rather than discovered inside it. The data processing agreement and the sub-processor list are part of that evidence set, and a buyer will ask for both.
Build the proof into the plan
Logging, audit, and evidence are not a compliance chore at the end. They are half of what the client's security and platform teams review, and they are cheaper to build as you go than to reconstruct under deadline. The workshop produces this as its own section, the client-facing half of a packet whose other half is the engineering runbook, so the tool you build and the proof you show never drift apart.
Sources and further reading
Work with Hunter Green