AI safety

Safety Checklist Consumer Facing AI Characters

Before any of this, the team should run a design sprint to develop the character itself, covering what it is, how it behaves, who it serves, and how it sounds. That work has its own process.This document picks up from there.

It is specifically for thinking about safety, and it is the checklist I want us to work through before a consumer facing character goes live in front of real users. Some of this happens before the safety workshop, some of it is the workshop itself, and the rest runs after launch and keeps running. Work top to bottom.

Items marked with an asterisk are specific to Figurate. There is a short note on each one at the end of the doc.

1. Decide what the character is for

A character with a narrow, well defined job is much safer. Pin this down before the workshop so the team has something concrete to work against.

Write down what the character is for and what it will never do.
Decide how the character behaves when a user pushes it off topic.
Define who the audience is.
Decide whether minors can reach the character. If they can, treat that as its own set of rules and revisit every step below with that in mind.
Name one person who owns safety for this character.

2. Run the workshop

Run a workshop with the team and identify things the character should not do or say that go against the company brand.
Identify the risks for this specific character and audience.
Write the procedure for when there is an incident. Cover severity levels, who gets called, and who talks to users.
Identify the compliance you need to adhere to. NIST AI RMF and the EU AI Act are good starting references if you serve those markets.
Decide whether extra guard rails need to be developed beyond the standard set.
List the specific harm areas to cover: crisis and self harm, medical, legal, or financial advice, harassment and hate, illegal activity, misinformation, and minor safety.

3. Write the guard rails

Write guard rails together that are specific to the organization and its character.
Build the guard rails into the Director* and policy layer. Prompt text on its own gets worked around by users, so it cannot be the boundary.
For each harm area, write what the character should actually do, including the words it uses.
Design a crisis path. Decide what the character does when someone is in distress or talks about self harm.
Decide how the character handles emotional dependency. A character people talk to often can become something they lean on, so decide how it sets limits and what it says.
Make sure the character always discloses that it is an AI and never claims to be a person.
Review the knowledge base* content for accuracy and bias. The character repeats that material as if it is true.

4. Pick the technology and use the platform safely

Select technology that is appropriate for the safety rating you need to adhere to.
Use the ledger and memory receipts* as your audit trail. Every important decision should leave a receipt.
Keep memory writes going through the review path*.
Design a crisis path. Decide what the character does when someone is in distress or talks about self harm.
Write a retention and deletion policy for what the character remembers about users. Pay attention to core identity facts*, since those are the sensitive ones.
Confirm memory* is scoped per user so one person's input cannot leak into another person's character.
Set up rate limiting and abuse detection.

5. Test the character

Decide the pass and fail thresholds before testing starts.
Test the character against the guard rails.
Run simulations.
Run jailbreak and prompt injection tests.
Test for bias across different kinds of users.
Test whether a user can poison what the character learns.
Test long conversations for drift. Characters degrade over a long session.

6. Launch

Roll out in stages. Start with a small beta and ramp up slowly.
Keep a person reviewing transcripts and receipts* every day for the first stretch.
Give users a visible way to report bad behavior.

7. Run it after launch

Implement the correct analytics and review the receipts* and flagged content regularly.
Build the ability to off ramp the character if there is an incident. Falling back to legacy mode* and disabling the voice lane* on its own are both options short of a full shutdown.
Re-test after any change to the flow*, knowledge base*, or prompt. Any of these can change how the character behaves.
Watch for provider drift. A model update from OpenRouter or ElevenLabs can change behavior even when we changed nothing.
Re-check the whole checklist on a fixed schedule.

8. Write it down

Write a character card that records what the character is for, its limits, and the risks we know about.
Have legal review the terms of use and disclaimers.
Keep records so we can show what the character did and why.

* A note on the Figurate systems

Figurate is the tool we use to build, test, and run AI characters. The items marked with an asterisk above are parts of it. Here is what they mean if you are reading this from outside the team.

The Director is the runtime control plane. It reads what the user is trying to do and decides how the conversation gets handled. This is where the real guard rails belong, because it sits below the character's prompt and a user cannot talk their way around it.

Legacy mode is the Off setting for the Director. It is the simpler fallback we can drop back to if something goes wrong.

The knowledge base is the reference material we write for a character to draw on. The character treats it as true, so it has to be accurate.

Memory is what the character learns about a user over time. It is held in five tiers, from stable identity facts down to passing context, and it is scoped per user so it does not leak between people.

The review path is the step that new memory writes pass through before they are kept, so the character does not quietly learn things unchecked.

Receipts and the ledger are the running record of important decisions and side effects. Every memory write and every routing decision can leave a receipt, and that is what lets us inspect the system after the fact instead of guessing from a transcript.

The voice lane is the live voice way of reaching a character. It runs alongside the text lane and shares the same character identity and context.The reason this matters for safety is simple. A lot of teams try to keep a character in line with prompt wording alone. In Figurate the guard rails, the audit trail, and the off ramp are real parts of the system, so the checklist above leans on them on purpose.

Eleven Labs Grant Logo