KORA Apps is a child-safety benchmark for consumer AI products. It evaluates apps from the point of view of a child using them, through the native UI, with the default system prompts, content filters, and product safeguards in play, without special configuration. Two age ranges are tested: 10–12 and 13–17.
Every app is evaluated against the 26 child-safety risks of the KORA Risk Taxonomy, on two arms :
| What the app says | Conversation arm. A child-LLM holds a multi-turn conversation with the app, an LLM judge scores each conversation 0 / 1 / 2. Cf. KORA Benchmark Methodology |
|---|---|
| What the app does | Product checks. 22 audits of the app's observable design (app-store listing, onboarding, account settings, UI). It tests what the app is, not what it says. |
The two arms are then aggregated into a final score out of 100 and a grade A–E.
We evaluate a curated corpus of 11 AI-native conversational products, selected by crossing three criteria : reach, daily time on platform, and risk profile representativeness.
For each of the 26 risks and age ranges, scenarios are generated and conversations are simulated between an agent role-playing as the child and the target app to be tested. The app is tested in default mode (out of the box, no extra guardrails, no child-persona setup unless mandatory at signup - if sign up is requested), so the score reflects what a child actually meets.
We ran the test with four conversations per risk · 26 risks = 104 conversations per app.
The full methodology to generate and judge conversations is documented in the KORA Benchmark Methodology. KORA Apps reuses the same scenario generator, the same child agent and the same LLM judge as the models benchmark, with one adaptation: the target model is the app's real interface.
We audited each app's observable surface: app-store listing, privacy policy, terms of service, onboarding, account settings, and live interface. We ran the audit on a minor account and compared it against an adult account.
This produced 22 checks (P1 to P22) and each check scores 0, 1 or 2 against a fixed rubric.
The 26 risks and the 22 product checks are grouped into 4 categories, derived from the KORA risk taxonomy.