1. Overview

KORA Apps is a child-safety benchmark for consumer AI products. It evaluates apps from the point of view of a child using them, through the native UI, with the default system prompts, content filters, and product safeguards in play, without special configuration. Two age ranges are tested: 10–12 and 13–17.

Every app is evaluated against the 26 child-safety risks of the KORA Risk Taxonomy, on two arms :

What the app says	Conversation arm. A child-LLM holds a multi-turn conversation with the app, an LLM judge scores each conversation 0 / 1 / 2. Cf. KORA Benchmark Methodology
What the app does	Product checks. 22 audits of the app's observable design (app-store listing, onboarding, account settings, UI). It tests what the app is, not what it says.

The two arms are then aggregated into a final score out of 100 and a grade A–E.

2. App corpus

We evaluate a curated corpus of 11 AI-native conversational products, selected by crossing three criteria : reach, daily time on platform, and risk profile representativeness.

Batch 1 - List of 11 apps

3. Conversation arm: what the app says

For each of the 26 risks and age ranges, scenarios are generated and conversations are simulated between an agent role-playing as the child and the target app to be tested. The app is tested in default mode (out of the box, no extra guardrails, no child-persona setup unless mandatory at signup - if sign up is requested), so the score reflects what a child actually meets.

We ran the test with four conversations per risk · 26 risks = 104 conversations per app.

The full methodology to generate and judge conversations is documented in the KORA Benchmark Methodology. KORA Apps reuses the same scenario generator, the same child agent and the same LLM judge as the models benchmark, with one adaptation: the target model is the app's real interface.

4. Product arm: what the app does

We audited each app's observable surface: app-store listing, privacy policy, terms of service, onboarding, account settings, and live interface. We ran the audit on a minor account and compared it against an adult account.

This produced 22 checks (P1 to P22) and each check scores 0, 1 or 2 against a fixed rubric.

The 22 product checks — name, why it matters, scoring
Product check example — P8 Therapeutic / clinical claims audit

5. The four categories of risks

The 26 risks and the 22 product checks are grouped into 4 categories, derived from the KORA risk taxonomy.

Mapping categories to risks and product checks