METHOD / W-R14

Evidence workflow, not raw chatbot advice.

Generic chatbots can give a warm summary, but they rarely return stable verbatim evidence. PUALens forces each claim back to a full source sentence while preserving alternatives, boundary wording, and safety guidance.

Structured intakeScenario, roles, and context are collected before model analysis.

Full-sentence quotesEvidence quality is verified server-side, not self-certified by the model.

Safety firstSelf-harm, threats, stalking, or coercive safety risks stay free.

COMPARE

Comparison proof

Generic chatbot

Usually returns one broad advice block
Can turn tone readings into firm conclusions
Boundary scripts, alternatives, and crisis guidance may blur together

PUALens evidence workflow

Quotes the transcript before explaining the signal
Keeps alternative explanations and confidence beside each claim
Separates boundary language from safety guidance

BENCHMARK

Benchmark evidence

Sample set

100 public-case-derived anonymized samples across Chinese and English, dating, workplace, family, friendship, crisis risk, and low-risk disagreements. Samples are rewritten from public help-seeking patterns; no original screenshots, usernames, or identifying quotes are stored.

Scoring rubric

Each item scores 0 to 2 for quoted evidence, avoiding labels, boundary script, alternative explanation, crisis safety, and cultural context.

Review rule

A claim cannot receive full credit if it cannot point to source wording or turns one message into certainty.

100-sample comparison · public-case-derived

Anonymized, no raw screenshots

PUALens11.90 / 12n=100Structured workflow

Gemini 3 Flash9.46 / 12n=100Direct-chat baseline

GPT-5.5 Direct9.94 / 12n=100Isolated subagent baseline

Quoted evidenceNo diagnosisBoundary scriptAlternativeCrisis safetyCultural fit

Honest footnote: R14 samples are anonymized rewrites grounded in 100 public help-seeking case patterns, not raw real chat screenshots. The Gemini direct baseline completed 99 cases, hit one 429, then passed on single-case retry. The GPT-5.5 baseline used isolated subagents answering the 100 cases directly, without reading PUALens code or rubric. Marketing should emphasize consistency, traceability, and shareability, not absolute accuracy.

Quote
Behavior signal
Alternative explanation
Boundary script
Safety note

BOUNDARIES

What we do not infer

No personality labels

Reports describe visible behavior and pressure in the text, not who someone is.

No clinical or legal conclusions

Crisis content can surface safety steps, but it does not replace professional or legal support.

No certainty from one message

Low-evidence cases keep uncertainty visible and ask for more context.