About CADRE

CADRE ("Catholic Alignment, Doctrine, and Reasoning Evaluation") evaluates how well AI language models understand and articulate Catholic teaching. The benchmark tests models across the four pillars of the Catechism, emphasizing dogmatic teachings that form the foundation of Catholic faith.

Our methodology incorporates the hierarchy of truths, weighting questions by authoritative level: Dogma (Level 1), Definitive Doctrine (Level 2), Authentic Magisterium (Level 3), and Prudential Judgments (Level 4).

CADRE uses two question variants to distinguish between factual retrieval and native reasoning patterns. Explicit questions test whether models can accurately retrieve Catholic teaching when directly asked. Implicit questions test whether Catholic reasoning emerges naturally when presented with neutrally-phrased questions answerable from multiple theological perspectives. This dual approach reveals whether a model merely possesses Catholic knowledge versus whether it exhibits Catholic reasoning as its native mode of ethical and theological analysis.

Model Leaderboard

Rank	Model	Provider	Overall Score	The Profession of Faith	The Celebration of the Christian Mystery	Life in Christ	Christian Prayer
1	Magisterium 1 magisterium-1	magisterium	96.7%	99.0%	95.4%	94.2%	98.0%
2	Grok 4 Fast grok-4-fast	xai	92.4%	88.4%	100.0%	88.6%	94.3%
3	Claude Sonnet 4.5 claude-sonnet-4.5	anthropic	85.9%	83.9%	87.8%	85.3%	87.3%
4	Hermes 4 405B hermes-4-405b	nousresearch	77.5%	78.8%	76.9%	73.7%	80.9%
5	Claude 3 Haiku claude-3-haiku	anthropic	77.3%	76.4%	82.8%	69.7%	81.8%
6	GPT-4 gpt-4	openai	71.5%	71.0%	73.4%	64.7%	78.6%

Cost / Catholic Analysis

Cost Metrics

USD per 1M tokens (average input/output). Lower-cost models suit high-volume applications; higher-cost models offer superior theological precision.

Pareto Frontier

Optimal models: high performance at lower cost (upper-left). Mission-critical theological applications may justify premium pricing.

Methodology

The benchmark consists of 50 questions across the four pillars of the Catechism—Creed (15 questions), Sacraments (12), Moral Life (13), and Prayer (10)—weighted by the hierarchy of truths: 56% dogma (divinely revealed truths), 26% definitive doctrine (magisterial teaching), and 18% authentic magisterium (authoritative teaching).

Each question has two variants testing distinct capabilities. Explicit questions assess whether models can retrieve Catholic teaching when directly asked, testing precise doctrinal knowledge, theological terminology, and citation ability (e.g., "What is the Catholic Church's teaching on the Holy Trinity?"). Implicit questions evaluate whether Catholic reasoning emerges naturally without prompting, using lenient scoring on details but evaluating native alignment patterns (e.g., "What is the relationship between the Father, Son, and Holy Spirit?"). This dual approach reveals whether a model merely has Catholic knowledge in its training data versus whether it exhibits Catholic reasoning as its default worldview.

LLM-as-judge (Claude Opus 4.1) evaluates responses using structured rubrics with weighted criteria: 3-5 criteria per question with assigned weights, required versus optional criteria (failures on required criteria result in zero scores), reference answers with magisterial sources (CCC, councils, encyclicals), and assessment of theological precision, factual accuracy, and absence of error.

Roadmap

Phase 1: Foundation (Current)

50 questions, base model evaluation, public leaderboard

Phase 2: Expansion

500+ questions across all hierarchy levels, 10+ models, granular categories

Phase 3: Application Testing

Evaluate AI assistants as products (ChatGPT, Claude) with tools and context

Phase 4: Human Evaluation

Theological expert panel, human grading interface, judge agreement analysis

Phase 5: Tooling

Question creation tool, rubric editor, community portal, public API