St. Peter's Basilica at sunset

CADRE

Catholic Alignment, Doctrine, and Reasoning Evaluation

A comprehensive benchmark for evaluating large language models on their alignment with Catholic teaching, doctrine, and moral reasoning.

Alpha version: not for production use—a technical preview to establish eval methodology

About CADRE

CADRE ("Catholic Alignment, Doctrine, and Reasoning Evaluation") evaluates how well AI language models understand and articulate Catholic teaching. The benchmark tests models across the four pillars of the Catechism, emphasizing dogmatic teachings that form the foundation of Catholic faith.

Our methodology incorporates the hierarchy of truths, weighting questions by authoritative level: Dogma (Level 1), Definitive Doctrine (Level 2), Authentic Magisterium (Level 3), and Prudential Judgments (Level 4).

CADRE uses two question variants to distinguish between factual retrieval and native reasoning patterns. Explicit questions test whether models can accurately retrieve Catholic teaching when directly asked. Implicit questions test whether Catholic reasoning emerges naturally when presented with neutrally-phrased questions answerable from multiple theological perspectives. This dual approach reveals whether a model merely possesses Catholic knowledge versus whether it exhibits Catholic reasoning as its native mode of ethical and theological analysis.

Model Leaderboard

RankModelProviderOverall ScoreThe Profession of FaithThe Celebration of the Christian MysteryLife in ChristChristian Prayer
1
Magisterium 1
magisterium-1
magisterium
96.7%
99.0%
95.4%
94.2%
98.0%
2
Grok 4 Fast
grok-4-fast
xai
92.4%
88.4%
100.0%
88.6%
94.3%
3
Claude Sonnet 4.5
claude-sonnet-4.5
anthropic
85.9%
83.9%
87.8%
85.3%
87.3%
4
Hermes 4 405B
hermes-4-405b
nousresearch
77.5%
78.8%
76.9%
73.7%
80.9%
5
Claude 3 Haiku
claude-3-haiku
anthropic
77.3%
76.4%
82.8%
69.7%
81.8%
6
GPT-4
gpt-4
openai
71.5%
71.0%
73.4%
64.7%
78.6%

Cost / Catholic Analysis

Cost Metrics

USD per 1M tokens (average input/output). Lower-cost models suit high-volume applications; higher-cost models offer superior theological precision.

Pareto Frontier

Optimal models: high performance at lower cost (upper-left). Mission-critical theological applications may justify premium pricing.

Methodology

The benchmark consists of 50 questions across the four pillars of the Catechism—Creed (15 questions), Sacraments (12), Moral Life (13), and Prayer (10)—weighted by the hierarchy of truths: 56% dogma (divinely revealed truths), 26% definitive doctrine (magisterial teaching), and 18% authentic magisterium (authoritative teaching).

Each question has two variants testing distinct capabilities. Explicit questions assess whether models can retrieve Catholic teaching when directly asked, testing precise doctrinal knowledge, theological terminology, and citation ability (e.g., "What is the Catholic Church's teaching on the Holy Trinity?"). Implicit questions evaluate whether Catholic reasoning emerges naturally without prompting, using lenient scoring on details but evaluating native alignment patterns (e.g., "What is the relationship between the Father, Son, and Holy Spirit?"). This dual approach reveals whether a model merely has Catholic knowledge in its training data versus whether it exhibits Catholic reasoning as its default worldview.

LLM-as-judge (Claude Opus 4.1) evaluates responses using structured rubrics with weighted criteria: 3-5 criteria per question with assigned weights, required versus optional criteria (failures on required criteria result in zero scores), reference answers with magisterial sources (CCC, councils, encyclicals), and assessment of theological precision, factual accuracy, and absence of error.

Roadmap

Phase 1: Foundation (Current)

50 questions, base model evaluation, public leaderboard

Phase 2: Expansion

500+ questions across all hierarchy levels, 10+ models, granular categories

Phase 3: Application Testing

Evaluate AI assistants as products (ChatGPT, Claude) with tools and context

Phase 4: Human Evaluation

Theological expert panel, human grading interface, judge agreement analysis

Phase 5: Tooling

Question creation tool, rubric editor, community portal, public API

CADRE is an open initiative to analyze and evaluate the accuracy and sufficient representation of Catholic teaching in AI systems.

For questions or contributions, please reach out through our GitHub repository.