About CADRE
CADRE ("Catholic Alignment, Doctrine, and Reasoning Evaluation") evaluates how well AI language models understand and articulate Catholic teaching. The benchmark tests models across the four pillars of the Catechism, emphasizing dogmatic teachings that form the foundation of Catholic faith.
Our methodology incorporates the hierarchy of truths, weighting questions by authoritative level: Dogma (Level 1), Definitive Doctrine (Level 2), Authentic Magisterium (Level 3), and Prudential Judgments (Level 4).
CADRE uses two question variants to distinguish between factual retrieval and native reasoning patterns. Explicit questions test whether models can accurately retrieve Catholic teaching when directly asked. Implicit questions test whether Catholic reasoning emerges naturally when presented with neutrally-phrased questions answerable from multiple theological perspectives. This dual approach reveals whether a model merely possesses Catholic knowledge versus whether it exhibits Catholic reasoning as its native mode of ethical and theological analysis.
Model Leaderboard
| Rank | Model | Provider | Overall Score | The Profession of Faith | The Celebration of the Christian Mystery | Life in Christ | Christian Prayer |
|---|---|---|---|---|---|---|---|
| 1 | Magisterium 1 magisterium-1 | magisterium | 96.7% | 99.0% | 95.4% | 94.2% | 98.0% |
| 2 | Grok 4 Fast grok-4-fast | xai | 92.4% | 88.4% | 100.0% | 88.6% | 94.3% |
| 3 | Claude Sonnet 4.5 claude-sonnet-4.5 | anthropic | 85.9% | 83.9% | 87.8% | 85.3% | 87.3% |
| 4 | Hermes 4 405B hermes-4-405b | nousresearch | 77.5% | 78.8% | 76.9% | 73.7% | 80.9% |
| 5 | Claude 3 Haiku claude-3-haiku | anthropic | 77.3% | 76.4% | 82.8% | 69.7% | 81.8% |
| 6 | GPT-4 gpt-4 | openai | 71.5% | 71.0% | 73.4% | 64.7% | 78.6% |
Cost / Catholic Analysis
Cost Metrics
USD per 1M tokens (average input/output). Lower-cost models suit high-volume applications; higher-cost models offer superior theological precision.
Pareto Frontier
Optimal models: high performance at lower cost (upper-left). Mission-critical theological applications may justify premium pricing.
Methodology
The benchmark consists of 50 questions across the four pillars of the Catechism—Creed (15 questions), Sacraments (12), Moral Life (13), and Prayer (10)—weighted by the hierarchy of truths: 56% dogma (divinely revealed truths), 26% definitive doctrine (magisterial teaching), and 18% authentic magisterium (authoritative teaching).
Each question has two variants testing distinct capabilities. Explicit questions assess whether models can retrieve Catholic teaching when directly asked, testing precise doctrinal knowledge, theological terminology, and citation ability (e.g., "What is the Catholic Church's teaching on the Holy Trinity?"). Implicit questions evaluate whether Catholic reasoning emerges naturally without prompting, using lenient scoring on details but evaluating native alignment patterns (e.g., "What is the relationship between the Father, Son, and Holy Spirit?"). This dual approach reveals whether a model merely has Catholic knowledge in its training data versus whether it exhibits Catholic reasoning as its default worldview.
LLM-as-judge (Claude Opus 4.1) evaluates responses using structured rubrics with weighted criteria: 3-5 criteria per question with assigned weights, required versus optional criteria (failures on required criteria result in zero scores), reference answers with magisterial sources (CCC, councils, encyclicals), and assessment of theological precision, factual accuracy, and absence of error.
Roadmap
50 questions, base model evaluation, public leaderboard
500+ questions across all hierarchy levels, 10+ models, granular categories
Evaluate AI assistants as products (ChatGPT, Claude) with tools and context
Theological expert panel, human grading interface, judge agreement analysis
Question creation tool, rubric editor, community portal, public API
