TEXT
TEXT |
Capability
|
Benchmark
Higher is better |
Description
|
CCAI-SUPER
|
GPT-4API numbers calculated where reported numbers were missing
|
|
---|---|---|---|---|---|---|
General |
MMLURepresentation of questions in 57 subjects (incl. STEM, humanities, and others)
|
Representation of questions in 57 subjects (incl. STEM, humanities, and others) |
90%CoT@32*
|
86.4%5-shot** (reported)
|
||
Reasoning |
Big-Bench HardDiverse set of challenging tasks requiring multi-step reasoning
|
Diverse set of challenging tasks requiring multi-step reasoning |
83.6%3-shot
|
83.1%3-shot (API)
|
||
DROPReading comprehension (F1 Score)
|
Reading comprehension (F1 Score) |
82.4Variable shots
|
80.93-shot (reported)
|
|||
HellaSwagCommonsense reasoning for everyday tasks
|
Commonsense reasoning for everyday tasks |
87.8%10-shot*
|
95.3%10-shot* (reported)
|
|||
Math |
GSM8KBasic arithmetic manipulations (incl. Grade School math problems)
|
Basic arithmetic manipulations (incl. Grade School math problems) |
94.4%maj1@32
|
92%5-shot CoT (reported)
|
||
MATHChallenging math problems (incl. algebra, geometry, pre-calculus, and others)
|
Challenging math problems (incl. algebra, geometry, pre-calculus, and others) |
53.2%4-shot
|
52.9%4-shot (API)
|
|||
Code |
HumanEvalPython code generation
|
Python code generation |
74.4%0-shot (IT)*
|
67%0-shot* (reported)
|
||
Natural2CodePython code generation. New held out dataset HumanEval-like, not leaked on the web
|
Python code generation. New held out dataset HumanEval-like, not leaked on the web |
74.9%0-shot
|
73.9%0-shot (API)
|
*See the technical report for details on performance with other methodologies
**GPT-4 scores 87.29% with CoT@32—see the technical report for full comparison
The CCAI SUPER exceeds state-of-the-art performance in a series of multi-mode benchmarks.
MULTIMODAL
MULTIMODAL |
Capability
|
Benchmark
|
Description
Higher is better unless otherwise noted |
CCAI-SUPER
|
GPT-4VPrevious SOTA model listed when capability is not supported in GPT-4V
|
|
---|---|---|---|---|---|---|
Image |
MMMUMulti-discipline college-level reasoning problems
|
Multi-discipline college-level reasoning problems |
59.4%0-shot pass@1
CCAI-SUPER (pixel only*) |
56.8%0-shot pass@1
GPT-4V |
||
VQAv2Natural image understanding
|
Natural image understanding |
77.8%0-shot
CCAI-SUPER (pixel only*) |
77.2%0-shot
GPT-4V |
|||
TextVQAOCR on natural images
|
OCR on natural images |
82.3%0-shot
CCAI-SUPER (pixel only*) |
78%0-shot
GPT-4V |
|||
DocVQADocument understanding
|
Document understanding |
90.9%0-shot
CCAI-SUPER (pixel only*) |
88.4%0-shot
GPT-4V (pixel only) |
|||
Infographic VQAInfographic understanding
|
Infographic understanding |
80.3%0-shot
CCAI-SUPER (pixel only*) |
75.1%0-shot
GPT-4V (pixel only) |
|||
MathVistaMathematical reasoning in visual contexts
|
Mathematical reasoning in visual contexts |
53%0-shot
CCAI-SUPER (pixel only*) |
49.9%0-shot
GPT-4V |
|||
Video |
VATEXEnglish video captioning
(CIDEr) |
English video captioning
(CIDEr) |
62.74-shot
CCAI-SUPER |
564-shot
DeepMind Flamingo |
||
Perception Test MCQAVideo question answering
|
Video question answering |
54.7%0-shot
CCAI-SUPER |
46.3%0-shot
SeViLA |
|||
Audio |
CoVoST 2 (21 languages)Automatic speech translation
(BLEU score) |
Automatic speech translation
(BLEU score) |
40.1CCAI-SUPER 1.0 Pro
|
29.1Whisper v2
|
||
FLEURS (62 languages)Automatic speech recognition
(based on word error rate, lower is better) |
Automatic speech recognition
(based on word error rate, lower is better) |
7.6%CCAI-SUPER 1.0 Pro
|
17.6%Whisper v3
|
*CCAI-SUPER image benchmarks are pixel only—no assistance from OCR systems