Let's Verify Step by Step

Let's Verify Step by Step

Hunter Lightman$^{*}$
OpenAI

Vineet Kosaraju$^{*}$
OpenAI

Yura Burda$^{*}$
OpenAI

Harri Edwards
OpenAI

Bowen Baker
OpenAI

Teddy Lee
OpenAI

Jan Leike
OpenAI

John Schulman
OpenAI

Ilya Sutskever
OpenAI

Karl Cobbe$^{*}$
OpenAI

$^{*}$ Primary authors. Correspondence to: Karl Cobbe [email protected]

Abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Executive Summary: Large language models have advanced rapidly in handling complex reasoning tasks, such as solving multi-step math problems. Yet, these models often produce logical errors or invent facts, known as hallucinations, which can derail entire solutions. This is a pressing issue today as AI systems increasingly tackle real-world applications in education, science, and decision-making, where reliability is crucial for safety and trust. Unreliable models risk spreading misinformation or failing in high-stakes scenarios, making it essential to develop training methods that build more robust reasoning without relying solely on correct final answers.

This document evaluates two approaches to training reward models—specialized AI components that score and select the best outputs from larger models—for improving accuracy on challenging math problems from the MATH dataset, which draws from competition-level questions in algebra, calculus, and geometry. The goal is to determine whether providing feedback on each intermediate reasoning step (process supervision) yields better results than feedback only on the final answer (outcome supervision), especially given the high cost of human input.

The researchers used base versions of advanced models like GPT-4, first fine-tuned on a large set of math-related texts to boost reasoning skills. They generated thousands of step-by-step solutions to MATH problems and collected human labels on the correctness of each step, creating a dataset of 800,000 labels across 75,000 solutions (PRM800K). For outcome supervision, labels came from checking final answers automatically. Process supervision involved humans marking steps as correct, incorrect, or neutral up to the first error. Experiments ran at two scales: large-scale with full human data to push performance limits, and small-scale using the large model to simulate labels for direct comparisons. Models were tested by ranking multiple generated solutions and selecting the top one for automatic grading, focusing on a held-out test set of 500 problems.

Key findings highlight process supervision's edge. First, the process-supervised reward model solved 78% of test problems when selecting from 1,860 candidates, compared to 72% for the outcome-supervised version and 70% for a simple majority-vote baseline— a gap that grows with more candidates, showing superior search efficiency. Second, process supervision scaled better overall; it outperformed outcome supervision across all data sizes in small-scale tests, even when outcome labels came from the more accurate process model rather than simple final-answer checks. Third, active learning—selecting solutions likely to contain subtle errors—made process supervision 2.6 times more data-efficient, allowing strong results with fewer human labels. Fourth, on fresh, out-of-distribution tests like recent AP exams and math competitions, process supervision achieved 73% accuracy versus 64% for outcome, confirming broader generalization. Finally, the team released the full PRM800K dataset to enable further research.

These results mean process supervision can produce more reliable AI by directly addressing where reasoning breaks down, reducing hallucinations and improving alignment with human judgment. Unlike outcome supervision, which struggles with "credit assignment" for errors in long chains, process methods encourage interpretable, step-by-step thinking that humans can verify, lowering risks of models gaming the system for right answers via wrong paths. This avoids an "alignment tax"—safer training without sacrificing performance—and could accelerate adoption in domains like science or policy analysis, where multi-step logic matters. The findings differ from prior work on simpler math, where outcomes sufficed; here, on harder problems, process supervision's precision proves decisive, potentially reshaping how we train AI for complex tasks.

Next, organizations should prioritize process supervision when building reward models for reasoning-heavy applications, integrating active learning to minimize labeling costs. For immediate action, fine-tune generators using these reward models via reinforcement learning to boost end-to-end accuracy. Options include scaling process data further for non-math domains like code or legal reasoning, weighing the higher initial setup against long-term gains in safety. Before broad deployment, pilot tests in target areas are needed. Additional analysis should explore iterative active learning and full RL pipelines to confirm scalability.

While the study provides strong evidence, limitations include potential overlap between training data and the MATH test set, though out-of-distribution results suggest minimal impact on comparisons. Small-scale tests relied on synthetic labels, which approximate but do not fully replicate humans. Confidence is high in process supervision's relative superiority, supported by consistent trends across scales, but readers should approach absolute performance figures cautiously due to dataset biases and the focus solely on reward models, not full systems.

1. Introduction

Section Summary: Large language models excel at multi-step reasoning through step-by-step thinking, but they often hallucinate by inventing facts, especially under uncertainty, which can ruin entire solutions in complex tasks. To address this, researchers train reward models to spot good and bad outputs, using either outcome supervision, which checks only the final answer, or process supervision, which evaluates each step for more precise, human-like guidance and better alignment. This study, building on prior work with a stronger model, more human feedback, and tougher math problems, demonstrates that process supervision creates far more reliable reward models—solving 78 percent of test problems—while also showing ways to boost efficiency through active learning and releasing a large dataset for further research.

Large language models are capable of solving tasks that require complex multi-step reasoning by generating solutions in a step-by-step chain-of-thought format ([1, 2, 3]). However, even state-of-the-art models are prone to producing falsehoods — they exhibit a tendency to invent facts in moments of uncertainty ([4]). These hallucinations ([5]) are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution. Detecting and mitigating hallucinations is essential to improve reasoning capabilities.

One effective method involves training reward models to discriminate between desirable and undesirable outputs. The reward model can then be used in a reinforcement learning pipeline ([6, 7, 8, 9]) or to perform search via rejection sampling ([10, 11, 12]). While these techniques are useful, the resulting system is only as reliable as the reward model itself. It is therefore important that we study how to most effectively train reliable reward models.

In closely related work, [13] describe two distinct methods for training reward models: outcome supervision and process supervision. Outcome-supervised reward models (ORMs) are trained using only the final result of the model's chain-of-thought, while process-supervised reward models (PRMs) receive feedback for each step in the chain-of-thought. There are compelling reasons to favor process supervision. It provides more precise feedback, since it specifies the exact location of any errors that occur. It also has several advantages relevant to AI alignment: it is easier for humans to interpret, and it more directly rewards models for following a human-endorsed chain-of-thought. Within the domain of logical reasoning, models trained with outcome supervision regularly use incorrect reasoning to reach the correct final answer ([14, 15]). Process supervision has been shown to mitigate this misaligned behavior ([13]).

Despite these advantages, [13] found that outcome supervision and process supervision led to similar final performance in the domain of grade school math. We conduct our own detailed comparison of outcome and process supervision, with three main differences: we use a more capable base model, we use significantly more human feedback, and we train and test on the more challenging MATH dataset ([16]).

Our main contributions are as follows:

  1. We show that process supervision can train much more reliable reward models than outcome supervision. We use our state-of-the-art PRM to solve $78.2%$ of problems from a representative subset of the MATH test set.
  2. We show that a large reward model can reliably approximate human supervision for smaller reward models, and that it can be used to efficiently conduct large-scale data collection ablations.
  3. We show that active learning leads to a $2.6\times$ improvement in the data efficiency of process supervision.
  4. We release our full process supervision dataset, PRM800K, to promote related research.

2. Methods

Section Summary: The researchers compare outcome supervision, which automatically checks final answers on math problems from the MATH dataset, with process supervision, which requires human labelers to evaluate the correctness of each step in AI-generated solutions. They run experiments at large scale using GPT-4 to build highly reliable reward models, and at small scale for fairer direct comparisons by having larger models guide training without human feedback. Data is collected by having people label steps as positive, negative, or neutral in a dataset called PRM800K, focusing on tricky wrong answers, and reward models are tested by seeing how often they pick the best correct solution from a set of options.

We perform a comparison of outcome and process supervision, following a similar methodology to [13]. Outcome supervision can be provided without humans, since all problems in the MATH dataset have automatically checkable answers. In contrast, there is no simple way to automate process supervision. We therefore rely on human data-labelers to provide process supervision, specifically by labelling the correctness of each step in model-generated solutions.

We conduct experiments in two separate regimes: large-scale and small-scale. Each has its own advantages, and they offer complimentary perspectives. At large-scale, we finetune all models from GPT-4 ([17]). We focus on advancing the state-of-the-art by training the most reliable ORM and PRM possible. Unfortunately the training sets for these reward models are not directly comparable, for reasons we will discuss in Section 3. These models are therefore not ideal for making an apples-to-apples comparison of outcome and process supervision. To address this flaw, we also train models at small-scale, where we can conduct a more direct comparison. In order to remove our dependence on costly human feedback, we use a large-scale model to supervise small-scale model training. This setup enables us to conduct several important ablations that would otherwise be infeasible.

2.1 Scope

At each model scale, we use a single fixed model to generate all solutions. We call this model the generator. We do not attempt to improve the generator with reinforcement learning (RL). When we discuss outcome and process supervision, we are specifically referring to the supervision given to the reward model. We do not discuss any supervision the generator would receive from the reward model if trained with RL. Although finetuning the generator with RL is a natural next step, it is intentionally not the focus of this work.

We instead focus exclusively on how to train the most reliable reward model possible. We evaluate a reward model by its ability to perform best-of-N search over uniformly sampled solutions from the generator. For each test problem we select the solution ranked highest by the reward model, automatically grade it based on its final answer, and report the fraction that are correct. A reward model that is more reliable will select the correct solution more often.

2.2 Base Models

All large-scale models are finetuned from the base GPT-4 model ([17]). This model has been pretrained solely to predict the next token; it has not been pretrained with any Reinforcement Learning from Human Feedback (RLHF) ([18]). The small-scale base models are similar in design to GPT-4, but they were pretrained with roughly $200$ times less compute. As an additional pretraining step, we finetune all models on a dataset of roughly 1.5B math-relevant tokens, which we call MathMix. Similar to [19], we find that this improves the model's mathematical reasoning capabilities. Details on how this dataset was constructed can be found in Appendix A.

2.3 Generator

To make parsing individual steps easier, we train the generator to produce solutions in a newline delimited step-by-step format. Specifically, we few-shot generate solutions to MATH training problems, filter to those that reach the correct final answer, and finetune the base model on this dataset for a single epoch. This step is not intended to teach the generator new skills; it is intended only to teach the generator to produce solutions in the desired format.

**Figure 1:** A screenshot of the interface used to collect feedback for each step in a solution.

2.4 Data Collection

To collect process supervision data, we present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator. Their task is to assign each step in the solution a label of positive, negative, or neutral, as shown in Figure 1. A positive label indicates that the step is correct and reasonable. A negative label indicates that the step is either incorrect or unreasonable. A neutral label indicates ambiguity. In practice, a step may be labelled neutral if it is subtly misleading, or if it is a poor suggestion that is technically still valid. We permit neutral labels since this allows us to defer the decision about how to handle ambiguity: at test time, we can treat neutral labels as either positive or negative. A more detailed description of the labelling instructions is provided in Appendix D.

We label solutions exclusively from the large-scale generator in order to maximize the value of our limited human-data resource. We refer to the entire dataset of step-level labels collected as PRM800K. The PRM800K training set contains 800K step-level labels across 75K solutions to 12K problems. To minimize overfitting, we include data from 4.5K MATH test problems in the PRM800K training set, and we therefore evaluate our models only on the remaining $500$ MATH test problems. More details about this test set can be found in Appendix C.

During data collection, we must decide which solutions to surface to data-labelers. The most straightforward strategy is to uniformly surface solutions produced by the generator. However, if we surface solutions that make obvious errors, the human feedback we get is less valuable. We would prefer to surface solutions that are more likely to fool our best reward model. To that end, we attempt to strategically select which solutions to show data-labelers. Specifically, we choose to surface convincing wrong-answer solutions. We use the term convincing to refer to solutions that are rated highly by our current best PRM, and we use wrong-answer to refer to solutions that reach an incorrect final answer. We use this slightly verbose phrasing to emphasize the fact that correctness is determined solely by checking the final answer, a process which occasionally leads to misgraded solutions. We expect to gain more information from labeling convincing wrong-answer solutions, since we know the PRM is mistaken about at least one step in each such solution.

In addition to using this selection strategy, we also iteratively re-train our PRM using the latest data at several points in the data collection process. At each iteration, we generate N solutions per problem and surface only the top K most convincing wrong-answer solutions to data-labelers. We experiment with either applying this top-K filtering at a problem level (K solutions per problem) or globally across the dataset (K solutions in total, unequally distributed among problems). Since the data collection process is expensive, it was not feasible to conduct at-scale ablations of these decisions. However, we perform several surrogate ablations in Section 4, using our largest PRM as a labelling oracle for a smaller PRM. More details about data collection can be found in Appendix B.

**Figure 2:** Two solutions to the same problem, graded by the PRM. The solution on the left is correct while the solution on the right is incorrect. A green background indicates a high PRM score, and a red background indicates a low score. The PRM correctly identifies the mistake in the incorrect solution.

2.5 Outcome-supervised Reward Models (ORMs)

We train ORMs following a similar methodology to [12]. We uniformly sample a fixed number of solutions per problem from the generator, and we train the ORM to predict whether each solution is correct or incorrect. In practice, we usually determine correctness by automatically checking the final answer, but in principle these labels could be provided by humans. At test time, we use the ORM's prediction at the final token as the overall score for the solution. We note the automatic grading used to determine ORM targets is not perfectly reliable: false positives solutions that reach the correct answer with incorrect reasoning will be misgraded. We discuss additional ORM training details in Appendix E.

2.6 Process-supervised Reward Models (PRMs)

We train PRMs to predict the correctness of each step after the last token in each step. This prediction takes the form of a single token, and we maximize the log-likelihood of these target tokens during training. The PRM can therefore be trained in a standard language model pipeline without any special accommodations. To determine the step-level predictions at test time, it suffices to perform a single PRM forward pass over the whole solution. We visualize large-scale PRM scores for two different solutions in Figure 2. To compare multiple solutions, it is necessary to compute a single score for each solution. This is an important but straightforward detail: we define the PRM score for a solution to be the probability that every step is correct under the PRM. We implement this as the product of the correctness probabilities for each step. We describe other possible scoring strategies and additional PRM training details in Appendix F.

When we provide process supervision, we deliberately choose to supervise only up to the first incorrect step. This makes the comparison between outcome and process supervision more straightforward. For correct solutions, both methods provide the same information, namely that every step is correct. For incorrect solutions, both methods reveal the existence of at least one mistake, and process supervision additionally reveals the precise location of that mistake. If we were to provide additional process supervision beyond the first mistake, then process supervision would have an even greater information advantage. This decision also keeps the labelling cost similar for humans: without relying on an easy-to-check final answer, determining the correctness of a solution is equivalent to identifying its first mistake. While most MATH problems do have easy-to-check final answers, we expect this to not remain true in more complex domains.

3. Large-scale Supervision

Section Summary: Researchers trained a large-scale process reward model (PRM) using detailed step-by-step labels from a dataset called PRM800K, while creating a strong outcome reward model (ORM) baseline by sampling 100 diverse solutions per problem from a generator, resulting in a dataset ten times larger and without overlap. Although the datasets differ and aren't directly comparable, they represent the best efforts to push forward-thinking AI evaluation methods, noting that relying solely on the biased PRM800K for the ORM would weaken it, and mixing in extra samples didn't help. A figure comparing the models shows the PRM consistently outperforming the ORM and a simple majority voting baseline when selecting the best from many generated solutions, with the advantage growing as more options are considered.

We train the large-scale PRM using the step-level labels in PRM800K. To ensure the large-scale ORM baseline is as strong as possible, we train on 100 uniform samples per problem from the generator. This means the ORM training set has no overlap with PRM800K, and it is an order of magnitude larger. Although these two training sets are not directly comparable, each represents our best attempt to advance the state-of-the-art with each form of supervision. We note that training the ORM solely on PRM800K solutions would be problematic, since our active learning strategy has heavily biased the dataset towards wrong-answer solutions. We did explore training the ORM on a superset of PRM800K solutions, by mixing in uniformly sampled solutions, but we found that this did not improve ORM performance.

**Figure 3:** A comparison of outcome-supervised and process-supervised reward models, evaluated by their ability to search over many test solutions. Majority voting is shown as a strong baseline. For $N \leq 1000$, we visualize the variance across many subsamples of the 1860 solutions we generated in total per problem.

Figure 3 shows how the best-of-N performance of each reward model varies as a function of N. Since majority voting is known to be a strong baseline ([20, 19]), we also include this method as a point of comparison. While the ORM performs slightly better than the majority voting baseline, the PRM strongly outperforms both. Not only does the PRM reach higher performance for all values of N, but the performance gap widens as N increases. This indicates that the PRM is more effective than both the ORM and majority voting at searching over a large number of model-generated solutions. We experimented with using RM-weighted voting ([21, 13]) to combine the benefits of the PRM and majority voting, but this did not noticeably improve performance. We use a specific subset of the MATH test set for evaluation, which we describe in Appendix C. We further break down these results by problem difficulty in Appendix G.

4. Small-scale Synthetic Supervision

Section Summary: To fairly compare process supervision, which rewards correct reasoning steps, against outcome supervision, which focuses on the final answer, researchers used a large pre-trained model to generate cheap labels for training smaller models on math problems. Their experiments showed that process supervision consistently outperformed both types of outcome supervision across various data scales, with model-based outcome checks proving more effective than simple final-answer verification due to fewer errors in labeling tricky solutions. Additionally, an active learning approach, which prioritized selecting diverse and challenging examples, made data collection about 2.6 times more efficient than random sampling, though attempts at iterative model updates during this process were unstable and didn't improve results.

We find that the PRM outperforms the ORM at large-scale, but this result alone paints an incomplete picture. To better compare outcome and process supervision, there are two confounding factors that must be isolated. First, the training sets for the ORM and the PRM are not directly comparable: the PRM training set was constructed using active learning, is biased towards answer-incorrect solutions, and is an order of magnitude smaller. Second, the final-answer grading will provide positive labels to spurious solutions that reach the correct final answer despite incorrect reasoning. This could damage ORM performance, an effect we may or may not want to attribute to outcome supervision more generally.

Due to the high cost of collecting human feedback, we cannot easily ablate these factors using human labelers. We instead perform the relevant ablations by using the large-scale PRM to supervise smaller models. This setup enables us to simulate a large amount of data collection at a modest cost. For the remainder of this section, we refer to the large-scale PRM from Section 3 as $\text{PRM}_{\text{large}}$.

**Figure 4:** A comparison of different forms of outcome and process supervision. Mean and standard deviation is shown across three seeds.

4.1 Process vs Outcome Supervision

We now conduct a direct comparison of outcome and process supervision. We first sample between 1 and 200 solutions per problem from a small-scale generator. For each dataset, we provide three forms of supervision: process supervision from $\text{PRM}{\text{large}}$, outcome supervision from $\text{PRM}{\text{large}}$, and outcome supervision from final-answer checking. The choice of supervision is the only difference between these three series of reward models, which are otherwise trained on identical datasets. See Appendix H for more details about how $\text{PRM}_{\text{large}}$ is used for outcome and process supervision.

In Figure 4a, we evaluate each reward model by its best-of-500 selection. We see that process supervision significantly outperforms both forms of outcome supervision at all data collection scales. In Figure 4b, we evaluate the best reward model from each series by its best-of-N performance across different values of N. We see that using $\text{PRM}{\text{large}}$ for outcome supervision is noticeably more effective than final-answer checking. This can be explained by the fact that $\text{PRM}{\text{large}}$ provides better supervision for solutions that reach the correct final answer using incorrect reasoning.

It is not clear whether supervision by $\text{PRM}{\text{large}}$ or by final-answer checking represents the more appropriate outcome supervision baseline. While final-answer supervision is more explicitly outcome based, its main weakness — the existence of false positives — is arguably over-emphasized in the MATH dataset. Outcome supervision by $\text{PRM}{\text{large}}$ better represents outcome supervision in domains that are less susceptible to false positives. We consider outcome supervision by $\text{PRM}_{\text{large}}$ to be the more relevant baseline, but we encourage the reader to draw their own conclusions.

4.2 Active Learning

Finally, we investigate the impact of active learning. We train a small-scale reward model, $\text{PRM}{\text{selector}}$, on a single sample from each problem, and we use this model to score 1000 samples per problem. To train each of our larger reward models, we select $N$ samples per problem such that $80%$ are the most convincing (according to $\text{PRM}{\text{selector}}$) wrong-answer samples, and $20%$ are the most convincing samples that remain (right- or wrong-answer). We score the selected samples with $\text{PRM}{\text{large}}$ and train on those scores. This process ensures that all samples are relatively convincing under $\text{PRM}{\text{selector}}$, that a large fraction are known to contain at least one mistake, and that our overall dataset is not too heavily biased toward wrong-answer solutions. Performance of this data labelling scheme is shown in Figure 4a. By comparing the slopes of the line of best fit with and without active learning, we estimate that this form of active learning is approximately 2.6x more data efficient than uniform data labelling. We note that the model trained on the largest active learning dataset (200 samples per problem) appears to slightly underperform the expected trend line. Our best explanation for this observation is that 200 samples represents a significant fraction of the overall selection pool (1000 samples) and that this relative lack of diversity limits the possible upside from active learning.

We also performed a preliminary investigation into the impact of iteratively retraining $\text{PRM}{\text{selector}}$ throughout data collection. Between iterations, we re-trained $\text{PRM}{\text{selector}}$ using all currently labeled data. Unfortunately, we observed instability in this process which we were unable to diagnose. The resulting reward models performed no better than the models described above. We expect some form of iterative retraining to be beneficial in active learning, but we currently have no concrete evidence to support this claim. We consider this a compelling direction for future research.

5. OOD Generalization

Section Summary: Researchers tested how well their AI models generalize to new, unseen problems from recent high school-level exams in subjects like calculus, chemistry, physics, and math competitions, using a set of 224 questions released after the models' training data. They compared two types of reward models—one focused on outcomes (ORM) and one on processes (PRM)—along with a simple majority voting approach, running 100 trials per problem to measure accuracy. The process-focused model performed best overall at 72.9%, outperforming the others and showing it can handle slight changes in problem types without losing its edge.

: Table 1: We measure out-of-distribution generalization using recent STEM tests. We evaluate the outcome-supervised RM, the process-supervised RM, and majority voting using 100 test samples per problem.

ORM PRM Majority Vote # Problems
AP Calculus $68.9%$ 86.7% $80.0%$ 45
AP Chemistry $68.9%$ 80.0% $71.7%$ 60
AP Physics $77.8%$ 86.7% $82.2%$ 45
AMC10/12 $49.1%$ 53.2% $32.8%$ 84
Aggregate $63.8%$ 72.9% $61.3%$ 234

To get some measure of out-of-distribution generalization, we evaluate our large-scale ORM and PRM on a held-out set of 224 STEM questions, pulled from the most recent AP Physics, AP Calculus, AP Chemistry, AMC10, and AMC12 exams. Since these tests were released after the pre-training dataset was compiled, we can have high confidence that the model has not seen these problems. We report the best-of-100 performance of the ORM, PRM and majority voting in Table 1. We observe results similar to those in Section 3: the PRM outperforms both the ORM and majority voting. This shows us that the PRM can tolerate a modest amount of distribution shift and that its strong performance holds up on fresh test questions.

6. Discussion

Section Summary: Process supervision offers a key advantage over outcome supervision by providing detailed feedback on exactly where errors occur in a model's reasoning, making it easier for AI to learn from mistakes, especially on complex problems like math. It also enhances AI alignment by promoting interpretable, human-approved thought processes that are safer and more reliable, without sacrificing performance—in fact, it avoids the usual "alignment tax" that can slow down safer AI adoption, though more research is needed in other areas beyond math. While there's a risk of test set contamination from online sources in the MATH dataset, the authors find no strong evidence of it affecting their models' core abilities, and it likely doesn't alter the comparisons between methods, as shown by strong generalization on clean data and low success rates on tough problems.

6.1 Credit Assignment

One clear advantage of process supervision is that it provides more precise feedback than outcome supervision. A reward model trained with outcome supervision faces a difficult credit-assignment task — to generalize well, it must determine where an incorrect solution went wrong. This is particularly difficult for hard problems: most model-generated solutions contain an error somewhere, so the marginal value of a negative label from outcome supervision is low. In contrast, process supervision provides a richer signal: it specifies both how many of the first steps were in fact correct, as well as the precise location of the incorrect step. Process supervision makes credit assignment easier, and we believe that this explains its strong performance.

6.2 Alignment Impact

Process supervision has several advantages over outcome supervision related to AI alignment. Process supervision is more likely to produce interpretable reasoning, since it encourages models to follow a process endorsed by humans. Process supervision is also inherently safer: it directly rewards an aligned chain-of-thought rather than relying on outcomes as a proxy for aligned behavior ([22]). In contrast, outcome supervision is harder to scrutinize, and the preferences conveyed are less precise. In the worst case, the use of outcomes as an imperfect proxy could lead to models that become misaligned after learning to exploit the reward signal ([13, 23, 24]).

In some cases, safer methods for AI systems can lead to reduced performance ([9, 25]), a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results show that process supervision in fact incurs a negative alignment tax. This could lead to increased adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains.

6.3 Test Set Contamination

The test set of the MATH dataset contains problems that are discussed in several online venues, and it is likely that some of these problems appear in the pretraining dataset for our models. We attempted to remove all MATH problems from our MathMix dataset using string-matching heuristics, but since humans can post hard-to-detect rephrasings of a problem online, it is difficult to make any strong guarantees about the overlap between MathMix and the MATH dataset.

In our experience inspecting model-generated solutions, we saw no clear signs of our models memorizing MATH problems. However, it is impossible to rule out subtle forms of memorization that would slip past manual inspection, and it is still possible that some degree of contamination has slightly inflated our performance on the MATH test set. Even in that case, we would expect any contamination to manifest similarly across all methods, and that the relative comparisons made throughout this work would remain mostly unaffected.

We also note that the PRM regularly surfaces correct solutions to MATH problems that have a low single-digit percentage solve-rate under the generator, some examples of which can be seen in Appendix I. The generator's low solve-rate is an additional indication that it has not encountered such problems via test set contamination. Our generalization results from Section 5 further strengthen our claim that test set contamination has not significantly impacted this work, since we observe qualitatively similar results on problems that are guaranteed to be uncontaminated.

7. Related Work

Section Summary: This section discusses related research on training AI models for math problem-solving, starting with a comparison to a study that found process supervision—guiding models step-by-step—works as well as outcome supervision—focusing on final answers—with less data, though the current work uses tougher problems, bigger datasets, and shows process supervision pulls ahead at larger scales. It also highlights similarities to another approach using a powerful reward model to train smaller ones, avoiding issues with human feedback in reinforcement learning. Finally, it reviews studies on enhancing language models' reasoning through techniques like fine-tuning on technical texts, self-consistency checks, and prompting for step-by-step thinking without extra training.

7.1 Outcome vs Process Supervision

In work closely related to our own, [13] compare the impact of outcome and process supervision in the domain of grade school math. They found that both methods led to similar final-answer error rates, and that process supervision achieved those results with less data. While our core methodology is very similar, there are three main details that differ. First, we use a more capable model to collect PRM800K dataset and to perform our large-scale experiments. However, our small-scale results in Section 4 suggest that large-scale models are not necessary to observe benefits from process supervision. Second, we evaluate on the MATH dataset, which is significantly more challenging than GSM8K. Third, we collect a much larger quantity of process supervision data.

On the surface, the results from [13] may seem to conflict with our claim that process supervision leads to better performance. However, we believe the apparent conflict can be explained by the difference in the scale of the supervision. The data scaling trend in Figure 4a suggests that a small amount of process supervision and a large amount of outcome supervision do in fact lead to similar performance, consistent with the results from [13]. The trend also shows that process supervision beats outcome supervision when scaled up, even when judged based solely on outcomes. This is consistent with our results in Section 3. We believe these results make a strong case for using process supervision.

7.2 Synthetic Supervision

Similar to our work in Section 4, [26] use a large reward model to supervise the training of smaller models. They study the over-optimization that occurs during RLHF, with experiments that require large quantities of human preference data. To work around this challenge, they use a gold-standard reward model to replace human feedback. Our use of a large-scale reward model to supervise smaller reward models shares similarities with their approach.

7.3 Natural Language Reasoning

Several recent studies that have examined the reasoning ability of large language models are implicitly relevant to our work. [19] showed that finetuning models on a large corpus of technical content led to significantly improved performance on MATH. [20] showed that self-consistency leads to remarkably strong performance on many reasoning benchmarks, notably without requiring any additional finetuning. [2] and [1] demonstrate the importance of explicitly performing intermediate reasoning steps via a chain of thought or a scratchpad in order to solve tasks that require multi-step reasoning. [3] show that models are able to perform this behavior zero-shot, conditioned only on a simple prompt.

8. Conclusion

Section Summary: This research demonstrates that guiding AI models through step-by-step process checks, rather than just final results, leads to far more dependable reward systems for tasks like math problem-solving. It also highlights how active learning can cut down on the expense of gathering human input by focusing feedback only on the most useful AI outputs. The team has made their full dataset of 800,000 human evaluations publicly available to spur more studies on aligning large language models, noting that process-based methods remain underexplored and warrant further investigation into their broader applications.

We have shown that process supervision can be used to train much more reliable reward models than outcome supervision in the domain of mathematical reasoning. We have also shown that active learning can be used to lower the cost of human data collection by surfacing only the most valuable model completions for human feedback. We release PRM800K, the full dataset of human feedback used to train our state-of-the-art reward model, with the hope that removing this significant barrier to entry will catalyze related research on the alignment of large language models. We believe that process supervision is currently under-explored, and we are excited for future work to more deeply investigate the extent to which these methods generalize.

Acknowledgements

Section Summary: The acknowledgements section expresses gratitude to several individuals, including Joshua Achiam, Mark Chen, and others, for their helpful feedback and discussions during the project. It also thanks Giambattista Parascandolo and Daniel Selsam for helping create the MathMix dataset, Jonathan Ward for developing the data collection interface, and Wojciech Zaremba for motivating the team to expand their efforts. Additional appreciation goes to Peter Hoeschele and Aris Kostantinidis for data collection support, OpenAI's research and supercomputing teams for providing essential infrastructure, and the Scale team along with numerous data labelers who built the PRM800K dataset.

We thank Joshua Achiam, Mark Chen, Jonathan Gordon, Dan Hendrycks, Lukasz Kaiser, Oleg Murk, Ben Sokolowsky, Francis Song, and Jonathan Uesato for valuable feedback and thoughtful discussions; Giambattista Parascandolo and Daniel Selsam for their contributions to the MathMix dataset; Jonathan Ward for contributing to the data collection interface; Wojciech Zaremba for encouraging us to scale up data collection; Peter Hoeschele and Aris Kostantinidis for supporting our data collection; the research acceleration and supercomputing teams at OpenAI for providing infrastructure support; and the team at Scale and the many data-labelers who created PRM800K.

Appendix

Section Summary: The appendix describes MathMix, a curated dataset of 1.5 billion high-quality math-related tokens, including problems, solutions, discussions, and synthetic data, designed for lightweight pretraining without mixing in general language, unlike larger prior datasets; it uses a smaller 1-billion-token version for training compact models and includes checks to prevent overlap with test data. It also details PRM800K, a collection of about 800,000 step-level labels on 75,000 math solutions gathered through two phases, with the bulk from an efficient second phase that prioritized labeling incorrect solutions via active learning to improve model performance, alongside quality control to ensure label accuracy. For evaluation, the authors expanded the training set to include most MATH problems and test models only on a random subset of 500 held-out problems to avoid overfitting.

A. MathMix

Similar to [19] we construct a large-scale dataset of high-quality math-relevant tokens for use in a lightweight pretraining stage, before finetuning on comparably smaller datasets like MATH and PRM800K. This dataset, which we call MathMix, has two main differences compared to the one used to train Minerva. First, it is smaller and more aggressively filtered to high-quality math problem-solving content, and second, it does not explicitly mix in general language data.

Minerva was trained on 38.5B tokens of arXiv documents and webscrape pages with LaTeX content, while MathMix consists of a smaller set of 1.5B tokens containing individual math problems and their solutions, free-form text discussing math problems and concepts, and synthetic data (Table 2). While Minerva was pretrained on a dataset with 5% general natural language data, we chose not to mix in any natural language data explicitly, primarily because MathMix already contains plenty of natural language data.

: Table 2: MathMix dataset components.

Data type Token count Present in pretraining?
Math problems and solutions $\sim$ 275M No
Free-form math discussion text (1) $\sim$ 430M No
Free-form math discussion text (2) $\sim$ 450M Yes
Synthetic data (1) $\sim$ 30M No
Synthetic data (2) $\sim$ 100M Yes
Critiques grading data $\sim$ 500M No

Note that when training smaller models, as in Section 4, we use a slightly smaller variant of MathMix that excludes the critiques data and only consists of 1B tokens. For our large models experiments, we train on MathMix for roughly 3B tokens (2 epochs). For our small models experiments, we train for 6 epochs (roughly 6.6B tokens).

We apply a set of decontamination checks on MathMix against the test split of the MATH dataset, including stripping out LaTeX and searching for matching n-grams, but we can make no strong guarantees on the efficacy of this decontamination. As discussed in Section 6.3, we would not expect the relative comparisons made throughout this work to be significantly impacted by test set contamination.

B. PRM800K

We collected $1, !085, !590$ step-level labels over $101, !599$ solution samples. We present the whole unfiltered dataset as PRM800K. During training we discard labels used for quality control, as well as any step-level labels for which the labeler was unable to complete the task. The filtered dataset contains about $800, !000$ step-level labels over $75, !000$ solutions. The full PRM800K dataset is available at https://github.com/openai/prm800k.

The data collection was split into two separate phases. In phase 1, we collected labels for multiple alternative completions at each step of a solution. This seeded our dataset but was cumbersome—for many steps the alternatives were repetitive, and we found labelers spent a lot of time supervising long uninteresting solutions. As a result, the step-level labels we collected in this phase are more repetitive than those collected later. In total, phase 1 represents about $5%$ of PRM800K, or about $40, !000$ step-level labels.

The majority of our labels were collected as part of phase 2, during which we scaled up and streamlined the data collection process. Phase 2 data collection is split into $10$ generations. For each generation, we sample $N$ solutions per problem from the generator. We rank these solutions with our current best PRM and surface the highest scoring wrong-answer solutions to our labelers. We retrain this PRM between each generation using all the latest data. This active learning strategy changes the balance of our data considerably. Though we sometimes surfaced correct solutions (either by manually injecting correct solutions or because of errors in our automatic grading), the vast majority of the labels we collected in this phase are for incorrect solutions. Table 3 breaks down the balance of correct/incorrect steps and solutions between the different phases of data collection. Though we mostly collected labels on incorrect solutions, we still collected many labels for correct individual steps. In fact, our small-scale ablations in Section 4.2 suggest that this active learning strategy, which favors labelling high-scoring wrong-answer solutions, improves performance despite the resulting imbalance in the dataset.

: Table 3: Distribution of positive/negative steps/solutions.

phase $1$ phase $2$ combined
$%$ end in correct solution $85.1$ $13.2$ $14.2$
$%$ correct steps $58.6$ $74.1$ $73.1$

Some of our phase 2 questions are intended for quality control. For a quality control question, researchers mark which steps are reasonable to label as incorrect. Then we assess that labelers are able to consistently mark those steps as incorrect. Prior to starting on phase 2, we required all labelers to label 30 quality control questions. This served as a screening test, and we only admitted labelers that agreed with our gold labels at least $75%$ of the time.

We then designated $10$- $20$ problems per generation as additional quality control questions, and we randomly served them to labelers as they worked through the task. We used the results of this continuous quality control to remove labelers whose quality slipped too far, as well as to prepare educational material on common mistakes in order to improve labeler alignment with our instructions.

C. Evaluation

As we scaled up the project, we began having to collect labels on multiple solutions for the same training problem. In order to avoid the risk of over-fitting on the $7, !500$ MATH training problems, we expanded the training set to include $4, !500$ MATH test split problems. We therefore evaluate our models only on the remaining $500$ held-out problems. We selected these $500$ test problems uniformly at random. In Figure 5, we show that the distribution of difficulty levels and subjects in this subset is representative of the MATH test set as a whole. The specific test set we used can be found at https://github.com/openai/prm800k. We leave it for future work to explore how many distinct training problems are actually necessary, and how quickly our methods overfit to the training set.

**Figure 5:** Two histograms comparing the distribution of problem difficulty levels and subjects in both the original MATH test set and in our 500 problem test subset.

D. Labelling Instructions

Labelers were tasked to look at steps in a solution and label each one as positive, negative, or neutral. A step is considered neutral if it is appropriate in context, reasonable, correct, and contains only computations that can be verified easily. A step is positive if it is neutral and also progresses towards the solution. All other steps are considered negative. Labelers were not given reference solutions, but they were given the ground truth final answers. We chose not to provide reference solutions to avoid biasing them towards one particular path to the solution. We chose to provide ground truth final answers since this information can sometimes help labelers resolve their own misunderstandings.

In phase 1, labelers were permitted to enter their own steps in the case that all candidate steps were negative. Then the solution would progress from a randomly selected positive step (or neutral if their were no positive ones). This often resulted in trajectories that got stuck in endless sequences of neutral steps that said reasonable things but made frustratingly slow progress towards a solution or negative steps that needed constant human supervision. In phase 2, we pre-generate whole solutions and end the task as soon as the first negative step is encountered. The full instructions given to labelers can be found at https://github.com/openai/prm800k/tree/main/prm800k/instructions.

E. ORM Training Details

We train outcome-supervised reward models in the same manner as token-level verifiers from [12], with a few subtle differences to hyperparameters. In particular, we only train for a single epoch on each dataset of model samples and reward model labels, without dropout, and without jointly learning a language modeling objective. We find that performance is not sensitive to most other hyperparameters, within a reasonable range.

To collect model samples, we simply sample uniformly from the generator at a temperature of 1.0 without applying any rebalancing of positives or negatives. At training time, the reward model makes predictions for every token in the context. The target for each token in a solution is the same, based on whether the solution is labelled correct or incorrect. At test time, we simply use the score of the final token in the completion as the overall score of the solution. We note that this setup is identical to the way token-level verifiers were trained in [12].

F. PRM Details

F.1 Training

We train our PRMs by fine-tuning the MathMix model to predict the probability of positive, negative, and neutral labels given a solution prefix ending in one of our labeled steps. We sweep over hyperparameters using a dataset containing the first $\sim!10%$ of PRM800K. Fine-tuning an LLM from its ordinary language modeling task to a classification task like this is a large distribution shift, and we found low learning rates were important to stable PRM training.

All of our PRMs are trained for 2 epochs. On smaller datasets (such as in phase 1 and the first few generations of phase 2) this improves the final performance over training for just 1 epoch. Additional epochs, up to some point, don't noticeably help or hurt performance. On larger datasets, the benefits of 2 epoch training diminishes, but we continue doing it for consistency.

F.2 Scoring

There are multiple ways of using the PRM to score solutions. In general, we produce a single solution-level score by performing a reduction over step-level scores, where the step-level score is the probability that the step's label is positive. This involves two specific implementation decisions. First, when determining a step-level score, we either consider a neutral label to be positive or negative. Second, when determining a solution-level score, we either use the minimum or the product over step-level scores as a reduction.

We show results from all four scoring strategies in Table 4. The best performing strategy is to take the product of step-level scores and to consider the neutrals as positives, but the difference in performance between all strategies is minor. Throughout the rest of this work, we consider neutral steps to be positive, and we define the solution score to be the product of step-level scores. Using the product instead of the minimum as the reduction does create a slight bias against solutions with a larger number of steps.

: Table 4: Best-of-1860 test performance using the PRM with four different scoring strategies.

product minimum
neutral = positive $78.2%$ $77.6%$
neutral = negative $77.4%$ $77.8%$

G. Difficulty Breakdown

We show performance of our ORM and PRM on each quintile of the MATH dataset. We determine quintiles based on the pass rate under the generator. It is interesting to note that the performance gap is not only apparent on high difficulty problems: it is in fact apparent across all difficulties. For the lowest difficulty problems, we see that it is possible to find adversarial examples that fool the ORM, since the ORM's performance slightly decreases as the number of samples increases. In contrast, the PRM remains highly robust over this same set of samples.

We also see that increasing the number of samples has the largest positive effect on the highest difficulty problems. This is to be expected, since a large number of generator samples may be required to find a true and convincing solution to a hard problem.

**Figure 6:** A breakdown of ORM vs PRM performance by problem difficulty.

H. Synthetic Supervision Details

We can use $\text{PRM}{\text{large}}$ to provide either outcome or process supervision for smaller models. We determine the labels for individual steps based on the step-level probabilities outputted by $\text{PRM}{\text{large}}$. To do this, we set an arbitrary threshold: any step that $\text{PRM}{\text{large}}$ assigns a negative label with greater than 20% probability is considered incorrect. We choose this threshold based on the observation that $\text{PRM}{\text{large}}$ is slightly miscalibrated in the direction of favoring positive labels.

To provide process supervision for a solution, we directly return the step-level labels (positive or negative) provided by $\text{PRM}{\text{large}}$, up until the first step that is marked as negative. This mimics our true human data collection process. To provide outcome supervision, we mark the solution as correct if and only if $\text{PRM}{\text{large}}$ considers every step to be correct (using the same thresholding logic).

I. PRM Visualizations

All examples shown come from the large-scale generator (GPT-4). We note the pass-rate under the generator to give some sense of the difficulty of these problems.

I.1 True Positives

These cherry-picked examples show the best-of-1860 solution from the generator as ranked by the large-scale PRM.

Problem 1. Generator pass-rate: $0.1%$. This challenging trigonometry problem requires applying several identities in a not-at-all obvious succession. Most solution attempts fail, because it is hard to choose which identities are actually helpful. Though successful solutions to this problem are rare, the reward model correctly recognizes when a valid chain-of-thought has been found.

Problem 2. Generator pass-rate: $5.8%$. In step 7 and 8, the generator starts performing guess-and-check. This is a common place the model might hallucinate, by claiming a particular guess is successful when it isn't. In this case, the reward model verifies each step and determines that the chain-of-thought is correct.

Problem 3. Generator pass-rate: $1.7%$. The generator successfully applies several trigonometric identities to simplify the expression.

Problem 4. Generator pass-rate: $4.5%$. Here, the generator successfully performs a complex series of polynomial factorizations. The use of the Sophie-Germain identity in step 5 is an important step that could be considered insightful.

I.2 True Negatives

Problem 5. Generator pass-rate: $4.5%$. The generator attempts to use the difference of squares formula in step 12 on an expression that isn’t in fact a difference of squares. The reward model catches this mistake.

Problem 6. Generator pass-rate: $93.5%$. In step 7, the generator makes an incorrect attempt to simplify an expression. The reward model catches this mistake.

Problem 7. Generator pass-rate: $48.0%$. In step 11, the generator makes a simple calculation error. The reward model catches this mistake.

Problem 8. Generator pass-rate: $5.8%$. The justification in step 8 is strange, but the reward model lets it slide. In step 9, though, the model incorrectly factors the expression. The reward model catches this mistake.

I.3 False Positives

Problem 9. Generator pass-rate: $18.5%$. The generator makes a subtle counting error in step 9. On the surface, it appears reasonable to claim that there are 5 ways to exchange the same colored ball since there are 5 colors. However, this undercounts by a factor of 2, since Bob has 2 choices for which ball to return to Alice. The reward model is fooled by this mistake.

Problem 10. Generator pass-rate: $17.6%$. In step 13, the generator attempts to simplify the equation by combining like terms. It correctly moves and combines the linear terms to the left-hand side, but then mistakenly leaves the right-hand side untouched. The reward model is fooled by this mistake.

Problem 11. Generator pass-rate: $13.4%$. The generator attempts to perform long division, but in step 16, it forgets to include the leading zeros in the repeating part of the decimal. The reward model is fooled by this mistake.

Problem 12. Generator pass-rate: $9.1%$. In step 4, the generator falsely claims that the sequence repeats itself every 12 terms, when it’s in fact every 10 terms. This sort of counting mistake occasionally fools the reward model.

References

[1] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.

[2] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.

[3] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.

[4] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

[5] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.

[6] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

[7] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.

[8] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.

[9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

[10] E. Nichols, L. Gao, and R. Gomez. Collaborative storytelling with large-scale neural language models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–10, 2020.

[11] J. Shen, Y. Yin, L. Li, L. Shang, X. Jiang, M. Zhang, and Q. Liu. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.

[12] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[13] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.

[14] E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.

[15] A. Creswell, M. Shanahan, and I. Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.

[16] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.

[17] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[18] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

[19] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.

[20] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.

[21] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022.

[22] A. Stuhlmüller and J. Byun. Supervise process, not outcomes. https://ought.org/updates/2022-04-06-process, 2022.

[23] A. Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to, 2022.

[24] T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg. Reinforcement learning with a corrupted reward channel. arXiv preprint arXiv:1705.08417, 2017.

[25] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.

[26] L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, 2022.