Model optimization

Ensure quality model outputs with evals and fine-tuning in the OpenAI platform.

LLM output is non-deterministic, and model behavior changes between model snapshots and families. Developers must constantly measure and tune the performance of LLM applications to ensure they're getting the best results. In this guide, we explore the techniques and OpenAI platform tools you can use to ensure high quality outputs from the model.

Evals

Systematically measure performance.

Prompt engineering

Give context, instructions, and goals.

Fine-tuning

Train models to excel at a task.

Model optimization workflow

Optimizing model output requires a combination of evals, prompt engineering, and fine-tuning, creating a flywheel of feedback that leads to better prompts and better training data for fine-tuning. The optimization process usually goes something like this.

Write evals that measure model output, establishing a baseline for performance and accuracy.
Prompt the model for output, providing relevant context data and instructions.
For some use cases, it may be desirable to fine-tune a model for a specific task.
Run evals using test data that is representative of real world inputs. Measure the performance of your prompt and fine-tuned model.
Tweak your prompt or fine-tuning dataset based on eval feedback.
Repeat the loop continuously to improve your model results.

Here's an overview of the major steps, and how to do them using the OpenAI platform.

Build evals

In the OpenAI platform, you can build and run evals either via API or in the dashboard. You might even consider writing evals before you start writing prompts, taking an approach akin to behavior-driven development (BDD).

Run your evals against test inputs like you expect to see in production. Using one of several available graders, measure the results of a prompt against your test data set.

Learn about evals

Run tests on your model outputs to ensure you're getting the right results.

Write effective prompts

With evals in place, you can effectively iterate on prompts. The prompt engineering process may be all you need in order to get great results for your use case. Different models may require different prompting techniques, but there are several best practices you can apply across the board to get better results.

Include relevant context - in your instructions, include text or image content that the model will need to generate a response from outside its training data. This could include data from private databases or current, up-to-the-minute information.
Provide clear instructions - your prompt should contain clear goals about what kind of output you want. GPT models like gpt-4.1 are great at following very explicit instructions, while reasoning models like o4-mini tend to do better with high level guidance on outcomes.
Provide example outputs - give the model a few examples of correct output for a given prompt (a process called few-shot learning). The model can extrapolate from these examples how it should respond for other prompts.

Learn about prompt engineering

Learn the basics of writing good prompts for the model.

Fine-tune a model

OpenAI models are already pre-trained to perform across a broad range of subjects and tasks. Fine-tuning lets you take an OpenAI base model, provide the kinds of inputs and outputs you expect in your application, and get a model that excels in the tasks you'll use it for.

Fine-tuning can be a time-consuming process, but it can also enable a model to consistently format responses in a certain way or handle novel inputs. You can use fine-tuning with prompt engineering to realize a few more benefits over prompting alone:

You can provide more example inputs and outputs than could fit within the context window of a single request, enabling the model handle a wider variety of prompts.
You can use shorter prompts with fewer examples and context data, which saves on token costs at scale and can be lower latency.
You can train on proprietary or sensitive data without having to include it via examples in every request.
You can train a smaller, cheaper, faster model to excel at a particular task where a larger model is not cost-effective.

Visit our pricing page to learn more about how fine-tuned model training and usage are billed.

Fine-tuning methods

These are the fine-tuning methods supported in the OpenAI platform today.

Method	How it works	Best for	Use with
Supervised fine-tuning (SFT)	Provide examples of correct responses to prompts to guide the model's behavior. Often uses human-generated "ground truth" responses to show the model how it should respond.	Classification Nuanced translation Generating content in a specific format Correcting instruction-following failures	`gpt-4.1-2025-04-14` `gpt-4.1-mini-2025-04-14` `gpt-4.1-nano-2025-04-14`
Vision fine-tuning	Provide image inputs for supervised fine-tuning to improve the model's understanding of image inputs.	Image classification Correcting failures in instruction following for complex prompts	`gpt-4o-2024-08-06`
Direct preference optimization (DPO)	Provide both a correct and incorrect example response for a prompt. Indicate the correct response to help the model perform better.	Summarizing text, focusing on the right things Generating chat messages with the right tone and style	`gpt-4.1-2025-04-14` `gpt-4.1-mini-2025-04-14` `gpt-4.1-nano-2025-04-14`
Reinforcement fine-tuning (RFT)	Generate a response for a prompt, provide an expert grade for the result, and reinforce the model's chain-of-thought for higher-scored responses. Requires expert graders to agree on the ideal output from the model. Reasoning models only.	Complex domain-specific tasks that require advanced reasoning Medical diagnoses based on history and diagnostic guidelines Determining relevant passages from legal case law	`o4-mini-2025-04-16`

How fine-tuning works

In the OpenAI platform, you can create fine-tuned models either in the dashboard or with the API. This is the general shape of the fine-tuning process:

Collect a dataset of examples to use as training data
Upload that dataset to OpenAI, formatted in JSONL
Create a fine-tuning job using one of the methods above, depending on your goals—this begins the fine-tuning training process
In the case of RFT, you'll also define a grader to score the model's behavior
Evaluate the results

Get started with supervised fine-tuning, vision fine-tuning, direct preference optimization, or reinforcement fine-tuning.

Learn from experts

Model optimization is a complex topic, and sometimes more art than science. Check out the videos below from members of the OpenAI team on model optimization techniques.