Iterate on prompts and test different models. Find the best fit for your use case.
Manage your prompts, contexts, and models in one place, locally on your machine. Test out different combinations and use cases with a few clicks.
Prompt Evaluation
Model Evaluation
Evaluation Function
Human Rating
Experiment Management
Context Library
BYOK API Integrations
Custom Models
Learn how 16x Eval works in this video demo.
Create new evaluations by specifying prompt, context and models. Multiple contexts will be combined together and added to the final prompt.
You can select multiple models to evaluate the same prompt and context in parallel.
Group evaluations into experiments to compare models and prompts performance for different use cases (coding, writing, question answering, etc).
You can view the top models and prompts for different experiments in the Experiments page.
16x Eval supports built-in models from various providers like OpenAI, Anthropic Claude, Google Gemini, DeepSeek, and OpenRouter.
We also support any other providers that offers OpenAI API compatibility, such as locally running Ollama.