Logo16x Eval

Release Notes

Release notes for 16x Eval. Latest features and improvements.

0.0.53

July 29, 2025

  • Fixed OpenRouter model with provider options not working

0.0.52

July 29, 2025

  • Added basic tool call support
    • Added Tool Library page to manage and create tools for evaluation
    • Added tools column and tool call column in evaluation table to display tool usage
  • Added technical writing category
  • Added copy benchmark as markdown functionality
  • Various bug fixes and UI/UX improvements
Release 0.0.52 - Tool Library page Release 0.0.52 - Tool call support in evaluations

0.0.51

July 23, 2025

  • Added average rating and ranking counts in benchmark page for better comparison
  • Added rubrics support for experiments to provide more structured evaluation for human evaluators
  • Added sort by prompt or response length functionality in eval table
  • Added more granular rating options (6.5, 5.5, 9.75, 8.25) for better evaluation precision
  • Various bug fixes and UI/UX improvements
Release 0.0.51 - Average rating display Release 0.0.51 - Rubrics support for experiments

0.0.50

July 21, 2025

  • Specific improvements for models like Kimi K2 where there are a lot of providers
    • Added compact layout setting for more efficient use of screen space
    • Added support for OpenRouter provider options to specify the exact provider for OpenRouter models
    • Added ability to duplicate evaluations to new experiments
  • Improved benchmark page with better layout and functionality
  • Fixed context rendering to properly handle whitespace
  • Various UI/UX improvements
Release 0.0.50 - Compact layout setting and OpenRouter provider options

0.0.49

July 16, 2025

  • Added links to provider api key page in settings page
  • UI/UX improvements to experiment page to allow more customization
Release 0.0.49 - Links to provider api key page in settings page

0.0.48

July 15, 2025

  • Added support for xAI as first-party model provider
  • Added first-party support for Grok 4 model
  • Changed the default model to Claude Sonnet 4
  • Fixed thoughts / reasoning token counting logic for OpenRouter provider
  • Various UI/UX improvements
Release 0.0.48 - xAI and Grok 4 support

0.0.47

June 20, 2025

  • Added pricing metrics for models in the evals page
  • Added cost metrics for individual evaluations in the evals page
Release 0.0.47 - Pricing and cost metrics

0.0.46

June 17, 2025

  • Added evaluation comparison page to compare the results of two evaluations side by side
Release 0.0.46 - Evaluation comparison page

0.0.45

June 8, 2025

  • Added support for Gemini 2.5 Pro Preview (06-05)
  • Optimized the UX for adding OpenRouter models for new users

0.0.44

June 1, 2025

  • Moved benchmark to a separate dedicated page with option to select models
  • Added sorting options for experiments page
  • Added temperature as an advanced setting
  • Merged rating and notes modal for better UX
  • Various UI/UX improvements
Release 0.0.44 - Benchmark page

0.0.43

May 26, 2025

  • Added a new table view for experiments with improved layout and 3-column display on large screens
  • Added experiment linking with evaluation functions to automatically enable the linked evaluation functions
  • Added writing statistics color-coded ranges for words per sentence and words per paragraph
  • Added response token count display in statistics
  • Added line wrapping option for better text readability
  • Added copy as markdown feature
  • Various UI/UX and performance improvements
Release 0.0.43 - Table view for experiments

0.0.42

May 23, 2025

  • Added support for Claude Sonnet 4 and Claude Opus 4 models

In our sample evaluation for coding and writing tasks, Claude Opus 4 absolutely dominated other models in both coding and writing tasks. It is the best performing model for all 4 tasks given.

Claude Sonnet 4 is also very impressive, coming in top 1 or top 2 in all tasks, beating almost all other models.

Release 0.0.42 - Claude 4 Opus Release 0.0.42 - Claude 4 Sonnet

0.0.41

May 22, 2025

  • Added archive functionality for experiments, prompts, and contexts to avoid cluttering the list
  • Added category grouping for evaluation functions and experiments in selection modal
  • Added category icons and ensured category filter persist across pages
  • Improved image context handling and storage
  • Fixed dark mode colors
  • Various UI/UX improvements
Release 0.0.41 - Category grouping Release 0.0.41 - Dark mode color fixes Release 0.0.41 - Archive experiments, prompts, and contexts

0.0.39

May 21, 2025

  • Added reasoning token count from OpenRouter provider

0.0.38

May 20, 2025

  • Added search functionality across the app (experiments, prompts, contexts)
  • Added green and red highlights for target and penalty strings from evaluation functions in model responses
  • Added drag and drop support for re-ordering contexts
  • Added created at timestamps for evaluation functions and experiments
  • Added new options (8.75 and 9.25) for more granular rating
  • Replaced 3rd-party LLM library with our own send-prompt library to support more features and improve stability
  • Various UI/UX improvements and bug fixes
Release 0.0.38 - Search functionality Release 0.0.38 - Target and penalty string highlights

0.0.37

May 15, 2025

  • Added support for system prompts
  • Added advanced settings for more configuration options
  • Added categories for prompts and contexts
  • Various UI/UX improvements
Release 0.0.37 - System prompts Release 0.0.37 - Prompt and Context Categories

0.0.35

May 9, 2025

  • Added ability to duplicate and export single evaluation function
  • Added support for penalizing occurrences of strings in evaluation functions
  • Added ability to copy evaluation as markdown
  • Added sorting by speed
  • Various UI/UX improvements and bug fixes

0.0.32

May 8, 2025

  • Various UI/UX improvements and bug fixes

0.0.31

May 7, 2025

  • Added notes feature for evaluations with improved UX
  • Added ability to cancel running evaluations
  • Added provider logos and icons for models
  • Added sorting by creation time
  • Improved UI/UX for various pages
  • Various code refactoring for better maintainability
Release 0.0.31 - Notes feature

0.0.30

May 1, 2025

  • Added performance metrics including speed, writing statistics, and reasoning response
  • Added ability to sort evaluations by rating or model name
  • Added model highlighting feature in experiments page
  • Increased timeout limit to 10 minutes
  • Improved experiment page UI/UX
  • Fixed image import and export functionality
Release 0.0.30 - Performance metrics and sorting Release 0.0.30 - Experiment page

0.0.29

April 28, 2025

  • Added support for Azure OpenAI models
  • Added experiment categories to help organize experiments
  • Added auto-fill for custom model API settings (Fireworks for now)
  • Fixed bugs with sending prompts to custom models (You need to delete the existing custom models and create a new ones)
  • Fixed temperature setting bug for OpenAI reasoning models
  • Various UI/UX improvements
Release 0.0.29 - Azure OpenAI models

0.0.28

April 26, 2025

  • Added support for creation time and token stats columns in the evaluation page
  • Improve export and import of evaluations containing images
  • Major code refactoring to support more features
Release 0.0.28 - Image context and token stats

0.0.26

April 26, 2025

  • Bug fixes related to app update (surprisingly, it's very tricky to get right)

0.0.24

April 25, 2025

  • Added support image context. You can now send images as context to the models that support it.
  • Added ability to import evaluations from a JSON file that was exported previously
  • Added a link to release notes in the settings page
  • Improve the UI/UX of app update
  • Fixed various bugs
Release 0.0.24 - Image context

0.0.19

April 23, 2025

  • Fixed a bug where API keys cannot be pasted into the API key input field
  • Fixed a bug where importing multiple files from the file system would fail
  • Added built-in exclusion filter for large files (>1MB) and binary files
    • Image file support is coming soon

0.0.16

April 23, 2025

  • Added dedicated page for managing evaluation functions
  • Added ability to check for updates and install updates on Settings page
  • UI/UX improvements
Release 0.0.16 - Evaluation functions page

0.0.10

April 21, 2025

  • Customizable columns for evaluation page
Release 0.0.10 - Customizable columns

0.0.9

April 20, 2025

  • Run evaluation on multiple models
  • Organize evaluations into experiments
  • Prompt library and context library
  • Built-in models and custom models

Download 16x Eval

Build your own evals. End the best-model debate once and for all.