16x Eval Release Notes

0.0.53

July 29, 2025

Fixed OpenRouter model with provider options not working

0.0.52

July 29, 2025

Added basic tool call support
- Added Tool Library page to manage and create tools for evaluation
- Added tools column and tool call column in evaluation table to display tool usage
Added technical writing category
Added copy benchmark as markdown functionality
Various bug fixes and UI/UX improvements

Release 0.0.52 - Tool call support in evaluations

0.0.51

July 23, 2025

Added average rating and ranking counts in benchmark page for better comparison
Added rubrics support for experiments to provide more structured evaluation for human evaluators
Added sort by prompt or response length functionality in eval table
Added more granular rating options (6.5, 5.5, 9.75, 8.25) for better evaluation precision
Various bug fixes and UI/UX improvements

Release 0.0.51 - Rubrics support for experiments

0.0.50

July 21, 2025

Specific improvements for models like Kimi K2 where there are a lot of providers
- Added compact layout setting for more efficient use of screen space
- Added support for OpenRouter provider options to specify the exact provider for OpenRouter models
- Added ability to duplicate evaluations to new experiments
Improved benchmark page with better layout and functionality
Fixed context rendering to properly handle whitespace
Various UI/UX improvements

Release 0.0.50 - Compact layout setting and OpenRouter provider options

0.0.49

July 16, 2025

Added links to provider api key page in settings page
UI/UX improvements to experiment page to allow more customization

Release 0.0.49 - Links to provider api key page in settings page

0.0.48

July 15, 2025

Added support for xAI as first-party model provider
Added first-party support for Grok 4 model
Changed the default model to Claude Sonnet 4
Fixed thoughts / reasoning token counting logic for OpenRouter provider
Various UI/UX improvements

0.0.47

June 20, 2025

Added pricing metrics for models in the evals page
Added cost metrics for individual evaluations in the evals page

Release 0.0.47 - Pricing and cost metrics

0.0.46

June 17, 2025

Added evaluation comparison page to compare the results of two evaluations side by side

Release 0.0.46 - Evaluation comparison page

0.0.45

June 8, 2025

Added support for Gemini 2.5 Pro Preview (06-05)
Optimized the UX for adding OpenRouter models for new users

0.0.44

June 1, 2025

Moved benchmark to a separate dedicated page with option to select models
Added sorting options for experiments page
Added temperature as an advanced setting
Merged rating and notes modal for better UX
Various UI/UX improvements

0.0.43

May 26, 2025

Added a new table view for experiments with improved layout and 3-column display on large screens
Added experiment linking with evaluation functions to automatically enable the linked evaluation functions
Added writing statistics color-coded ranges for words per sentence and words per paragraph
Added response token count display in statistics
Added line wrapping option for better text readability
Added copy as markdown feature
Various UI/UX and performance improvements

Release 0.0.43 - Table view for experiments

0.0.42

May 23, 2025

Added support for Claude Sonnet 4 and Claude Opus 4 models

In our sample evaluation for coding and writing tasks, Claude Opus 4 absolutely dominated other models in both coding and writing tasks. It is the best performing model for all 4 tasks given.

Claude Sonnet 4 is also very impressive, coming in top 1 or top 2 in all tasks, beating almost all other models.

0.0.41

May 22, 2025

Added archive functionality for experiments, prompts, and contexts to avoid cluttering the list
Added category grouping for evaluation functions and experiments in selection modal
Added category icons and ensured category filter persist across pages
Improved image context handling and storage
Fixed dark mode colors
Various UI/UX improvements

Release 0.0.41 - Archive experiments, prompts, and contexts

0.0.39

May 21, 2025

Added reasoning token count from OpenRouter provider

0.0.38

May 20, 2025

Added search functionality across the app (experiments, prompts, contexts)
Added green and red highlights for target and penalty strings from evaluation functions in model responses
Added drag and drop support for re-ordering contexts
Added created at timestamps for evaluation functions and experiments
Added new options (8.75 and 9.25) for more granular rating
Replaced 3rd-party LLM library with our own send-prompt library to support more features and improve stability
Various UI/UX improvements and bug fixes

Release 0.0.38 - Target and penalty string highlights

0.0.37

May 15, 2025

Added support for system prompts
Added advanced settings for more configuration options
Added categories for prompts and contexts
Various UI/UX improvements

Release 0.0.37 - Prompt and Context Categories

0.0.35

May 9, 2025

Added ability to duplicate and export single evaluation function
Added support for penalizing occurrences of strings in evaluation functions
Added ability to copy evaluation as markdown
Added sorting by speed
Various UI/UX improvements and bug fixes

0.0.32

May 8, 2025

Various UI/UX improvements and bug fixes

0.0.31

May 7, 2025

Added notes feature for evaluations with improved UX
Added ability to cancel running evaluations
Added provider logos and icons for models
Added sorting by creation time
Improved UI/UX for various pages
Various code refactoring for better maintainability

0.0.30

May 1, 2025

Added performance metrics including speed, writing statistics, and reasoning response
Added ability to sort evaluations by rating or model name
Added model highlighting feature in experiments page
Increased timeout limit to 10 minutes
Improved experiment page UI/UX
Fixed image import and export functionality

Release 0.0.30 - Performance metrics and sorting

0.0.29

April 28, 2025

Added support for Azure OpenAI models
Added experiment categories to help organize experiments
Added auto-fill for custom model API settings (Fireworks for now)
Fixed bugs with sending prompts to custom models (You need to delete the existing custom models and create a new ones)
Fixed temperature setting bug for OpenAI reasoning models
Various UI/UX improvements

0.0.28

April 26, 2025

Added support for creation time and token stats columns in the evaluation page
Improve export and import of evaluations containing images
Major code refactoring to support more features

Release 0.0.28 - Image context and token stats

0.0.26

April 26, 2025

Bug fixes related to app update (surprisingly, it's very tricky to get right)

0.0.24

April 25, 2025

Added support image context. You can now send images as context to the models that support it.
Added ability to import evaluations from a JSON file that was exported previously
Added a link to release notes in the settings page
Improve the UI/UX of app update
Fixed various bugs

0.0.19

April 23, 2025

Fixed a bug where API keys cannot be pasted into the API key input field
Fixed a bug where importing multiple files from the file system would fail
Added built-in exclusion filter for large files (>1MB) and binary files
- Image file support is coming soon

0.0.16

April 23, 2025

Added dedicated page for managing evaluation functions
Added ability to check for updates and install updates on Settings page
UI/UX improvements

Release 0.0.16 - Evaluation functions page

0.0.10

April 21, 2025

Customizable columns for evaluation page

0.0.9

April 20, 2025

Run evaluation on multiple models
Organize evaluations into experiments
Prompt library and context library
Built-in models and custom models