0.0.46
June 17, 2025
- Added evaluation comparison page to compare the results of two evaluations side by side
0.0.45
June 8, 2025
- Added support for Gemini 2.5 Pro Preview (06-05)
- Optimized the UX for adding OpenRouter models for new users
0.0.44
June 1, 2025
- Moved benchmark to a separate dedicated page with option to select models
- Added sorting options for experiments page
- Added temperature as an advanced setting
- Merged rating and notes modal for better UX
- Various UI/UX improvements
0.0.43
May 26, 2025
- Added a new table view for experiments with improved layout and 3-column display on large screens
- Added experiment linking with evaluation functions to automatically enable the linked evaluation functions
- Added writing statistics color-coded ranges for words per sentence and words per paragraph
- Added response token count display in statistics
- Added line wrapping option for better text readability
- Added copy as markdown feature
- Various UI/UX and performance improvements
0.0.42
May 23, 2025
- Added support for Claude Sonnet 4 and Claude Opus 4 models
In our sample evaluation for coding and writing tasks, Claude Opus 4 absolutely dominated other models in both coding and writing tasks. It is the best performing model for all 4 tasks given.
Claude Sonnet 4 is also very impressive, coming in top 1 or top 2 in all tasks, beating almost all other models.
0.0.41
May 22, 2025
- Added archive functionality for experiments, prompts, and contexts to avoid cluttering the list
- Added category grouping for evaluation functions and experiments in selection modal
- Added category icons and ensured category filter persist across pages
- Improved image context handling and storage
- Fixed dark mode colors
- Various UI/UX improvements
0.0.39
May 21, 2025
- Added reasoning token count from OpenRouter provider
0.0.38
May 20, 2025
- Added search functionality across the app (experiments, prompts, contexts)
- Added green and red highlights for target and penalty strings from evaluation functions in model responses
- Added drag and drop support for re-ordering contexts
- Added created at timestamps for evaluation functions and experiments
- Added new options (8.75 and 9.25) for more granular rating
- Replaced 3rd-party LLM library with our own send-prompt library to support more features and improve stability
- Various UI/UX improvements and bug fixes
0.0.37
May 15, 2025
- Added support for system prompts
- Added advanced settings for more configuration options
- Added categories for prompts and contexts
- Various UI/UX improvements
0.0.35
May 9, 2025
- Added ability to duplicate and export single evaluation function
- Added support for penalizing occurrences of strings in evaluation functions
- Added ability to copy evaluation as markdown
- Added sorting by speed
- Various UI/UX improvements and bug fixes
0.0.32
May 8, 2025
- Various UI/UX improvements and bug fixes
0.0.31
May 7, 2025
- Added notes feature for evaluations with improved UX
- Added ability to cancel running evaluations
- Added provider logos and icons for models
- Added sorting by creation time
- Improved UI/UX for various pages
- Various code refactoring for better maintainability
0.0.30
May 1, 2025
- Added performance metrics including speed, writing statistics, and reasoning response
- Added ability to sort evaluations by rating or model name
- Added model highlighting feature in experiments page
- Increased timeout limit to 10 minutes
- Improved experiment page UI/UX
- Fixed image import and export functionality
0.0.29
April 28, 2025
- Added support for Azure OpenAI models
- Added experiment categories to help organize experiments
- Added auto-fill for custom model API settings (Fireworks for now)
- Fixed bugs with sending prompts to custom models (You need to delete the existing custom models and create a new ones)
- Fixed temperature setting bug for OpenAI reasoning models
- Various UI/UX improvements
0.0.28
April 26, 2025
- Added support for creation time and token stats columns in the evaluation page
- Improve export and import of evaluations containing images
- Major code refactoring to support more features
0.0.26
April 26, 2025
- Bug fixes related to app update (surprisingly, it's very tricky to get right)
0.0.24
April 25, 2025
- Added support image context. You can now send images as context to the models that support it.
- Added ability to import evaluations from a JSON file that was exported previously
- Added a link to release notes in the settings page
- Improve the UI/UX of app update
- Fixed various bugs
0.0.19
April 23, 2025
- Fixed a bug where API keys cannot be pasted into the API key input field
- Fixed a bug where importing multiple files from the file system would fail
- Added built-in exclusion filter for large files (>1MB) and binary files
- Image file support is coming soon
0.0.16
April 23, 2025
- Added dedicated page for managing evaluation functions
- Added ability to check for updates and install updates on Settings page
- UI/UX improvements
0.0.10
April 21, 2025
- Customizable columns for evaluation page
0.0.9
April 20, 2025
- Run evaluation on multiple models
- Organize evaluations into experiments
- Prompt library and context library
- Built-in models and custom models