model bench

llm comparison workbench

Fire one prompt at multiple models simultaneously, stream responses side by side with real-time metrics (latency, token count, cost), then score them with manual ratings or an LLM-as-judge evaluation.

1select models|2write prompt|3hit run|4evaluate

api key not set

Model bench routes all requests through OpenRouter, a single gateway to Claude, GPT, Gemini, DeepSeek, and more. Add your own key in settings to unlock streaming.