The model dashboards enable you to monitor and interpret a model's performance.
There are two dashboards:
Deployed models
Allowed models, not deployed
External API Models
This section describes the model dashboard for models that have been allowed, but that are not deployed.
Note that the dashboard only shows data when there is traffic.
Metric | Description | Example/Usefulness | |
|---|---|---|---|
1 | Average Throughput: | The average number of requests processed by the model per unit of time. Measurement: Requests per minute (rpm) or tokens per second (depends on configuration). | Indicates how many queries the model can handle on average. Helps understand the model’s performance capabilities. |
2 | Average Time to First Token (TTFT): | The average time (in seconds) for the model to return the first token of the response. Measurement: Seconds | Reflects perceived responsiveness. Critical for user experience, showing the wait time before seeing output. |
3 | Throughput Over Time: | A time-series chart showing changes in throughput over a selected period. | Useful for identifying performance trends, traffic spikes, or bottlenecks. Helps with performance monitoring and improving model efficiency. |
4 | Time To First Token Over Time:
| A time-series chart showing changes in TTFT over a selected period. | Useful for spotting and troubleshooting latency variations due to load, network issues, or model performance. Helps understand model response times. |
5 | Error Percentage: | The overall percentage of requests that resulted in errors, out of all requests. Measurement: % | Indicates reliability. Example: If 2 out of 100 requests fail, the error percentage is 2%. |
6 | Error Rate Over Time:
| A time-series chart showing the percentage of failed requests over time. | Useful for detecting when errors occur and correlating them with system conditions. Critical for understanding error trends and maintaining system health. |
7 | Time Period Displayed:
| Indicates the time window used for all the metrics shown on the dashboard. Options include the last hour, last 24 hours, last 7 days, etc. | Helps in setting the context for the data being reviewed. |
Deployed Open Source Models
This section describes the model dashboard for models that are already deployed.
Metric | Description | Example/Usefulness | |
|---|---|---|---|
1 | Latency Metrics (P50 E2E Latency)
| Measures the median (P50) end-to-end latency of requests, i.e., the time it takes for a request to be processed and a response returned. | Indicates how quickly the model responds under typical conditions. Lower latency = better responsiveness. |
2 | Request Throughput (Request Success Rate) | The number of successful requests handled by the model per second. | Shows how many requests the model can process concurrently and whether it is keeping up with demand. |
3 | Token Throughput (Prompt/Output Tokens/sec) | The rate at which tokens are processed or generated by the model, measured in tokens per second:
| Useful for evaluating efficiency of inference and ensuring the model meets performance requirements for larger inputs/outputs. |
4 | Request Queue (Waiting Requests) | The number of incoming requests waiting in the queue because the model is at capacity. | If the queue grows, it indicates the model cannot keep up with traffic. This could point to resource limits or scaling needs. |
5 | Token Latency | Measures the average time to process or generate a single token. | Provides a fine-grained view of efficiency beyond end-to-end request latency. |