AI Inference Stress Server

This product can help to determine and analyse the large data

You can input any JSON-based data URL. The server is able to ingest data, and using those data, you can chat anything with those data.

Usage Guide

Here you need to set the number of queues you must use to get your query answered by the llm. Remember, higher the queues, more the GPU memory would be used. However, the mapping would be 1 query per queue.
You can install models from the available list. If the model is already there it will show the Installed tag.
You can uninstall models from the list of installed models.
Here, you must provide your dataset in json form. (this is optional)
Probe time interval for GPU bandwidth monitor. for less to get better metrics.
Single query is for just one query. Batch query is for multiple queries in 1 go. For this you need a csv file. An example of the a set of queries based on the popular show Pokemon is below
Select the llm call. We have two options, one is for direct call and second is for through ollama call. Where direct calls have the option to profile GPU bandwidth with the use of ncu. ollama call: We’re using ollama library which support handling of llm models, direct call: We leverage the Llama index to manage interactions with LLM models, enabling GPU bandwidth profiling through NCU. By incorporating GPU layers, we gain enhanced control over the process.
When we select a direct call this option will be available. If profile gpu bandwidth selected it will return with additional csv data which has profiling data for bandwidth.
Select the model from installed models list
Enter a query which you want to ask to llm or you can upload csv if we selected a batch query.