Taking a GenAI Project to Production

Taking a GenAI project to production

Generative AI and Large Language Models (LLMs) are the new revolution of Artificial Intelligence, bringing the world capabilities that we could only dream about less than two years ago. Unlike previous milestones, such as Deep Learning, in the current AI revolution, everything is happening faster than ever before.

Many feel that the train is about to leave the station, and since we are talking about bullet trains – every day matters. So how do you get to the station to board that train on-time, and more importantly – how do you know you are getting on the train that makes the most sense for your application?

Sooner than later most software development operations will be integrating Generative AI into their applications, so let’s take a look at some key issues that will make sure you are on the right train and headed in the right direction.

Existing Solutions

Here are three common solutions for connecting with Large Language Models (LLM):

  • Company specific: One of the most straightforward solutions to implement, is connecting directly to the existing models offered on leading developer platforms such as OpenAI, Google, Mistral, Anthropic, Cohere, DeepSeek and others.
  • Model aggregators: Certain cloud vendors offer AI support with direct connection to several models from managed services such as Amazon Bedrock and Microsoft Azure. The advantage of this option is gaining access to several customizable models using a single supplier..
  • Open source: The web has a number of model-repositories (most popular being Hugging Face), that can be used for commercial applications. These models offer a high level of flexibility in terms of fine-tuning the model to your requirements. It is also the only option which works for applications involving sensitive data.

Remember that besides running  the models, you’ll also need a place to store the models and a way to manage them, using solutions such as JFrog Artifactory for MLOps.

MaaS or Self-Hosted?

The very first question when thinking about kicking off a Generative AI project is probably where does the model sit? The age of Deep Learning taught most of us to use our own proprietary models, usually distributed on Cloud services. GenAI gives us a new option – Model-as-a-Service (MaaS), which quickly became the mainstream option in many people’s minds, thanks to OpenAI. So which service should you use?

Let’s break this down by the issues which need to be considered:

Costs

The most important thing is usually  how much will this cost us? Comparing costs of MaaS and self-hosted models is not trivial, as MaaS is usually charged using tokens, and self-hosted costs are determined by computing power and usage. So it really comes down to the question of what is the predicted usage of the service? The higher the usage, the more likely it is that you will want to use a self-hosted mode.

Security

In some cases, sending the queries over the web to a remote hosted service might be a security risk or policy violation. If this is the case, then MaaS may not be the right option for you.

On the other hand, it’s worth noting that security needs to be taken into consideration when using open source models, even when they come from well-known sites such as Hugging Face or Kaggle. These sites are essentially model-hosting services, and while they do offer some level of security , you must be careful when using open-source models on your own infrastructure as these services can also introduce malicious packages.

Networking

One of the biggest disadvantages of using MaaS is networking. Every request must be sent to another server, most likely in another region, and this causes a significant increase in latency.  In addition, certain platforms might limit the number of requests you can make per hour, which could become an issue if you are designing a core product that entails a significant number of requests.  Once again, the higher the usage, the less likely MaaS is for you.

Choosing the Right Model

Not all LLMs are created equal, and for a very good reason. Each model has its own advantages – and disadvantages. Finding the best model for your task requires some consideration regarding model size and language support.

Size

Yes, size matters. The larger the model, the more capable it is of handling not too-trivial tasks. But, larger models are also slower and more expensive, either when paying by-token or by the size of machines required to run them. Always try to find the minimal size required for your objectives, otherwise using unnecessarily large models is a sure way to burn cash and use up your budget.

Language

If your task involves nothing but free-text in English, you can pretty much skip this section. But if you require more than that, you might need to rethink which models to use. Some models are fine-tuned to output code, while others focus on specific languages. There are LLMs designed primarily for English that are also capable of learning additional languages. The problem with these models is performance can be degraded as more languages are added. In such cases, it is preferable to use LLMs that were designed specifically for multilingual applications. Choosing the model best suited for your target language can make a huge difference.

The Pace Is Fast

It’s hard to grasp, but NLP Generative AI-for-the-masses is about 18 months old. What started as a single experimental language product at the end of November 2022, has evolved to more than 15,000 models by mid-2023, and we’re just getting started. Models that were state-of-the-art just several months ago have been surpassed by new alternatives – from Llama 3 raising the bar of open-source models to Claude 3 showing better performances than GPT-4 – and more promising technologies are emerging even as you read these very words. Keeping up with the latest models requires AI developers to accelerate  adoption rates and technology-shift cycles faster than ever before.

This means constantly reassessing your current model’s performance against new variations , and retraining existing applications with new models to improve performance, even if the current version is fully functional. ,. Remember that even though your performance might be in line with expectations, your competitors may be adopting newer models with significantly better results.

The Bottom Line

LLMs are the closest things we have these days to true magic, and as fascinating as it might be, choosing the wrong model can result in losing a lot of time and money, while selecting the right model  can make the difference between success and failure. We hope these insights are useful, and help you leap forwards with your own AI applications using JFrog MLOps solutions.