Building Scalable Architectures for LLM-based Apps

Generative AI is going to be part of each application, if not entirely based centred around it. With GenAI being the new UI, it provides a superior user experience, there is nothing more user friendly than the user asking to achieve something and here you go; AI does it all. But currently building Apps for GenAI is extremely tricky and expensive if not designed properly. You will experience challenges just about everywhere, starting from data processing, to building scalable infrastructure and coordinating the user traffic through the LLM.

Building end-to-end solutions require your application to be scalable, performant, secure and cheap. Usually, scaling your solution will require bigger machines which incur bigger cost, but also with heavy usecases like RAG applications, your solution will see reduced performance as you scale. At UTHEREAL.ai we have been through all this and build a pretty darn good solution that we are very keen to share with the world. We wanted to start our first series of blogs with some of the things we learned after 1 year of building

Where to Build?

GenerativeAI is basically all the rage with all cloud platforms, you will find plenty of tooling in GCP, AWS and Azure. All of them have their pros and cons. This is what we learned (at the point of time we reviewed them)

AWS: Is most flexible, but requires skill to build due to the sparsity of applications and tooling and the many ways you can find yourself building an application. LLMs-wise, they are our favourite, they have plenty of LLMs to choose from, some open source others are paid, you can also use any other LLM on the market if you know how to use their API.

GCP: is most builder-friendly, they have pretty much a scalable API for everything and anything you need to do with your RAG application. Think twice before trying to build you own (We did tho anyways) — but it is something hard to pass on, especially if the LLM is a small part of your application.

Azure: has openAI and also offers a lot of tooling although we found it the hardest to work with due to the somewhat more complex nature of their cloud platform and also the pricing for us was harder to pin-down and completely understand.

Others: platforms such as hugging-face, NVidia and other smaller startups have created more of specialised APIs that developers can use to do an awful lot of the processing with little code, do your own research, there are a few specially around RAG

So our advise is to choose the providers you are most comfortable with in terms of experience of tooling and building, because you will (or eventually will) end up hitting the limits of your technology and this is where familiarity will come in handy to get over those fun hurdles.

Building UTHEREAL.ai

UTHEREAL.ai has a number of key processes to support our user journey. Data ingestion and a search / inference are a few where most of the work has gone. The first thing we noticed was the resource requirements for these user journeys. They both required different environments to run their processes efficiently. There are also intersections between these compute resources that both need. This meant we needed to increase flexibility in our architecture, and good/sane orchestration

These processes also had their own expectations on latency for load times, and vastly different invocation frequency as some of them are surely going to bear a lot of requests, which again meant more flexibility with added fine-tuned scalability. We wanted the processes to be also very resistant, failures are expected all the times, we need to be able to detect and recover quickly, this meant logging and monitoring is key, best if it is proactive, not reactive. So what did we do? How did we make those architecture decision?

Building Monolith vs. Serverless

This is an early decision we were faced with, your options are unlimited here in the sense of different things you can put together for your app. But essentially, are you either building micro-services Serverless; meaning many services doing their own thing talking to each other and each one handles its own scalability. Or, are you building a beefy rack of scalable servers that can handle the same workload repeatedly but scale either horizontally or vertically. The Pros and Cons are many here and they all depend on your app architecture, but here is what we found (Assuming you well architected your app as much as you can)

Serverless Pros

  • A service fit for each process, fine tuned and build for that purpose

  • Each service can scale on its own, all you need to do is allow it to do so (Safely if you know how)

  • Broken parts can be swapped out for better ones anytime without total disaster or downtime

  • Cost is more transparent by each process, can optimise separately

  • Can be cheaper because each service is only called when needed and scale when needed

  • It is just cooler, some say its the Modern Architecture

Serverless Cons

  • Bad days ahead if you don’t know what you are doing, bad handovers between services, not getting a grip on requirements or security roles or compatibility between your services will consume you

  • Funny enough, its not cheaper at times. In fact it can be very expensive if the right controls not put in place

  • More parts to manage, update, more code bases some can be bespoke

  • Security is more complex if not managed well from the start

Monolith Pros

  • Simpler architecture, less messy, less parts to manage

  • Simpler security

  • Can be cheaper if optimised well and purchased ahead of time for a long period (1–3yrs)

  • Can be extremely reliable if tuned to the right tasks

Monolith Cons

  • Old school, not cool anymore

  • Single failure can cause most of the app to go down

  • You need to worry more about redundancies and fall-backs

So what is the answer? Indeed the best approach here is to study well your application and processes, and utilise Serverless for the right tasks and perhaps the repeated processes can remain within a monolith architecture, that way you have the best of both

Lastly, we saw the need to speed up our application through parallelising computational tasks and aggregating, this meant careful coding + infrastructure deployed in an organised way to manage scaling for burst moments.

Conclusion

At UTHEREAL.ai we are very research-based, we do our homework and don’t just rush into building. Our intention to so to build strategically, we hate re-doing work. Our infrastructure works really well with processing large datasets, it scales gracefully with a heavier workload and scales down to the minimum resources required to keep the infrastructure alive. We still have lots of work to do, it is not a sprint, its a marathon

Kidus, Lead Engineering @UTHEREAL

W@UTHEREAL.ai

WAEL

CEO & Lead Architect

Next
Next

Building with LLMs — Key considerations and emerging research topics