Atlas: Streamlining BigBasket’s 40+ lines of testing across 80+ Microservices in Non-Production Environments

💡
This blogpost was originally from BigBasket's Tech Blog and is now featured here!

TL; DR:

  1. HTTP Header-based routing. Per experiment/project variant.

  2. In-House developed Nginx proxy-based solution that does “multi-project” routing.

  3. Automated Terraform and Nginx config-generation, makes infra setups easy, faster, and better.

  4. A nicer web UI that makes things transparent for everybody.

Which one is more complex to understand? BigBasket’s Non-Production Cloud setup or its Production counterpart? If you ask this question to any engineer in BigBasket’s Tech teams, they will immediately say “it's the non-production, of course!!”.

Trivia: BigBasket (as of Jan 2023) does all major releases on 80+ microservices. But we do have a Monolith/Mono-repo that we maintain for all legacy reasons.

Background (or BigBasket’s Non-Production Environments complexity today, as of Jan 2023)

  • Multiple non-production environments. 🧪🧪🧪

  • Multiple kinds of HTTP workloads. (Monolith vs Microservices. Usually, all are HTTP based.)

  • Multiple kinds of batched workloads. (Kafka Consumers, Crons)

  • Collection of services owned by respective teams.

  • Each such service can run multiple versions of itself. (For example Say, one version for an experimental workload, one other for a performance fix, and another one for a long-term evaluation thing, etc..) We BigBasket call these "projects". In short, every service can run "multiple projects in parallel".

  • Some services are fairly stable and won’t be running any multiple projects. Just a (git) master or stable version always.

  • All services heavily use (K8s) Node Port service overlay. We have defined port ranges for each non-production environment.

Problem

(Since BigBasket needs to run multiple projects/versions per service) We need to have the capability to execute isolated computing environments for individual projects, allowing for parallel execution without impacting stability in non-production environments.

This “Isolation” can be easily achieved at the compute level, by running dedicated (K8s) deployments for multiple projects. But we need to have a routing layer that can understand the request and send it to the respective project-specific variant of the given service. (Now that's a problem worth solving!)

Trivia: The “order” microservice runs 10 different variants of itself on any given day. These variants could be ranging from Canaries, to Long-term/Strategic projects, to different business lines (like BBNow for example.)

Solution

“Sticking to basics” always works.

We could come up with a solution that can:

  • Stitch requests across multiple services using a simple HTTP header. Call it something like X-Project and let different microservices or anything involved in the request chain respect it and propagate that without any issues. This solves the grouping of requests.

  • Mobile apps can be made to pass this X-Project sticky header using a “Debug” screen. As this is a non-production use case alone 😉.

  • Browsers can do the same using an extension like Mod-Header for example.

  • Have a simple proxy setup (like an AWS API Gateway / ALB / even standard Nginx, that is well-tested and well-respected by all service teams) that routes requests based on X-Project request headers to different services.

  • Provide visibility for individual service teams to see the list of all services related to a common project or see variants of their services across projects.

Why invent the wheel when you can embrace an external tool and extend it?

To handle our complex routing requirements:

  • AWS ALB proved to be a NO go. As it limits the number of rules (200 max) and conditions (5 per every listener rule) that can be configured per ALB.

  • AWS API Gateway (as of Aug 8, 2021) doesn’t support header-based routing to multiple origins. Also adding/tampering with custom headers on the fly can be done using AWS API Gateway, but we need to write code to do this. Not easily maintainable.

  • Kubernetes Ingress Controllers? Like Nginx’s? Header-based switching is not supported in (K8s) ingress specs natively. (This was back in 2021). I believe in 2022, K8s started deprecating Ingress specs in favour of much richer Gateway specs. But the whole “Ingress Gateway” ecosystem is also very new.

  • Just to solve this intra-cluster header-based-fanout bringing a heavy tool like Envoy/Istio/Traefik etc.. may not be worth it. Especially when we have services that are very sensitive to extra tail latencies.

Trivia: BigBasket has certain microservices whose average response time is a millisecond (or even less).

This made us invent a solution in-house called Atlas that connects all the above dots (and some more).

Atlas

Atlas has two main components.

  1. Web-UI, which individual service teams can access via a browser, for example, to launch new variants of their service or to launch new projects altogether that could span multiple services.

  2. Projects-Proxy, a real Nginx-based proxy that does X-Project header-based routing) to individual “Project” specific services.

Who said setting up infrastructure dynamically and through code is not fun?

The Web-UI component of Atlas spins an AWS Application Loadbalancer for each project. We could do it using AWS CLI or SDKs. But this would violate our infrastructure setup principles. Where Terraform is the default choice to set up any infra component. Hence we started creating ALBs via Terraform code, which is generated on the fly based on the data fed via Web-UI. Let that sink in for a bit!

The Projects-Proxy’s Nginx configuration is prepared dynamically (again via code) that contains all the “header based fan-out to individual Project specific upstreams/Load balancers” rules, which will even be refreshed on the fly, without a downtime, say when a new URL/service/project is onboarded.

Websockets come in handy when we want to relay live information (like Terraform plan or apply outputs) to all browser windows.

Routing today (as of Jan 2023)

  • Required public/private AWS ALBs send requests to the Projects-Proxy layer.

  • Projects-Proxy, based on X-Project the header sends requests to the required “Project specific” load balancer.

  • Project-specific load balancers have a small number of rules (related to a small number of services) that send the traffic to respective service ports (at the K8s level).

  • Individual microservices (internal to a given K8s cluster or external) can contact projects-proxy either directly or via ALB to send the traffic to the required/default upstream service variant.

  • If a service doesn’t run the given project variant (or supplies an invalid X-Project header), those requests are safely routed to a stable or master variant of the service by default.

Outcomes

With the invention of Atlas, we:

  • Gained the capability to create infrastructure on the fly, even though “Infrastructure-As-Code”, in minutes. This used to take days. ⏱️

  • This paved the way where teams can launch their experiments faster. Especially from a cloud infrastructure from point of view.

  • Generating Terraform or Nginx configuration in an automated way means no chance of accidental configuration screwups. No special skills are required to be learned too by infrastructure teams. Templates could be prepared by the required subject-matter-experts once and the same can be reused/rendered any number of times, by anybody to create any new project.

  • Since Terraform holds a key here to generate AWS ALBs, we started shipping Terraform and Atlas binaries together with every release of Atlas. This made the (terraform) state upgrades very easy.

  • Atlas Web-UI is just a glorified infra-code generator. This can be used to create any Cloud infrastructure in general. (AWS EKS cluster ✅, Helm charts ✅, etc… endless possibilities).

  • Today (as of Jan 2023) majority of BigBasket’s microservices are onboarded onto the Atlas platform. Making their testing in non-production environments, easy, robust, and isolated.

  • Service teams didn’t see any large increase in the tail latencies (even with the multi-hop network flow). 👍

  • HeVa started supporting project as a first-class attribute. This also made individual service teams structure their Helm manifests and templates in a better way.

  • Atlas’s unified web UI that shows project(s) vs service(s) mapping helped service teams to visualize the routing in a better way. This information was scattered across many AWS ALBs earlier, posing a challenge for all non-infra folks.

Gaps to address as we move forward

  • Making Atlas UI, more developer-friendly. Right now certain screens are (cloud) infra friendly 😉.

  • Ensuring Atlas can withstand any AWS cloud-level throttling.

  • Since the platform supports a large number of experiments to run per service, over time this can get out of hand when service teams fail to retire projects. Better controls need to be put to clean up finished projects.

Here are more stories about how we do what we do. Please check out our open job positions if you are interested.