HeVa: Achieving Consistency in Microservice Deployments through Manifest Discipline

HeVa: Achieving Consistency in Microservice Deployments through Manifest Discipline

Photo by Growtika on Unsplash

💡
This blogpost was originally from BigBasket's Tech Blog and is now featured here!

This was back in June 2020, We @ BigBasket were into microservices heavily. (We do, even now!) The holy grail of microservices is, every microservice team should be fully owning the service end-to-end lifecycle. While this is all good in theory, when every microservice team starts deploying their services, with their own/different deployment patterns/approaches/tools things start to become unwieldy.

Trivia: BigBasket moved to running self-hosted and self-managed Kubernetes clusters since Apr 2018. This is where all microservices run today (as of Jan 2023).

Especially things related to deployment manifests. Writing huge declarative manifests by hand containing Kubernetes resource specifications was the norm. Kubectl was the tool used to apply these manifests on all environments. This posed a few challenges:

  1. When deployment manifests had to be duplicated when we graduate a service from one environment to the other. (Dev >> QA >> … >> Pre-Prod >> Production)

  2. Tools like kustomize gave some mileage, but teams ran into issues that restricted their (microservice) freedoms.

  3. (Microservice) Teams were recommended to follow good naming conventions, so that this makes developers of all services, at home when they switch from one service to another. But this was a “recommendation” and we found that this was not necessarily followed strictly. (Murphy’s law always works!)

  4. Certain “production” reliability practices like, “ensuring Pod Disruption Budgets, Autoscalers, Pod-Anti-Affinity rules are configured for your production workloads”, were not getting followed by all teams. This lead to their service outages sometimes.

  5. With the introduction of “Helm” certain problems like duplicate configuration went away (and brought many good things actually), certain problems persisted.

  6. Helm was only able to merge values files when the yaml content is fully defined in a flat map. But when its defined using nested maps (like the case of a free flow yaml), it was not able to merge them correctly. This lead to a lot of pain when teams maintained a “base” values file, and separate environment specific “override” files.

  7. To deal with all these problems and since K8s community was still maturing and didn’t have a tool that solves all these problems in one package, we had to invent “HeVa”.

HeVa : Helm Values File Validator

Heva is a simple command line binary written in Go(lang), that does these things:

Structural validation of Helm values files

  • Structural validation of Helm values files. This is where we can say a given (k8s) deployment is supposed to contain fields like replicas. Or HPAs need to be there when environment is production etc..

  • “Required” vs optional field validation. For example: We can say a given deployment’s containers field is mandatory one in values.yaml vs a configMap which could be optional.

“Units” validation.

  • Resource Quotas for CPU. This is where we can say it has to be in Millicores (m). (We have seen case where people would forget the unit and put a value like 100 for CPU. Why actually means 100 cores and not 100 millicores which is what developers would need!)

  • Resource Quotas for Memory. This is where we can say it has to be in Mebibytes (Mb).

  • MaxSurge and MaxUnavailable numbers in rollingUpdate strategy has to be in percentages (%) (as percentages work better over scale).

Semantic Validations

  • Ensure pod-anti-affinity section is mentioned for production environment.

  • Ensure infrastructure dependencies (like a database endpoint, credentials etc… )contain just place-holders and not real settings. Real values can be encrypted and could dynamically be injected during the deployment.

  • Ensure replicas are not defined if HPA is defined and vice-versa.

  • Ensure Pod Disruption Budgets are present for all deployments, for prod environment.

  • Ensure infra settings defined in values files are base64 encoded.

  • Ignore certain deployments to be deployed when the environment is production.

  • Merges all required values files perfectly no matter how many levels deep the configuration was written in. (And this final merged content is validated in one go.) This ensured there is NO configuration drift between “base” values file and “environment” specific overrides files. Overrides files became very slim.

And some more! And make these available for every developer to quickly try with their deployment manifests, locally. Here is the user experience of HeVa:

Trivia: We run 1000s of (k8s) deployments via Helm today (as of Jan 2023)

Outcomes

With the invention of HeVa, many microservices started integrating this into their deployment tooling. Its very common to see messages like the one below in every Jenkins console output these days:

Running Heva...
Heva version: v0.2.23
Merging all values files into one /tmp/order-final-qa-values.yaml
time="2022-12-16T19:10:09+05:30" level=info msg="Yaml content saved to file: /tmp/order-final-qa-values.yaml"
time="2022-12-16T19:10:09+05:30" level=info msg="Validation passed"

And incase anyone does any mistake regarding any of the above listed problems, the error gets caught well ahead in dev or qa stage.

This also made us (Platform Engineering team) be conscious of every change that needs to be added to Heva, while still maintaining backward compatibility with all values files of microservices.

Conclusion

More than validation, HeVa brought all microservice deployment manifests (i.e.. helm values files) into one common structure. This is pretty good win for service onboarding. Especially when all previous production learnings are baked into one tool and that is involved in every production/non-production deployment.

One has to note that, the same manifest validation can be done via admission controller webhooks (in a live fashion), but we wanted something that anyone developer can try on their laptops, even without a local k8s cluster and without any such huge roadblocks.

Trivia : Prior to writing HeVa in Go, we cooked up a solution using HiYaPyCo in Python, but many advantages of Go made us rewrite this in Go and Go became a default language for all Platform tools since then. Ranging from tools that migrate Kafka messages from one cluster to other, to spinning AWS load balancers to facilitate 40+ lines of testing across 80+ microservices in our non-production environments, they are exclusively written in Go. Expect blogposts on these topics very soon. Stay tuned.

Here are more stories about how we do what we do. Please checkout our open job positions if you are interested.