Automating Kubeflow Deployment with Helm and AWS CDK

Part 1: CI/CD setup and getting started with Helm + CDK

George Novack
11 min readAug 6, 2022

Part 1: CI/CD setup and getting started with Helm + CDK

Part 2: Configuring Cluster Authentication

Part 3: Installing Kubeflow

Kubeflow is an open-source platform for machine learning application development and deployment on Kubernetes. At its core, Kubeflow is a collection of integrated open-source components that simplify the full machine learning model lifecycle from experimentation to deployment for inference.

Throughout the next few articles, we will be creating an automated CI/CD pipeline to deploy Kubeflow on AWS EKS. We will use the AWS CDK to define all of the AWS infrastructure as code, Helm to package and configure the Kubernetes resources that we will create on our cluster, and AWS CodePipeline to create an automated pipeline for our EKS Cluster and Kubeflow deployment.

Before diving into the details, I want to first highlight an important difference between the approach I will take to deploy Kubeflow on EKS and the approach outlined in the official Kubeflow on AWS Deployment Guide. The official guide uses Kustomize to configure the various components of Kubeflow, whereas I will be using Helm. Both of these approaches can get you the same end result; however, I am choosing to go with Helm for a few reasons:

  • AWS CDK Support: The AWS CDK has built-in support for adding Helm Charts to an EKS Cluster using the eksCluster.addHelmChart() function, which allows us to include the Kubeflow installation directly within our CloudFormation Stack. On the other hand, using Kustomize, we would have to execute separate shell scripts to install Kubeflow after the creation of our Stack.
  • AWS Secrets: The official Kubeflow manifests include some hard-coded secret values, such as the Hashed Password and OIDC Client Credentials referenced in the following Config Map. I would like to replace these hard-coded values with references to secrets stored in AWS Secrets Manager. This way, we can keep sensitive information out of source control, and allow other applications to reference these secrets from a centralized location.
  • Code Reuse: Kubeflow depends on many other open-source applications Istio, Dex, and Cert Manager. The Kubeflow Manifests repository includes copies of the manifests required to install these various dependencies; however, many of these projects already maintain their own official manifests and Helm Charts. Where possible, I have opted to reuse and extend, rather than duplicate, any existing Helm Charts that have already been created for these dependencies.

Project Overview

We will start from an empty AWS CDK project, and add all of the cloud infrastructure necessary to get Kubeflow up and running on EKS.

An empty CDK project can be initialized by running:

cdk init app --language typescript

For those not familiar with the basics of writing infrastructure-as-code using the AWS CDK, it will be helpful to read through the documentation on creating a your first CDK app before proceeding.

CDK applications define 1 or more CloudFormation Stacks to be deployed on AWS. Our application will define 2 Stacks:

The EKS Stack will contain the EKS Cluster and its associated resources such as IAM Roles, VPC, and subnets.

The Deployment Stack will contain a CI/CD pipeline that builds and deploys the EKS Stack.

Creating the EKS Stack

The class below defines the EKS Stack, which contains a master IAM role for our EKS Cluster, a VPC in which EKS nodes will be run, and the EKS Cluster itself:

  • The IAM Role EksClusterMasterRole is the master role on our EKS cluster.. By passing it in the mastersRole option when creating the EKS Cluster, we create a mapping between this AWS IAM Role and the system:masters Kubernetes RBAC group. This allows our IAM role to perform operations on Kubernetes resources like Pods and Services.
  • We create the VPC in which all of the Kubernetes nodes will be run.
  • Then, we create the EKS cluster and define some basic configuration such as the Kubernetes version and the initial number of nodes. We also provide a value for the albController option. This lets the CDK know to deploy the AWS Load Balancer Controller, which allows the cluster to create AWS Load Balancer resources that will make our services accessible from outside the cluster.
  • Finally, the tag Kubernetes.io/cluster/{CLUSTER_NAME}: owned is added to each public and private VPC subnet. These tags allow EKS to properly discover our subnets. You can see more on this here: https://aws.amazon.com/premiumsupport/knowledge-center/eks-vpc-subnet-discovery/

Creating the Deployment Stack

With the EKS Stack defined above, we could deploy our EKS Cluster and associated resources simply by running cdk deploy from the command line; however, it will be better to create a pipeline to automate this deployment whenever any changes are pushed.

To create this pipeline, we use the CDK Pipelines library, which allows us to define a deployment pipeline for our CDK application within our CDK application.

To see how this works, we first create a deployment stage; this is essentially a step in the pipeline that deploys our EKS Stack.

Next, we define a new CloudFormation Stack, the Deployment Stack, in which we create the CI/CD pipeline and add the EksDeploymentStage as a step:

  • We create the CI/CD pipeline by instantiating a new instance of CodePipeline . The synth option defines the location from which the pipeline should pull our source code — in this case, a GitHub repository — and the commands it should run to build this source code into a CloudFormation template.
  • We create a new instance of the EksDeploymentStage defined mentioned earlier, passing in the AWS Account ID and region. We then add this stage to our pipeline using pipeline.addStage()

In order to authenticate to the source GitHub repository, we use AWS CodeStar Connections to create a connection to GitHub. The GitHub Connections guide in the AWS documentation walks through the steps to create a new connection to GitHub. After creating this connection, we copy the resulting Connection ARN and pass it to the connectionArn property on line 9 above.

Bootstrapping the Deployment Pipeline

Because we are using the CDK Pipelines library, the deployment pipeline defined in the Deployment Stack above can perform self-mutation. This means that when we make changes to the deployment pipeline in our CDK code (e.g. by modifying an existing deployment stage, or by adding a new stage), the pipeline will modify itself to incorporate these changes automatically.

In order to get the self-mutating pipeline up and running, however, we must perform a one-time manual deployment to create the pipeline. After performing the manual deployment, any subsequent changes to the pipeline are deployed by the pipeline itself.

We use cdk deploy from the command line to perform this manual deployment; but in order for this command to work, we must first define the entry-point for our CDK application.

The entry-point is a typescript file that creates an instance of the App construct, initializes our DeploymentStack, and then calls app.synth() to synthesize a CloudFormation template that defines all of the AWS Infrastructure required by the Stack:

In order to let the CDK Toolkit know to treat this file as the entry-point of our application, we modify the cdk.json configuration file and specify the app command as described here: Specifying the app command

For me, the app command looks like this:

{
"app": "npx ts-node --prefer-ts-exts bin/cdk.ts"
}

With the entry-point configured, we bootstrap the deployment pipeline by running:

cdk deploy

After running this command, we are able to view the status of the deployment pipeline in the AWS CodePipeline console.

The deployment pipeline may take several minutes to run when first creating the EKS cluster, but once it is finished, we will have a working EKS cluster. To validate that everything was created properly, we can run the following command to connect kubectl to our new cluster:

aws eks update-kubeconfig --name KubeflowCluster --region ${REGION} --role-arn arn:aws:iam::${ACCOUNT_ID}:role/EksClusterMasterRole

We can then run kubectl get pods -n kube-system to check that the core Kubernetes services are up and running. The output should look something like this:

NAME                       READY   STATUS    RESTARTS   AGE
coredns-6548845887-gz22z 1/1 Running 0 22m
coredns-6548845887-gz22z 1/1 Running 0 22m
kube-proxy-zlzx6 1/1 Running 0 17m
kube-proxy-f7hrx 1/1 Running 0 17m
aws-node-5tw4l 1/1 Running 0 17m
aws-node-f4kvk 1/1 Running 0 17m

Installing Kubeflow

We now have an automated, self-mutating deployment pipeline that creates a CloudFormation Stack containing an EKS Cluster. Next, we need to automate the installation of Kubeflow on our cluster.

As mentioned at the beginning, we will use Helm to install Kubeflow, and we will use the built-in support for Helm within the CDK to include this installation directly within our CDK stack.

Before installing Kubeflow itself, there are several core services that we need to install on our EKS cluster. Specifically, we will need to install:

  • cert-manager: cert-manager is a system the simplifies the process of issuing and managing certificates in our Kubernetes cluster.
  • Istio: Istio is a service mesh that allows us to more easily apply routing logic and access controls to traffic flowing to and from the services running in our cluster.
  • Dex: Dex is an open-source identity provider service that simplifies authentication by providing a single interface that can be used to connect to various popular identity providers like Google and GitHub.
  • OIDC Auth Service: odic-authservice is a service that is responsible for redirecting users to an identity provider (in our case, this will be Dex) whenever they are required to authenticate to services running on our cluster.

Luckily, most of these services have support for installation via Helm, so it will not require much effort to get them up and running.

Install cert-manager

The official cert-manager Helm chart can be found in the https://charts.jetstack.io repository. We can add a Helm chart to our CDK application simply by using the cluster.addHelmChart() method; however, to better encapsulate some of the details around installing cert-manager, we will create a CertManagerDeployment class that performs this installation:

  • We first define a CertManagerDeploymentProps interface that is used to pass an instance of an EKS Cluster construct to our class, and which allows the user of the class to override certain properties of the installation if needed, such as the Helm release name, the namespace, and the version of cert-manager to install.
  • Next, the CertManagerVersion class provides a convenient way for users to discover existing versions of the cert-manager Helm chart. For now, we will only keep a single version here.
  • And finally, the CertManagerDeployment class takes in an instance of the interface defined above and performs the installation on the provided EKS cluster.

We then install cert-manager on our cluster by creating an instance of the CertManagerDeployment class:

After redeploying the EKS Stack, we can check that cert-manager is up and running by using:

kubectl get pods -n cert-manager

Install Istio

To install Istio, we will take an approach similar to that described above for installing cert-manager.

We’ll create the following IstioDeployment class to encapsulate the Istio installation via Helm:

  • This class is very similar to the CertManagerDeployment class from earlier. The only real difference is that, within this class, we are installing two Helm charts: the base chart, and the istiod chart. Since the istiod chart depends on the base chart, and since they both depend on the Kubernetes namespace, we use node.addDependency to ensure that all of these resources are deployed in the proper order.

Install the Istio Ingress Gateway

To allow external traffic into our cluster, we need to install an Istio Ingress Gateway. There is an official chart for deploying an Ingress Gateway; however, instead of using this chart as-is, we will extend it to create an additional Kubernetes Ingress resource.

Why do we need this Ingress resource? When we first created our EKS cluster via the CDK, we used the albController option within the CDK to deploy the AWS Load Balancer Controller. The Load Balancer Controller is responsible for provisioning an AWS Application Load Balancer for every Kubernetes Ingress resource with the appropriate annotations.

To extend the existing Ingress Gateway chart, we will first create a new Helm chart:

helm create istio-ingress

Next, we update the Chart.yaml file and add the official Istio Ingress Gateway chart as a dependency:

dependencies:
- name: gateway
version: 1.12.7
repository: https://istio-release.storage.googleapis.com/charts

To define the Ingress resource, we add a new yaml template to the templates/ folder:

And then specify the annotations for our Ingress in Values.yaml

ingress:
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'

With the new Helm chart created, we add the IstioIngressDeployment class to our CDK application to install it:

  • Since the new istio-ingress Helm chart does not exist on any remote Helm repository, we cannot install it using the repository and chart properties of addHelmChart() like we did with cert-manager and Istio earlier. Instead, we use the chartAsset property, which allows us to provide the Helm chart as a CDK Asset.

We can create and use a Helm chart asset when instantiating an instance of IstioIngressDeployment like so, where path is the local file system path to the istio-ingress Helm chart:

Once this Helm chart is deployed, the AWS Load Balancer Controller will provision a new AWS Application Load Balancer to expose the istio-ingressgateway service.

The DNS name for our ingress gateway can be obtained either through the AWS console (as shown in the image above), or by running:

kubectl get ingress -n istio-system

To validate that Istio and the ingress gateway are working as expected, we can deploy a minimal service and check if it is accessible via the new Load Balancer.

We’ll define the necessary Kubernetes resources in a manifest file:

This manifest defines a few different resources:

  • A Pod running nginx
  • A Service that exposes the pod
  • An Istio Gateway resource that describes a set of ports and protocols that can be used by external traffic entering through the ingress gateway
  • An Istio VirtualService resource that defines a set of routing rules to apply to traffic entering via the above gateway. Specifically, this VirtualService simply routes all traffic to the Service defined earlier.

After deploying this service with kubectl apply we can navigate to the Application Load Balancer DNS name and validate that we are greeted with the nginx welcome page.

Wrapping up (for now)

At this point, we have a fully automated deployment pipeline that deploys an EKS cluster via the AWS CDK. And, using Helm, we have automated the installation of a few of the core services that are required to get Kubeflow running.

In the next part of this series, we will finish setting up Kubeflow’s dependencies by installing Dex and OIDC AuthService.

All of the source code for the completed CDK application can be found here: https://github.com/gnovack/kubeflow-helm-cdk

References

--

--