Automating Kubeflow Deployment with Helm and AWS CDK

Part 3: Installing Kubeflow

11 min readAug 6, 2022

Part 1: CI/CD setup and getting started with Helm + CDK
Part 2: Configuring Cluster Authentication
Part 3: Installing Kubeflow

In the last two parts of this series, we created a CI/CD pipeline using AWS CDK Pipelines to deploy an EKS cluster. Using Helm and the AWS CDK, we automated the installation of an Istio service mesh on our cluster, then implemented an end-to-end user authentication process using Dex and OIDC Auth Service.

Now, we are ready to install Kubeflow.

Creating the Kubeflow Helm Chart

Kubeflow is made up of many different subcomponents: Jupyter Notebooks, Pipelines, Volumes, Katib, and more. Here, we will create a separate Helm chart for each one of these subcomponents, as well as a parent chart which will reference all of them as subcharts. To define the templates in all of these charts, we will reuse manifests from the kubeflow/manifests repository and make modifications as needed.

We will start by creating the parent chart:

helm create kubeflow

There are a few resources that we will define directly within the parent Helm chart:

Kubeflow Cluster Roles: There are a few common Cluster Roles that are referenced by the various Kubeflow subcomponents via Cluster Role Aggregation Rules. These Cluster Roles are used to define common sets of privileges, such as admin, edit, and view. These Cluster Roles are defined in common/kubeflow-roles/base/cluster-roles.yaml in the source repository.
Istio Resources: In Part 1 of this series of articles, we deployed a sample nginx service, which included an Istio Gateway resource which defined a set of ports and protocols that can be used by external clients. We will define a similar Gateway resource for Kubeflow, as well as a few Cluster Roles that add Istio-related privileges to the Aggregate Cluster Roles mentioned above. These resource are define in common/istio-1-9/kubeflow-istio-resources/base.

As I pull in manifests from the kubeflow/manifests repository throughout this article, I will be extracting various values to make them into more reusable Helm templates. For example, my templatized version of the Istio Gateway manifest looks like this:

With the following default values in values.yaml :

Due to the sheer number of manifests that are required to get Kubeflow up and running, I won’t be able to show each and every template in this article, so be sure to check out the completed Helm charts here: kubeflow-helm-cdk/charts

The Central Dashboard

The Kubeflow Central Dashboard is the central UI hub that is used to access all of the other Kubeflow subcomponents.

We first create a new chart for the Central Dashboard:

helm create central-dashboard

And then create the templates by using the manifests from apps/centraldashboard.

One of the templates in the Central Dashboard chart defines the Istio Virtual Service resource that is used to route traffic to the dashboard:

Notice on line 7, we must specify the Gateway that will apply the routes defined in this Virtual Service. This will be the Gateway defined in the parent Helm chart. We can provide this value from the values.yaml file in the parent Helm chart like so:

We also need to make sure that the central-dashboard chart is specified as a dependency by the parent chart by adding the following to Chart.yaml in the parent chart (since we have defined the central-dashboard chart locally, we can use the relative filesystem path to the chart as the value of repository :

If we now deploy the parent kubeflow chart, we will be able to access the Kubeflow Central Dashboard by navigating to the AWS Load Balancer DNS name in a browser and completing the login process by using the email address and password created in Part 2 of this series.

At this point, however, we can’t actually do anything within the dashboard.

Profile Service

Kubeflow uses Profiles and User Namespaces to create a multi-tenant environment, where each user has access to only the resources within their own namespace, or within a namespace that has been shared with them. This setup allows a single Kubeflow installation to be shared by multiple users or teams.

To create and manage these Profiles, we deploy the Profiles Service using the manifests from apps/profiles/upstream. We create a new Helm chart called profiles and add it as a dependency in the parent kubeflow chart just as we did with the central-dashboard chart. One small difference with the profiles chart is that it depends on a Custom Resource Definition, specifically the Profile resource defined in kubeflow.org_profiles.yaml.

Luckily, Helm supports installing CRDs, so all we have to do is add the CRD manifest to a directory called crds within the profiles chart.

Creating a New Profile

There are two ways to create a new Profile and User Namespace within Kubeflow:

The first is to manually create a manifest file that defines a Profile resource, and create the Profile using kubectl create as described in the Kubeflow docs for Manual profile creation.
The other approach is to allow the user to create their own Profile through the Kubeflow UI when they first login to the Kubeflow Dashboard. We will take this approach here.

We can prompt the user to create a new Profile through the UI by setting the REGISTRATION_FLOW environment variable in the Central Dashboard service to "true"

After updating this environment setting and redeploying the Profiles service, the user is now prompted to provide a name for their new User Namespace

From now on, when this user logs in to the dashboard, their User Namespace will be displayed in the upper left corner of the UI

We can also verify that a Profile object was created by running:

kubectl get profilesNAME           AGE
example-user   2m6s

Kubeflow Notebooks

One of the core components of Kubeflow is Kubeflow Notebooks, which allows us to create development environments within our Kubeflow cluster. We can configure these development environments with different amounts of CPU, memory, disk space, and hardware accelerators like GPUs; and then use them to perform data analysis and experimentation directly with our cluster.

We will deploy the services needed to get Kubeflow Notebooks up and running now.

Jupyter Web App

The first step is to install the Jupyter Web App service using the manifests defined in apps/jupyter/jupyter-web-app/upstream. This service defines the UI that can be used to create new notebooks and view existing notebooks.

There are a number of configurations that can be used to customize the notebook creation experience. For now, however, we will stick with the default values for these: jupyter-web-app/templates/config.yaml

Since we are accessing the Kubeflow UI via http and not https, we will need to set the environment variable APP_SECURE_COOKIES to false in the Jupyter Web App service. We add the environment variable to the deployment.yaml template in the jupyter-web-app Helm chart:

This way, users of the chart can provide the desired value for the APP_SECURE_COOKIES setting via Helm values.

After installing this chart, we can view the Kubeflow Notebooks UI by navigating to {{ALB_DNS_NAME}}/jupyter/ or by opening the Kubeflow Dashboard and selecting Notebooks in the side menu.

The first thing we notice in the UI is an error indicating that the resource user/notebooks cannot be found. We can try to ignore this issue and proceed with creating a new Notebook Server by selecting New Server; however, after setting the configurations for the new Notebook Server, we will be met with the same error when attempting to launch the server.

So what is the reason for this error?

The Kubeflow Notebooks UI is attempting to list (when we first navigated to the Notebooks tab in the dashboard) and create (when pressing Launch at the bottom of the New Server form) Kubernetes resources of type Notebook, but this resource type does not exist by default on our cluster.

We can confirm this by running

kubectl get notebookerror: the server doesn't have a resource type "notebook"

The Notebook resource type is defined using a Custom Resource Definition that is managed by the Notebook Controller service. We will deploy this service next.

Notebook Controller

There are several different Kubernetes resources needed to host a single Jupyter Notebook Server in our cluster, such as a Stateful Set to manage the server pod, a Service to provide an internal DNS name for the pod, and an Istio Virtual Service to route traffic to the Service from outside the cluster.

Kubeflow encapsulates all of these resources into the Notebook Custom Resource Definition (CRD). This CRD, as well as the Controller which manages instances of this resource, are defined in apps/jupyter/notebook-controller/upstream in the source repository.

We will create a new Helm chart called notebook-controller to deploy this CRD and controller service.

There are two important environment variables that we provide to the Notebook Controller service: USE_ISTIO is a boolean flag indicating whether our Jupyter Notebooks will be running in an Istio service mesh; and ISTIO_GATEWAY is the name of the Istio Gateway resource through which traffic will be routed to our Notebooks.

These environment variables are defined in the following Config Map template:

Notice that for the value of the ISTIO_GATEWAY setting is in the form {{NAMESPACE}}/{{ISTIO_GATEWAY_NAME}} . When a new notebook is created, the ISTIO_GATEWAY setting will be used to create a Virtual Service in the Kubeflow user’s namespace (e.g. example-user from earlier), which is separate from the namespace where our Istio Gateway is deployed; therefore, we must provide the namespace of the Gateway along with its name.

With these settings in place, we can deploy the Notebook Controller service and validate that the Notebook resource type is now recognized in our cluster:

kubectl get notebook -n example-userNo resources found in example-user namespace.

Next, we can navigate back to the Notebooks page in the Kubeflow dashboard and create a new Notebook Server by selecting New Server again.

This time, we see a slightly different error message indicating that the poddefaults resource could not be found.

We will address this error shortly; but for now, we can provide a name for the new Notebook Server and select Launch at the bottom of the page. After a short startup time, we can see the new Notebook Server up and running on the Notebooks page.

Selecting Connect will launch the JupyterLab UI in a new browser tab, where we are free to create and execute notebooks on our new Notebook Server.

We now have the basic functionality of Kubeflow Notebooks working, but there are a few more services we need to deploy to enable the full range of Kubeflow Notebooks features.

Admission Webhook

In the last section, we deployed the Notebook Controller, and were then able to successfully create a new Notebook Server. However, there was an error message that we encountered and ignored during the creation process:

[404] The requested resource could not be found in the API Server http:/{{ALB_DNS_NAME}}/jupyter/api/namespaces/example-user/poddefaults

Similar to the error we saw before we deployed the Notebook Controller, this error indicates that Kubeflow is trying to find all resources of type poddefaults in our user namespace, but is unable to do so.

Again, we can confirm that the resource type poddefaults is not recognized by using kubectl

kubectl get poddefaults -n example-usererror: the server doesn't have a resource type "poddefaults"

Just like the notebook resource type, the poddefaults resource type is defined by a CRD in apps/admission-webhook/upstream/base.

The purpose of the poddefaults resource is to allow users to specify a set of default configurations to apply to pods that are created by Kubeflow components (e.g. the pods which run our Jupyter Notebook Servers). For example, we may want to mount a secret containing some important credentials into all of our Notebook Server pods.

Let’s test this out. We will first deploy the poddefaults CRD and the associated Mutating Webhook Configuration which intercepts requests to create new pods, and injects the desired Pod Defaults.

We create a new Helm chart called admission-webhook and list it as a dependency in the parent kubeflow chart just like with the others.

After deploying this chart, we can check that the poddefault resource is recognized:

kubectl get poddefault -n example-userNo resources found in example-user namespace.

Next, we will define a new PodDefault object which injects a simple environment variable into our Notebook Server pods:

And then create this object on the cluster with kubectl apply -f

Once this object is created, we can create a new Notebook Server from the Kubeflow UI and expand the Configurations drop-down to select our new Pod Default configuration.

After creating this Notebook Server, we can validate that the sample-env environment variable was injected by running kubectl describe pod on the Notebook Server pod and examining the container environment settings

...
sample-env-notebook:
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sun, 31 Jul 2022 10:01:33 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     600m
      memory:  1288490188800m
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      NB_PREFIX:   /notebook/example-user/sample-env-notebook
      sample-env:  sample-env-value
...

Volumes

The last step to complete our Kubeflow Notebooks setup is to deploy the Volumes Web App, which allows users to view, create, and delete data volumes from the Kubeflow UI.

We will deploy this app by creating a new volumes-web-app Helm chart from the Kubeflow manifests defined in apps/volumes-web-app/upstream.

Just like with the Jupyter Web App, we need to set the APP_SECURE_COOKIES environment variable to false so that we can access the Kubeflow dashboard over http.

Once the Volumes Web App is deployed, we will create a new Notebook Server and specify the name and size of the Workspace Volume:

Then we can navigate to the Volumes tab in the Kubeflow menu to view our new volume.

One important thing to note is that this volume is not managed by the Notebook Server that uses it. We can test this by deleting the Notebook Server and validating that the volume is not deleted. This means that these volumes can be used as long-term persistent storage by any number of Notebook Servers.

Wrapping up Part 3

Throughout these 3 articles on deploying Kubeflow with Helm and the AWS CDK, we have seen how to go from an empty AWS CDK application to a fully-functional CI/CD pipeline that deploys an EKS Cluster and installs the Kubeflow Dashboard, Kubeflow Notebooks, and all of the core services that Kubeflow depends on.

There are still several components of Kubeflow that are not included in our current Kubeflow Helm chart, most notably Kubeflow Pipelines. In the next article, we will add Kubeflow Pipelines to our Helm Chart.

Thanks for reading. All of the source code for the completed CDK application can be found here: https://github.com/gnovack/kubeflow-helm-cdk