Object Store¶

Learn how and why deployKF needs an object store. Learn how to use any S3-compatible object store with Kubeflow Pipelines.

What is an Object Store?¶

An object store is a type of storage system that manages data as objects, as opposed to traditional file systems which manage data as files. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.

What is an S3-compatible Object Store?¶

The most well-known object store is Amazon S3. Given its popularity, many other object stores have implemented S3-compatible APIs, which allows them to be used with tools that are designed to work with S3.

Why does deployKF use an Object Store?¶

An S3-compatible object store is a dependency of Kubeflow Pipelines, which uses it to store pipeline definitions and artifacts from pipeline runs.

Connect an External Object Store¶

By default, deployKF includes an embedded MinIO instance. However, to improve the performance and reliability of Kubeflow Pipelines, we recommend using an external S3-compatible object store.

Embedded MinIO

You should ALWAYS use an external S3-compatible object store. The embedded MinIO is only intended for testing purposes as it only supports a single replica, and has no backups.

Please ensure you are familiar with MinIO's licence, at the time of writing it was AGPLv3. deployKF is licensed under Apache 2.0 and does NOT contain any code from MinIO, instead, we provide links so that you may download MinIO directly from official sources, at your own discretion.

You may use any S3-compatible object store, as long as it is accessible from the Kubernetes cluster where deployKF is running.

You might consider using one of the following services:

Platform	Object Store	S3-compatible Endpoint
Amazon Web Services	Amazon S3	`s3.{region}.amazonaws.com`
Google Cloud	Google Cloud Storage	`storage.googleapis.com` you must use HMAC Keys for authentication
Microsoft Azure	Azure Blob Storage	No first-party API. Third-party translation layers like S3Proxy can be used.
Alibaba Cloud	Alibaba Cloud Object Storage Service (OSS)	`s3.oss-{region}.aliyuncs.com`
IBM Cloud	IBM Cloud Object Storage	`s3.{region}.cloud-object-storage.appdomain.cloud`
Other	Cloudflare R2	`{account_id}.r2.cloudflarestorage.com`
Self-Hosted	MinIO, Ceph, Wasabi	See provider documentation.

S3-compatible APIs Only

Currently, Kubeflow Pipelines only supports object stores which have an S3-compatible XML API. This means that while you can use services like Google Cloud Storage, you will need to use their XML API, and features like GKE Workload Identity will NOT work.

If you would like Kubeflow Pipelines to implement support for the native APIs of your object store, please raise this with the upstream Kubeflow Pipelines community.

1. Create a Bucket¶

You must create a single bucket for Kubeflow Pipelines. Refer to the documentation for your object store to learn how to create a bucket.

For example, if you are using AWS S3, you may use the following methods:

2. Create IAM Policies¶

You must create IAM Policies to allow Kubeflow Pipelines to access your bucket. Refer to the documentation for your object store to learn how to create IAM Policies.

For example, if you are using AWS S3, you may use the following methods:

Bucket IAM Policies¶

It is recommended to create separate IAM Roles for each component and user. The following are example IAM Policies for the Kubeflow Pipelines BACKEND and PROFILE namespaces.

IAM Policy - Backend

The following IAM Policy can be used by the Kubeflow Pipelines BACKEND, replace <BUCKET_NAME> with the name of your bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>/artifacts/*",
        "arn:aws:s3:::<BUCKET_NAME>/pipelines/*",
        "arn:aws:s3:::<BUCKET_NAME>/v2/artifacts/*"
      ]
    }
  ]
}

IAM Policy - Profile

The following IAM Policy can be used by each PROFILE namespace, replace <BUCKET_NAME> with the name of your bucket, and <PROFILE_NAME> with the name of the profile.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>/artifacts/<PROFILE_NAME>/*",
        "arn:aws:s3:::<BUCKET_NAME>/v2/artifacts/<PROFILE_NAME>/*"
      ]
    }
  ]
}

To learn more about how objects are stored in the bucket, see the following section:

Object Store Structure

All Kubeflow Pipelines artifacts are stored in the same bucket, but are separated by object key prefixes.

The following table shows the prefixes used by Kubeflow Pipelines:

Key Prefix	Purpose
`/pipelines/`	pipeline definitions
`/artifacts/{profile_name}/`	pipeline run artifacts (KFP v1)
`/v2/artifacts/{profile_name}/`	pipeline run artifacts (KFP v2)

Key Format

Notice that the key prefixes include {profile_name}, this allows prefix-based IAM Policies to ensure each profile only has access to its own artifacts.

3. Disable Embedded MinIO¶

The deploykf_opt.deploykf_minio.enabled value controls if the embedded MinIO instance is deployed.

The following values will disable the embedded MinIO instance:

deploykf_opt:
  deploykf_minio:
    enabled: false

4. Connect Kubeflow Pipelines¶

How you connect Kubeflow Pipelines to your external object store depends on the authentication method you choose.

The following sections show how to configure each method:

Key-Based Authentication

IRSA-Based Authentication

All S3-compatible object stores support key-based authentication.

In this method, deployKF will use HMAC Keys (that is, an access_key and secret_key) to authenticate with your object store.

Step 1 - Create Secrets (Backend)

First, create a secret for the Kubeflow Pipelines backend:

## create a secret for the KFP backend
kubectl create secret generic \
  "kubeflow-pipelines--backend-object-store-auth" \
  --namespace "kubeflow" \
  --from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
  --from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Info

The backend secret MUST be in the kubeflow namespace, as this is where the KFP backend is deployed.
The backend secret should have access to all KFP artifacts in the bucket.
See the Example IAM Policies.

Step 2 - Create Secrets (User Profiles)

Next, create a secret for each profile that will use Kubeflow Pipelines:

## create a secret for the "team-1" profile
kubectl create secret generic \
  "kubeflow-pipelines--profile-object-store-auth--team-1" \
  --namespace "my-namespace" \
  --from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
  --from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

## create a secret for the "team-2" profile
kubectl create secret generic \
  "kubeflow-pipelines--profile-object-store-auth--team-2" \
  --namespace "my-namespace" \
  --from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
  --from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Info

The profile secrets can be in any namespace, deployKF will automatically clone the correct secret into the profile namespace and configure KFP to use it.
It is common to store all the profile secrets in a single namespace, as this makes them easier to manage.
Each profile secret should only have the minimum permissions required for that profile.
See the Example IAM Policies.

Step 3 - Configure deployKF

Finally, configure deployKF to use the secrets you created using the following values:

Value	Purpose
`deploykf_core.deploykf_profiles_generator.profileDefaults.tools.kubeflowPipelines.objectStoreAuth`	Default bucket authentication used in profiles that do NOT have `tools.kubeflowPipelines.objectStoreAuth` defined in their `deploykf_core.deploykf_profiles_generator.profiles` list entry.
`kubeflow_tools.pipelines.objectStore`	Connection details & bucket authentication used by the KFP backend (not profiles).
`kubeflow_tools.pipelines.bucket`	Bucket name and region configs.

The following values will connect Kubeflow Pipelines to an external object store using key-based authentication:

deploykf_core:
  deploykf_profiles_generator:

    ## NOTE: each profile can override the defaults 
    ##       see under `profiles` for an example of a profile 
    ##       which overrides the default auth pattern
    ##
    profileDefaults:
      tools:
        kubeflowPipelines:
          objectStoreAuth:
            ## (OPTION 1):
            ##  - all profiles share the same access key (NOT RECOMMENDED)
            ##  - the `existingSecretAccessKeyKey` and `existingSecretSecretKeyKey`
            ##    reference the KEY NAMES in the Kubernetes Secret you create
            ##
            #existingSecret: "my-secret-name"
            #existingSecretNamespace: "my-namespace"
            #existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
            #existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"

            ## (OPTION 2):
            ##  - each profile has its own access key
            ##  - instances of '{profile_name}' in `existingSecret` 
            ##    are replaced with the profile name
            ##  - the `existingSecretAccessKeyKey` and `existingSecretSecretKeyKey`
            ##    reference the KEY NAMES in the Kubernetes Secret you create
            ##
            existingSecret: "kubeflow-pipelines--profile-object-store-auth--{profile_name}"
            existingSecretNamespace: "my-namespace"
            existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
            existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"

    ## example of a profile which overrides the default auth
    #profiles:
    #  - name: "my-profile"
    #    members: []
    #    tools:
    #      kubeflowPipelines:
    #        objectStoreAuth:
    #          existingSecret: "my-secret-name"
    #          existingSecretNamespace: "" # defaults to the profile's namespace
    #          existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
    #          existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"

kubeflow_tools:
  pipelines:
    bucket:
      ## this specifies the name of your bucket (and region, if applicable)
      name: kubeflow-pipelines
      region: ""

    objectStore:
      useExternal: true

      ## this specifies the S3-compatible endpoint of your object store
      ##  - for S3 itself, you may need to use the region-specific endpoint
      ##  - don't set a port unless it is non-standard
      host: "s3.amazonaws.com"
      port: ""
      useSSL: true

      ## these credentials are used by the KFP backend (not profiles)
      auth:
        ## (OPTION 1):
        ##  - set keys with values (NOT RECOMMENDED)
        #accessKey: "AKIAIOSFODNN7EXAMPLE"
        #secretKey: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

        ## (OPTION 2):
        ##  - read a kubernetes secret from the 'kubeflow' namespace
        ##  - note, `existingSecretKey` specifies the KEY NAMES in the 
        ##    secret itself, which contain the secret values
        existingSecret: "kubeflow-pipelines--backend-object-store-auth"
        existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
        existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"

If you are using EKS and S3, you may use IAM roles for service accounts (IRSA).

In this method, EKS will inject access keys automatically based on Kubernetes ServiceAccount annotations, so no keys are required.

IRSA is only supported on EKS

IRSA is only supported when connecting to S3 from an EKS cluster.

If you are using a different platform, you will need to use key-based authentication.

Step 1 - Enable IRSA

First, you must enable IRSA on your EKS cluster.

Refer to the AWS Documentation for detailed instructions.

Step 2 - Associate IAM Roles

You will need to associate the Kubernetes ServiceAccounts to the roles you created in the previous step.

The following ServiceAccounts are used by Kubeflow Pipelines:

Component	Namespace	ServiceAccount Name
Kubeflow Pipelines Backend	`kubeflow`	`ml-pipeline`
Kubeflow Pipelines Backend	`kubeflow`	`ml-pipeline-ui`
Kubeflow Pipelines Backend	`kubeflow-argo-workflows`	`argo-server`
Kubeflow Pipelines Backend	`kubeflow-argo-workflows`	`argo-workflow-controller`
User Profiles	`{profile_name}`	`default-editor`

For example, the following command will associate the arn:aws:iam::MY_ACCOUNT_ID:policy/MY_POLICY_NAME IAM Policy with the ml-pipeline ServiceAccount in the kubeflow namespace, and create an IAM Role named kubeflow-pipelines-backend:

eksctl create iamserviceaccount \
  --cluster "my-cluster" \
  --namespace "kubeflow" \
  --name "ml-pipeline" \
  --role-name "kubeflow-pipelines-backend" \
  --attach-policy-arn "arn:aws:iam::MY_ACCOUNT_ID:policy/MY_POLICY_NAME" \
  --approve

Step 3 - Configure deployKF

The following values are needed to configure IRSA-based auth:

Value	Purpose
`deploykf_core.deploykf_profiles_generator.profileDefaults.plugins`	Default profile-plugins, used by profiles which do NOT have `plugins` defined in their `deploykf_core.deploykf_profiles_generator.profiles` list entry. Note, the `AwsIamForServiceAccount` plugin is used to configure AWS IRSA-based auth by annotating the `default-editor` and `default-viewer` ServiceAccounts in each profile.
`kubeflow_dependencies.kubeflow_argo_workflows.controller.serviceAccount`	Kubernetes ServiceAccount used by the Argo Workflows Controller
`kubeflow_dependencies.kubeflow_argo_workflows.server.serviceAccount`	Kubernetes ServiceAccount used by the Argo Server UI
`kubeflow_tools.pipelines.serviceAccounts.apiServer`	Kubernetes ServiceAccount used by the Kubeflow Pipelines API Server
`kubeflow_tools.pipelines.serviceAccounts.frontend`	Kubernetes ServiceAccount used by the Kubeflow Pipelines Frontend
`kubeflow_tools.pipelines.objectStore.auth.fromEnv`	If `true`, disables all other auth methods, so the AWS Credential Provider Chain will try to use IRSA-based auth.

The following values will connect Kubeflow Pipelines to an external object store using IRSA-based authentication:

deploykf_core:
  deploykf_profiles_generator:

    ## NOTE: if you want to have a different set of plugins for each profile,
    ##       for example, to have some profiles use a different IAM role,
    ##       you can define the `plugins` list explicitly in a profile 
    ##       to override the default plugins
    profileDefaults:
      plugins:
        - kind: AwsIamForServiceAccount
          spec:
            awsIamRole: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
            AnnotateOnly: true

    ## example of a profile which overrides the default plugins
    #profiles:
    #  - name: "my-profile"
    #    members: []
    #    plugins:
    #      - kind: AwsIamForServiceAccount
    #        spec:
    #          awsIamRole: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
    #          AnnotateOnly: true

kubeflow_dependencies:
  kubeflow_argo_workflows:
    controller:
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
    server:
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"

kubeflow_tools:
  pipelines:
    serviceAccounts:
      apiServer:
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
      frontend:
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"

    bucket:
      name: kubeflow-pipelines
      region: "us-west-2"

    objectStore:
      useExternal: true

      ## for IRSA, this should always be "s3.{region}.amazonaws.com" or similar
      host: "s3.us-west-2.amazonaws.com"
      useSSL: true

      auth:
        ## setting `fromEnv` to `true` disables all other auth methods
        ## so the AWS Credential Provider Chain will try to use IRSA-based auth
        fromEnv: true

deployKF 0.1.4 and earlier

If you are using deployKF 0.1.4 or earlier, you will need to explicitly set kubeflow_tools.pipelines.kfpV2.minioFix to false. Note that newer versions of deployKF do not have this value, as the MinIO issue has been resolved.

For example:

kubeflow_tools:
  pipelines:
    kfpV2:
      ## NOTE: only required if you are using 'sample-values.yaml' as a base
      ##       as `minioFix` can only be 'true' when using the embedded MinIO
      minioFix: false

Last update: 2024-06-12
Created: 2024-03-12