Object Store¶
Learn how and why deployKF needs an object store. Learn how to use any S3-compatible object store with Kubeflow Pipelines.
What is an Object Store?¶
An object store is a type of storage system that manages data as objects, as opposed to traditional file systems which manage data as files. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.
What is an S3-compatible Object Store?¶
The most well-known object store is Amazon S3. Given its popularity, many other object stores have implemented S3-compatible APIs, which allows them to be used with tools that are designed to work with S3.
Why does deployKF use an Object Store?¶
An S3-compatible object store is a dependency of Kubeflow Pipelines, which uses it to store pipeline definitions and artifacts from pipeline runs.
Connect an External Object Store¶
By default, deployKF includes an embedded MinIO instance. However, to improve the performance and reliability of Kubeflow Pipelines, we recommend using an external S3-compatible object store.
Embedded MinIO
You should ALWAYS use an external S3-compatible object store. The embedded MinIO is only intended for testing purposes as it only supports a single replica, and has no backups.
Please ensure you are familiar with MinIO's licence, at the time of writing it was AGPLv3. deployKF is licensed under Apache 2.0 and does NOT contain any code from MinIO, instead, we provide links so that you may download MinIO directly from official sources, at your own discretion.
You may use any S3-compatible object store, as long as it is accessible from the Kubernetes cluster where deployKF is running.
You might consider using one of the following services:
Platform | Object Store | S3-compatible Endpoint |
---|---|---|
Amazon Web Services | Amazon S3 | s3.{region}.amazonaws.com |
Google Cloud | Google Cloud Storage | storage.googleapis.com you must use HMAC Keys for authentication |
Microsoft Azure | Azure Blob Storage | No first-party API. Third-party translation layers like S3Proxy can be used. |
Alibaba Cloud | Alibaba Cloud Object Storage Service (OSS) | s3.oss-{region}.aliyuncs.com |
IBM Cloud | IBM Cloud Object Storage | s3.{region}.cloud-object-storage.appdomain.cloud |
Other | Cloudflare R2 | {account_id}.r2.cloudflarestorage.com |
Self-Hosted | MinIO, Ceph, Wasabi | See provider documentation. |
S3-compatible APIs Only
Currently, Kubeflow Pipelines only supports object stores which have an S3-compatible XML API. This means that while you can use services like Google Cloud Storage, you will need to use their XML API, and features like GKE Workload Identity will NOT work.
If you would like Kubeflow Pipelines to implement support for the native APIs of your object store, please raise this with the upstream Kubeflow Pipelines community.
1. Create a Bucket¶
You must create a single bucket for Kubeflow Pipelines. Refer to the documentation for your object store to learn how to create a bucket.
For example, if you are using AWS S3, you may use the following methods:
2. Create IAM Policies¶
You must create IAM Policies to allow Kubeflow Pipelines to access your bucket. Refer to the documentation for your object store to learn how to create IAM Policies.
For example, if you are using AWS S3, you may use the following methods:
Bucket IAM Policies¶
It is recommended to create separate IAM Roles for each component and user. The following are example IAM Policies for the Kubeflow Pipelines BACKEND and PROFILE namespaces.
IAM Policy - Backend
The following IAM Policy can be used by the Kubeflow Pipelines BACKEND, replace <BUCKET_NAME>
with the name of your bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/artifacts/*",
"arn:aws:s3:::<BUCKET_NAME>/pipelines/*",
"arn:aws:s3:::<BUCKET_NAME>/v2/artifacts/*"
]
}
]
}
IAM Policy - Profile
The following IAM Policy can be used by each PROFILE namespace, replace <BUCKET_NAME>
with the name of your bucket, and <PROFILE_NAME>
with the name of the profile.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/artifacts/<PROFILE_NAME>/*",
"arn:aws:s3:::<BUCKET_NAME>/v2/artifacts/<PROFILE_NAME>/*"
]
}
]
}
To learn more about how objects are stored in the bucket, see the following section:
Object Store Structure
All Kubeflow Pipelines artifacts are stored in the same bucket, but are separated by object key prefixes.
The following table shows the prefixes used by Kubeflow Pipelines:
Key Prefix | Purpose |
---|---|
/pipelines/ | pipeline definitions |
/artifacts/{profile_name}/ | pipeline run artifacts (KFP v1) |
/v2/artifacts/{profile_name}/ | pipeline run artifacts (KFP v2) |
Key Format
Notice that the key prefixes include {profile_name}
, this allows prefix-based IAM Policies to ensure each profile only has access to its own artifacts.
3. Disable Embedded MinIO¶
The deploykf_opt.deploykf_minio.enabled
value controls if the embedded MinIO instance is deployed.
The following values will disable the embedded MinIO instance:
deploykf_opt:
deploykf_minio:
enabled: false
4. Connect Kubeflow Pipelines¶
How you connect Kubeflow Pipelines to your external object store depends on the authentication method you choose.
The following sections show how to configure each method:
All S3-compatible object stores support key-based authentication.
In this method, deployKF will use HMAC Keys (that is, an access_key
and secret_key
) to authenticate with your object store.
Step 1 - Create Secrets (Backend)
First, create a secret for the Kubeflow Pipelines backend:
## create a secret for the KFP backend
kubectl create secret generic \
"kubeflow-pipelines--backend-object-store-auth" \
--namespace "kubeflow" \
--from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
--from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Info
- The backend secret MUST be in the
kubeflow
namespace, as this is where the KFP backend is deployed. - The backend secret should have access to all KFP artifacts in the bucket.
- See the Example IAM Policies.
Step 2 - Create Secrets (User Profiles)
Next, create a secret for each profile that will use Kubeflow Pipelines:
## create a secret for the "team-1" profile
kubectl create secret generic \
"kubeflow-pipelines--profile-object-store-auth--team-1" \
--namespace "my-namespace" \
--from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
--from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
## create a secret for the "team-2" profile
kubectl create secret generic \
"kubeflow-pipelines--profile-object-store-auth--team-2" \
--namespace "my-namespace" \
--from-literal AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
--from-literal AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Info
- The profile secrets can be in any namespace, deployKF will automatically clone the correct secret into the profile namespace and configure KFP to use it.
- It is common to store all the profile secrets in a single namespace, as this makes them easier to manage.
- Each profile secret should only have the minimum permissions required for that profile.
- See the Example IAM Policies.
Step 3 - Configure deployKF
Finally, configure deployKF to use the secrets you created using the following values:
Value | Purpose |
---|---|
deploykf_core.deploykf_profiles_generator.profileDefaults.tools.kubeflowPipelines.objectStoreAuth | Default bucket authentication used in profiles that do NOT have tools.kubeflowPipelines.objectStoreAuth defined in their deploykf_core.deploykf_profiles_generator.profiles list entry. |
kubeflow_tools.pipelines.objectStore | Connection details & bucket authentication used by the KFP backend (not profiles). |
kubeflow_tools.pipelines.bucket | Bucket name and region configs. |
The following values will connect Kubeflow Pipelines to an external object store using key-based authentication:
deploykf_core:
deploykf_profiles_generator:
## NOTE: each profile can override the defaults
## see under `profiles` for an example of a profile
## which overrides the default auth pattern
##
profileDefaults:
tools:
kubeflowPipelines:
objectStoreAuth:
## (OPTION 1):
## - all profiles share the same access key (NOT RECOMMENDED)
## - the `existingSecretAccessKeyKey` and `existingSecretSecretKeyKey`
## reference the KEY NAMES in the Kubernetes Secret you create
##
#existingSecret: "my-secret-name"
#existingSecretNamespace: "my-namespace"
#existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
#existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"
## (OPTION 2):
## - each profile has its own access key
## - instances of '{profile_name}' in `existingSecret`
## are replaced with the profile name
## - the `existingSecretAccessKeyKey` and `existingSecretSecretKeyKey`
## reference the KEY NAMES in the Kubernetes Secret you create
##
existingSecret: "kubeflow-pipelines--profile-object-store-auth--{profile_name}"
existingSecretNamespace: "my-namespace"
existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"
## example of a profile which overrides the default auth
#profiles:
# - name: "my-profile"
# members: []
# tools:
# kubeflowPipelines:
# objectStoreAuth:
# existingSecret: "my-secret-name"
# existingSecretNamespace: "" # defaults to the profile's namespace
# existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
# existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"
kubeflow_tools:
pipelines:
bucket:
## this specifies the name of your bucket (and region, if applicable)
name: kubeflow-pipelines
region: ""
objectStore:
useExternal: true
## this specifies the S3-compatible endpoint of your object store
## - for S3 itself, you may need to use the region-specific endpoint
## - don't set a port unless it is non-standard
host: "s3.amazonaws.com"
port: ""
useSSL: true
## these credentials are used by the KFP backend (not profiles)
auth:
## (OPTION 1):
## - set keys with values (NOT RECOMMENDED)
#accessKey: "AKIAIOSFODNN7EXAMPLE"
#secretKey: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
## (OPTION 2):
## - read a kubernetes secret from the 'kubeflow' namespace
## - note, `existingSecretKey` specifies the KEY NAMES in the
## secret itself, which contain the secret values
existingSecret: "kubeflow-pipelines--backend-object-store-auth"
existingSecretAccessKeyKey: "AWS_ACCESS_KEY_ID"
existingSecretSecretKeyKey: "AWS_SECRET_ACCESS_KEY"
If you are using EKS and S3, you may use IAM roles for service accounts (IRSA).
In this method, EKS will inject access keys automatically based on Kubernetes ServiceAccount annotations, so no keys are required.
IRSA is only supported on EKS
IRSA is only supported when connecting to S3 from an EKS cluster.
If you are using a different platform, you will need to use key-based authentication.
Step 1 - Enable IRSA
First, you must enable IRSA on your EKS cluster.
Refer to the AWS Documentation for detailed instructions.
Step 2 - Associate IAM Roles
You will need to associate the Kubernetes ServiceAccounts to the roles you created in the previous step.
The following ServiceAccounts are used by Kubeflow Pipelines:
Component | Namespace | ServiceAccount Name |
---|---|---|
Kubeflow Pipelines Backend | kubeflow | ml-pipeline |
Kubeflow Pipelines Backend | kubeflow | ml-pipeline-ui |
Kubeflow Pipelines Backend | kubeflow-argo-workflows | argo-server |
Kubeflow Pipelines Backend | kubeflow-argo-workflows | argo-workflow-controller |
User Profiles | {profile_name} | default-editor |
For example, the following command will associate the arn:aws:iam::MY_ACCOUNT_ID:policy/MY_POLICY_NAME
IAM Policy with the ml-pipeline
ServiceAccount in the kubeflow
namespace, and create an IAM Role named kubeflow-pipelines-backend
:
eksctl create iamserviceaccount \
--cluster "my-cluster" \
--namespace "kubeflow" \
--name "ml-pipeline" \
--role-name "kubeflow-pipelines-backend" \
--attach-policy-arn "arn:aws:iam::MY_ACCOUNT_ID:policy/MY_POLICY_NAME" \
--approve
Step 3 - Configure deployKF
The following values are needed to configure IRSA-based auth:
Value | Purpose |
---|---|
deploykf_core.deploykf_profiles_generator.profileDefaults.plugins | Default profile-plugins, used by profiles which do NOT have plugins defined in their deploykf_core.deploykf_profiles_generator.profiles list entry.Note, the AwsIamForServiceAccount plugin is used to configure AWS IRSA-based auth by annotating the default-editor and default-viewer ServiceAccounts in each profile. |
kubeflow_dependencies.kubeflow_argo_workflows.controller.serviceAccount | Kubernetes ServiceAccount used by the Argo Workflows Controller |
kubeflow_dependencies.kubeflow_argo_workflows.server.serviceAccount | Kubernetes ServiceAccount used by the Argo Server UI |
kubeflow_tools.pipelines.serviceAccounts.apiServer | Kubernetes ServiceAccount used by the Kubeflow Pipelines API Server |
kubeflow_tools.pipelines.serviceAccounts.frontend | Kubernetes ServiceAccount used by the Kubeflow Pipelines Frontend |
kubeflow_tools.pipelines.objectStore.auth.fromEnv | If true , disables all other auth methods, so the AWS Credential Provider Chain will try to use IRSA-based auth. |
The following values will connect Kubeflow Pipelines to an external object store using IRSA-based authentication:
deploykf_core:
deploykf_profiles_generator:
## NOTE: if you want to have a different set of plugins for each profile,
## for example, to have some profiles use a different IAM role,
## you can define the `plugins` list explicitly in a profile
## to override the default plugins
profileDefaults:
plugins:
- kind: AwsIamForServiceAccount
spec:
awsIamRole: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
AnnotateOnly: true
## example of a profile which overrides the default plugins
#profiles:
# - name: "my-profile"
# members: []
# plugins:
# - kind: AwsIamForServiceAccount
# spec:
# awsIamRole: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
# AnnotateOnly: true
kubeflow_dependencies:
kubeflow_argo_workflows:
controller:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
server:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
kubeflow_tools:
pipelines:
serviceAccounts:
apiServer:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
frontend:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::MY_ACCOUNT_ID:role/MY_ROLE_NAME"
bucket:
name: kubeflow-pipelines
region: "us-west-2"
objectStore:
useExternal: true
## for IRSA, this should always be "s3.{region}.amazonaws.com" or similar
host: "s3.us-west-2.amazonaws.com"
useSSL: true
auth:
## setting `fromEnv` to `true` disables all other auth methods
## so the AWS Credential Provider Chain will try to use IRSA-based auth
fromEnv: true
deployKF 0.1.4
and earlier
If you are using deployKF 0.1.4
or earlier, you will need to explicitly set kubeflow_tools.pipelines.kfpV2.minioFix
to false
. Note that newer versions of deployKF do not have this value, as the MinIO issue has been resolved.
For example:
kubeflow_tools:
pipelines:
kfpV2:
## NOTE: only required if you are using 'sample-values.yaml' as a base
## as `minioFix` can only be 'true' when using the embedded MinIO
minioFix: false
Created: 2024-03-12