Data Storage

Principles

The file storage solution associated with Datalab is MinIO, an object storage system based on the cloud, compatible with Amazon’s S3 API. In practice, this has several advantages:

Stored files are easily accessible from anywhere: a file can be accessed directly via a simple URL, which can be shared.
It is possible to access the stored files directly within the data science services (R, Python, etc.) offered on Datalab, without the need to copy the files locally beforehand, greatly improving the reproducibility of analyses.

Managing your data

Importing data

The My Files page in Datalab takes the form of a file explorer showing the different buckets (repositories) to which the user has access.

Each user has a personal bucket by default to store their files. Within this bucket, two options are possible:

“Create a directory”: Creates a directory in the current bucket/directory hierarchically, similar to a traditional file system.
“Upload a file”: Uploads one or multiple files to the current directory.

Note

The graphical interface for data storage in Datalab is still under construction. As such, it may experience responsiveness issues. For frequent operations on file storage, it may be preferable to interact with MinIO via the terminal.

Using data stored on MinIO

The access credentials needed to access data on MinIO are pre-configured in the various Datalab services, accessible in the form of environment variables. This greatly facilitates importing and exporting files from the services.

Configuration

In R, interaction with an S3-compatible file system is made possible by the aws.s3 library.

library(aws.s3)

In Python, interaction with an S3-compatible file system is made possible by two libraries:

Boto3, a library created and maintained by Amazon.
S3Fs, a library that allows interaction with stored files similar to a classic filesystem.

For this reason, and because S3Fs is used by default by the pandas library to manage S3 connections, we will present how to manage storage on MinIO via Python using this library.

import os
import s3fs

# Create filesystem object
S3_ENDPOINT_URL = "https://" + os.environ["AWS_S3_ENDPOINT"]
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': S3_ENDPOINT_URL})

MinIO offers a command-line client (mc) that allows interaction with the storage system in a manner similar to a classic UNIX filesystem. This client is installed by default and accessible via a terminal in the various Datalab services.

The MinIO client offers basic UNIX commands such as ls, cat, cp, etc. The complete list is available in the client documentation.

Listing the files in a bucket

aws.s3::get_bucket("donnees-insee", region = "")

fs.ls("donnees-insee")

The Datalab storage is accessible via the alias s3. For example, to list the files in the bucket donnees-insee:

mc ls s3/donnees-insee

Importing data in a service

BUCKET <- "donnees-insee"
FILE_KEY_S3 <- "diffusion/BPE/2019/BPE_ENS.csv"

df <-
  aws.s3::s3read_using(
    FUN = readr::read_delim,
    # Put FUN options here
    delim = ";",
    object = FILE_KEY_S3,
    bucket = BUCKET,
    opts = list("region" = "")
  )

The S3Fs package allows you to interact with files stored on MinIO as if they were local files. The syntax is therefore very familiar to Python users. For example, to import/export tabular data via pandas:

import pandas as pd

BUCKET = "donnees-insee"
FILE_KEY_S3 = "diffusion/BPE/2019/BPE_ENS.csv"
FILE_PATH_S3 = BUCKET + "/" + FILE_KEY_S3

with fs.open(FILE_PATH_S3, mode="rb") as file_in:
    df_bpe = pd.read_csv(file_in, sep=";")

To copy data from a MinIO bucket to the local service:

mc cp s3/donnees-insee/diffusion/BPE/2019/BPE_ENS.csv ./BPE_ENS.csv

Warning

Copying files to the local service is generally not a good practice: it limits the reproducibility of analyses and becomes quickly impossible with large volumes of data. Therefore, it is preferable to get into the habit of importing data directly into R/Python.

Exporting data to MinIO

BUCKET_OUT = "<my_bucket>"
FILE_KEY_OUT_S3 = "my_folder/BPE_ENS.csv"

aws.s3::s3write_using(
    df,
    FUN = readr::write_csv,
    object = FILE_KEY_OUT_S3,
    bucket = BUCKET_OUT,
    opts = list("region" = "")
)

BUCKET_OUT = "<my_bucket>"
FILE_KEY_OUT_S3 = "my_folder/BPE_ENS.csv"
FILE_PATH_OUT_S3 = BUCKET_OUT + "/" + FILE_KEY_OUT_S3

with fs.open(FILE_PATH_OUT_S3, 'w') as file_out:
    df_bpe.to_csv(file_out)

To copy data from the local service to a bucket on MinIO:

mc cp local/path/to/my/file.csv s3/<my_bucket>/remote/path/to/my/file.csv

Renewing expired access tokens

Access to MinIO storage is possible via a personal access token, which is valid for 7 days and automatically regenerated at regular intervals on SSP Cloud. When a token has expired, services created before the expiration date (using the previous token) can no longer access storage, and the affected service will appear in red on the My Services page. In this case, there are two options:

Open a new service on Datalab, which will have a default, up-to-date token.
Manually replace expired tokens with new ones. Scripts indicating how to do this for different MinIO uses (R/Python/mc) are available here. Simply choose the relevant script and execute it in your current working environment.

Advanced Usage

Creating a Service Account

For security reasons, the authentication to MinIO used by default in the interactive services of the SSP Cloud relies on a temporary access token. In the context of projects involving periodic processing or the deployment of applications, a more permanent access to MinIO data may be required.

In this case, a service account is used, which is an account tied to a specific project or application rather than an individual. Technically, instead of authenticating to MinIO via a triplet (access key id, secret access key, and session token), a pair (access key id, secret access key) will be used, granting read/write permissions to a specific project bucket.

The procedure for creating a service account is described below.

Graphical Interface
Terminal (mc)

Open the MinIO console
Open the Access Keys tab
The service account information is pre-generated. It is possible to modify the access key to give it a simpler name.
The policy specifying the rights is also pre-generated. Ideally, the policy should be restricted to only cover the project bucket(s).
Once the service account is generated, the access key and secret access key can be used to authenticate the services/applications to the specified bucket.

Create a service on the SSP Cloud with up-to-date MinIO access. Confirm that the connection works with:

mc ls s3/<username>

Generate a policy.json file with the following content, replacing project-<my_project> with the name of the relevant bucket (twice):

{
    "Version": "2012-10-17",
    "Statement": [
     {
      "Effect": "Allow",
      "Action": [
       "s3:*"
      ],
      "Resource": [
       "arn:aws:s3:::projet-<my_project>",
       "arn:aws:s3:::projet-<my_project>/*"
      ]
     }
    ]
  }

In a terminal, generate the service account with the following command:

mc admin accesskey create s3 $AWS_ACCESS_KEY_ID --access-key="<access-key>" --secret-key="<secret-access-key>" --policy="policy.json"

replacing <access-key> and <secret-key> with names of your choice. Ideally, give a simple name for the access key (e.g., sa-project-projectname) but a complex key for the secret access key, which can be generated, for example, with the gpg client:

gpg --gen-random --armor 1 16

You can now use the access key and secret access key to authenticate the services/applications to the specified bucket

Warning

Note that the generated authentication information appears only once. They can then be stored in a password manager, a secret storage service like Vault, or via the Onyxia project settings feature, which allows importing the service account directly into services at the time of their configuration.

Data Storage

Principles

Managing your data

Importing data

Sharing data

Using data stored on MinIO

Configuration

Listing the files in a bucket

Importing data in a service

Exporting data to MinIO

Renewing expired access tokens

Advanced Usage

Creating a Service Account