Data Storage

Principles

The file storage solution associated with Datalab is MinIO, an object storage system based on the cloud, compatible with Amazon’s S3 API. In practice, this has several advantages:

  • Stored files are easily accessible from anywhere: a file can be accessed directly via a simple URL, which can be shared.
  • It is possible to access the stored files directly within the data science services (R, Python, etc.) offered on Datalab, without the need to copy the files locally beforehand, greatly improving the reproducibility of analyses.

MinIO Schema

Managing your data

Importing data

The My Files page in Datalab takes the form of a file explorer showing the different buckets (repositories) to which the user has access.

Each user has a personal bucket by default to store their files. Within this bucket, two options are possible:

  • Create a directory”: Creates a directory in the current bucket/directory hierarchically, similar to a traditional file system.
  • Upload a file”: Uploads one or multiple files to the current directory.
Note

The graphical interface for data storage in Datalab is still under construction. As such, it may experience responsiveness issues. For frequent operations on file storage, it may be preferable to interact with MinIO via the terminal.

Sharing data

The default access policy of the S3 storage forbids all access to a bucket by any third party users of the SSPCLoud. The only exception is the diffusion folder located directly at the root of each bucket for which other users have read-only access by default.

Therefore a straightforward way to share files for, say, a training session, is to create a diffusion folder in the user’s personal bucket and use it to store all resources meant to be shared with other users of the plateform.

By clicking on a file in their personal bucket, the user can access its characteristics page. On this page, it is also possible to manually change the diffusion status of the file. Changing the status of the file from “private” to “public” generates a diffusion link, which can then be shared for downloading the file. The “public” status only grants read-only access rights to other users, and modifying or deleting other users’ personal files is not possible.

Note

For collaborative projects, it can be beneficial for different participants to have access to a shared storage space. It is possible to create shared buckets on MinIO for this purpose. Feel free to contact us via the channels specified on the “First Use” page if you wish to work on open data projects on the SSPCloud Datalab.

Warning

In accordance with the terms of use, only non-sensitive data (e.g. open data) may be stored on the SSPCloud Datalab. Setting a file’s diffusion status to “private” does not guarantee its confidentiality.

Using data stored on MinIO

The access credentials needed to access data on MinIO are pre-configured in the various Datalab services, accessible in the form of environment variables. This greatly facilitates importing and exporting files from the services.

Configuration

In R, interaction with an S3-compatible file system is made possible by the aws.s3 library.

library(aws.s3)

In Python, interaction with an S3-compatible file system is made possible by two libraries:

  • Boto3, a library created and maintained by Amazon.
  • S3Fs, a library that allows interaction with stored files similar to a classic filesystem.

For this reason, and because S3Fs is used by default by the pandas library to manage S3 connections, we will present how to manage storage on MinIO via Python using this library.

import os
import s3fs

# Create filesystem object
S3_ENDPOINT_URL = "https://" + os.environ["AWS_S3_ENDPOINT"]
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': S3_ENDPOINT_URL})

MinIO offers a command-line client (mc) that allows interaction with the storage system in a manner similar to a classic UNIX filesystem. This client is installed by default and accessible via a terminal in the various Datalab services.

The MinIO client offers basic UNIX commands such as ls, cat, cp, etc. The complete list is available in the client documentation.

Listing the files in a bucket

aws.s3::get_bucket("donnees-insee", region = "")
fs.ls("donnees-insee")

The Datalab storage is accessible via the alias s3. For example, to list the files in the bucket donnees-insee:

mc ls s3/donnees-insee

Importing data in a service

BUCKET <- "donnees-insee"
FILE_KEY_S3 <- "diffusion/BPE/2019/BPE_ENS.csv"

df <-
  aws.s3::s3read_using(
    FUN = readr::read_delim,
    # Put FUN options here
    delim = ";",
    object = FILE_KEY_S3,
    bucket = BUCKET,
    opts = list("region" = "")
  )

The S3Fs package allows you to interact with files stored on MinIO as if they were local files. The syntax is therefore very familiar to Python users. For example, to import/export tabular data via pandas:

import pandas as pd

BUCKET = "donnees-insee"
FILE_KEY_S3 = "diffusion/BPE/2019/BPE_ENS.csv"
FILE_PATH_S3 = BUCKET + "/" + FILE_KEY_S3

with fs.open(FILE_PATH_S3, mode="rb") as file_in:
    df_bpe = pd.read_csv(file_in, sep=";")

To copy data from a MinIO bucket to the local service:

mc cp s3/donnees-insee/diffusion/BPE/2019/BPE_ENS.csv ./BPE_ENS.csv
Warning

Copying files to the local service is generally not a good practice: it limits the reproducibility of analyses and becomes quickly impossible with large volumes of data. Therefore, it is preferable to get into the habit of importing data directly into R/Python.

Exporting data to MinIO

BUCKET_OUT = "<my_bucket>"
FILE_KEY_OUT_S3 = "my_folder/BPE_ENS.csv"

aws.s3::s3write_using(
    df,
    FUN = readr::write_csv,
    object = FILE_KEY_OUT_S3,
    bucket = BUCKET_OUT,
    opts = list("region" = "")
)
BUCKET_OUT = "<my_bucket>"
FILE_KEY_OUT_S3 = "my_folder/BPE_ENS.csv"
FILE_PATH_OUT_S3 = BUCKET_OUT + "/" + FILE_KEY_OUT_S3

with fs.open(FILE_PATH_OUT_S3, 'w') as file_out:
    df_bpe.to_csv(file_out)

To copy data from the local service to a bucket on MinIO:

mc cp local/path/to/my/file.csv s3/<my_bucket>/remote/path/to/my/file.csv

Renewing expired access tokens

Access to MinIO storage is possible via a personal access token, which is valid for 7 days and automatically regenerated at regular intervals on SSP Cloud. When a token has expired, services created before the expiration date (using the previous token) can no longer access storage, and the affected service will appear in red on the My Services page. In this case, there are two options:

  • Open a new service on Datalab, which will have a default, up-to-date token.

  • Manually replace expired tokens with new ones. Scripts indicating how to do this for different MinIO uses (R/Python/mc) are available here. Simply choose the relevant script and execute it in your current working environment.

Advanced Usage

Creating a Service Account

For security reasons, the authentication to MinIO used by default in the interactive services of the SSP Cloud relies on a temporary access token. In the context of projects involving periodic processing or the deployment of applications, a more permanent access to MinIO data may be required.

In this case, a service account is used, which is an account tied to a specific project or application rather than an individual. Technically, instead of authenticating to MinIO via a triplet (access key id, secret access key, and session token), a pair (access key id, secret access key) will be used, granting read/write permissions to a specific project bucket.

The procedure for creating a service account is described below.

  • Open the MinIO console

  • Open the Access Keys tab

  • The service account information is pre-generated. It is possible to modify the access key to give it a simpler name.

  • The policy specifying the rights is also pre-generated. Ideally, the policy should be restricted to only cover the project bucket(s).

  • Once the service account is generated, the access key and secret access key can be used to authenticate the services/applications to the specified bucket.

  • Create a service on the SSP Cloud with up-to-date MinIO access. Confirm that the connection works with:
mc ls s3/<username>
  • Generate a policy.json file with the following content, replacing project-<my_project> with the name of the relevant bucket (twice):
{
    "Version": "2012-10-17",
    "Statement": [
     {
      "Effect": "Allow",
      "Action": [
       "s3:*"
      ],
      "Resource": [
       "arn:aws:s3:::projet-<my_project>",
       "arn:aws:s3:::projet-<my_project>/*"
      ]
     }
    ]
  }
  • In a terminal, generate the service account with the following command:
mc admin user svcacct add s3 $AWS_ACCESS_KEY_ID --access-key="<access-key>" --secret-key="<secret-key>" --policy="policy.json"

replacing <access-key> and <secret-key> with names of your choice. Ideally, give a simple name for the access key (e.g., sa-project-projectname) but a complex key for the secret access key, which can be generated, for example, with the gpg client:

gpg --gen-random --armor 1 16
  • You can now use the access key and secret access key to authenticate the services/applications to the specified bucket
Warning

Note that the generated authentication information appears only once. They can then be stored in a password manager, a secret storage service like Vault, or via the Onyxia project settings feature, which allows importing the service account directly into services at the time of their configuration.