Data Storage
Principles
The file storage solution associated with Datalab is MinIO, an object storage system based on the cloud, compatible with Amazon’s S3 API. In practice, this has several advantages:
- Stored files are easily accessible from anywhere: a file can be accessed directly via a simple URL, which can be shared.
- It is possible to access the stored files directly within the data science services (R, Python, etc.) offered on Datalab, without the need to copy the files locally beforehand, greatly improving the reproducibility of analyses.
Managing your data
Importing data
The My Files page in Datalab takes the form of a file explorer showing the different buckets (repositories) to which the user has access.
Each user has a personal bucket by default to store their files. Within this bucket, two options are possible:
- “Create a directory”: Creates a directory in the current bucket/directory hierarchically, similar to a traditional file system.
- “Upload a file”: Uploads one or multiple files to the current directory.
The graphical interface for data storage in Datalab is still under construction. As such, it may experience responsiveness issues. For frequent operations on file storage, it may be preferable to interact with MinIO via the terminal.
Using data stored on MinIO
The access credentials needed to access data on MinIO are pre-configured in the various Datalab services, accessible in the form of environment variables. This greatly facilitates importing and exporting files from the services.
Configuration
In R, interaction with an S3-compatible file system is made possible by the aws.s3
library.
library(aws.s3)
In Python, interaction with an S3-compatible file system is made possible by two libraries:
- Boto3, a library created and maintained by Amazon.
- S3Fs, a library that allows interaction with stored files similar to a classic filesystem.
For this reason, and because S3Fs is used by default by the pandas library to manage S3 connections, we will present how to manage storage on MinIO via Python using this library.
import os
import s3fs
# Create filesystem object
= "https://" + os.environ["AWS_S3_ENDPOINT"]
S3_ENDPOINT_URL = s3fs.S3FileSystem(client_kwargs={'endpoint_url': S3_ENDPOINT_URL}) fs
MinIO offers a command-line client (mc
) that allows interaction with the storage system in a manner similar to a classic UNIX filesystem. This client is installed by default and accessible via a terminal in the various Datalab services.
The MinIO client offers basic UNIX commands such as ls, cat, cp, etc. The complete list is available in the client documentation.
Listing the files in a bucket
::get_bucket("donnees-insee", region = "") aws.s3
"donnees-insee") fs.ls(
The Datalab storage is accessible via the alias s3
. For example, to list the files in the bucket donnees-insee
:
mc ls s3/donnees-insee
Importing data in a service
<- "donnees-insee"
BUCKET <- "diffusion/BPE/2019/BPE_ENS.csv"
FILE_KEY_S3
<-
df ::s3read_using(
aws.s3FUN = readr::read_delim,
# Put FUN options here
delim = ";",
object = FILE_KEY_S3,
bucket = BUCKET,
opts = list("region" = "")
)
The S3Fs package allows you to interact with files stored on MinIO as if they were local files. The syntax is therefore very familiar to Python users. For example, to import/export tabular data via pandas
:
import pandas as pd
= "donnees-insee"
BUCKET = "diffusion/BPE/2019/BPE_ENS.csv"
FILE_KEY_S3 = BUCKET + "/" + FILE_KEY_S3
FILE_PATH_S3
with fs.open(FILE_PATH_S3, mode="rb") as file_in:
= pd.read_csv(file_in, sep=";") df_bpe
To copy data from a MinIO bucket to the local service:
mc cp s3/donnees-insee/diffusion/BPE/2019/BPE_ENS.csv ./BPE_ENS.csv
Copying files to the local service is generally not a good practice: it limits the reproducibility of analyses and becomes quickly impossible with large volumes of data. Therefore, it is preferable to get into the habit of importing data directly into R
/Python
.
Exporting data to MinIO
= "<my_bucket>"
BUCKET_OUT = "my_folder/BPE_ENS.csv"
FILE_KEY_OUT_S3
::s3write_using(
aws.s3
df,FUN = readr::write_csv,
object = FILE_KEY_OUT_S3,
bucket = BUCKET_OUT,
opts = list("region" = "")
)
= "<my_bucket>"
BUCKET_OUT = "my_folder/BPE_ENS.csv"
FILE_KEY_OUT_S3 = BUCKET_OUT + "/" + FILE_KEY_OUT_S3
FILE_PATH_OUT_S3
with fs.open(FILE_PATH_OUT_S3, 'w') as file_out:
df_bpe.to_csv(file_out)
To copy data from the local service to a bucket on MinIO:
mc cp local/path/to/my/file.csv s3/<my_bucket>/remote/path/to/my/file.csv
Renewing expired access tokens
Access to MinIO storage is possible via a personal access token, which is valid for 7 days and automatically regenerated at regular intervals on SSP Cloud. When a token has expired, services created before the expiration date (using the previous token) can no longer access storage, and the affected service will appear in red on the My Services page. In this case, there are two options:
Open a new service on Datalab, which will have a default, up-to-date token.
Manually replace expired tokens with new ones. Scripts indicating how to do this for different MinIO uses (
R
/Python
/mc
) are available here. Simply choose the relevant script and execute it in your current working environment.
Advanced Usage
Creating a Service Account
For security reasons, the authentication to MinIO used by default in the interactive services of the SSP Cloud relies on a temporary access token. In the context of projects involving periodic processing or the deployment of applications, a more permanent access to MinIO data may be required.
In this case, a service account is used, which is an account tied to a specific project or application rather than an individual. Technically, instead of authenticating to MinIO via a triplet (access key id, secret access key, and session token), a pair (access key id, secret access key) will be used, granting read/write permissions to a specific project bucket.
The procedure for creating a service account is described below.
Open the MinIO console
Open the
Access Keys
tabThe service account information is pre-generated. It is possible to modify the access key to give it a simpler name.
The
policy
specifying the rights is also pre-generated. Ideally, the policy should be restricted to only cover the project bucket(s).Once the service account is generated, the access key and secret access key can be used to authenticate the services/applications to the specified bucket.
- Create a service on the SSP Cloud with up-to-date MinIO access. Confirm that the connection works with:
mc ls s3/<username>
- Generate a
policy.json
file with the following content, replacingproject-<my_project>
with the name of the relevant bucket (twice):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::projet-<my_project>",
"arn:aws:s3:::projet-<my_project>/*"
]
}
]
}
- In a terminal, generate the service account with the following command:
mc admin user svcacct add s3 $AWS_ACCESS_KEY_ID --access-key="<access-key>" --secret-key="<secret-key>" --policy="policy.json"
replacing <access-key>
and <secret-key>
with names of your choice. Ideally, give a simple name for the access key (e.g., sa-project-projectname
) but a complex key for the secret access key, which can be generated, for example, with the gpg
client:
gpg --gen-random --armor 1 16
- You can now use the access key and secret access key to authenticate the services/applications to the specified bucket
Note that the generated authentication information appears only once. They can then be stored in a password manager, a secret storage service like Vault, or via the Onyxia project settings feature, which allows importing the service account directly into services at the time of their configuration.