102 - persist bundle metadata in separate table

ryanlong requested to merge 102-main into main Feb 29, 2024

Created by: ri-pandey

Description

Persist bundle metadata in separate table.

Related Issue(s)

Closes #102 (closed), possibly #37 (closed) as well.

Changes Made

List the main changes made in this PR. Be as specific as possible.

Checklist

Before submitting this PR, please make sure that:

Additional Information

Added separate table for bundle metadata (bundle). This is linked to dataset via dataset_id.
Archive step is updated to evaluate bundle metadata (path, checksum, etc.), create the bundle record, and associate it with the corresponding dataset.
Stage step is updated to verify that the checksum of the downloaded bundle is the same as the checksum that the Archive step persisted to the database (this was the requirement for ticket #37 (closed)).
Download step is updated to create a symlink for the bundle download.
All places in the code that were using dataset.bundle_size have been updated to use the size from the new bundle table.
A new alias bundle_alias is used to obfuscate the actual location of the bundle on Slate-scratch. I am not reusing the stage_alias for this because the stage_alias directory symlinks to the top-level directory inside dataset, instead of the dataset's directory. Therefore downloading the staged dataset with this approach would end up having the bundle inside the dataset. This is the alias provided to users who attempt to download bundles.
Added new paths for staging bundles

I have also added a new script (populate_bundles.py) that

iterates through the currently archived datasets
downloads them from the SDA
verifies that the calculated checksum of each downloaded bundle is the same as its checksum retrieved (using the hsi utility) from SDA.
If checksum validation passes, it runs the sync_archived_bundles workflow on these datasets, which runs the tasks archive (which populates the bundle metadata in the bundle table), stage, validate, and setup_download steps on each of them, thus preparing them for download.

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading