102 - persist bundle metadata in separate table
Created by: ri-pandey
Description
Persist bundle metadata in separate table.
Related Issue(s)
Closes #102 (closed), possibly #37 (closed) as well.
Changes Made
List the main changes made in this PR. Be as specific as possible.
-
Feature added -
Bug fixed -
Code refactored -
Documentation updated -
Other changes: [describe]
Checklist
Before submitting this PR, please make sure that:
-
Your code passes linting and coding style checks. -
Documentation has been updated to reflect the changes. -
You have reviewed your own code and resolved any merge conflicts. -
You have requested a review from at least one team member. -
Any relevant issue(s) have been linked to this PR.
Additional Information
- Added separate table for bundle metadata (
bundle
). This is linked todataset
viadataset_id
. - Archive step is updated to evaluate bundle metadata (path, checksum, etc.), create the bundle record, and associate it with the corresponding dataset.
- Stage step is updated to verify that the checksum of the downloaded bundle is the same as the checksum that the Archive step persisted to the database (this was the requirement for ticket #37 (closed)).
- Download step is updated to create a symlink for the bundle download.
- All places in the code that were using
dataset.bundle_size
have been updated to use the size from the newbundle
table. - A new alias
bundle_alias
is used to obfuscate the actual location of the bundle on Slate-scratch. I am not reusing thestage_alias
for this because the stage_alias directory symlinks to the top-level directory inside dataset, instead of the dataset's directory. Therefore downloading the staged dataset with this approach would end up having the bundle inside the dataset. This is the alias provided to users who attempt to download bundles. - Added new paths for staging bundles
I have also added a new script (populate_bundles.py
) that
- iterates through the currently archived datasets
- downloads them from the SDA
- verifies that the calculated checksum of each downloaded bundle is the same as its checksum retrieved (using the
hsi
utility) from SDA. - If checksum validation passes, it runs the
sync_archived_bundles
workflow on these datasets, which runs the tasks archive (which populates the bundle metadata in thebundle
table), stage, validate, and setup_download steps on each of them, thus preparing them for download.