Skip to content

102 - persist bundle metadata in separate table

ryanlong requested to merge 102-main into main

Created by: ri-pandey

Description

Persist bundle metadata in separate table.

Related Issue(s)

Closes #102 (closed), possibly #37 (closed) as well.

Changes Made

List the main changes made in this PR. Be as specific as possible.

  • Feature added
  • Bug fixed
  • Code refactored
  • Documentation updated
  • Other changes: [describe]

Checklist

Before submitting this PR, please make sure that:

  • Your code passes linting and coding style checks.
  • Documentation has been updated to reflect the changes.
  • You have reviewed your own code and resolved any merge conflicts.
  • You have requested a review from at least one team member.
  • Any relevant issue(s) have been linked to this PR.

Additional Information

  • Added separate table for bundle metadata (bundle). This is linked to dataset via dataset_id.
  • Archive step is updated to evaluate bundle metadata (path, checksum, etc.), create the bundle record, and associate it with the corresponding dataset.
  • Stage step is updated to verify that the checksum of the downloaded bundle is the same as the checksum that the Archive step persisted to the database (this was the requirement for ticket #37 (closed)).
  • Download step is updated to create a symlink for the bundle download.
  • All places in the code that were using dataset.bundle_size have been updated to use the size from the new bundle table.
  • A new alias bundle_alias is used to obfuscate the actual location of the bundle on Slate-scratch. I am not reusing the stage_alias for this because the stage_alias directory symlinks to the top-level directory inside dataset, instead of the dataset's directory. Therefore downloading the staged dataset with this approach would end up having the bundle inside the dataset. This is the alias provided to users who attempt to download bundles.
  • Added new paths for staging bundles

I have also added a new script (populate_bundles.py) that

  • iterates through the currently archived datasets
  • downloads them from the SDA
  • verifies that the calculated checksum of each downloaded bundle is the same as its checksum retrieved (using the hsi utility) from SDA.
  • If checksum validation passes, it runs the sync_archived_bundles workflow on these datasets, which runs the tasks archive (which populates the bundle metadata in the bundle table), stage, validate, and setup_download steps on each of them, thus preparing them for download.

Merge request reports

Loading