Skip to content

Improve Watch Script and Dataset Creation

ryanlong requested to merge watch into main

Created by: deepakduggirala

Description

The codebase now allows dataset creation from multiple points, which can cause inconsistencies between the watch script's local dataset state and the actual database state. Additionally, checking for dataset existence locally before creation introduces race conditions under concurrent requests.

Another issue with maintaining local state is that if a user deletes a dataset from the UI that was originally ingested by the watch script, it will not be reprocessed until the script restarts and fetches the latest state.

API Changes

  • POST /datasets

    • Returns 201 Created with the dataset object on success.
    • Returns 409 Conflict if a dataset with the same name and type already exists (is_deleted=false).
  • POST /datasets/bulk

    • Processes multiple datasets at once.
    • Returns a response object:
      {
        "created": [...], 
        "conflicted": [...], 
        "errored": [...]
      }

Watch Script Changes

  • Removed local tracking of completed datasets. Now relies solely on POST /datasets/bulk.
  • Introduced a Full Scan:
    • Periodically calls POST /datasets/bulk with all directory names in the watch directory.
    • Ensures reprocessing of datasets that were deleted after ingestion.
    • Can be disabled by setting config registration.full_scan_every_n_scans to None or at the observer level.

⚠️ WARNING: With full scan enabled, any deleted dataset will be re-ingested if its source directory remains in the watch path. To prevent this, either disable full scan or ensure the source directory is removed after archival as part of the integrated workflow.

Changes Made

List the main changes made in this PR. Be as specific as possible.

  • Feature added
  • Bug fixed
  • Code refactored
  • Tests changed
  • Documentation updated
  • Other changes: [describe]

Checklist

Before submitting this PR, please make sure that:

  • Your code passes linting and coding style checks.
  • Documentation has been updated to reflect the changes.
  • You have reviewed your own code and resolved any merge conflicts.
  • You have requested a review from at least one team member.
  • Any relevant issue(s) have been linked to this PR.

Additional Information

I have tested this code in bioloop-dev

Merge request reports

Loading