Improve Watch Script and Dataset Creation
Created by: deepakduggirala
Description
The codebase now allows dataset creation from multiple points, which can cause inconsistencies between the watch script's local dataset state and the actual database state. Additionally, checking for dataset existence locally before creation introduces race conditions under concurrent requests.
Another issue with maintaining local state is that if a user deletes a dataset from the UI that was originally ingested by the watch script, it will not be reprocessed until the script restarts and fetches the latest state.
API Changes
-
POST /datasets
- Returns
201 Created
with the dataset object on success. - Returns
409 Conflict
if a dataset with the same name and type already exists (is_deleted=false
).
- Returns
-
POST /datasets/bulk
- Processes multiple datasets at once.
- Returns a response object:
{ "created": [...], "conflicted": [...], "errored": [...] }
Watch Script Changes
- Removed local tracking of completed datasets. Now relies solely on
POST /datasets/bulk
. - Introduced a Full Scan:
- Periodically calls
POST /datasets/bulk
with all directory names in the watch directory. - Ensures reprocessing of datasets that were deleted after ingestion.
- Can be disabled by setting config
registration.full_scan_every_n_scans
toNone
or at the observer level.
- Periodically calls
Changes Made
List the main changes made in this PR. Be as specific as possible.
-
Feature added -
Bug fixed -
Code refactored -
Tests changed -
Documentation updated -
Other changes: [describe]
Checklist
Before submitting this PR, please make sure that:
-
Your code passes linting and coding style checks. -
Documentation has been updated to reflect the changes. -
You have reviewed your own code and resolved any merge conflicts. -
You have requested a review from at least one team member. -
Any relevant issue(s) have been linked to this PR.
Additional Information
I have tested this code in bioloop-dev