Rename and ingest subdirs
Created by: ri-pandey
Rename and ingest subdirectories
Description
This script, given a directory, renames all subdirectories within the directory according to a deterministic pattern, and ingests the renamed subdirectories.
Changes Made
List the main changes made in this PR. Be as specific as possible.
-
Feature added -
Bug fixed -
Code refactored -
Tests changed -
Documentation updated -
Other changes: [describe]
Checklist
Before submitting this PR, please make sure that:
-
Your code passes linting and coding style checks. -
Documentation has been updated to reflect the changes. -
You have reviewed your own code and resolved any merge conflicts. -
You have requested a review from at least one team member. -
Any relevant issue(s) have been linked to this PR.
Additional Information
Rename and Register On-Demand Script
Purpose
- Processes subdirectories within a given directory
- Renames them according to a specific format
- Registers them as data products
Process
- Iterates through subdirectories in the specified path
- Renames each subdirectory to:
{PROJECT_NAME}-{DIR_NAME}-{SUBDIRECTORY_NAME}
- Copies renamed directories to a new 'renamed_directories' folder
- Registers each new directory as a data product
Key Features
- Dry run option for simulating the process without making changes
- Checks for existing registered data products to avoid duplicates
- If a subdirectory is already registered, compares contents of existing renamed directory with the original. If the contents are not the same, logs a warning indicating that the original subdirectory has been modified since it's corresponding renamed directory was registered. In this case, the renamed directory is not registered again.
- Deletes renamed directories if registration fails
Usage
python -m workers.scripts.rename_and_register_ondemand [OPTIONS] DIR_PATH PROJECT_NAME
Arguments
-
DIR_PATH
: Path to the directory containing subdirectories to process -
PROJECT_NAME
: Name of the project to use in the new directory names
Options
-
--dry-run
: Simulates the process without making changes (default: True)
Note
- The
renamed_directories
folder is deleted after all subdirectories are successfully processed - The 'register_ondemand.py' script has been modified, so that it now takes the dataset's path as an argument as well. This is necessary because I am using the register_ondemand script to register these subdirectories, and the current version of the script does not allow specifying the location of the directory that is to be ingested.
Create Dummy Dataset Script
Purpose
- Creates a directory with dummy files for testing and development
- Generates subdirectories, each containing exactly 1GB of random data by default.
Process
- Creates the main directory if it doesn't exist
- Iterates through the specified number of subdirectories
- For each subdirectory, creates random files until the total size reaches 1GB, or the specified size.
Key Features
- Customizable number of subdirectories
- Each subdirectory contains exactly 1GB of data by defult
- Random file sizes between 1MB and 100MB
- Random file names
Usage
python -m workers.scripts.create_dummy_dataset [OPTIONS] DIR_PATH
Arguments
-
DIR_PATH
: Path where the dummy directory should be created
Options
-
--subdirs
: Number of subdirectories to create (default: 3) -
--size_gb
: Size of each subdirectory in GB (default: 1)
Example
python -m workers.scripts.create_dummy_dataset /path/to/dummy_directory --subdirs=5 --size_gb=2