Skip to content

Rename and ingest subdirs

ryanlong requested to merge rename_and_ingest_subdirs into main

Created by: ri-pandey

Rename and ingest subdirectories

Description

This script, given a directory, renames all subdirectories within the directory according to a deterministic pattern, and ingests the renamed subdirectories.

Changes Made

List the main changes made in this PR. Be as specific as possible.

  • Feature added
  • Bug fixed
  • Code refactored
  • Tests changed
  • Documentation updated
  • Other changes: [describe]

Checklist

Before submitting this PR, please make sure that:

  • Your code passes linting and coding style checks.
  • Documentation has been updated to reflect the changes.
  • You have reviewed your own code and resolved any merge conflicts.
  • You have requested a review from at least one team member.
  • Any relevant issue(s) have been linked to this PR.

Additional Information

Rename and Register On-Demand Script

Purpose

  • Processes subdirectories within a given directory
  • Renames them according to a specific format
  • Registers them as data products

Process

  1. Iterates through subdirectories in the specified path
  2. Renames each subdirectory to: {PROJECT_NAME}-{DIR_NAME}-{SUBDIRECTORY_NAME}
  3. Copies renamed directories to a new 'renamed_directories' folder
  4. Registers each new directory as a data product

Key Features

  • Dry run option for simulating the process without making changes
  • Checks for existing registered data products to avoid duplicates
  • If a subdirectory is already registered, compares contents of existing renamed directory with the original. If the contents are not the same, logs a warning indicating that the original subdirectory has been modified since it's corresponding renamed directory was registered. In this case, the renamed directory is not registered again.
  • Deletes renamed directories if registration fails

Usage

python -m workers.scripts.rename_and_register_ondemand [OPTIONS] DIR_PATH PROJECT_NAME

Arguments

  • DIR_PATH: Path to the directory containing subdirectories to process
  • PROJECT_NAME: Name of the project to use in the new directory names

Options

  • --dry-run: Simulates the process without making changes (default: True)

Note

  • The renamed_directories folder is deleted after all subdirectories are successfully processed
  • The 'register_ondemand.py' script has been modified, so that it now takes the dataset's path as an argument as well. This is necessary because I am using the register_ondemand script to register these subdirectories, and the current version of the script does not allow specifying the location of the directory that is to be ingested.

Create Dummy Dataset Script

Purpose

  • Creates a directory with dummy files for testing and development
  • Generates subdirectories, each containing exactly 1GB of random data by default.

Process

  1. Creates the main directory if it doesn't exist
  2. Iterates through the specified number of subdirectories
  3. For each subdirectory, creates random files until the total size reaches 1GB, or the specified size.

Key Features

  • Customizable number of subdirectories
  • Each subdirectory contains exactly 1GB of data by defult
  • Random file sizes between 1MB and 100MB
  • Random file names

Usage

python -m workers.scripts.create_dummy_dataset [OPTIONS] DIR_PATH

Arguments

  • DIR_PATH: Path where the dummy directory should be created

Options

  • --subdirs: Number of subdirectories to create (default: 3)
  • --size_gb: Size of each subdirectory in GB (default: 1)

Example

python -m workers.scripts.create_dummy_dataset /path/to/dummy_directory --subdirs=5 --size_gb=2

Merge request reports

Loading