Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent usage of paths in the experimental loggers #20521

Open
expnn opened this issue Dec 26, 2024 · 0 comments
Open

Inconsistent usage of paths in the experimental loggers #20521

expnn opened this issue Dec 26, 2024 · 0 comments
Labels
needs triage Waiting to be triaged by maintainers refactor

Comments

@expnn
Copy link

expnn commented Dec 26, 2024

Outline & Motivation

The usage of log_dir and save_dir are confusion.

Imaging we are implement a new logger, let's see how we can define the save_dir property.

class CustomLogger(Logger):
    def __init__(self, all_experiment_path):
        self._root_dir = all_experiment_path

    @property
    @override
    def save_dir(self) -> str:
        option1 = os.path.join(self._root_dir, self.name, str(self.version))
        # option2 = self._root_dir
        # option3 = os.path.join(self._root_dir, self.name)
        return option1  # or option2 or option3

The first option includes the experiment name and version as subdirectories, and the second one does not.

As CustomLogger is not a subclass of TensorBoardLogger or CSVLogger. the Trainer.log_dir is defined to be logger.save_dir if the first logger is a CustomLogger, according to here.
The config.yaml file will be saved to logger.save_dir/config.yaml. The checkpoint file will be saved to logger.save_dir/logger.name/logger.version, which
expand to be logger._root_dir/logger.name/logger.version/logger.name/logger.version, which is with duplicate directory hierarchy, according to __resolve_ckpt_dir() function.

I DO NOT like it. The following tree view shows this clearly.

<all_experiment_root>
└── experiment_name
    ├── version_0
    │   ├── config.yaml
    │   └── experiment_name
    │        └── version_0
    │             └── checkpoints
    │                  └── epoch=7-step=3000.ckpt
    └──  version_1
        ├── config.yaml
        └── experiment_name
              └── version_1
                   └── checkpoints
                        └── epoch=28-step=10875.ckpt

However, there is no way to remove the duplicate hierarchy for now. If the second or third option is adopted, the config.yaml file will be saved in a directory irrelevant to the experiment version, and will be overwrite by next experiment. This is not acceptable.

Pitch

I'm expecting the result hierarchy like this:

<all_experiment_root>
└── experiment_name
    ├── version_0
    │   ├── config.yaml
    │   └── checkpoints
    │         └── epoch=7-step=3000.ckpt
    └── version_1
        ├── config.yaml
        └── checkpoints
        └── epoch=28-step=10875.ckpt

I tried to fix this problem and was willingly to submit a PR. Fixing this issue seems more complicated than imagined.

Additional context

Because I'm new to Lightning, I'm not sure I fully understand Logger's design. From the current code, several attributes such as root_dir, log_dir, and save_dir are not clearly defined. For example, in the defination of lightning.fabric.loggers.tensorboard.TensorBoardLogger class, the experiment name is not included in the root_dir path, but for lightning.pytorch.loggers.tensorboard.TensorBoardLogger, it is included.

The root_dir and log_dir properties can be None introduces unnecessary complexities. The documentation of log_dir says:

Return the root directory where all versions of an experiment get saved, or `None` if the logger does not save data locally.

If I understand it correctly, the property root_dir and log_dir should never be None for all concrete loggers. This is because the config.yaml should always be saved to somewhere like root_dir/name/version. Even if a logger does not save config.yaml locally, we can still initialize root_dir to the current directory ('.'), but not really use it.

I have two questions:

  • Why we need the save_dir property? Why always setting root_dir to be <root>/name and log_dir to be <root>/name/version not sufficient?

  • What makes TensorBoardLogger and CSVLogger special that result in the Trainer.log_dir to be logger.log_dir instead of logger.save_dir?

cc @justusschock @awaelchli

@expnn expnn added needs triage Waiting to be triaged by maintainers refactor labels Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Waiting to be triaged by maintainers refactor
Projects
None yet
Development

No branches or pull requests

1 participant