Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for globbing in podio::CreateDataFrame, podio::makeReader and podio.reading.get_reader #686

Open
Zehvogel opened this issue Sep 23, 2024 · 3 comments

Comments

@Zehvogel
Copy link
Contributor

tl; dr:

import podio

# breaks
# reader = podio.reading.get_reader("some/path/*.edm4hep.root")

# works
reader = podio.root_io.Reader("some/path/*.edm4hep.root")

The ROOTReader was (and technically still is) able to read from multiple files at once using globbing as the names are passed to TChain directly

podio/src/ROOTReader.cc

Lines 240 to 251 in 46b02cf

void ROOTReader::openFiles(const std::vector<std::string>& filenames) {
m_metaChain = std::make_unique<TChain>(root_utils::metaTreeName);
// NOTE: We simply assume that the meta data doesn't change throughout the
// chain! This essentially boils down to the assumption that all files that
// are read this way were written with the same settings.
// Reading all files is done to check that all file exists
for (const auto& filename : filenames) {
if (!m_metaChain->Add(filename.c_str(), -1)) {
throw std::runtime_error("File " + filename + " couldn't be found or the \"" + root_utils::metaTreeName +
"\" tree couldn't be read.");
}
}

Using the new Reader interface this does not work any more as every path is passed to TFile first

podio/src/Reader.cc

Lines 38 to 44 in 46b02cf

if (suffix == "root") {
TFile* file = TFile::Open(filenames[0].c_str());
bool hasRNTuple = false;
if (!file) {
throw std::runtime_error("Could not open file: " + filenames[0]);
}

This is particularly annoying when using the podio DataSource as the normal RDataFrame behaviour is also to pass everything to TChain. I know this was probably never explicitly supported but it would be great to get this back :)

@tmadlener
Copy link
Collaborator

Do you know when this last worked? Because, no recent change comes to my mind that would alter this behavior. I don't think podio.reading.get_reader every supported globbing, because that effectively opens whatever is passed as a filename directly via TFile.Open (which is also where I suppose this is breaking):

def _determine_root_format(filename):
"""Peek into the root file to determine which flavor we have at hand."""
file = TFile.Open(filename)


It probably is slightly confusing, but the c++ "interface" podio::Reader is completely distinct from anything that happens in python. The podio.root_io.Reader is the podio::ROOTReader

class Reader(BaseReaderMixin):
"""Reader class for reading podio root files."""
def __init__(self, filenames):
"""Create a reader that reads from the passed file(s).
Args:
filenames (str or list[str]): file(s) to open and read data from
"""
if isinstance(filenames, str):
filenames = (filenames,)
self._reader = podio.ROOTReader()
self._reader.openFiles(filenames)
super().__init__()

@Zehvogel
Copy link
Contributor Author

Sorry maybe my description was too cryptic.

It probably is slightly confusing, but the c++ "interface" podio::Reader is completely distinct from anything that happens in python.

I did not know that but it also does not matter. In the end both the podio.reading.get_readerand podio::Reader::makeReader do not support globbing and even for the same reason (using TFile). I just choose python for an easier example. The podio.root_io.Reader aka. podio::ROOTReader always did support globbing and still does.

However, only podio::ROOTReader supporting globbing does not help me if I want to use the DataSource though.

@tmadlener
Copy link
Collaborator

Right, I see where you are coming from. I am nevertheless removing the Regression from the issue title, since neither get_reader nor makeReader, nor the DataSource ever supported globbing, at least not in the version in which they finally landed in podio.

The problem with supporting globbing from the c++ side is, that there we would have to effectively implement our own wrapper around glob for that. For the python bindings we could probably hide a glob.glob in a thin wrapper somewhere.

@tmadlener tmadlener changed the title Regression: No longer able to read from multiple root files at once with globbing Support for globbing in CreateDataFrame Sep 23, 2024
@tmadlener tmadlener changed the title Support for globbing in CreateDataFrame Support for globbing in podio::CreateDataFrame, podio::makeReader and podio.reading.get_reader Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants