Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for file system operations: directory manipulation #220

Open
aradi opened this issue Jul 16, 2020 · 39 comments
Open

API for file system operations: directory manipulation #220

aradi opened this issue Jul 16, 2020 · 39 comments
Labels
topic: IO Common input/output related features

Comments

@aradi
Copy link
Member

aradi commented Jul 16, 2020

So, following the suggestion in #201, let's start to discuss file system access API. In order to keep it focused, I'd suggest to start with the directory related operations first. My proposal:

open_directory(dirname, dirdesc, status): Opens a directory and returns a descriptor

type(dirdesc_t): Type containing the data of an opened directory

dirdesc_t%get_next(): Returns the next entry in the open directory or "" if
there are no more entries.

make_directory(dirname, parents, status): Makes a directory. If parents is .true.
also parent directories are created in case they do not exist yet.

remove_directory(dirname, content, status): Removes a directory. If content is
.true. also the directory content is removed (recursive delete).

change_directory(dirname, status): Changes to the given directory

get_working_dir() -> char(:), allocatable: Returns the current working directory.

is_dir(fname) -> logical: GIve .true. if fname is a directory.

A few notes:

  • The functionality above can be easily realized with a libc-interface (should work on all POSIX systems) and a bit of C-code. The big question is Windows, as I have no clue how to implement this functionality on WIndows.

  • As for error handling we could have a status argument. I'd argue for having a derived type type(status_t) containg the result of the operation (OK/Error) and a string with the error message in case of error. Whether the status argument should be optional or not is open for discussions.

@aradi aradi changed the title API for file system operations: 1 directory manipulation API for file system operations: directory manipulation Jul 16, 2020
@certik
Copy link
Member

certik commented Jul 16, 2020

We should survey Python, Rust, Julia, Matlab, etc. to see what API they use for this.

@arjenmarkus
Copy link
Member

arjenmarkus commented Jul 17, 2020 via email

@everythingfunctional
Copy link
Member

I think we should start with just the file path bit. See the Rust library, or the Haskell library.

Also, I don't particularly like the manual iterator design for open_directory. I'd much rather have a function get_contents(directory) -> type(file_path), allocatable :: paths(:). To me, open_directory and dirdesc%get_next() smacks of object-oriented assembly, in an attempt to satisfy primitive obsession.

@milancurcic
Copy link
Member

I'd much rather have a function get_contents(directory) -> type(file_path), allocatable :: paths(:)

👍 +1

Also, I see inconsistent naming, e.g. is_dir() and dirdesc_t, but change_directory and remove_directory. I like shorter names if they're common enough. In this case, I think they are. What do you think about:

  • change_directory -> chdir
  • make_directory -> mkdir
  • remove_directory -> rmdir
  • get_working_dir -> get_cwd or getcwd

Python API reference: https://docs.python.org/3/library/os.html

@milancurcic
Copy link
Member

I think we should start with just the file path bit. See the Rust library, or the Haskell library.

Should this be a separate thread? I think in this one we discuss only directory stuff, plus I think most directory operations here are independent of file path.

@aradi
Copy link
Member Author

aradi commented Jul 19, 2020

@everythingfunctional The interface here is basically a 1-1 mapping of the libc API, which also uses an iterator. (Works in Rust also similarly) But sure, if we have a decent file path type, we could have a convenience funtion on top of it which returns an allocatable array instead.

@milancurcic Although I'd be very happy to go with Pythons naming convention, it would not be consistent with Fortrans recent naming strategy (e.g. execute_command_line instead of exec or even command_argument_count). For sake of adoption, I'd probably prefer the style which is more Fortran like, even if I am not big fan of it. We could use _dir everywhere to make it more consistent and somehwat shorter, though: create_dir (or make_dir), remove_dir, change_dir, etc...

@everythingfunctional
Copy link
Member

I see. In that case the question becomes one of philosophy for stdlib. Does stdlib provide lots of low level stuff for people to build abstractions on top of, or does it aim to provide useful abstractions and ergonomic APIs?

@aradi
Copy link
Member Author

aradi commented Jul 23, 2020

@everythingfunctional I agree, it is a philosophical question. And given the traditional reluctance of Fortran to delve into the depths of the underlying operating system, a higher level solution would indeed be probably more appropriate.

If we return an array, we have to decide on the element type. I see 3 different options:

  • characters of some fixed length (e.g. max path length size on the system, or similar)
  • a string derived type (offering variable string sizes and maybe also some string capabilties)
  • a path derived type (offering path manipulation and path-query capabilties).

I am working on Option 3 at the moment, but as discussed in #224, we should decide, whether and if yes how many derived types we want to impose on the users of stdlib...

@certik
Copy link
Member

certik commented Jul 23, 2020

@everythingfunctional regarding the goal of stdlib, I consider it both:

  • provide reusable low level functionality that gets the job done and people can use it to build more high level interfaces, or just use it directly; ideally these are just raw functions / subroutines that do not use derived types
  • nicely designed high level interfaces (perhaps OO based) that use the low level blocks; this is designed to be ergonomic and can use derived types or other OO features

That way everybody gets what they want, both people who prefer more OO, as well as people who prefer more direct low level API.

@everythingfunctional
Copy link
Member

In that case, I think the original proposal is perfectly suitable as low-level building blocks. It wouldn't be difficult to build a more ergonomic API using them.

@MarDiehl
Copy link
Contributor

MarDiehl commented Jul 30, 2020

Since this is related to #201, I want to explain the rational why I prefer to rely on standard types and no type bound procedures for low level operations:

In Fortran, one cannot chain function calls (and array access), so an object oriented approach will result in something like:

type(t_path) :: filename
filename = getcwd()
call filename%join('my_file.ext)
print*, filename%islink()

while the functional/procedural way gives:

print*, islink(join(getcwd(),'my_file.ext'))

I acknowledge that this is not the most elaborate example, but in generally I think that a procedural approach is often more natural for Fortran.

@certik
Copy link
Member

certik commented Jul 30, 2020

@MarDiehl I am a big fan of simple functions / subroutines also over object oriented approach. In general, we can have both to satisfy everybody.

@aradi
Copy link
Member Author

aradi commented Jul 30, 2020

@MarDiehl Yes, your reasoning is very convincing and indeed, that approach fits definitely better to Fortran.

But giving the suggested directory API above, which part would you propose to be change? The only type bound procedure we have so far is get_next(this) so far. As this changes the internal state of the directory, it must be implemented as a subroutine, so we have the choice between

call get_next(mydir, nextentry)

and

call mydir%get_next(nextentry)

I don't know, whether there is any esthetic difference between the two....

If you refer to the type(path) suggestion, I am with you, that we should rather use intrinsic types (e.g. allocatable chars here) whenever possible.

@milancurcic , @MarDiehl As for being "fortranic": Then getcwd() should be rather get_working_dir() or similar, as especially recently introduced Fortran functions handling the environment (e.g. execute_command_line, command_argument_count tend to be rather verbose than abbreviated. I am not saying, that we should be that verbose, but I'd propose to introduce names which are more consistent with the naming philosophy of the already existing language.

@arjenmarkus
Copy link
Member

arjenmarkus commented Jul 30, 2020 via email

@MarDiehl
Copy link
Contributor

@aradi

what about an interface like the following:

function list_directory(path)

    character(len=*), intent(in)  :: path
    character(len=:), allocatable, dimension(:) :: list_directory
    ...

end function list_directory

this would simply return a array of strings containing the file names, an optional argument could specify the search order (creation time, name). This is similar go https://docs.python.org/2/library/os.html#os.listdir, again with the drawback that Fortran has no list type and we need to use an array.

@certik
Copy link
Member

certik commented Jul 30, 2020

I don't think that works in Fortran, I think you have to use a derived type to have an array of strings, like I did here:

https://github.com/fortran-lang/fpm/blob/fcb7f675a8203f0ab518b20e9e11ee6dd49c3186/fpm/src/fpm.f90#L78

@MarDiehl
Copy link
Contributor

I don't think that works in Fortran, I think you have to use a derived type to have an array of strings, like I did here:

https://github.com/fortran-lang/fpm/blob/fcb7f675a8203f0ab518b20e9e11ee6dd49c3186/fpm/src/fpm.f90#L78

It works. You just need to figure out the maximum length among all strings and the number of strings:

allocate(character(len=max_len)::list_directory(N_files))

Implementing such a function is of course a little bit cumbersome, but the advantage of stdlib is that it has to be done only once.

@certik
Copy link
Member

certik commented Jul 30, 2020

Ah I see. Ok, that can be a possibility. A bit wasteful, but simpler, and perhaps even more performing than the allocatable in allocatable approach.

@urbanjost
Copy link

I use an interface similar to what is being described, and looking over the list, three more that I find I use extensively are
one for getting the current scratch directory by assuming it is set by $TEMPDIR, $TMP, $TEMP or /tmp; but it would be nice to get what opening a Fortran SCRATCH file is using; something to set what that directory is; and when getting a directory listing as a standard array if you also want to return the length, which can increase processing speed versus trim() and also lets you handle outliers like trailing spaces in the name (an alternative is adding a null character to the name like C). I happen to use the prefix system_ for just about anything that requires an ISO_C_BINDING and is calling a "standard" library; but have no problem with the names above, except that some of the simpler ones like chdir, mkdir, ... are often common extensions so you have the potential problem of referring to the extensions if you forget the USE statement, and/or confusion over which one is being called for those familiar with the extention.

For reference my names are:

833 get_tmp (3m_io) - [M_io] Return the name of the scratch directory(LICENSE:PD)
835 system_dir (3m_io) - [M_io] return filenames in a directory matching specified wildcard string(LICENSE:PD)
1018 system_getcwd (3m_system) - [M_system] call getcwd(3c) to get the pathname of the current working directory(LICENSE:PD)

1016 system_chdir (3m_system) - [M_system] call chdir(3c) from Fortran to change working directory(LICENSE:PD)
1019 system_mkdir (3m_system) - [M_system] call mkdir(3c) to create a new directory(LICENSE:PD)
1024 system_rewinddir (3m_system) - [M_system] call rewinddir(3c) to rewind directory stream(LICENSE:PD)
1031 system_isdir (3m_system) - [M_system] checks if argument is a directory path(LICENSE:PD)
1036 system_closedir (3m_system) - [M_system] close a directory stream by calling closedir(3c)(LICENSE:PD)
1043 system_opendir (3m_system) - [M_system] open directory stream by calling opendir(3c)(LICENSE:PD)
1047 system_readdir (3m_system) - [M_system] read a directory using readdir(3c)(LICENSE:PD)

@urbanjost
Copy link

PS: setting and querying where the scratch files go is not an issue for most programs, but can be a big issue due to performance and space issues when huge scratch files are requred in HPC appiications. I but some notes about that and some of the major portability problems concering scratch files on Blevin's Fortran WIKI page a long time ago.

@aradi
Copy link
Member Author

aradi commented Jul 31, 2020

@MarDiehl Yes, the allocated character array is as "fortranic" as only possible. 😄 Three remarks:

  • It is not that cumbersome to implement. The algorithm would have first to loop over all entries of a directory anyway (to find out how many there are) and then once more to put their names in the allocated array. In he first loop one can then also easily determine the max file name length.

  • Consumers would have to trim() the entries to find out the true file names. As long as the file names are just passed to intrinsic statements, like open, this is OK, as open automatically trims the passed character variable, but it may be still error prone and annoying if the user tries to manipulate the file name (e.g. checking whether it ends with a given suffix, or when an additional suffix should be added to the file name).

  • Should other file manipulation commands automatically trim character variables containing file names? It would be quite convenient (one would spare a lot of trim()s and 100% compatible with current Fortran. However it also would inherit consequence, that it is impossible to treat file names ending on whitespaces, or to be more precise, impossible to distinguish between file names differing only in the nr. of whitespaces at the end of their file names. (But, since we can not fix that problem in the existing open statement anyway, probably the best thing is to have a "consistently broken" implementation of the other features in stdlib anyway...)

@MarDiehl
Copy link
Contributor

@aradi:

  1. True, it is actually not that bad. I do the same in the split function.
  2. Yes, further manipulation would always require a trim. At least for all the os/os_path functions I would do that for all intent(in) strings
  3. This is a important point I totally missed. The hack used for URL encoding (replace by + or %20) is quite ugly. I agree with you that we should simply document this behavior since we cannot change the behavior of trim/open anyways. Better we have a simple and easy to user function that works for 99% of the users than something complicated that handles corner cases but is hard to use for the vast majority. Spaces on the shell are anyways a nightmare and trailing spaces are even difficult for graphical file explorers.

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 7, 2020 via email

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 7, 2020 via email

@MarDiehl
Copy link
Contributor

MarDiehl commented Sep 7, 2020

@arjenmarkus sorry for the late response, I've been busy with other things.

  1. error handling: I have a branch with an optional stat argument. If that is given, it's return value can be used for checking the success of a function. Otherwise, error stop is called if something goes wrong. This behavior is similar to the behavior of standard language featurs.
  2. windows path handling: We can make set the parameter defining the path separator depending on the OS. That would leave the question if there should be a standardize function (i.e. all user input with / would be translated to \ on Windows). In python, there are actually two implementation to handle these subtle differences (posixpath/ntpath)
  3. Is there a difference between "real windows" and MingGW regarding the expected behavior on Windows?
  4. I will continue to work on the library on the weekend. Should we exchange ideas via skype or zoom then? I think that would be easier.

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 7, 2020 via email

@MarDiehl
Copy link
Contributor

MarDiehl commented Sep 8, 2020

ok, then it's best if you contact me via email: [email protected] if you have time.

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 8, 2020 via email

@MuellerSeb
Copy link
Contributor

I am really happy about the implementations by @MarDiehl ! You should come up with a Draft PR to discuss details there!

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 7, 2020 via email

@MarDiehl
Copy link
Contributor

MarDiehl commented Oct 7, 2020

@arjenmarkus Sorry for the late reply, I started a new appointment and simply ignored all emails that were not directly related to my old and new work. I'm now back at this

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 7, 2020 via email

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 7, 2020 via email

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 11, 2020 via email

@LKedward
Copy link
Member

LKedward commented Oct 11, 2020

Hi @arjenmarkus , I think the above code will work if you use inquire after opening the file:

subroutine local_getsize_routine( path, sz, ierr )
    character(len=*), intent(in) :: path
    integer, intent(out)         :: sz
    integer, intent(out)         :: ierr
    integer :: fh

    open(newunit=fh, file = path, iostat = ierr )
    if (ierr /=0 ) return
    inquire( unit=fh, size = sz, iostat = ierr )
    close(fh)
end subroutine local_getsize_routine

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 11, 2020 via email

@MarDiehl
Copy link
Contributor

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 11, 2020 via email

@urbanjost
Copy link

Have an old file talking about six methods to do this depending on the vintage of the compiler. Was going to send it but I think most methods were mentioned (INQUIRE, ISO_C_BINDING, output of external command, open in append mode as a stream and get position-1, .... I have notes in there about a lot of compilers or systems not supporting multiple simultaneous opens on the file so I think your test should already have the file open to test for that; although the notes look like most of the notes are from 2008.

@awvwgk awvwgk added the topic: IO Common input/output related features label Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: IO Common input/output related features
Projects
None yet
Development

No branches or pull requests

10 participants