Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for high level I/O #14

Open
milancurcic opened this issue Dec 18, 2019 · 37 comments
Open

Proposal for high level I/O #14

milancurcic opened this issue Dec 18, 2019 · 37 comments
Labels
topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...

Comments

@milancurcic
Copy link
Member

Branched from #1 (comment)


One personal challenge I have with stock Fortran are its somewhat awkward and low-level I/O facilities -- open, read, write, inquire, rewind, and close. I often wished for a higher-level interface, like what you get with Python's open() -- you open a file with a function, get a file-like instance with methods that let you do stuff with it.

This would do away with unit numbers, which I don't think application developers should have to deal with. It could also be a solution to the problem that allocatable character strings must be pre-allocated before use on read statement.

Is there anything similar out there for Fortran? Would this be of interest to people here? I'd use it.

@jvdp1 wrote: I would use it too.

@cmacmackin wrote: I'd personally like something along these lines. However, the problem is in defining methods on the file-object; these would need to know the number and type of arguments at compile-time. It would be impractical to produce methods with every conceivable permutation of object types. It would also require variadic functions, which are not available. As such, this can not be implemented well in Fortran, although perhaps something would be possible if we were to wrap some C-routines and pass in deferred-type objects.

@milancurcic
Copy link
Member Author

However, the problem is in defining methods on the file-object; these would need to know the number and type of arguments at compile-time. It would be impractical to produce methods with every conceivable permutation of object types. It would also require variadic functions, which are not available.

@cmacmackin Can you explain why the number and type of arguments aren't known at compile time? How many can there possibly be? I think you'd need just clever formatting of numeric types to characters, but I may be missing the point.

@zbeekman
Copy link
Member

I think he means that with standard write you can pass an arbitrary number of variables of different types. That makes it hard for a file%write() method to be generic because how do you create a sufficiently generic interface? I may be miss-interpreting his comment, but this is my current understanding.

@milancurcic
Copy link
Member Author

I see, that didn't even cross my mind. Wouldn't for most purposes one argument (character string for text, numeric or strings for binary) suffice?

The only time I put multiple variables on write or read statements is for list-directed I/O. When I do it, the language-provided write or read are simple enough.

If we had nice on-the-fly formatting like Python does, then this becomes trivial I think.

@certik
Copy link
Member

certik commented Dec 18, 2019

@milancurcic can you elaborate a bit why the unit numbers are insufficient? The recommended approach with newunit from Fortran 2008 seems very simple: https://www.fortran90.org/src/best-practices.html#file-input-output. The unit number integer is your file like instance. That's how I always viewed it. The summarize the current approach:

To read from a file:

integer :: u
open(newunit=u, file="log.txt", status="old")
read(u, *) a, b
close(u)

Write to a file:

integer :: u
open(newunit=u, file="log.txt", status="replace")
write(u, *) a, b
close(u)

To append to an existing file:

integer :: u
open(newunit=u, file="log.txt", position="append", status="old")
write(u, *) N, V(N)
close(u)

The only annoying thing to me is that you have to put status="old" when you want to read from a file, you cannot just do open(newunit=u, file="log.txt") because the default status is undefined or something like that (I forgot the exact reason).

@milancurcic
Copy link
Member Author

Yes, unit numbers work. They were tedious before newunit, now they're OK. I think of them as file handles. Now, I can at best say they're not as elegant as having a file-like instance that carries the unit number abstracted away from the user. So, this is perhaps more of a style preference. I think managing unit numbers for different files (say, if you have multiple files open) may be more error-prone and harder to develop than if using file-like instances, but this is kinda hand-wavy.

With a high-level interface, your examples would be something like this. Read from a file:

type(file_type) :: f
f = file_type('log.txt', 'r')
a = f % read()
f % close()

Write to a file:

type(file_type) :: f
f = file_type('log.txt', 'w')
call f % write(a)
f % close()

Append to an existing file:

type(file_type) :: f
f = file_type('log.txt', 'r+')
call f % write(a)
f % close()

The pros as I see them:

  • A bit more elegant style;
  • Don't need to carry unit numbers, only file instances;
  • Pretty abstraction of status, position, and action attributes;
  • close could be baked into file_type's destructor so that the file is automatically closed when f goes out of scope;

Cons:

  • Would need elegant, on-the-fly formatting system between strings and numeric values;
  • Trivial implementation can write one variable per method call
  • Others?

@certik
Copy link
Member

certik commented Dec 18, 2019

I talked with @zjibben about this and he pointed out another advantage of your approach: automatic closing of files when the instance goes out of scope. Also I would suggest that r is the default as in Python. So your first example can become just:

type(file_type) :: f
f = file_type('log.txt')
a = f % read()

Also perhaps the type could be called File, or File_t, we should discuss this in #3.

One issue is that you need to be able to read different types and I don't know if you can override read based on the result type. So it might need to be called as call f%read(a).

@certik
Copy link
Member

certik commented Dec 18, 2019

Another issue: how do you distinguish between:

read(u,*) a, b

and

read(u,*) a
read(u,*) b

The first reads two numbers one a line, the second reads a number from two lines --- or are the two equivalent?

@milancurcic
Copy link
Member Author

I like just File.

In the case of reading and writing formatted data, the trivial implementation doesn't operate on multiple variables. It reads or writes a string, for example:

character(len=:), allocatable :: a
type(File) :: f
f = File('log.txt')
f % read(a) ! reads whole file into string a
f % seek(0) ! rewind
f % readline(a) ! reads a single line into string a

To write:

f = File('log.txt', 'w')
f % write('pi = %.3f' .fmt. 3.141)

similar to Python. The last line would write 'pi = 3.141' to log.txt. This is an example of on-the-fly formatting that I mentioned earlier, which I think is needed if you want to elegantly read and write multiple variables. .fmt. here is our implementation of %.

C-style formatting is something I'd very much like in Fortran. I think it may help some newcomers to the language as well because this kind of formatting is more common in other languages. But that's for another proposal. :)

@cmacmackin
Copy link

I think he means that with standard write you can pass an arbitrary number of variables of different types. That makes it hard for a file%write() method to be generic because how do you create a sufficiently generic interface? I may be miss-interpreting his comment, but this is my current understanding.

Yes, that's what I was trying to communicate, albeit not very well.

@cmacmackin
Copy link

C-style formatting is something I'd very much like in Fortran. I think it may help some newcomers to the language as well because this kind of formatting is more common in other languages. But that's for another proposal. :)

This actually wouldn't be too difficult to implement in a standard library. We'd just write a series of wrappers in C, taking different numbers of void*t arguments. We'd then use interoperability to call these from Fortran and wrap them in a generic block. We could have versions accepting between, say, 1 and 30 arguments (tedious, but could be automatically generated), which should be enough for anyone.

@ivan-pi
Copy link
Member

ivan-pi commented Dec 19, 2019

I like the idea. As mentioned by @cmacmackin this would be easier to implement if Fortran had support for variadic functions (in fact I mentioned exactly this use case in my comment at j3-fortran/fortran_proposals#76 (comment)).

The idea of wrapping a bunch of C subroutines in a generic block seems like a good idea. I am only worried about a combinatorial explosion. Say I want to read 3 integers, 4 reals, and a character array, etc., from a single line, wouldn't we then need to generate all possible combinations of all intrinsic types? Or would you somehow convert everything internally to a C pointer and use a format specifier? (Hopefully, my idea makes sense.)

type(File) :: f
integer :: a, b, c
real :: d, e, ff, g, h
character(len=1) :: char_arr(5)

call f%open("data.txt","r")
call f%readline(a, b, c, d, e, ff, g, h, char_arr, fmt=" ... some kind of format specifier ...")

Or we could just read lines as an allocatable character string and do the type conversion ourselves:

read(f%readline(),*) a, b, c, d, e, ff, g, h, char_arr ! <-- assuming this works?

I suppose this would work:

character(len=:), allocatable :: line
! ... open file ...
call f%readline(line)
read(line,*) a, b, c, d, e, ff, g, h, char_arr

@jvdp1
Copy link
Member

jvdp1 commented Dec 19, 2019

I would suggest to keep high-level I/O for ASCII files with formatted data (e.g., table of reals; something similar to File suggested by @milancurcic). Maybe some formats, like CSV, could be supported too.

If the user wants something more complex (mix of integers/reals/characters), or binary/stream files, the user can still use newunit (or the readline op proposed by @ivan-pi ; but probably less efficient).

@ivan-pi
Copy link
Member

ivan-pi commented Dec 19, 2019

I would suggest to keep high-level I/O for ASCII files with formatted data (e.g., table of reals; something similar to File suggested by @milancurcic). Maybe some formats, like CSV, could be supported too.

So more like np.loadtxt(), but with the possibility to load the individual columns or rows directly into arrays (1d or 2d)?

@jvdp1
Copy link
Member

jvdp1 commented Dec 19, 2019

So more like np.loadtxt(), but with the possibility to load the individual columns or rows directly into arrays (1d or 2d)?

Yes, it was my idea.

I observed that I/O is usually quite a difficult step for most beginners when they come from R/Python/Matblab. Such high-level I/O could maybe help them.
Also, when performance is not an issue, high-level I/O could also simplify programs.

@marshallward
Copy link

I would also like to see a more POSIX-like interface to the filesystem. I would also like to see this as independent from the Fortran interface, with a focus on data streams rather than the list-directed I/O for which Fortran I/O has been tailored.

I also think that there should be a generic File type, but it would be good to start with complete interfaces to the low level file I/O C functions, such as POSIX open, read, etc. More elegant interfaces can then be build from these.

I like the way Python does this, with a top level os module which provides a commont platform which conditionally wraps to a low level interface: posix module in Linux/OSX, nt for Windows, etc. Perhaps even the Fortran intrinsics could be supported as an additional backend of necessary, along with more exotic platforms (OpenVMS?).

@certik
Copy link
Member

certik commented Dec 19, 2019

@ivan-pi, @jvdp1 see #16 for loadtxt.

@milancurcic
Copy link
Member Author

milancurcic commented Dec 19, 2019

Great ideas for C-style formatting implementation. A basic implementation for File can work on just strings at first, and parsing the strings into numeric variables and vice versa can be a separate issue (and module) that File would eventually depend on. I can prepare a prototype for File that works on strings.

@marshallward I'd like a maintained and tested POSIX interface as well. Do you mind opening a proposal for this? (Update: done in #22.)

@cmacmackin We can have a dedicated proposal to just C-style formatting. I like your idea for implementation. Do you mind opening a proposal for this? (Update: done in #19.)

@cmacmackin
Copy link

@ivan-pi My suggestion of calls to C was specifically for a printf function. This would avoid the combinatorial explosion because printf works on void* data types. These can be passed in from Fortran using a "deferred-type" argument, type(*). The interface would look something like this:

void printf_wrapper1(const char *format, void *arg1) {
    printf(format, arg1, arg2)
} 

void printf_wrapper2(const char *format, void *arg1, void *arg2) {
    printf(format, arg1, arg2)
} 
interface printf
subroutine printf_wrapper1(format_str, arg1) bind(c)
    character(len=1), dimension(*), intent(in) :: format_str
    type(*), intent(in) :: arg1
end subroutine printf_wrapper1

subroutine printf_wrapper2(format_str, arg1, arg2) bind(c)
    character(len=1), dimension(*), intent(in) :: format_str
    type(*), intent(in) :: arg1, arg2
end subroutine printf_wrapper2
end interface printf

The main complication with this is how to convert between Fortran and C strings. It wouldn't be hard to provide wrapper routines which do this for the Format string, but string arguments to printf could be more of a challenge.

@rweed
Copy link

rweed commented Dec 20, 2019

Three things I've implemented that I find useful for IO particularly input data where I create my own keyword based input. One is a File class that basically wraps current Fortran IO (open, read, write etc) but hides the unit number. I have read and write wrappers that allow you to read/write formatted, list-directed or binary of the intrinsic data types from the same subroutine. Another useful class is a deferredLenString class that is a defined type that contains a deferred length string with some routines to convert back and forth to regular strings and do the equivalent of LEN, LEN_TRIM etc. Finally, I define a fileImage class that uses an array of the deferredLenString class to hold all the records in a formatted input file in memory as a "file image". I have filters that can change case, strip delimiters etc from the original file. When processing keyword based input files, using a file image instead of the disk file is a lot faster. Some or all of these might prove useful for implementing IO classes specifically for processing text files etc.

@gronki
Copy link

gronki commented Dec 31, 2019

I think we do not need a derived type wrapper for Fortran i/o. The current i/o covers many of the general needs. Some missing capabilities can be (and gradualy are) improved from the language side and introducing a wrapper would not help with anything. For a particular file format (such as PPM), anyone can write their own wrapper using derived types. This is literally what derived types are for. They should not be used for introducing another way of doing the same thing (which is universally bad). So I am strongly against this proposal.

@certik
Copy link
Member

certik commented Jan 3, 2020

I generally agree with @gronki on this --- it's similar in #69. Rather than creating our own string_t, let's use the intrinsic, and work with the language and compilers to improve it if needed.

However, as a pragmatic approach, I think the best it to have both: a low level API that uses the language features. People like @gronki and I will use that primarily. And optionally a high level API, that lots of people can also use. I think we can have both. What we should not have is just the high level API without the low level one.

@milancurcic
Copy link
Member Author

Yes, perhaps it's better for this to start as a separate project, and after having something concrete to look at we can discuss whether people would like to have it in stdlib or not.

@certik
Copy link
Member

certik commented Jan 3, 2020

@milancurcic let's separate the genuine new functionality, which would go into the low level API, from a convenience wrapper, which would go into the high level OO API.

Here are some ideas off top of my head that would go into the low level API:

  • Using the unit handle u (of type integer) directly
  • The standard Fortran functions read, write, close, etc.
  • Reading a string from a file and returning it as properly allocated (related to String handling routines #69)
  • Reading an array of numbers (real or integer) and returning a properly allocated array back
  • Perhaps some easier function open (so that one can do u = open(filename) and u = open(filename, "w") as in Python etc. instead of the current open(newunit=u, file="log.txt", status="old", action="read") and open(newunit=u, file="log.txt", status="replace", action="write"), which is so complicated that I literally just had to look it up again --- the standard is badly designed that it left both status and action undefined --- we should work on fixing it) See Implement open(filename, mode) and use it #71.
  • ... what else?

The high level API would then wrap these in a convenience derived type wrapper file_t and methods.


I see a repeated pattern across many of the issues in stdlib ---- the high level "simple" OO interface is motivated by the fact that the low level built-in Fortran language features are insufficient. In order to fix the low level API, which a lot of people would otherwise use, let's use the OO interface as a vehicle to figure out what features we want and need (not being constrained by the limitations of the low level API: we can design our derived type and methods in any way we like). Then extract the genuine new functionality, and put it into the low level API, and the high level API is just a thin wrapper. Then let's create proposals to get some of the stdlib's improvements of the low level API into the Fortran standard itself -- backed by wide support here and our future users using it. And let people choose between the low level API and the high level API. It looks like we will have users for both.

@everythingfunctional
Copy link
Member

I've always found it really strange that Fortran's read and write actually muddled the jobs of parsing and formatting together with the I/O. It leads to input files/specifications that are predicated on that functionality, which in my opinion is not a good thing. You're users shouldn't have to care about the peculiarities of the language you're using. I once was porting some Fortran code to Python and had to write something to simulate Fortran's IO in order to make sure old input files would still work properly.

I'm pretty fond of Python's file API. I think we should mimic that as a library and see if it works well for everybody. We need a string type to implement it properly though since readlines returns an array of strings. An on-the-fly formatting library would be really helpful too.

@gronki
Copy link

gronki commented Jan 3, 2020

@certik I am not sure if I undestood correctly the last paragraph. Do you mean that the wrapper should be developed as the protype/testbench for the features that will be handled intrinsically in the future?

@everythingfunctional I disagree there is any reason to mimic Python's io library in fortran I/O. Is there anything that can be done in Python that Fortran does not enable to do with its current functionality? I would even argue that Fortran has superior functionality in some aspects and that the lesser brevity in other situations comes from lacks in string manipulation capabilities (for example, having to use write/read to parse strings to numbers).

Again, sorry for being a bitter potato. I am all for changes and criticizing Fortran where it sucks but I also want to make sure that we do not re-invent what's already good just because other languages handle it differently. Please do not take my opinions personally. :)

@certik
Copy link
Member

certik commented Jan 3, 2020

@gronki yes, that's what I meant.

Is there anything that can be done in Python that Fortran does not enable to do with its current functionality?

Here is one example: #71. Look how much simpler the new open function is. Opening files in Fortran is so complicated that I literally have to look it up every time. This new function, on the other hand, I can remember, as it would be identical to Python.

@milancurcic
Copy link
Member Author

Yes, ease of use is the key motivation for me here, and IMO is one of highest priorities for stdlib. I also learn standard Fortran I/O every time I use it. Teaching it is especially tedious.

@gronki
Copy link

gronki commented Jan 3, 2020

I understand your point. But my counter point is: then why not just use Python. :) What I want to say is that "just because it looks like Python" is not an argument because Fortran is not Python.

In this particular issue, the "Python-like" (or actually, C-like, because that's its origin) syntax takes the same amount of information (unit number, filename, access mode), just arranges it differently. It does not change the quality of life in any way. Nor would derived type wrapper of current I/O.

I am all for quality improvements that actually do make I/O more functional. But I disagree any of the ones discussed so far do that. That's only my opinion though and please don't be discouraged. As it was mentioned by someone before, since this is stdlib project (and not standard proposal project), it's probably best to let people make their packages and see if the solution produced ends up actually being better. :)

@certik
Copy link
Member

certik commented Jan 4, 2020

It's about making Fortran easier to use. This:

s = open(filename)
s = open(filename, "w")

is simpler and more consistent than this:

open(newunit=s, file=filename, status="old", action="read")
open(newunit=s, file=filename, status="replace", action="write")

That's all there is to it. Just like Fortran has arrays, which do have similar information as the various C++ array libraries, but they are much easier to use. That's the whole point. So the goal of stdlib is to make Fortran easier to use and more productive, with a standard syntax, agreed upon by a wide community.

@everythingfunctional
Copy link
Member

Is there anything that can be done in Python that Fortran does not enable to do with its current functionality?

No, Python doesn't have nearly the number of options as Fortran, but that's the point. Fortran has too many options, and that makes it difficult to use. It would be nice if there was a simpler, more user friendly way to deal with files so you wouldn't have to look it up every time.

@jvdp1
Copy link
Member

jvdp1 commented Jan 6, 2020

After some discussions and implementations (see #71, #77, #86, #91), the function open has been added to stdlib_experimental_io:

integer function open(filename, mode, iostat) result(u)
    character(*), intent(in) :: filename
    character(*), intent(in), optional :: mode
    integer, intent(out), optional :: iostat
    ...
end function

This function returns a unit number for a file opened following the provided mode.
The mode can be include the following letters (following the Python open):

  • r (read), w (write and replace the file if it exists), x (write in a new file), a (append)
  • + (readwrite)
  • t (text), b (binary)
    The default mode is rt.

Currently, both text and binary files are opened with access = stream.

So, should this API support the other accesses (sequential and direct)? What are the pros and cons to support them (or not)?
So let 's discuss this API (@milancurcic @zbeekman @certik @cmacmackin @ivan-pi @marshallward @rweed @gronki @everythingfunctional and all others I maybe missed)

Supporting direct access would require to add recl to the API too. IMHO this would defeat the purpose of stdlib function open that aims to propose a more friendly way to deal with files.

A draft PR (#91) has been opened to propose an API to support at least sequential access (default is still stream):

integer function open(filename, mode, iostat, access) result(u)
    character(*), intent(in) :: filename
    character(*), intent(in), optional :: mode
    character(*), intent(in), optional :: access
    integer, intent(out), optional :: iostat
end function

@certik
Copy link
Member

certik commented Jan 6, 2020

To start: what is the advantage of unformatted sequential, compared to unformatted stream? There is a disadvantage that the format is compiler version specific. But is there any advantage? Could it be perhaps faster, or is that not the case.

@jvdp1
Copy link
Member

jvdp1 commented Jan 7, 2020

In this comment, @jacobwilliams mentioned:

delete a file or directory (I know for files we can open and then close(..status='delete') but that is just so non-modern).

Would it be an interest to implement a subroutine/function in stlib to delete a file? For example, this would check if the file exists, if it is not opened, and if both are true, would delete the file.
Such a subroutine/function could be implemented in stdlib_experimenatl_io.
Possible interface:

io = delete('file.txt')

or

call delete('file.txt',iostat=io)

with iostat being optional

Such a function/subroutine would avoid the 'traditional':

open(newunit=u, file = 'file.txt', status = 'old')
close(u, status = 'delete')

@certik
Copy link
Member

certik commented Jan 7, 2020

I think the answer is yes, only we should make the naming consistent. Python has remove, and we should survey other languages.

@jvdp1
Copy link
Member

jvdp1 commented Jan 7, 2020

To delete a file:

  • Python: remove
  • Julia: rm
  • R: remove
  • Matlab: delete
  • C/C++: remove
    -D: remove
  • Fortran: delete in built-in close

and more examples here

remove (or rm) seems to be quite popular in other languages after a quick scan.

@certik
Copy link
Member

certik commented Jan 7, 2020

I am fine with calling it delete. I assume this would go into the io module?

@jvdp1
Copy link
Member

jvdp1 commented Jan 7, 2020

I am fine with calling it delete.

Me too. delete would be also in agreement with close(...='delete').
However, if we want to extend that to directories, rm or remove could be more appropriate (with e.g. rmdir to remove directories.

I assume this would go into the io module?

It would make sense to me, since open is already in the io module.

If we want to implement more stuff around file systems #100, we should probably have a new module dedicated to file system stuff (e.g., rm, rmdir, ls, mkdir,...)

@jvdp1 jvdp1 added the topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... label Jan 18, 2020
jvdp1 pushed a commit to jvdp1/stdlib that referenced this issue Feb 4, 2021
@wclodius2 wclodius2 mentioned this issue Jun 19, 2022
jvdp1 pushed a commit that referenced this issue Jun 20, 2022
Addition of test_maps using test-drive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...
Projects
None yet
Development

No branches or pull requests

10 participants