Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array of strings #24

Open
certik opened this issue Oct 19, 2019 · 30 comments
Open

Array of strings #24

certik opened this issue Oct 19, 2019 · 30 comments
Labels
Clause 7 Standard Clause 7: Types

Comments

@certik
Copy link
Member

certik commented Oct 19, 2019

A common request is to simplify how to do array of strings. Currently one option is this:

type string
    character(:), allocatable :: s
end type

integer, parameter :: N = 5
type(string) :: A(N)
integer :: i
do i = 1, N
    A(i)%s = "some string"
end do
do i = 1, N
    print *, A(i)%s
end do
end

It might be worth considering if there should be some way in Fortran to do this directly without the derived type.

@tclune
Copy link
Member

tclune commented Oct 20, 2019

While I sympathize, the nature of Fortran arrays really does require that the elements all be the same type and with the same type parameters. An unfortunate, but common, design pattern for Fortran is the "wrap it" so that you can treat it consistently with other types. Here one is wrapping a string to support treatment in an array, but I've also wrapped arrays so that they can be treated with CLASS(*) in a consistent manner.

For the specific purpose at hand, you could consider the StringVector implementation from gFTL-Shared (https://github.com/Goddard-Fortran-Ecosystem/gFTL-shared)

@aradi
Copy link
Contributor

aradi commented Oct 21, 2019

Wouldn't it be enough if the Fortran Standard enforced a standardized derived type type(string) with hidden internals (similar to type(c_ptr))? With most of the the current string manipulation functionality provided as type bound procedures? Also including assignment and automatic handling by the print/write statements, so that your example becomes

type(string) :: A(N)
integer :: i
do i = 1, N
    A(i) = "some string"
end do
do i = 1, N
    print *, A(i)
end do
end

I guess, this should not require many changes. Edit descriptors may need additional thoughs, probably.

@FortranFan
Copy link
Member

FortranFan commented Oct 21, 2019

Wouldn't it be enough if the Fortran Standard enforced a standardized derived type type(string) with hidden internals (similar to type(c_ptr))? ..

It will be really, really nice if the Fortran standard offered something INTRINSIC that is far more well-featured (e.g., arrays of strings with different lengths) and less verbose for coders than "CHARACTER(LEN=:), ALLOCATABLE" or its encapsulation in a derived type!!

Converse to "You had me at Hello", the number of instances where "You LOST me at CHARA.." when it comes to someone new looking at Fortran and not liking it is quite long.

It's year 2019 after all, operations with "STRINGS" are so basic and intrinsic in programming, a coder should simply be able to do:

   string :: s(2)
   s = [ "Hello", "World!" ]
   print *, s  ! prints Hello World!
   call to_upper( s )
   print *, s  ! prints HELLO WORLD!
end

and so forth either along the above lines or something similar. As to how the "string" type is implemented in a processor - whether as an intrinsic derived type or some other compiler "magic" - doesn't matter.

The thought process of steering the practitioners of Fortran toward a library solution for something as simple and fundamental as a STRING type appears a grave disservice, this should be a low hanging-fruit to offer to the coders.

The several available library solutions out there including toward the so-called "Part 2" of the standard with ISO_FORTRAN_STRING has long proven the use cases and the feasibility, the challenge now is mainly in standardizing the set of operations to be offered and their names e.g., 'push_back' or 'append'; 'remove' or 'replace', etc.

@jacobwilliams
Copy link

My preference would be an intrinsic string class, that has all the expected behavior of normal characters strings (assignment, s(1:3) slice notation, works with other intrinsic routines like read, write, etc.) Without these, it's just not very useful.

But... I don't want the committee to provide some small set of other routines (upper, lower, or whatever) and then not update them for 20 years. What if it was possible to allow the intrinsic type to be extended? That way the user community could develop a library that has everything anybody would need (e.g., regular expressions, parsers, ... things that would never be added by the Fortran standard).

A bonus would be some sort of backward compatibility with character dummy arguments. But that may be asking too much.

@certik
Copy link
Member Author

certik commented Oct 22, 2019

One approach is to do something like std::string, which the C++ standard specifies (I think) and then just keep updating it if needed, just like C++ does:

https://en.cppreference.com/w/cpp/string/basic_string

You can see some of the methods are since C++11, C++17, C++20, and so on.

Related to this is that I think Fortran needs to have a standard every 3 years, just like C++, precisely to prevent not updating a feature for 20 years. I opened a new issue for this at #36.

@tclune
Copy link
Member

tclune commented Oct 28, 2019

I agree that having a new intrinsic type STRING would be of considerable benefit. Vendors are usually somewhat reluctant to make changes in the internal type system of their compiler, so there may be some debate about the benefits vs costs. This is esp. true since a user defined type could give most of the benefits without changing the standard at all. (Of course, I think I can still demonstrate compiler bugs with such wrappers on most extant compilers ...)

Failing an intrinsic type, it would be good if we could establish an aux Fortran library for such things so that my String wrapper and your String wrapper are compatible. My current approach to such things is to have very small GitHub projects for isolated capabilities, but a better solution might be a large collection of agreed-upon functionality. Various groups have something along these lines, but we'd want to consolidate the effort under a single umbrella. This would evolve faster than the standard but would still require some governance structure.

@gronki
Copy link

gronki commented Oct 28, 2019

Fortran derived types are not capable of what C++ or Python types are. And they were never designed to be that. So I would be against any suggestion such as "implement strings/containers/.... by derived types". Especially that Fortran character is pretty good as it is. Instead I think some more character manipulation functions, python-like str() and regex capability would be all one needs. Also handling them intrinsically by the Fortran library or using native libraries (in case of regex) would make it way more efficient than clunky Fortran implementation. Dominik

@certik
Copy link
Member Author

certik commented Oct 28, 2019

Yes, my suggestion of std::string was to have an intrinsic type string in Fortran, or simply improve the current character type to be able to do arrays of string and other things, not to have it as a library, due to the same reasons as @gronki sad.

@tclune and I had long discussions of "being part of the language", versus "a library as in C++ or Python". For Fortran, I am in the "part of the language" camp, and Tom is more open to the "library approach".

@FortranFan
Copy link
Member

@gronki wrote:

Fortran derived types are not capable of what C++ or Python types are. And they were never designed to be that. ..

I don't quite agree with this.

Besides, Fortran has always granted certain allowances to intrinsics, whether types - basic/derived, statements, procedures,- that are not feasible with user written code. Processors can then implement whatever "magic" they think appropriate behind the scenes to support the facilities. Examples include generic intrinsic procedures, derived types for interoperability with C, etc. Thus there is no reason why an intrinsic STRING type cannot be introduced featuring considerable benefits for the practitioners of the language, and which should be the overriding motivation, in a manner that might at first appear alien to a user derived types e.g., an inextensible, single-component only container type for a new intrinsic string type which offers an "(x:y)" operator which aren't possible today with user derived types!

On the other hand, the problem with CHARACTER type is it's really at the point of being immutable, any further change to it can adversely impact some compiler implementation or other. The resistance to any change here will be great.

But also, there is the issue of ARRAY semantics in Fortran which, per its original design, calls for symmetry and shies away from jagged structures. So this is yet another issue when it comes to user needs for arrays of strings.

Thus, the best compromise in my opinion is a new string type which builds on the capabilities of current CHARACTER type but also offers further benefits for coders.

@tclune
Copy link
Member

tclune commented Oct 28, 2019

@certik To clarify: "in-the-compiler" is generally better that "in-a-library", I'm sure we can all agree. Rather the issue is that if you want everything in the compiler you are going to be disappointed many many times due to finite resources.

The number of developers that could/would contribute to an open library far exceeds those that can/will contribute to the standard and esp. any commercial compiler. Perhaps, just perhaps, flang will emerge and create a thriving ecosystem of active branches of development of new (well thought out) features, but even then I don't think the basic balance will change much.

So, given the finite resources for making changes to the standard and the commercial compilers, I want to focus on things that fundamentally cannot be done with user code. Of course there are grey areas, and I'm by no means an absolutist. An intrinsic String type would likely rise to my 2nd tier of language priorities. (2nd tier things often happen precisely because the 1st tier things are too hard/controversial.)

@FortranFan
Copy link
Member

FortranFan commented Oct 28, 2019

@tclune wrote:

..
So, given the finite resources for making changes to the standard and the commercial compilers, I want to focus on things that fundamentally cannot be done with user code. Of course there are grey areas, and I'm by no means an absolutist. An intrinsic String type would likely rise to my 2nd tier of language priorities. ..

A point to keep in mind is what is needed to "grease the wheel to make the sale" i.e., to get Fortran considered even as an option for new projects or for refactoring of existing code-bases. An intrinsic string type appears to fall under this category.

Being able to code "string s = 'Hello World!'" conveys certain sense of ease-of-use and modernity which is not quite measurable but which I think is priceless. One just loses many a sharp mind at "character(len=:), allocatable ..". One can then build many a "field of dreams" with coarrays (an enormous amount of committee resources were expended on it), etc. in Fortran, but "they" rarely come to experience any of that ..

One can accept the argument Fortran does not necessarily need to include as a top priority feature "in the compiler", say, containers of the likes of C++ STL like deques and priority_queues and unordered multimaps, etc. A vision for Fortran to offer better intrinsic capabilities (e.g., generics) at some point so coders can "home brew" their own libraries toward such capabilities may be ok right now.

But with a string type, an intrinsic capability is badly needed for Fortran's credibility. And it doesn't seem all that difficult, almost all the heavy-lifting was already done with Part 2 of the standard with ISO_FORTRAN_STRING. Part 2 is now effectively gone. It's mostly a matter of merging into Part 1 a modernized version of it.

However if every item, no matter how well-established in programming parlance, becomes too difficult to get added to Fortran and requires endless discussions and iterations on use cases, requirements, specifications, and syntax, then perhaps it's time for someone to contact INCITS and "pull the plug" on Fortran.

@gronki
Copy link

gronki commented Oct 28, 2019 via email

@certik
Copy link
Member Author

certik commented Oct 28, 2019

@gronki I think you have a point --- as an example, C++ is a language that has much better features than Fortran to develop libraries for basic things. And so they do not have an array in the language (the idea was that C++ allows enough abstraction to implement your own), and as a result, C++ has dozens of arrays libraries, all incompatible with each other.

They do have std::string, implemented in a library, but this library is the C++ standard library. So we could have a Fortran standard library, so that things like strings and lists are all compatible. But at that point it must still be standardized, and so we might as well have it in the language itself.

However, I do agree with @tclune's point that it is by far easier to create a library than to get something in the language.

@jacobwilliams
Copy link

@certik should we continue to discuss a potential intrinsic string class here, or make a new issue specifically for that? That would solve the array of strings issue, but really it's a more general topic that touches on other issues.

@certik
Copy link
Member Author

certik commented Nov 24, 2019 via email

@everythingfunctional
Copy link
Member

I ended up implementing my own version of the ISO_VARYING_STRING module for a couple of reasons. gfortran still hasn't implemented it, even though it's been in the standard for quite a while now (I don't know that any other compiler has either). I hadn't seen a particularly complete, well tested, or well written version. There are significant bugs in most implementations of allocatable characters, not the least of which is that you still can't put them in an array even if they weren't buggy.

I was under the impression that since ISO_VARYING_STRING had been approved, it was the right way to move forward. If you use a standards compliant version, if anybody ever does actually implement it in the compiler, you won't have to change any of your code, the external library will just become unnecessary. I also thought that standard covered all of the intrinsic procedures accepting or returning characters, and any additional procedures that were truly necessary.

Granted, not every string functionality one would want is included, but that's what libraries are for.

@ivan-pi
Copy link

ivan-pi commented Apr 10, 2021

Failing an intrinsic type, it would be good if we could establish an aux Fortran library for such things so that my String wrapper and your String wrapper are compatible.

Hey @tclune, we have tried to make something along these lines over at the stdlib repository. For the start we converged to a non-extensible string_type (similar to the iso_varying_string) that can be used easily in an array of strings and supports all of the intrinsic character functions (the definition of the type is in the file stdlib_string_type.f90).

As a second step, @awvwgk provided a demonstration of an abstract base class (ABC) for a string object, already demonstrating wrappers to ftlString and StringiFor based upon the ABC.

Some discussions related to the (problems of the) ABC are in:

I don't think we can get much further than this within the scope of the current standard.

@ivan-pi
Copy link

ivan-pi commented Apr 27, 2021

I was under the impression that since ISO_VARYING_STRING had been approved, it was the right way to move forward. If you use a standards compliant version, if anybody ever does actually implement it in the compiler, you won't have to change any of your code, the external library will just become unnecessary.

As I understand, the ISO_VARYING_STRING was withdrawn, see https://www.iso.org/standard/6129.html. I'm not sure if dropping the ISO prefix would cause more or less confusion at this point.

@everythingfunctional
Copy link
Member

I was under the impression that since ISO_VARYING_STRING had been approved, it was the right way to move forward. If you use a standards compliant version, if anybody ever does actually implement it in the compiler, you won't have to change any of your code, the external library will just become unnecessary.

As I understand, the ISO_VARYING_STRING was withdrawn, see https://www.iso.org/standard/6129.html. I'm not sure if dropping the ISO prefix would cause more or less confusion at this point.

Yes, it was withdrawn, but I didn't know that at the time. I'm still a fan of the specification, even if it isn't in the standard. IMO, it's proven itself to be quite useful.

@ivan-pi
Copy link

ivan-pi commented Apr 27, 2021

This would be great for a blog post, looking at the story of this module, and doing some comparisons with the string_type currently in stdlib.

@everythingfunctional
Copy link
Member

That's a good suggestion. I've got lots going on, but hopefully I can get to it before too long.

@MarDiehl
Copy link

MarDiehl commented Jun 8, 2021

I really don't understand the efforts of implementing a new string type. The current (allocatable) strings are rather flexible and serves at least my purposes very well.
What is missing IMHO is a list type that can be used as a collection of arbitrary types. Strings of different length could be simply put in a list. That's how it is done in Python, where the a numpy string array is be also limited to strings of the same length.

@tclune
Copy link
Member

tclune commented Jun 8, 2021

I've not followed this thread closely, but I think the distinction is that a proper String would encapsulate the allocatable aspect. Without that any attempt to define something like a list would require a wrapper type due to Fortran's quirky syntax in this regard. I.e., a list of Strings would be a virtually identical implementation to a list of Integers. While a list of deferred length CHARACTER(len=:), ALLOCATABLE would require a wrapper type and lots of boiler plate logic for diving down the extra level in the data structure.

@MarDiehl
Copy link

MarDiehl commented Jun 8, 2021

@tclune: If you neglect that you need to define the type, string handling is not much different from python:

program test_string

  character(len=:), allocatable :: string

  string = 'hello world'
  print*, len(string)

  string = string//', hello Github'
  print*, len(string)

  print*, string(14:)

  string = ''
  print*, len(string)

  string = ' '
  print*, len(string)

end program test_string
#/usr/bin/env python3

string = 'hello world'
print(len(string))

string = 'hello world'+', hello Github'
print(len(string))

print(string[13:])

string = ''
print(len(string))

string = ' '
print(len(string))

The boiler plate comes in when you want to define a collection of strings, but IMHO this is due to the lack of a list type that can contain arbitrary types/kinds.

@tclune
Copy link
Member

tclune commented Jun 8, 2021

Introducing such generic programming capabilities is my highest priority, and the primary reason I joined the Fortran committee in the first place. And the ability to use that for containers is my most important use case.

I don't want to overpromise on the schedule, and there are of course risks that such a big feature can fail to get the necessary votes.

@tclune
Copy link
Member

tclune commented Jun 8, 2021

It also may be worth pointing out that even if Fortran were to introduce something like a List container that can contain "anything" it would be a bit difficult to use. The dynamic typing in languages like Python significantly improves the usability of such a structure.

For instance, suppose we want to retrieve the 5th element from a list L and store it in a variable x. How should we declare x? The type of the 5th element cannot be known at compile time, so even if we "know" that it will be of type real and we do:

real :: x
...
x = L(5)

How does the compiler check that the types agree? What should it do if the types do not agree at run time?

You might respond that CLASS(*) is the right thing in such a context and we would do:

class(*), pointer :: x
...
x => L(5)

That can work (with some less interesting caveats), but ... x is not very useful here. We can't really do much else with it without an ugly (IMO) SELECT TYPE block.

Basically any attempt to make containers as incredibly flexible as they are in Python, will be very problematic in Fortran. C++ STL is probably a safer guide to what can work in Fortran. It does allow lists of void pointers, but requires casting. Most STL containers are declared for specific types.

@MarDiehl
Copy link

MarDiehl commented Jun 8, 2021

Static typing clearly results in these constraints. SELECT TYPE is the price you need to pay if you don't want to give the programmer the responsibility (or freedom, depends on perspective) to cast to the correct type.

I'm not very familiar with C+ STL, but to me it seems like a 'typed list'. A list of type class(*) would be then just one possibility of using this feature.

I don't know what STL is usually used for, but from my python experience a list usually contains similar things. So I expect that a STL-like container will rarely result in long SELECT TYPE statements.

@tclune
Copy link
Member

tclune commented Jun 8, 2021

Yes - that was what I was trying to say ... Usual cases involve concrete types or pointers to base types if polymorphism is desired. But even then C++ STL has the same memory footprint for each object, and does not readily handle items of variable size. This allows direct computation of memory offsets ala how Fortran manages array indexing.

One gets around that for say varying length strings, by having a standard string template that effectively creates a wrapper type that has an allocatable character array inside. Which gets back to what the original discussion was requesting, if I understood. Whether Fortran accomplishes this with the new generics facility or does something special just for the case of strings, is a useful topic. My hope is that the former is powerful enough to obviate the need for the latter. But early days ...

@gronki
Copy link

gronki commented Jun 16, 2021 via email

@milancurcic milancurcic added the Clause 7 Standard Clause 7: Types label Nov 22, 2021
@ivan-pi
Copy link

ivan-pi commented Sep 24, 2023

More debate of this topic here: https://fortran-lang.discourse.group/t/how-do-i-allocate-an-array-of-strings/3930/27

In stdlib we have a string type which provides most if not all the same functions as the built-in character type and can also be used for an array of strings:

program main
  use stdlib_string_type, str => string_type
  implicit none
  type(str) :: a(5)
  integer :: i 

  a = [ str('mary'), str('had'), str('a'), str('little'), str('lamb')]
  
  ! Watch out, derived-type IO is needed!!! (there may be bugs!)
  write(*,'(*(DT))') a // ' '

  ! Alternative is to convert to character first
  write(*,'(*(A))') (char(a(i)) // ' ', i = 1, size(a))

  ! The `char` function is not elemental, because it would need to
  ! pick a fixed-length.

end program main

The main places where some friction exists are: initialization, I/O, and conversion to the built-in character type.

In addition we have a stringlist_type that supports insertion at arbitrary positions of the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clause 7 Standard Clause 7: Types
Projects
None yet
Development

No branches or pull requests

10 participants