-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String handling routines #69
Comments
This is a great summary! I feel like most of what we need has already been done in these projects (and others), so mainly we need to just gather it all together. Some important things to decide:
|
Should we base it on the ISO_VARYING_STRING module? If so, the class is Should we utilize intrinsic function names? like |
functional-fortran implements several functions on strings:
(Caution: |
Thanks for this initiative and listing the current landscape. I think we definitely want stdlib to have good string support. (For conversion from real/integer numbers to strings, I implemented a function @ivan-pi do you want to go ahead and create a table of the basic subroutines and let's brainstorm how they should be named, to be consistent with other languages and/or the above various string implementations if possible. And also if they should be functions or subroutines and what arguments to accept. |
@jacobwilliams is right about raising the question how to represent the string. We should start with that. I would recommend (as usual) to have a lowest level API that operates on the standard Fortran (allocatable where appropriate) character. Then, have a higher level API that operates on a string type, and simply calls the lower level API. Regarding a name, see #26, it seems most people agree that the convention to name derived type is to append That way people can use these low level API routines right away. For example in my codes I do not need to modify any data structures and can start using it. The higher level |
I would vote that the low level API be based on functions, pure and elemental where possible and appropriate. I would stick with the Fortran convention of optional status parameters where there is the possibility of things going wrong, and if one is not provided and something goes wrong it crashes. I have tended to use that convention in any routines that go from a string to some intrinsic like:
Honestly, I thought the ISO_VARYING_STRING standard did a great job of covering all of the intrinsic functions available for |
My plan was to go through the libraries above and create a table of the most commonly available routines in the next days. I agree we should consider both low-level routines which work directly on strings of type The book Fortran Tools for VAX/VMS and MS-DOS by Jones & Crabtree contains a description of a Fortran string-handling library. Interestingly, they decided to use null-terminated strings like in C, meaning they needed to build a separate set of functions from the intrinsic ones (concatenation operator // and length function). They later used these tools to develop a compiler for a subset of the Fortran language itself! Their conclusion about strings was:
|
Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html. |
I have linked three distinct implementations in the top post. The links from the gfortran compiler pages are dead as well as the link in Modern Fortran Explained by MCR. Edit: An informal description of the |
@ivan-pi thanks. I like your plan. It looks like the |
Building the low-level API on
That seems complicated to me... but it would cover all the bases... |
I think there are two possible APIs here: intrinsic and derived-type one. For the intrinsic API, I also see the intrinsic one as the starting point. Higher-level (derived type) implementation is likely to use the intrinsic API internally. |
My understanding, and somebody correct me if I don't have this quite right, is that the ISO_VARYING_STRING standard was created before If allocatable character actually worked we wouldn't a new derived type for strings. You would just use the intrinsic type and move on. But I think as written in the standard, it probably will never truly work properly in all cases (especially as in read statements, since other allocatable arrays don't and aren't supposed to). If there is a new type for strings, I don't think a lower level library or API should be exposed, and it should probably not be based on allocatable characters. |
As I mentioned above, you can use this trick to return As @everythingfunctional mentioned, for example GFortran used to have huge problems with allocatable strings and leaked memory. The latest version has improved a lot. Given that this is standard Fortran, and stdlib is a standard library, I think it is ok if we depend on the standard, and if there are compiler bugs, we'll try to workaround them and ensure they are reported. Regarding read statements, see #14 that would handle that. I think we should at least try to create a consistent low level API, not give up without even trying. If it truly cannot be done, only then we'll have to do what you propose, and only expose the |
I thought you could only use intrinsic procedures in variable declaration statements. Learned something new. That's a neat trick, but like you said, not particularly efficient. |
Let's discuss a simple example: character(*)Here is an implementation: function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(len(s)) :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
! if lowercase, make uppercase
t(i:i) = char(ichar(t(i:i)) + diff)
end if
end do
end function When the user wants to use it, he could do this: character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
allocate(character(len(s)) :: a)
a = upcase(s)
print *, a which prints:
The main disadvantage of this approach is that the user needs to know the size ahead of time. In this case he knows --- it's the same size as the original string. Although modern gfortran has reallocatable LHS turned on, so then just this works: character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a So I think that would work for character(:), allocatableHere is the implementation using function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(:), allocatable :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
! if lowercase, make uppercase
t(i:i) = char(ichar(t(i:i)) + diff)
end if
end do
end function It's still used like this: character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a But since this as an extra allocation inside |
Now let's discuss integer to string conversion, the two implementations: character(*)pure integer function str_int_len(i) result(sz)
! Returns the length of the string representation of 'i'
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: s
! If 's' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(s, '(i0)') i
sz = len_trim(s)
end function
pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
character(len=str_int_len(i)) :: s
write(s, '(i0)') i
end function And usage: character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a) which prints:
character(:), allocatablepure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: tmp
character(:), allocatable :: s
! If 'tmp' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(tmp, '(i0)') i
s = trim(tmp)
end function And usage: character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a) which prints:
DiscussionUnlike in the (Note: if we implement our own integer to string conversion algorithm, then we avoid the ugly |
Here is my proposal for the low level API:
Unfortunately some compilers might leak memory or segfault when such strings are used in derived types. Ultimately, long term, the compilers must be fixed. That's why I think the above proposal is a good one for the long term. In the short term, if we want to provide strings to users that actually work in all today's compilers, it might be that the only way is to create a |
Specifically for the case of integer to string conversion, you could also dynamically allocate a buffer for each integer kind and then trim the result into an allocatable character string: function integer_to_string2(i) result(res)
character(len=:),allocatable :: res
integer, intent(in) :: i
character(len=range(i)+2) :: tmp
write(tmp,'(i0)') i
res = trim(tmp)
end function If we want to avoid internal I/O this function becomes something like function integer_to_string1(ival) result(str)
integer, intent(in) :: ival
character(len=:), allocatable :: str
integer, parameter :: ibuffer_len = range(ival)+2
character(len=ibuffer_len) :: buffer
integer :: i, sign, n
if (ival == 0) then
str = '0'
return
end if
sign = 1
if (ival < 0) sign = -1
n = abs(ival)
buffer = ""
i = ibuffer_len
do while (n > 0)
buffer(i:i) = char(mod(n,10) + ichar('0'))
n = n/10
i = i - 1
end do
if (sign == -1) then
buffer(i:i) = '-'
i = i - 1
end if
str = buffer(i+1:ibuffer_len)
end function For processing floating point values the functions are much more difficult to develop compared to those using internal read and write statements. |
I did some keyword searchs in the list of popular Fortran projects. It seems that most projects use their own set of character conversion and string handling routines for stuff like reading input values from files, parsing command line options, defining settings, etc.. Here are the results of my search of some of the top projects:
The second and third column measure the number of Fortran files that contain the keywords string or character, respectively. This includes both command statements and comments so it may be a bit misleading. In one of the codebases I even found this comment:
|
CasingThe purpose of these functions is to return of copy of a character string ( either character(len=*) or a derived string type) with the case converted . The common variants are uppercase, lowercase, and titlecase. The libraries cited in the first post contain the following function prototypes: ! functional
function str_upper(str)
function str_lower(str)
function str_swapcase(str)
pure function ucase(input)
pure function lcase(input)
function str_lowercase(str)
function str_uppercase(str)
subroutine str_convert_to_lowercase(str)
subroutine str_convert_to_uppercase(str)
pure elemental function lowercase_string(str)
function uppercase(str)
function lowercase(str)
! object-oriented
procedure, pass(self) :: camelcase
procedure, pass(self) :: capitalize
procedure, pass(self) :: lower
procedure, pass(self) :: snakecase
procedure, pass(self) :: startcase
procedure, pass(self) :: upper
function vstring_tolower(this[,first,last])
function vstring_toupper(this[,first,last])
function vstring_totitle(this[,first,last]) Some versions will return a new string, while some work in place. In at least one of the functions, it did not convert the case of characters enclosed between quotation marks. These are the similar functions available in other programming languages:
My top three name picks are:
Edit: for consistency with the character conversions functions |
I'd like to add to the list of facilities here the overloaded operator print *, 3 * 'hello' ! prints 'hellohellohello'
print *, 'world' * 2 ! prints 'worldworld' It's easy to make and use. The only downside I can think of is a somewhat weird API when importing it: use stdlib_experimental_strings, only: operator(*) |
Yes, I have seen this kind of usage in one of the above mentioned libraries. I am not sure whether it is not perhaps better to promote the usage of the intrinsic A benefit of repeat is precisely that you avoid the import statement. |
Oops, I didn't know about |
It is really hard to have a day job and keep up with all these threads, so my apologies if I've missed something because I'm just skimming here. A few opinionated notes:
I need to look at the varying string and character array proposals in more detail. FWIW, I personally prefer the Ruby Python OO approaches with methods because it will make import statements much simpler: Pull in the string class and you get all the methods along with the type/class declaration. Now some operators may need to be pulled in as well if you want to be able to concatenate a real (lhs) with a string (rhs, can't have a TBP operator to the left of the object IIRC). I was thinking of starting a PR marrying my work on ZstdFortranLib with a UDT/Class approach rather than operating on raw character scalars and arrays which is awkward for things like |
While there is an intrinsic implementation, |
@zbeekman I am struggling with all the threads also, but that is good news. It means there is lots of momentum. If you can help us design a good low and high level API for strings (#69 (comment)), that would be great. |
@certik k
I like your first one the most. With integers you can use some math to count up how many digits there are, and if you need a sign on the front, which completely removes the need to declare the max string length AND to do the IO twice. Instead you use integer and floating point math which (hopefully) will be reasonably quick. IIRC, I implemented something to do this in JSON-Fortran but I'll have to look for it. Also, I don't mean to whine about not being able to keep up, and I agree that it's good, but it's hard to keep track of all the balls in the air. |
After a brief search there are at least 3 ways to do this without performing the conversion to a string then counting digits:
I would guess that 1 is the fastest way to do this, but it may depend on the compiler and hardware. 3 has conversion to a real, then log10 is probably computed iteratively, and it is converted back to an int, so 1. may be faster despite the loop. |
After reading through this thread I found subtle issue with the proposed low-level API for
Taking just the basic functionality mentioned by @ivan-pi here, I implemented a |
I created an exploratory implementation of a functional string handling at awvwgk/stdlib_string as fpm project. A non-fancy string type is implemented there, which basically provides the same functionality as a deferred length character but can be used in an The overall implementation comes close to
|
@awvwgk this would be the high level API that operates on the How would a low level API look? Let's look at some examples, say the read_formatted function. It doesn't need the The maybe function can also operate on |
Bad example, the
The idea so far was to provide the intrinsic low level API for a string type, on which later the high level API can be defined.
Exactly, I wanted to explore a common basis of agreed on functions for a future high level string object. The minimal agreed on basis should be easily all the intrinsic procedures defined for
I decided to pick the part of the high-level API that will have no overlap with a potential low-level API. This way the low level API can be explored separately, like in #310
This one was chosen deliberately to be an internal implementation detail, i.e. it is not part of the public API. |
Here is what I mean: awvwgk/stdlib_string#1 In that PR, I implemented a low level version of |
@certik, the procedures in Sebastian's module are in fact equivalents of the intrinsic character procedures already available in Fortran:
The pull request #310 is the first to propose new procedures ( String processing in Fortran is not that bad, considering the number of procedures already there. If we could add casing, numeric to string convertors (and vice-versa), join and split, and perhaps a few more procedures, I think most usage cases would be covered. |
@ivan-pi actually there is genuine new functionality, that I just extracted here: awvwgk/stdlib_string#1 (comment). |
@certik I see, you are right the there is new functionality, the Regarding the The However, I disagree on the low level API for user defined derived type input output, it is strictly a feature that can only be defined for a derived type but not an intrinsic and we won't be able to make use of it to safely read into a The gist is, I don't want to introduce new functionality beyond the existing |
@awvwgk I just saw your comment, my comment here I think replies to yours: awvwgk/stdlib_string#1 (comment). |
There is now also a branch at my stdlib fork. There is one really unfortunate thing here, GCC 7 and 8 do not support evaluation of user defined pure procedures in variable declarations. Adopting this The solution is to adopt the |
As promised in #320 (comment) I tried to devise an abstract base class (ABC) for an extendible string class. This one turned out much more difficult to design than a non-extendible functional string type, you can check the base class definition here: https://github.com/awvwgk/stdlib_string/blob/string-class/src/stdlib_string_class.f90 The class is a bit more bloated than it has to be because I made it compatible with the intrinsic character type and the functional string type as well to ease testing. One thing that turns out to be very difficult to account for are overloaded intrinsic procedures, you can find two implementation for each intrinsic procedure (except for the lexical comparison where I took a shortcut), one for the overloaded generic interface ( Another problem was returning a class polymorphic object from a procedure ( Since we have a whole lot of intrinsic character procedures implementing a string class based on this ABC can become tedious, therefore I designed the ABC to provide mock implementations based on the setter ( While this is not a final specification yet, I wanted to share it as aid for discussion functional vs. object oriented implementation of a string in stdlib. From the above notes you might gather that a truly extendible string class could result in significant performance penalties for the user. Still there might be some value in having a string object available. |
If I understand things correctly, the assignment to character should be handled explicitly through the
Which procedures does this hold for?
👍 This is better and more Fortranic IMO. |
None, because I had to reconsider this design choice due to missing compiler support. |
Does the initial design choice (the one which breaks GCC 7 and 8 support) survive in any of the earlier commits on your private fork? I wonder if you could still pull it off, by moving the functions out of a module... I still don't fully grasp how the implementation differed. Would for example the In any case your pull request is a big step to make string-handling easier. |
@ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way. |
I have recently learned that overloading an assignment operator is a
mistake in most cases. For example, appending one element to an allocatable
array using the notation:
string_array = [string_array, string("new_string")]
will not work.
With this design flaw of a language, I'd argue that overloading assignments
should be avoided at all cost.
Dominik
pt., 5 mar 2021 o 13:48 Sebastian Ehlert <[email protected]>
napisał(a):
… @ivan-pi <https://github.com/ivan-pi> See
https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff
for a version based entirely on returned character values. I think this
version could run into problems when used in an elemental way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4NA3N5NJ3COJ2CJF3KNWLTCDHJRANCNFSM4KCFW35Q>
.
|
I updated my stdlib_string project with an abstract base class for a more object-oriented string implementation. As a demonstration of such a |
Not sure if it was linked before, Clive Page wrote a nice summary about character types in Fortran: https://fortran.bcs.org/2015/suggestion_string_handling.pdf There was also a thread over at the Fortran-FOSS programmers: Fortran-FOSS-Programmers/Fortran-202X-Proposals#4 A link was provided to a WG5 document, which talks about a |
Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:
The discussion also mentioned the proposed
iso_varying_string
module, which was supposed to include some string routines. I found three distinct implementations of this module:iso_varying_string
proposal; the module dates back to 1998)I also found the following Fortran libraries targeting string handling:
sub
,gsub
,split
,join
, and conversion on concatenation. WIP thoughIt is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.
For comparison here are some links to descriptions of string handling functions in other programming languages:
Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).
My suggestion would be to start with some of the easy functions like
capitalize
,count
,endswith
,startswith
,upper
,lower
, and the conversion routines from numeric types to strings and vice-versa.The text was updated successfully, but these errors were encountered: