-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO CONCURRENT might be broken #62
Comments
See https://j3-fortran.org/doc/year/19/19-134.txt , which describes the problem and suggests a solution. The Committee "deferred" this problem and hasn't wanted to discuss it since. |
@klausler thanks a lot for submitting the proposal! Thanks also for commenting under the other issues. I am really sorry the committee didn't discuss this. Did they provide any feedback at all? |
I received no official response. |
@klausler thank you. As a member of the committee, I apologize. This is unacceptable to me and I am trying to convince the committee that we need to consider every proposal that gets officially submitted (even if for just 5 to 10 minutes). In fact I was at the February 2019 meeting, but I don't recall what happened to your paper, as that was my first meeting and I was just trying to figure out how the committee works. Now when we have this GitHub repository, I plan to track every technical comment the committee makes in issues. If it makes you feel any better, the committee didn't consider my proposal either (in #1). I know it happened to others too. I feel it's very inefficient, because had the committee provided feedback to you, you could have submitted a better paper for the October 2019 meeting, and so on, and we could have had this feature in a much more "ready" shape. @sblionel, here is an example of a proposal that I suggest the committee spends 5 to 10 minutes at plenary to discuss and then I volunteer to summarize the feedback in this issue here. I still think this would be the most efficient. But I would be fine with the alternative that multiple committee members provide feedback here directly in the issue and the committee does not officially consider any such proposals until later, as that would still be an improvement and it would move us in the right direction. |
My problem, as an implementor, is that Fortran users expect from its name that |
@klausler I agree. Let's discuss it further. Your proposal has an example: SUBROUTINE FOO(N, A, B, T, K, L)
IMPLICIT NONE
INTEGER, INTENT(IN) :: N, K(N), L(N)
REAL, INTENT(IN) :: A(N)
REAL, INTENT(OUT) :: B(N)
REAL, INTENT(INOUT) :: T(N)
INTEGER :: J
DO CONCURRENT (J=1:N)
T(K(J)) = A(J)
B(J) = T(L(J))
END DO
END SUBROUTINE FOO How would this be written if your proposal is accepted? Using the |
I should correct this case; I intend The validity of this loop depends on the values of I would restrict |
@klausler let's write down the actual examples, it's will be much easier for me (and I am sure others) to follow the arguments. K = [1, 2, 3]
L = [4, 5, 6] K = [1, 2, 3]
L = [1, 2, 3] K = [1, 1, 1]
L = [1, 1, 1] Currently 1., 2., and 3. are allowed. You are proposing for 1. and 2. to be allowed, but 3. to be forbidden (I corrected this sentence based on the comment below). Given that this depends on the values of the arrays Isn't the idea of |
No, I'm proposing only that your third case be disallowed. And I don't know what the original idea behind EDIT: Your second case is fine as written; change it to K==L==[1,1,2] and it would be a better example of the condition I had in mind. |
I corrected my comment above to match what you are proposing. Regarding your last edit |
When all of the elements of Otherwise, and this is the problem, So But |
Thank you. I think it is clear now which cases are allowed and which ones are not. Let's get some feedback from other members of the committee. @sblionel, do you know what the original idea behind If other members of the committee agree that this should be fixed, then the next step is to update the proposal. I am happy to help. |
This discussion puzzles me somewhat, as I don't agree with some of the assertions made. However, this is not my area of expertise and I would prefer to see opinions of committee members more versed in parallelism, such as Bill Long of Cray. DO CONCURRENT was designed as a replacement for F95's FORALL as it was determined that FORALL's semantics made parallelism very difficult. The whole idea of DO CONCURRENT is that the iterations in any order and to any degree of parallelism, as the user promises there are no cross-iteration dependencies. There are already Fortran implementations that successfully parallelize DO CONCURRENT (Intel and probably Cray), so I don't really understand what the problem is. DO CONCURRENT was mainly modeled on OpenMP PARALLEL DO, especially with the F18 additions of locality clauses. I see that Peter's paper got "deferred" at the February 2019 meeting and not taken up again. I'm not a good person to discuss this with. |
I suspect that the language in what is now 11.1.7.5 para. 4 (first bullet) was intended to handle cases like:
by requiring the processor to automatically localize the obvious temporary (The specific language reads: If a variable has unspecified locality, • if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates; ...) |
I will try to make this as clear as I possibly can, using the example that I presented earlier.
This subroutine complies with all of the constraints and "shalls" in the Fortran 2018 standard that pertain to DO CONCURRENT. But it cannot be executed in parallel and produce correct results. This is because the DO CONCURRENT construct, despite its name, imposes restrictions on the program that are sufficient to guarantee that the iterations of the loop may be run in any sequential order. The restrictions necessary to guarantee safe execution in arbitrary sequential order are not sufficient to guarantee safe execution in parallel. When ifort parallelizes this loop (
Except that OpenMP PARALLEL DO imposes stricter restrictions on the loop, so that it can be safely executed in parallel. DO CONCURRENT's restrictions are only strong enough to ensure safe serial execution in arbitrary order of iterations. Perhaps it was believed that these weaker restrictions would be easier to comply with, and that a sufficiently smart compiler could apply automatic localization and then safely parallelize. This turns out not to be the case, since a compiler can apply automatic localization only to variables that can be identified at compilation time.
Why not? |
Because I don't head J3, nor even the HPC subgroup under whose purview this would fall. As I wrote above, I do not consider myself an expert on the parallel features - I have a general understanding but that is all. Bill Long of Cray is the person I consider most knowledgeable about parallelism, though there are others on the committee who seem to understand it well. I will ask Bill to take a look at this thread and see if he wants to offer an opinion. |
@klausler your paper is written more as a feature request (this is the language of the standard, this is how I interpret the language, this is how it should be changed) when I think an interpretation request (this is the language of the standard, this is how I interpret the language, is my interpretation correct for that language, is my interpretation what was intended) would get more prompt attention. |
It's been 16 months, so prompt attention is a lost cause at this point. As is J3. I've removed myself from membership on that committee. |
I raised this question again yesterday on the J3 email list. You can read the thread here. If I understand it correctly, these issues have been known for a while and are what prompted the addition in F2018 of the DEFAULT(NONE) locality specifier, as requiring the compiler to analyze the loop for possible sharing was difficult. The general opinion seems to be that explicitly specifying the locality of all variables is needed to enable parallelization, but that changing the F2008 behavior would break programs. |
If anybody had tried to solve the problem of my specific example, they would have learned that the recently added locality clauses are not sufficient to its needs. They accept only the names of whole variables (and, less important, can't distinguish between a pointer and its target). |
DO CONCURRENT's locality rules are broken even apart from parallelization concerns. Is the following a conforming program under Fortran 2018? PROGRAM ONE Quoting 11.1.7.5(3): A has SHARED locality but is both defined and referenced in all iterations. If A had unspecified locality (no SHARED(A) locality specifier), then And the program would not be conforming, both for referencing A in each Can the word "variable" possibly refer to the elements of A, rather than 11.1.7.5(2), emphasis added 11.1.7.5(4) |
Peter, would you please ask this on the J3 email list, of which you're still a member? I think you'd get a more authoritative response there. But to answer the question, "Can the word variable possibly refer to the elements of A", the answer is yes. R902 defines variable as including designator and designator (R901) includes array-element. I admit it can be a bit confusing where sometimes variable names are mentioned (and thus excluding array elements), but in the places where it just says variable, then an array element qualifies. In your example, the variable reference is |
Thanks for the reminder. |
I just asked on the J3 mailinglist to clarify the main problem raised in this issue: https://mailman.j3-fortran.org/pipermail/j3/2020-July/012241.html |
@klausler here is an answer by Malcolm: https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html If I understand it correctly, he says that to get maximum performance, one has to do |
There is no |
Right - to me, |
The value of a |
@klausler I am still confused: with the current Fortran Standard and your example, compilers are required to put in (potentially) costly runtime checks, or is there a way to write it using explicit locality specifiers to parallelize efficiently? |
@klausler wrote Nov. 17, 2020 7:01 PM EST:
c.f. https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html where J3 "effectively" suggested something which is not yet in the standard. So, is it possible for flang compilers to do both!? That is, first have a standard-conforming implementation of DO CONCURRENT. But then also consider an "Experimental" edition of flang compiler(s) that attempts to do the "right thing", perhaps via a "DEFAULT(SHARED)" or some suitable extension that is Fortrannic and which can then be proposed for Fortran 202Y as further improvement to be incorporated into the standard? |
Anything is possible, but these are all just second-best alternatives to J3 just fixing the problems. |
Hi @klausler , now that my Google Summer of Code project proposal for If you happen to be at the Fortran Discourse forums, please send me a PM with your NVIDIA email address so I can send you a calendar invite to the meeting I'm in the process of setting up with Jeff Larkin, Güray Özen, and the folks at Predictive Science who published arXiv:2110.10151 about the Of course, anyone in this thread who is interested is welcome too. |
I have reviewed the document for LLVM Fortran that describes the problems with |
@klausler you have the most knowledge on this particular issue (since you wrote the Flang document!). Do you think you could please submit proposals for 2Y to fix this? I'll help champion it and advocate for it, but it would take me much longer to write up than it would take for you, since you have thought about all the details here and what needs to be done. |
My 2019 paper, which was ignored by J3, remains my favored solution; recycle it if you like. It would be nice if the semantics of locality specifiers with regard to pointers and |
A J3 mailing list discussion of this topic spanned 44 emails over 11 days in July 2020. I wrote email 42 attempting to crystallize the discussion into a practice that I could teach. My takeaway: it suffices for every If someone submits a Fortran 202Y proposal related to this issue, I suggest that it either involve backward-compatible changes to |
@klausler could you share with us what course of action NVIDIA chose for offloading |
The three problems with A better (but not best) suggestion was to use So I still recommend that the default implicit localization rule be changed (perhaps by syntax) to pertain only to names that could appear in an explicit |
This is helpful. I hope the NVIDIA representative(s) on J3 can champion your suggestions. I'd like to see these issues addressed but I would be a poor champion because I have little personal use for most of the features that break
A J3 member in the aforementioned mailing list discussion stated that
Maybe but I think the committee gives more weight to what the committee intended than to what users might incorrectly assume. I think the unstated goal is to make the standard clear, consistent, and useful. If it's unclear, add a note. If it's inconsistent, fix it. If it's clear and consistent but not useful, replace the feature much like |
There's good precedent in a similar case of J3 doing the right thing to preserve the intent of a feature. The response, admirably, was to plug the hole. The change might invalidate existing code, but it was the right thing to do.
|
@klausler This is exactly why I want to start discussion. I plan to push for Edit: after a careful re-read of R1130 in J3/18-007r1 section 11.1.7.2, turns out |
Btw, the first thread from the discussion over the J3 mailing list that @rouson mentioned can be accessed here: https://mailman.j3-fortran.org/pipermail/j3/2020-July/012229.html |
Another WG5/J3 meeting has come and gone with no recorded action on fixing DO CONCURRENT. The HPC subgroup didn't even submit a report on their assigned F'202Y discussion items. At this point, the best implementation option appears to me to be to ignore the broken standard and assume that the default localization rules apply only to variables that could have appeared in an explicit LOCAL clause. J3 has had three years to fix this and done nothing. |
Your item is on the list of things being considered, and there was quite a bit of discussion, but no action at this time. There is quite a bit of disagreement on the matter, especially with some of the claims, but it is being taken seriously. You're correct that the HPC subgroup didn't submit a paper with initial comments, but there will be further discussion. It;'s too bad that you chose to withdraw from the committee since you obviously have a passion for the issues. |
I became convinced that J3's current process is incapable of producing quality work. The best that I can do is describe the bugs in the standard as I encounter them as an implementer, so that they're documented and you can fix them or not in the standard as you choose. It's not that different of a situation from being a user of a buggy compiler -- one works around the bugs, but still reports them responsibly in the hope that something might be done about them before they affect other users. |
The plan for HPC features in Fortran 202Y (https://j3-fortran.org/doc/year/23/23-146.txt) omits any mention of fixing |
I took another look at this, particularly Peter's examples. The following is #62 (comment) except with the read from T removed. subroutine foo(N, A, B, T, K, L)
implicit none
integer, intent(in) :: N, K(N), L(N)
real, intent(in) :: A(N)
real, intent(out) :: B(N)
real, intent(inout) :: T(N)
integer :: J
do concurrent (J=1:N)
! During execution, K(J) is always 1.
T(K(J)) = A(J)
end do
end subroutine foo As no locality is specified, we can refer to the following:
Writing to If we add
Note that I am interpreting "variable" to mean the element of an array, not the whole array, even though it's unclear, because if I interpret "variable" as the whole array, it is impossible to use Can someone tell me what is wrong with my thinking? |
You deleted the reference to From Fortran's perspective there are no such things as "data races" in DO CONCURRENT. It's not a parallel programming construct. |
My point is that, there is a race on T in both your program and mine, and unless WG5 believes that race conditions are legal and defined, the interpretation of your program does not matter, because it has undefined behavior before it gets to the interesting part. The solution is to make data races undefined behavior, to match every other programming model with concurrent loops, not to accept that data races are legal and well-defined and try to reason about the consequences of that. |
It is meaningless to talk about race conditions in serial code. DO CONCURRENT, despite its name, is defined as a serial construct. RYOS. F'202X 11.1.7.4.3 paragraph 3: "The block of a DO CONCURRENT construct is executed for every active combination of the index-name values. Each execution of the block is an iteration. The executions may occur in any order." |
it was intended to allow parallel implementations. i am proceeding with the intent to make parallelism a reasonable implementation. given that Fujitsu, Cray, Intel and NVIDIA all implement DC with parallelism in a wide range of cases, i believe that allowing data races was the mistake, not parallelism. |
Yes, that is entirely my point. DO CONCURRENT's default locality rules were badly defined and allow non-parallelizable data accesses to be written in conforming code. |
removed |
Your example is clearly non-conforming, and should remain so. And it's not relevant to this particular issue. |
@klausler reported in #60 (comment):
Let's discuss that here. @klausler, can you work with @gklimowicz to fix that? Gary has some proposals regarding "do concurrent". Or there is no way to fix this issue.
The text was updated successfully, but these errors were encountered: