-
-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for an ecl race in maxima init #35195
Conversation
@mkoeppe : I've just realized that I misplaced the I marked this long time since it takes ~ 2s in my (fast) machine with |
Extra line break before |
You could try if |
We run a new instance of sage in a subprocess to ensure maxima is not already initialized. We use a temporary MAXIMA_USERDIR so its empty, and we try to initialize maxima twice in parallel to entice the race. This temporary dir is placed within `DOT_SAGE` so it is easy to try different filesystems. The bug triggers more frequently if `DOT_SAGE` is in a high latency filesystem (e.g. sshfs on a non-local host). The next commit introduces a workaround for the bug.
Sugestion for testing:
|
I rebased on |
@@ -116,7 +136,19 @@ | |||
ecl_eval("(setq $nolabels t))") | |||
ecl_eval("(defvar *MAXIMA-LANG-SUBDIR* NIL)") | |||
ecl_eval("(set-locale-subdir)") | |||
ecl_eval("(set-pathnames)") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same could also be done on the CL side using something like (handler-case (set-pathnames) (file-error () (progn (ensure-directories-exist *maxima-objdir*) (set-pathnames)))
(untested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing a fix in lisp, but that would be prone to the same race.
Bear in mind that *maxima-objdir*
has multiple components, here you restart once but the race might hit again in a following component. OTOH there might be a real reason for ensure-directories-exist
to fail (e.g. permissions, or maybe there is a file or other non-directory thing with a name clash with *maxima-objdir*
, etc) so we don't want to retry forever.
I think the only proper way to fix this is to catch the error in the implementation of ensure-directories-exist
. I.e. when mkdir
fails, test again if the directory exists and if so ignore the error and continue to the next component. Idealy this has to be done in ECL, or else include a correct implementation in maxima itself and use it instead.
Note that this calling of ensure-directories-exists
in the initialization of maxima is only done for ecl; based on the comment in maxima source maybe this is so because of this race.
I have a PoC patch to ecl to fix this, I'm yet to test it thoroughly and submit to ecl. But even if we patch ecl in sage and this is eventually fixed upstream, we'd still need a workaround for all the system ecl in the wild. A workaround in maxima itself may work better since sage-the-dist doesn't accept system maxima but that is still a step backwards IMO (it would be nice if #33718 and #32867 moved forward).
--- src/lsp/mislib.lsp.orig 2021-02-01 09:59:46.000000000 -0300
+++ src/lsp/mislib.lsp 2023-02-15 21:10:54.498524116 -0300
@@ -302,7 +302,14 @@
(let ((ps (namestring p)))
(when verbose
(format t "~%;;; Making directory ~A" ps))
- (si::mkdir ps mode)))))
+ (handler-case
+ (si::mkdir ps mode)
+ (error (e)
+ (if (si::file-kind p nil)
+ (when verbose
+ (format t "~%;;; Warning: ~A already exists" ps))
+ (error e)))
+ )))))
(values pathname created))))
(defmacro with-hash-table-iterator ((iterator package) &body body)
In any case my point is that this workaround may be necessary for a while nevertheless. The code I want to push does nothing unless an exception is raised, which should almost never happen in normal usage and never happen with a fixed ecl).
The advantage of using python os.makedirs()
is that its implementation is immune to this race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point in favor of the Python solution. Probably best to include this as a comment in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point in favor of the Python solution. Probably best to include this as a comment in the code.
I added something to the comment, no other change. See if you like it or feel free to suggest any wording for the comment.
Codecov ReportBase: 88.60% // Head: 88.59% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #35195 +/- ##
===========================================
- Coverage 88.60% 88.59% -0.02%
===========================================
Files 2140 2140
Lines 397273 397279 +6
===========================================
- Hits 352004 351951 -53
- Misses 45269 45328 +59
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
When maxima is initialized a bug in ecl implementation of `ensure-directories-exist` might result in a runtime error. As a workaround, in case we get a runtime error we use python to create the directory and then continue with maxima initialization. Note that for normal usage the directory will already exist within the user's `DOT_SAGE` so this code will almost never run. However, when running doctests on CI this occasionally triggers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Haven't spent much time trying to repro the bug, but the test looks plausible.
I hate this: "This branch is out-of-date with the base branch". This is awful because it triggers people to click on a button doing a stupid innecessary merge. I hate the way sage tree is littered with merges. Release manager merges are awesome because it gives a clean straightforward history combined with I'd rather prefer to get this message "This branch has no conflicts with the base branch" as in e.g. leahneukirchen/xtools#266. Moreover, sometimes adding a merge on a PR branch confuses github which then starts claiming one has many more commits and files changed than there really are. |
At least only maintainers can push this button on another person's PR (I think) |
Documentation preview for this PR is ready! 🎉 |
I guess so. But on other repos I don't see the button on my own PR. It seems unnecessary: either it can be merged without conflicts (if so, why do it? the release manager can merge it) or there are conflicts (then you have to do it on a checkout). IMO what it is useful is that I get a banner telling me either
Also: rebase FTW. /rant |
Calling initialize-runtime-globals will run set-pathnames and be subject to the issue described in sagemath#26968. Thus the workaround introduced in sagemath#35195 has to be done before anything that may call set-pathnames (e.g. initialize-runtime-globals).
📚 Description
When maxima is initialized a bug in ecl implementation of
ensure-directories-exist
might result in a runtime error.As a workaround, in case we get a runtime error we use python to create
the directory and continue with maxima initialization.
Note that for normal usage the directory will already exist within the
user's
DOT_SAGE
so this code will almost never run. However, whenrunning doctests on CI this occasionally triggers.
New doctest
The first commit introduces a doctest to try to catch this race.
We run a new instance of sage in a subprocess to ensure maxima is
not already initialized. We use a temporary
MAXIMA_USERDIR
so its empty,and we try to initialize maxima twice in parallel to entice the race.
This temporary dir is placed within
DOT_SAGE
so it is easyto try different filesystems.
The bug triggers more frequently if
DOT_SAGE
is in a highlatency filesystem (e.g. sshfs on a non-local host).
Closes #26968.
📝 Checklist