Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic service change output file, file gets uploaded, but never gets downloaded by target service. #5741

Closed
Tracked by #5694
wvangeit opened this issue Apr 25, 2024 · 16 comments
Assignees
Labels
bug buggy, it does not work as expected

Comments

@wvangeit
Copy link
Contributor

wvangeit commented Apr 25, 2024

What happened rarely on my local machine deployment, became more problematic on osparc-master.

Sometimes when a service (notebook, python runner, etc) writes a file to the output folder (port), the log shows that the file got uploaded.
However, the file on the receiving end doesn't get downloaded (it's not present in the input folder, and there is no log message by the sidecar saying that the file got downloaded)
This is a flaky issue. The same scripts work many times, but get this particular issue from time to time, at different stages of file interactions.

@sanderegg
Copy link
Member

sanderegg commented Apr 25, 2024

@wvangeit there is another option to transfer files/number/strings/(maybe objects) via the ZeroMQ facilities that @GitHK added between these 2 services that you have that are actually. They are in a feedback loop right?
Please talk with @GitHK to learn how to use that facility. Otherwise we can discuss as well but I will need to search on how this exactly works probably @GitHK will be much faster.
The ZeroMQ system completely bypasses the usual oSparc mechanism to transfer data and goes in a point to point connection between the services. At the cost of not being saved inside the oSparc project metadata, which is ok in your case right?

@wvangeit
Copy link
Contributor Author

Yes, I'm very aware of the zeromq facilities, I have several prototypes using these. But this system has its own issues.
One of the main being that these ports are not exposed atm in the GUI.
Tbh this is also one of the reasons why I didn't worry 'too' much about flukes from time to time on my local system, because i knew the longer future would have the zeromq ports. Unfortunately these file upload errors became way more pronounced on our master deployments.

But oth, I also think that we have to make file communications more stable. This is basically affecting all the dynamic services (at least), so it's not just me seeing it. I remember Taylor already being happy I made this to make things slightly more deterministic: https://github.com/wvangeit/osparc-filecomms

@wvangeit
Copy link
Contributor Author

This issue has a partial workaround, in the sense that if the code that sends a file keeps writing the file in regular intervals, it eventually gets picked up by the system. However, this obviously doesn't work if one can't control the code writing the file (file pickers, other services, etc)

@wvangeit wvangeit added this to the Leeroy Jenkins milestone May 24, 2024
@wvangeit
Copy link
Contributor Author

I'm moving this to 'Error without workaround', because this is still happening too often.
As mentioned in the previous comments, it still generates problems when one can not control the incoming service.

@wvangeit
Copy link
Contributor Author

@GitHK If you need an example of this (incoming service is file picker, but behavior is the same)
On osparc-staging.speag.com
80471a7a-11d3-11ef-9cb8-02420a0803ee

This same worked twice before, but in the run of around 11:02 25 Jun, suddenly the parallelrunner didn't receive it's configuration file, although it's clearly present in the filepicker:

Image

Service can't the file (parallelrunner.json)

Image

FYI @pcrespov

@wvangeit
Copy link
Contributor Author

(And also FYI, it just reran the exact same study again as the one above, and now it worked, so there was no issue with the study itself)

@GitHK
Copy link
Contributor

GitHK commented Jun 25, 2024

@wvangeit I see that this started and received 3 calls to pull the inputs.
It pulled the inputs twice in a row. 09:01:06.971Z -> 09:01:09.052Z and `09:01:09.866Z -> 09:01:10.891Z.

Unfortunately there are no further logs of exactly what was pulled. Maybe that can be added for future debugging.

@GitHK
Copy link
Contributor

GitHK commented Jun 25, 2024

So your log lines Waiting for parallelrunner.json to exist ... are as follows:

  • first 09:01:05.974Z
  • second after the files are pulled 09:01:15.996Z

Afterwards it just sits there as you described.

@wvangeit
Copy link
Contributor Author

wvangeit commented Jul 1, 2024

Is the improved logging already available in staging? I just happened again around 10:50 Jul 1, for study ID 80471a7a-11d3-11ef-9cb8-02420a0803ee
on https://osparc-staging.speag.com/
(again the 'Waiting for parallelrunner.json to exist ...)

@GitHK
Copy link
Contributor

GitHK commented Jul 1, 2024

@wvangeit Yes, it was released to staging. Let me have a look and report back

@GitHK
Copy link
Contributor

GitHK commented Jul 1, 2024

replied privately, looks like some path mismatch.

@wvangeit
Copy link
Contributor Author

wvangeit commented Jul 2, 2024

Just investigated this with @GitHK . The very last issue I mentioned is apparently a currently known frontend issue on osparc staging, which is currently being investigated. None of the inputs of a service are not there, because the service is still in the 'connecting' state due to the bug.
Image

For reference, this 'only' affects the last example I provided. All the other examples from above are still valid.

@GitHK
Copy link
Contributor

GitHK commented Jul 2, 2024

so since this is unrelated, let me know if you can find one where it is actually running and you have you see the issue.

@wvangeit wvangeit modified the milestones: South Island Iced Tea, Tom Bombadil Jul 12, 2024
@sanderegg sanderegg modified the milestones: Tom Bombadil, Eisbock Aug 13, 2024
@GitHK
Copy link
Contributor

GitHK commented Aug 20, 2024

@wvangeit is this still an issue?

@GitHK GitHK removed this from the Eisbock milestone Aug 20, 2024
@wvangeit
Copy link
Contributor Author

It's a bit difficult for me to tell atm, because my osparc-filecomms lib tries to put a layer on top to hide this problem. I assume ESB input/output folders would also solve this issue at some point. For me you can close it for now, and i can reopen when i see another case of this happening.

@GitHK
Copy link
Contributor

GitHK commented Aug 20, 2024

ok let's close it for now

@GitHK GitHK closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug buggy, it does not work as expected
Projects
None yet
Development

No branches or pull requests

3 participants