Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The case for configuration_name to be required and not optional. #36

Open
gadamc opened this issue Oct 12, 2017 · 0 comments
Open

The case for configuration_name to be required and not optional. #36

gadamc opened this issue Oct 12, 2017 · 0 comments

Comments

@gadamc
Copy link
Collaborator

gadamc commented Oct 12, 2017

I advocate that, when/if we make changes to refactor this library, we make the configuration_name a required parameter when instantiating a new object to configure an object store connection.

  1. It seems likely that this could cause errors to users without any kind of explicit notification to the user in the situations where multiple object stores are being used.
creds1 = {...} #credentials for one object store
creds2 = {...} #creds for another object store

conf1 = ibmos2spark.configure(sc, creds1)
conf2 = ibmos2spark.configure(sc, creds2)

#at this point, the user believes connections to both object stores have been successfully created

rdd = sc.textFile( conf1.url(bucket, object_name) )

#do work

sc.saveAsTextFile( conf2.url(bucket, object_name), rddfinal)

In the scenario above, significant problems could occur if both object stores contain objects with the same name. This may be a likely scenario if users are processing and moving data to different locations, the wrong piece of data would be retrieved and processed and then rewritten, overwriting the original data. If the user hasn't taken care to create archive buckets/containers and configured object store to track revisions (is that possible with COS S3? It is possible with OpenStack OS).

There is no warning to the user in this scenario and if a large job on Spark is performed, this could potentially wipe out a user's entire data set.

If there is no object in the second object store with 'object_name', then the rdd = sc.textFile(...) line will fail. However, it would still be very confusing to the user as to why it failed. The stack trace from Spark would say something like "file not found", but there'd be no indication that it was looking in the wrong object store.

Another potential issue could stem from situations where the configuration code is executed on worker nodes if the ibmos2spark.configure call is made within functions that are then parallelized in a map function. In that scenario, it may be unpredictable, scattering data across different object storage instances and/or failing in some cases and not in others.

This may feel like an 'edge case' or does not follow the usage by a large percentage of the DSX/ObjectStore users. However, I think we should avoid the potential for catastrophic failure. Justifiably angry users can spread bad news far a wide with ease.

  1. It seems like a very low burden to require a configuration name. In DSX when one uses the insert to code button, we already provide a randomized configuration name. We should continue this policy.

Alternative solution

One alternative option would be for the ibmos2spark library to randomly generate a configuration_name if one is not provided. This would essentially solve the problem, from what I can tell. I'd like to think more about the merits / demerits of this idea though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant