You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I advocate that, when/if we make changes to refactor this library, we make the configuration_name a required parameter when instantiating a new object to configure an object store connection.
It seems likely that this could cause errors to users without any kind of explicit notification to the user in the situations where multiple object stores are being used.
creds1 = {...} #credentials for one object store
creds2 = {...} #creds for another object store
conf1 = ibmos2spark.configure(sc, creds1)
conf2 = ibmos2spark.configure(sc, creds2)
#at this point, the user believes connections to both object stores have been successfully created
rdd = sc.textFile( conf1.url(bucket, object_name) )
#do work
sc.saveAsTextFile( conf2.url(bucket, object_name), rddfinal)
In the scenario above, significant problems could occur if both object stores contain objects with the same name. This may be a likely scenario if users are processing and moving data to different locations, the wrong piece of data would be retrieved and processed and then rewritten, overwriting the original data. If the user hasn't taken care to create archive buckets/containers and configured object store to track revisions (is that possible with COS S3? It is possible with OpenStack OS).
There is no warning to the user in this scenario and if a large job on Spark is performed, this could potentially wipe out a user's entire data set.
If there is no object in the second object store with 'object_name', then the rdd = sc.textFile(...) line will fail. However, it would still be very confusing to the user as to why it failed. The stack trace from Spark would say something like "file not found", but there'd be no indication that it was looking in the wrong object store.
Another potential issue could stem from situations where the configuration code is executed on worker nodes if the ibmos2spark.configure call is made within functions that are then parallelized in a map function. In that scenario, it may be unpredictable, scattering data across different object storage instances and/or failing in some cases and not in others.
This may feel like an 'edge case' or does not follow the usage by a large percentage of the DSX/ObjectStore users. However, I think we should avoid the potential for catastrophic failure. Justifiably angry users can spread bad news far a wide with ease.
It seems like a very low burden to require a configuration name. In DSX when one uses the insert to code button, we already provide a randomized configuration name. We should continue this policy.
Alternative solution
One alternative option would be for the ibmos2spark library to randomly generate a configuration_name if one is not provided. This would essentially solve the problem, from what I can tell. I'd like to think more about the merits / demerits of this idea though.
The text was updated successfully, but these errors were encountered:
I advocate that, when/if we make changes to refactor this library, we make the
configuration_name
a required parameter when instantiating a new object to configure an object store connection.In the scenario above, significant problems could occur if both object stores contain objects with the same name. This may be a likely scenario if users are processing and moving data to different locations, the wrong piece of data would be retrieved and processed and then rewritten, overwriting the original data. If the user hasn't taken care to create archive buckets/containers and configured object store to track revisions (is that possible with COS S3? It is possible with OpenStack OS).
There is no warning to the user in this scenario and if a large job on Spark is performed, this could potentially wipe out a user's entire data set.
If there is no object in the second object store with 'object_name', then the
rdd = sc.textFile(...)
line will fail. However, it would still be very confusing to the user as to why it failed. The stack trace from Spark would say something like "file not found", but there'd be no indication that it was looking in the wrong object store.Another potential issue could stem from situations where the configuration code is executed on worker nodes if the ibmos2spark.configure call is made within functions that are then parallelized in a map function. In that scenario, it may be unpredictable, scattering data across different object storage instances and/or failing in some cases and not in others.
This may feel like an 'edge case' or does not follow the usage by a large percentage of the DSX/ObjectStore users. However, I think we should avoid the potential for catastrophic failure. Justifiably angry users can spread bad news far a wide with ease.
Alternative solution
One alternative option would be for the
ibmos2spark
library to randomly generate a configuration_name if one is not provided. This would essentially solve the problem, from what I can tell. I'd like to think more about the merits / demerits of this idea though.The text was updated successfully, but these errors were encountered: