-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RING-44425 - Comments for SPARK scripts #48
base: master
Are you sure you want to change the base?
RING-44425 - Comments for SPARK scripts #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stashing my review here — to be continued.
I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.
Co-authored-by: scality-fno <[email protected]>
I've applied most. Questions posed to outstanding suggestions. |
… github.com:scality/spark into improvement/RING-44425-S3_FSCK-scripts-add-comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New pass on P0. TODO: evaluate the need for each column to exist.
Let's keep the focus on comments, including if columns are used or not used in later steps. Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts. |
Co-authored-by: scality-fno <[email protected]>
Co-authored-by: scality-fno <[email protected]>
e99b340
to
50c5108
Compare
👍 |
Take a look at p1-p4 now. They all have input structures, required fields,
etc.
The *WT Heck is inside S3_FSCK files* google sheet had a header added for
at least one tab. This may cause a comment to incorrectly state it is
reading with header describing input which may not actually have a header,
needing to be changed to _c0, _c1 etc.
Trevor Benson
Scality - Staff Engineer
+1 (707) 479 2965 (m) | ***@***.***
@scality <https://twitter.com/scality>
@TrevorBenson <https://www.linkedin.com/in/trevorbenson/>
scality.com <https://www.scality.com>
…On Tue, Oct 10, 2023 at 8:59 AM scality-fno ***@***.***> wrote:
New pass on P0. TODO: evaluate the need for each column to exist.
Let's keep the focus on comments, including if columns are used or not
used in later steps.
Once we get this approved we can use a new ticket to suggest minimizing
the data written to fields actually used in later scripts. 🤞 with our
comments detailing how it all works we can get approval to change the
actual scripts.
👍
I'll make the time for P1, then
—
Reply to this email directly, view it on GitHub
<#48 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACF6ID4Q57E4MLIUVPANLTLX6VWFZAVCNFSM6AAAAAA5LAI4A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVG42DQMZVHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
3eb5bc2
to
644b54e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite sure there's not much to say (in the current design) about P2, P3 and P4.
One question I have is: can we make the P2 anti-join operation more efficient?
dfCOSsingle = dfCOSsingle.withColumn("ringkey",dfCOSsingle["_c1"]) | ||
# ??? | ||
dfCOSsingle = dfCOSsingle.withColumn("_c1",F.expr("substring(_c1, 1, length(_c1)-14)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation likely explains why the withColumn()
is used to duplicate _c1 into ringkey instead of a withColumnRenamd()
for dfCOSsingle.
dfnew = rdd.flatMap(lambda x: x).toDF() | ||
|
||
single = "%s://%s/%s/s3fsck/s3-dig-keys.csv" % (PROTOCOL, PATH, RING) | ||
# write the dataframe to a csv file with a header | ||
# output structure: (digkey, sproxyd input key, subkey if available) | ||
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want correct headers, this write operation w/ header="true"
is where we get our first instance of generic _c0, _c1 column names. We could perform:
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single) | |
dfnew = dnew.withColumnRenamed("_c0", "digkey).withColumnRenamed("_c1", "input_key").withColumnRenamed("_c2", "subkey") | |
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single) |
Requires updating the p2 script to read the new column names instead of the generic ones.
# e.g. 555555A4948FAA554034E155555555A61470C07A,8000004F3F3A54FFEADF8C00000000511470C070,g1disk1,0 | ||
# Required Fields: | ||
# - _c1 (main chunk) | ||
# - _c3 (FLAG) | ||
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another spot we can inject valid headers prior to later commands, making them a bit simpler to comprehend:
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files) | |
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)\ | |
df = df.withColumnRenamed("_c0", "ringkey").withColumnRenamed("_c1", "mainchunk").withColumnRenamed("_c2", "disk").withColumnRenamed("_c3", "flag") |
in this example I name the _c0 (ring chunk keys) as ringkey, instead of the naming _c1, the main chunk, as ringkey. I think this could potentially reduce confusion if we decide to be very specific and use explicit terms for each data type
- ringkey (or ring_key) # The 30-33 chunk keys and 70-7B chunk keys
- mainchunk (or main_chunk) # The 30 or 70 main chunk (aka zero keys)
- disk
- flag
- inputkey (or input_key) # The sproxyd input key
- digkey (or dig_key) # The md5sum digged from the main chunk
I like the underscore versions for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever's easier to read is fine by me
Co-authored-by: scality-fno <[email protected]>
It might be useful to contribute those to the repository so we can improve upon them as needed. We might need to repeat this process for SOFS_FSCK. |
There are a few spots with ???? which either the entire operation I wanted to confirm before adding a comment, or there is a description, but I am not 100% confident in its accuracy and need to compare some more current output from the scripts to be sure.