Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU architectures to the build-info file #1840

Merged
merged 4 commits into from
Mar 7, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions build/build-info
jlowe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ echo_build_properties() {
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD)
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo url=$(cd "$git_path" && git config --get remote.origin.url)
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u)
Copy link
Contributor Author

@parthosa parthosa Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation:
This command searches for files libcudfjni.a or libcudf.a in the current directory, then runs cuobjdump on each file to extract architecture information and sorts the unique results numerically.

Result Format:
70 75 80 86 90

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.

I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.

This should not be using find, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.

Copy link
Collaborator

@gerashegalov gerashegalov Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.

Copy link
Contributor Author

@parthosa parthosa Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be using find, it should be examining the known location where the artifact is placed before it is

For cudf supported architectures, we should check ./thirdparty/cudf/cpp/build/libcudf.a.
For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going

We should probably use set intersection of each lib. Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so

They are the same, you can pick either path. The former is where the library is orginally built, the latter is where it's copied to pull it into the resulting jar.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

Theoretically yes but not easily the way it builds today. IMO we could easily overengineer this thing to solve a problem that doesn't need to be solved in practice. We could simply probe for the architectures on libcudf.so and call it a day. Is it 100% bulletproof? No, theoretically someone could do a very crazy thing to get that to not be correct because they went out of their way to somehow build libcudf.a, libcudfjni.a, and libcudf.so differently. But I've never seen this in practice and arguably YAGNI applies. That avoids the "where are all the places to look" problem and the need to do set intersections.

Copy link
Collaborator

@gerashegalov gerashegalov Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?

inconsistent build environment after upmerges, branch switches, other mistakes during custom dev build, bugs in a prod build plus things like nvcomp kernels built externally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed script delimiter to semi-colon and explicitly passed 'libcudf.a' and 'libcudf.so' as library files for cudf and spark-rapids-jni for processing by the build info script

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My perception from internal chats is that devs routinely runs into build situations that would be YAGNI, but we do not need to optimize for this 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of the space-separated arch list. Java properties file format allows for the space being the separator between the key and the value portion of the line

jshell> var props = new java.util.Properties()
props ==> {}

jshell> props.load(new java.io.StringReader("gpu_architectures 70 75 80 86 90"))

jshell> props
props ==> {gpu_architectures=70 75 80 86 90}

I prefer a semicolon-separated value format without any white space in the value 70;75;80;86;90 similar to the first example https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES and this is what we typically use for -DGPU_ARCHS passed to Maven

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. This would keep it consistent with current usage.

for arg in "$@"; do
echo $arg
done
Expand Down
Loading