Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure UTF-8 encoding for reading non-english characters #1211

Merged
merged 3 commits into from
Jul 23, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Jul 22, 2024

Fixes #1209. This PR fixes an issue where unit tests fail due to encoding errors when reading non-English characters.

Changes

  • Added a wrapper object UTF8Source around Source with charset set as UTF8
  • Added a check in scalastyle_config.xml that disables usage of any Source.from related methods.
  • Included a helper method FSUtils.readFileContentAsUTF8() to read file as UTF8 and close resources.

Scalastyle Output

Verified the scalastyle check in dev branch. 28 occurrences of Source.from are detected.

Scalastyle Output
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/util/ToolUtilsSuite.scala message=Use UTF8Source.from instead of Source.from line=234 column=37
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/util/ToolUtilsSuite.scala message=Use UTF8Source.from instead of Source.from line=235 column=37
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimelineSuite.scala message=Use UTF8Source.from instead of Source.from line=74 column=23
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/GenerateDotSuite.scala message=Use UTF8Source.from instead of Source.from line=74 column=23
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=291 column=24
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=328 column=24
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=341 column=30
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=380 column=24
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=390 column=30
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=416 column=24
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=450 column=24
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=498 column=27
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=1249 column=28
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=1389 column=19
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala message=Use UTF8Source.from instead of Source.from line=1558 column=28
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/test/scala/com/nvidia/spark/rapids/tool/planparser/SqlPlanParserSuite.scala message=Use UTF8Source.from instead of Source.from line=121 column=27
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/util/RapidsToolsConfUtil.scala message=Use UTF8Source.from instead of Source.from line=98 column=19
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala message=Use UTF8Source.from instead of Source.from line=264 column=14
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/DriverLogProcessor.scala message=Use UTF8Source.from instead of Source.from line=43 column=17
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=110 column=17
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=117 column=17
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=129 column=23
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=136 column=25
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=155 column=17
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=160 column=17
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=168 column=22
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=170 column=22
error file=/Users/psarthi/Work/spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/PluginTypeChecker.scala message=Use UTF8Source.from instead of Source.from line=178 column=17
Processed 164 file(s)
Found 28 errors

Testing

Tested the changes in a CICD job.

Note

  • The scalastyle check has been kept broad i.e. Source.from instead of specifics (i.e. Source.fromFile) so that in future if anyone uses a new method (eg. Source.fromURI), it will be automatically be caught.

Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 22, 2024
@parthosa parthosa self-assigned this Jul 22, 2024
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa added core_tools Scope the core module (scala) and removed user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 22, 2024
@parthosa parthosa requested a review from amahussein July 22, 2024 23:44
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
LGTME

@parthosa parthosa changed the title Ensure UTF-8 encoding for reading non-english characters in unit tests Ensure UTF-8 encoding for reading non-english characters Jul 23, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa!

@parthosa parthosa merged commit b2fc2f3 into NVIDIA:dev Jul 23, 2024
14 checks passed
@parthosa parthosa deleted the spark-rapids-tools-1209 branch July 23, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Unit test "output non-english characters" failed in pipeline
3 participants