-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for special caracters in ASCII format #225
Comments
I think that the issue is coming from this line : https://github.com/AbsaOSS/cobrix/blob/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala#L71 All the characters having an ascii code higher than 127 is replaced by a blank. Are these higher ascii codes causing issues in parsing ebcdic ascii files? |
Thanks for the bug report. Will take a look. |
Can confirm. So far I see no reason for high ASCII characters being 'masked' (replaced by spaces). Thanks again for such a detailed bug report! |
Which charset is used for the ASCII data in your case? But default strings will be treated as |
Thanks for your quick feedback on this bug. I think that an option is the most generic solution to provide the charset. By the way, is it possible to also have that option when loading the ascii file line by line please ? Here is the calling code that illustrates the idea (where "charsetOption" is the value of the provided charset by the user) :
|
Thanks for the code. It illustrates your use case very well. |
The design of spark-cobol does not allow specifying a charset per row. It was done so for efficiency purposes. We can do it for the whole copybook. As long as your CSV fields are encoded using the same charset, it should work for you. Approximately, your code will look like this (when the feature is implemented):
The |
Thanks for you valuable analysis and your positive feedback. The example I am using doesn't aim to apply a charset per row. It only aims to apply the same charset on all the strings that are present in the cobol file. The only way I have found to get the correct characters in the resulting data farame (with accents), is applying the charset when decoding the strings in StringDecoders.scala : def decodeAsciiString(bytes: Array[Byte], trimmingType: Int): String = { So if this decodeAsciiString function will be given an extra charset option, it would resolve all the issues with the special characters I think :) |
Great, that is exactly what was implemented. The fix is available in |
Big thanks yruslan. I can confirm that version 2.0.1 resolved all my special characters issues. |
I have some fixed length cobol files in ASCII format. They contain some special french characters like (é, à, ç, ô...).
When I read the file in csv format using val df = spark.read.csv("path to file"), I can see the accents after a df.show()
001000000011951195 séjour 2019-11-09-00.01.02.276249
When I load the same file using cobrix, the accents are replaced by three blanks :
001000000011951195 s jour 2019-11-09-00.01.02.276249
This is the code I used to load this data in cobrix :
val df = spark
.read
.format("cobol")
.option("copybook_contents", copybook)
.option("is_record_sequence", "true")
.option("encoding", "ascii")
.load("../cobol_data/test_cobrix.txt")
Are there any options to correctly load special characters please?
The text was updated successfully, but these errors were encountered: