Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for special caracters in ASCII format #225

Closed
tchaari opened this issue Dec 15, 2019 · 10 comments
Closed

Support for special caracters in ASCII format #225

tchaari opened this issue Dec 15, 2019 · 10 comments
Assignees
Labels
accepted Accepted for implementation bug Something isn't working
Milestone

Comments

@tchaari
Copy link

tchaari commented Dec 15, 2019

I have some fixed length cobol files in ASCII format. They contain some special french characters like (é, à, ç, ô...).

When I read the file in csv format using val df = spark.read.csv("path to file"), I can see the accents after a df.show()
001000000011951195 séjour 2019-11-09-00.01.02.276249

When I load the same file using cobrix, the accents are replaced by three blanks :
001000000011951195 s jour 2019-11-09-00.01.02.276249

This is the code I used to load this data in cobrix :
val df = spark
.read
.format("cobol")
.option("copybook_contents", copybook)
.option("is_record_sequence", "true")
.option("encoding", "ascii")
.load("../cobol_data/test_cobrix.txt")

Are there any options to correctly load special characters please?

@tchaari
Copy link
Author

tchaari commented Dec 15, 2019

I think that the issue is coming from this line : https://github.com/AbsaOSS/cobrix/blob/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala#L71

All the characters having an ascii code higher than 127 is replaced by a blank. Are these higher ascii codes causing issues in parsing ebcdic ascii files?

@yruslan
Copy link
Collaborator

yruslan commented Dec 17, 2019

Thanks for the bug report. Will take a look.

@yruslan
Copy link
Collaborator

yruslan commented Dec 17, 2019

Can confirm. So far I see no reason for high ASCII characters being 'masked' (replaced by spaces).
We will remove this restriction, rerun our internal integration tests. The fix should be available in the next minor release.

Thanks again for such a detailed bug report!

@yruslan yruslan self-assigned this Dec 17, 2019
@yruslan yruslan added accepted Accepted for implementation bug Something isn't working labels Dec 17, 2019
@yruslan yruslan added this to the 2.0.1 milestone Dec 17, 2019
@yruslan
Copy link
Collaborator

yruslan commented Dec 18, 2019

Which charset is used for the ASCII data in your case?
Is it UTF-8 or ISO-8859-1, for instance?

But default strings will be treated as UTF-8 I guess, but we will add an option to specify a charset.

@tchaari
Copy link
Author

tchaari commented Dec 18, 2019

Thanks for your quick feedback on this bug.

I think that an option is the most generic solution to provide the charset. By the way, is it possible to also have that option when loading the ascii file line by line please ?

Here is the calling code that illustrates the idea (where "charsetOption" is the value of the provided charset by the user) :

val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")
val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, false,Seq(),Map(), StringTrimmingPolicy.TrimRight, CommentPolicy.apply(),CodePage.getCodePageByName("common"), FloatingPointFormat(2), Seq())
val sparkSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.KeepOriginal, false).getSparkSchema
val spark = SparkSession
  .builder()
  .appName("Spark-Cobol Cobrix example")
  .getOrCreate()
val dfText = spark.read.option("encoding", charsetOption).csv(cobolFilePath)
val rddRow = dfText.rdd.map( line => {
  RowExtractors.extractRecord(parsedCopybook.ast, line.getString(0).getBytes(charsetOption), charsetOption)
})

@yruslan
Copy link
Collaborator

yruslan commented Dec 18, 2019

Thanks for the code. It illustrates your use case very well.
Yes, we can add a charset option to RowExtractors.extractRecord(), it makes perfect sense in this case.

@yruslan
Copy link
Collaborator

yruslan commented Dec 20, 2019

The design of spark-cobol does not allow specifying a charset per row. It was done so for efficiency purposes. We can do it for the whole copybook. As long as your CSV fields are encoded using the same charset, it should work for you. Approximately, your code will look like this (when the feature is implemented):

val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")
val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, false,Seq(),Map(), StringTrimmingPolicy.TrimRight, CommentPolicy.apply(),CodePage.getCodePageByName("common"), charsetOption, FloatingPointFormat(2), Seq())
val sparkSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.KeepOriginal, false).getSparkSchema
val spark = SparkSession
  .builder()
  .appName("Spark-Cobol Cobrix example")
  .getOrCreate()
val dfText = spark.read.option("encoding", charsetOption).csv(cobolFilePath)
val rddRow = dfText.rdd.map( line => {
  RowExtractors.extractRecord(parsedCopybook.ast, line.getString(0).getBytes(charsetOption))
})

The charsetOption is added to parseTree() and removed from extractRecord().

@tchaari
Copy link
Author

tchaari commented Dec 20, 2019

Thanks for you valuable analysis and your positive feedback.

The example I am using doesn't aim to apply a charset per row. It only aims to apply the same charset on all the strings that are present in the cobol file.

The only way I have found to get the correct characters in the resulting data farame (with accents), is applying the charset when decoding the strings in StringDecoders.scala :

def decodeAsciiString(bytes: Array[Byte], trimmingType: Int): String = {
val buf = new String(bytes, "iso-8859-1")
if (trimmingType == TrimNone) {
buf
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf)
} else if (trimmingType == TrimRight) {
StringTools.trimRight(buf)
} else {
buf.trim
}
}

So if this decodeAsciiString function will be given an extra charset option, it would resolve all the issues with the special characters I think :)

@yruslan
Copy link
Collaborator

yruslan commented Dec 20, 2019

Great, that is exactly what was implemented. The fix is available in 2.0.1.
You can now provide ASCII charset to parseTree() as StandardCharsets.ISO_8859_1. The parameter goes right after the EBCDIC code page.

@tchaari
Copy link
Author

tchaari commented Dec 20, 2019

Big thanks yruslan. I can confirm that version 2.0.1 resolved all my special characters issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants