Support for special caracters in ASCII format #225

tchaari · 2019-12-15T16:22:45Z

I have some fixed length cobol files in ASCII format. They contain some special french characters like (é, à, ç, ô...).

When I read the file in csv format using val df = spark.read.csv("path to file"), I can see the accents after a df.show()
001000000011951195 séjour 2019-11-09-00.01.02.276249

When I load the same file using cobrix, the accents are replaced by three blanks :
001000000011951195 s jour 2019-11-09-00.01.02.276249

This is the code I used to load this data in cobrix :
val df = spark
.read
.format("cobol")
.option("copybook_contents", copybook)
.option("is_record_sequence", "true")
.option("encoding", "ascii")
.load("../cobol_data/test_cobrix.txt")

Are there any options to correctly load special characters please?

tchaari · 2019-12-15T17:21:42Z

I think that the issue is coming from this line : https://github.com/AbsaOSS/cobrix/blob/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala#L71

All the characters having an ascii code higher than 127 is replaced by a blank. Are these higher ascii codes causing issues in parsing ebcdic ascii files?

yruslan · 2019-12-17T14:09:26Z

Thanks for the bug report. Will take a look.

yruslan · 2019-12-17T14:22:27Z

Can confirm. So far I see no reason for high ASCII characters being 'masked' (replaced by spaces).
We will remove this restriction, rerun our internal integration tests. The fix should be available in the next minor release.

Thanks again for such a detailed bug report!

yruslan · 2019-12-18T07:50:49Z

Which charset is used for the ASCII data in your case?
Is it UTF-8 or ISO-8859-1, for instance?

But default strings will be treated as UTF-8 I guess, but we will add an option to specify a charset.

tchaari · 2019-12-18T09:36:56Z

Thanks for your quick feedback on this bug.

I think that an option is the most generic solution to provide the charset. By the way, is it possible to also have that option when loading the ascii file line by line please ?

Here is the calling code that illustrates the idea (where "charsetOption" is the value of the provided charset by the user) :

val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")
val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, false,Seq(),Map(), StringTrimmingPolicy.TrimRight, CommentPolicy.apply(),CodePage.getCodePageByName("common"), FloatingPointFormat(2), Seq())
val sparkSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.KeepOriginal, false).getSparkSchema
val spark = SparkSession
  .builder()
  .appName("Spark-Cobol Cobrix example")
  .getOrCreate()
val dfText = spark.read.option("encoding", charsetOption).csv(cobolFilePath)
val rddRow = dfText.rdd.map( line => {
  RowExtractors.extractRecord(parsedCopybook.ast, line.getString(0).getBytes(charsetOption), charsetOption)
})

yruslan · 2019-12-18T10:14:10Z

Thanks for the code. It illustrates your use case very well.
Yes, we can add a charset option to RowExtractors.extractRecord(), it makes perfect sense in this case.

yruslan · 2019-12-20T07:57:04Z

The design of spark-cobol does not allow specifying a charset per row. It was done so for efficiency purposes. We can do it for the whole copybook. As long as your CSV fields are encoded using the same charset, it should work for you. Approximately, your code will look like this (when the feature is implemented):

val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")
val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, false,Seq(),Map(), StringTrimmingPolicy.TrimRight, CommentPolicy.apply(),CodePage.getCodePageByName("common"), charsetOption, FloatingPointFormat(2), Seq())
val sparkSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.KeepOriginal, false).getSparkSchema
val spark = SparkSession
  .builder()
  .appName("Spark-Cobol Cobrix example")
  .getOrCreate()
val dfText = spark.read.option("encoding", charsetOption).csv(cobolFilePath)
val rddRow = dfText.rdd.map( line => {
  RowExtractors.extractRecord(parsedCopybook.ast, line.getString(0).getBytes(charsetOption))
})

The charsetOption is added to parseTree() and removed from extractRecord().

tchaari · 2019-12-20T08:24:48Z

Thanks for you valuable analysis and your positive feedback.

The example I am using doesn't aim to apply a charset per row. It only aims to apply the same charset on all the strings that are present in the cobol file.

The only way I have found to get the correct characters in the resulting data farame (with accents), is applying the charset when decoding the strings in StringDecoders.scala :

def decodeAsciiString(bytes: Array[Byte], trimmingType: Int): String = {
val buf = new String(bytes, "iso-8859-1")
if (trimmingType == TrimNone) {
buf
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf)
} else if (trimmingType == TrimRight) {
StringTools.trimRight(buf)
} else {
buf.trim
}
}

So if this decodeAsciiString function will be given an extra charset option, it would resolve all the issues with the special characters I think :)

yruslan · 2019-12-20T13:14:27Z

Great, that is exactly what was implemented. The fix is available in 2.0.1.
You can now provide ASCII charset to parseTree() as StandardCharsets.ISO_8859_1. The parameter goes right after the EBCDIC code page.

tchaari · 2019-12-20T17:27:14Z

Big thanks yruslan. I can confirm that version 2.0.1 resolved all my special characters issues.

yruslan self-assigned this Dec 17, 2019

yruslan added accepted Accepted for implementation bug Something isn't working labels Dec 17, 2019

yruslan added this to the 2.0.1 milestone Dec 17, 2019

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Add support for higher order ASCII characters.

ee88df3

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Add support for specifying an ASCII charset.

236404d

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Use 7-bit ASCII charset by default for backward compatibility.

dfb85ef

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Update README.

edc9ccf

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Add support for higher order ASCII characters.

24bf590

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Add support for specifying an ASCII charset.

9f57546

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Use 7-bit ASCII charset by default for backward compatibility.

4018869

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Update README.

2a3fafd

yruslan added a commit that referenced this issue Dec 20, 2019

#225 Update release notes in README.

e9770e4

yruslan closed this as completed Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for special caracters in ASCII format #225

Support for special caracters in ASCII format #225

tchaari commented Dec 15, 2019

tchaari commented Dec 15, 2019

yruslan commented Dec 17, 2019

yruslan commented Dec 17, 2019

yruslan commented Dec 18, 2019

tchaari commented Dec 18, 2019

yruslan commented Dec 18, 2019

yruslan commented Dec 20, 2019

tchaari commented Dec 20, 2019

yruslan commented Dec 20, 2019

tchaari commented Dec 20, 2019

Support for special caracters in ASCII format #225

Support for special caracters in ASCII format #225

Comments

tchaari commented Dec 15, 2019

tchaari commented Dec 15, 2019

yruslan commented Dec 17, 2019

yruslan commented Dec 17, 2019

yruslan commented Dec 18, 2019

tchaari commented Dec 18, 2019

yruslan commented Dec 18, 2019

yruslan commented Dec 20, 2019

tchaari commented Dec 20, 2019

yruslan commented Dec 20, 2019

tchaari commented Dec 20, 2019