Skip to content

Commit

Permalink
#481 Fix ASCII control characters handling policy. Add 'keep_all' str…
Browse files Browse the repository at this point in the history
…ing trimming policy.
  • Loading branch information
yruslan committed Mar 24, 2022
1 parent 746a199 commit e4006e3
Show file tree
Hide file tree
Showing 12 changed files with 199 additions and 92 deletions.
24 changes: 14 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1203,17 +1203,17 @@ Again, the full example is available at

##### Data parsing options

| Option (usage example) | Description |
| ------------------------------------------ |:----------------------------------------------------------------------------- |
| .option("string_trimming_policy", "both") | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`. |
| .option("ebcdic_code_page", "common") | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, `cp875`. `*_extended` code pages supports non-printable characters that converts to ASCII codes below 32. |
| .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion. |
| .option("ascii_charset", "US-ASCII") | Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc. |
| .option("is_utf16_big_endian", "true") | Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default). |
| .option("floating_point_format", "IBM") | Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`. |
| Option (usage example) | Description |
| ------------------------------------------ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .option("string_trimming_policy", "both") | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files |
| .option("ebcdic_code_page", "common") | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, `cp875`. `*_extended` code pages supports non-printable characters that converts to ASCII codes below 32. |
| .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion. |
| .option("ascii_charset", "US-ASCII") | Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc. |
| .option("is_utf16_big_endian", "true") | Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default). |
| .option("floating_point_format", "IBM") | Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`. |
| .option("variable_size_occurs", "false") | If `false` (default) fields that have `OCCURS 0 TO 100 TIMES DEPENDING ON` clauses always have the same size corresponding to the maximum array size (e.g. 100 in this example). If set to `true` the size of the field will shrink for each field that has less actual elements. |
| .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}") | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping. |
| .option("improved_null_detection", "false") | If `true`, values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings. |
| .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}") | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping. |
| .option("improved_null_detection", "false") | If `true`, values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings. |

##### Modifier options

Expand Down Expand Up @@ -1398,6 +1398,10 @@ at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
A: Update hadoop dll to version 3.2.2 or newer.

## Changelog
- #### 2.4.10 will be released soon.
- [#481](https://github.com/AbsaOSS/cobrix/issues/481) ASCII control characters are now ignored instead of being replaced with spaces.
A new string trimming policy (`keep_all`) allows keeping all control characters in strings (including `0x00`).

- #### 2.4.9 released 4 March 2022.
- [#474](https://github.com/AbsaOSS/cobrix/issues/474) Fix numeric decoder of unsigned DISPLAY format. The decoder made more strict and does not allow sign
overpunching for unsigned numbers.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,15 @@ class AsciiStringDecoderWrapper(trimmingType: Int, asciiCharsetName: String, imp
// Filter out all special characters
val buf = new ArrayBuffer[Byte](bytes.length)
while (i < bytes.length) {
if (bytes(i) >= 0 && bytes(i) < 32 /* Special characters are masked */ )
buf.append(32)
else
if (trimmingType == KeepAll || bytes(i) >= 32 || bytes(i) < 0) {
buf.append(bytes(i))
}
i = i + 1
}

val str = new String(buf.toArray, charset)

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll) {
str
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(str)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ object DecoderSelector {
case TrimLeft => StringDecoders.TrimLeft
case TrimRight => StringDecoders.TrimRight
case TrimBoth => StringDecoders.TrimBoth
case KeepAll => StringDecoders.KeepAll
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ object StringDecoders {
val TrimLeft = 2
val TrimRight = 3
val TrimBoth = 4
val KeepAll = 5

// Characters used for HEX conversion
private val HEX_ARRAY = "0123456789ABCDEF".toCharArray
Expand All @@ -55,7 +56,7 @@ object StringDecoders {
i = i + 1
}

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll ) {
buf.toString
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf.toString)
Expand All @@ -81,13 +82,15 @@ object StringDecoders {
var i = 0
val buf = new StringBuffer(bytes.length)
while (i < bytes.length) {
if (bytes(i) < 32 /*Special and high order characters are masked*/ )
buf.append(' ')
else
if (trimmingType == KeepAll || bytes(i) >= 32) {
buf.append(bytes(i).toChar)
} else if (bytes(i) < 0) {
buf.append(' ')
}
i = i + 1
}
if (trimmingType == TrimNone) {

if (trimmingType == TrimNone || trimmingType == KeepAll) {
buf.toString
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(buf.toString)
Expand Down Expand Up @@ -116,7 +119,7 @@ object StringDecoders {
new String(bytes, StandardCharsets.UTF_16LE)
}

if (trimmingType == TrimNone) {
if (trimmingType == TrimNone || trimmingType == KeepAll) {
utf16Str
} else if (trimmingType == TrimLeft) {
StringTools.trimLeft(utf16Str)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package za.co.absa.cobrix.cobol.parser.policies
object StringTrimmingPolicy extends Enumeration {
type StringTrimmingPolicy = Value

val TrimNone, TrimLeft, TrimRight, TrimBoth = Value
val TrimNone, TrimLeft, TrimRight, TrimBoth, KeepAll = Value

def withNameOpt(s: String): Option[Value] = {
val exactNames = values.find(_.toString == s)
Expand All @@ -33,6 +33,8 @@ object StringTrimmingPolicy extends Enumeration {
Some(TrimRight)
} else if (sLowerCase == "both") {
Some(TrimBoth)
} else if (sLowerCase == "keep_all") {
Some(KeepAll)
} else {
None
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ class AsciiStringDecoderWrapperSpec extends WordSpec {
val str = "\u0001\u0005A\u0008\u0010B\u0015\u001F"
val decoder = new AsciiStringDecoderWrapper(TrimNone, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == " A B ")
assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "AB")
}

"support left trimming" in {
Expand All @@ -81,7 +81,14 @@ class AsciiStringDecoderWrapperSpec extends WordSpec {
val str = "\u0002\u0004A\u0007\u000FB\u0014\u001E"
val decoder = new AsciiStringDecoderWrapper(TrimBoth, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "A B")
assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == "AB")
}

"be able to decode strings when keep_all is the trimming policy" in {
val str = "\u0002\u0004A\u0007\u000FB\u0014\u001E"
val decoder = new AsciiStringDecoderWrapper(KeepAll, "ASCII", false)

assert(decoder(str.getBytes(StandardCharsets.UTF_8)) == str)
}

"be serializable and deserializable" in {
Expand Down
6 changes: 3 additions & 3 deletions data/test17_expected/test17d.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{"File_Id":0,"Record_Id":2,"SEGMENT_ID":"C","COMPANY_ID":"9377942526","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"92714306","TAXPAYER_NUM":959592241},"CONTACTS":[{"PHONE_NUMBER":"+(277) 944 44 55","CONTACT_PERSON":"Janiece Newcombe"}]}}
{"File_Id":0,"Record_Id":6,"SEGMENT_ID":"C","COMPANY_ID":"3483483977","STATIC_DETAILS":{"COMPANY_NAME":"Robotrd Inc.","ADDRESS":"2 Park ave., Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":31195396},"CONTACTS":[{"PHONE_NUMBER":"+(174) 970 97 54","CONTACT_PERSON":"Tyesha Debow"},{"PHONE_NUMBER":"+(848) 832 61 68","CONTACT_PERSON":"Mindy Celestin"},{"PHONE_NUMBER":"+(455) 184 13 39","CONTACT_PERSON":"Mabelle Winburn"}]}}
{"File_Id":0,"Record_Id":7,"SEGMENT_ID":"C","COMPANY_ID":"7540764401","STATIC_DETAILS":{"COMPANY_NAME":"Eqartion Inc.","ADDRESS":"871A Forest ave., Toronto","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"6 H","TAXPAYER_NUM":87432264},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":7,"SEGMENT_ID":"C","COMPANY_ID":"7540764401","STATIC_DETAILS":{"COMPANY_NAME":"Eqartion Inc.","ADDRESS":"871A Forest ave., Toronto","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"6H","TAXPAYER_NUM":87432264},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":8,"SEGMENT_ID":"C","COMPANY_ID":"4413124035","STATIC_DETAILS":{"COMPANY_NAME":"Xingzhoug","ADDRESS":"74 Qing ave., Beijing","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"2f","TAXPAYER_NUM":50803302},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":12,"SEGMENT_ID":"C","COMPANY_ID":"9546291887","STATIC_DETAILS":{"COMPANY_NAME":"ZjkLPj","ADDRESS":"5574, Tokyo","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"73538919","TAXPAYER_NUM":926102835},"CONTACTS":[{"PHONE_NUMBER":"+(300) 252 33 17","CONTACT_PERSON":"Carrie Celestin"},{"PHONE_NUMBER":"+(907) 101 70 64","CONTACT_PERSON":"Edyth Deveau"},{"PHONE_NUMBER":"+(694) 918 17 44","CONTACT_PERSON":"Jene Norgard"}]}}
{"File_Id":0,"Record_Id":15,"SEGMENT_ID":"C","COMPANY_ID":"9168453994","STATIC_DETAILS":{"COMPANY_NAME":"Test Bank","ADDRESS":"1 Garden str., London","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"82573513","TAXPAYER_NUM":942814519},"CONTACTS":[{"PHONE_NUMBER":"+(768) 691 44 85","CONTACT_PERSON":"Timika Bourke"},{"PHONE_NUMBER":"+(695) 918 33 16","CONTACT_PERSON":"Lynell Riojas"}]}}
Expand Down Expand Up @@ -49,12 +49,12 @@
{"File_Id":0,"Record_Id":155,"SEGMENT_ID":"C","COMPANY_ID":"9898799886","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"i","TAXPAYER_NUM":93022636},"CONTACTS":[{"PHONE_NUMBER":"+(576) 960 82 65","CONTACT_PERSON":"Carrie Maxim"},{"PHONE_NUMBER":"+(211) 823 44 73","CONTACT_PERSON":"Carrie Batman"},{"PHONE_NUMBER":"+(121) 202 45 80","CONTACT_PERSON":"Cliff Gagliano"},{"PHONE_NUMBER":"+(675) 313 76 46","CONTACT_PERSON":"Gabriele Hisle"}]}}
{"File_Id":0,"Record_Id":159,"SEGMENT_ID":"C","COMPANY_ID":"1542972569","STATIC_DETAILS":{"COMPANY_NAME":"ZjkLPj","ADDRESS":"5574, Tokyo","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"62949671","TAXPAYER_NUM":909261108},"CONTACTS":[{"PHONE_NUMBER":"+(759) 249 16 51","CONTACT_PERSON":"Estelle Thorpe"},{"PHONE_NUMBER":"+(66) 307 32 55","CONTACT_PERSON":"Cliff Deveau"},{"PHONE_NUMBER":"+(710) 445 38 90","CONTACT_PERSON":"Sulema Debow"}]}}
{"File_Id":0,"Record_Id":162,"SEGMENT_ID":"C","COMPANY_ID":"5492257935","STATIC_DETAILS":{"COMPANY_NAME":"ECSRONO","ADDRESS":"123/B Prome str., Denver","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"_","TAXPAYER_NUM":67540319},"CONTACTS":[{"PHONE_NUMBER":"+(168) 809 90 63","CONTACT_PERSON":"Alona Celestin"},{"PHONE_NUMBER":"+(845) 120 90 31","CONTACT_PERSON":"Estelle Flatt"}]}}
{"File_Id":0,"Record_Id":167,"SEGMENT_ID":"C","COMPANY_ID":"2366383436","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\" Z","TAXPAYER_NUM":35788122},"CONTACTS":[{"PHONE_NUMBER":"+(515) 716 22 11","CONTACT_PERSON":"Alona Shapiro"},{"PHONE_NUMBER":"+(649) 897 62 54","CONTACT_PERSON":"Wilbert Tumlin"},{"PHONE_NUMBER":"+(180) 179 20 17","CONTACT_PERSON":"Deshawn Thorpe"},{"PHONE_NUMBER":"+(12) 730 88 41","CONTACT_PERSON":"Sulema Batman"}]}}
{"File_Id":0,"Record_Id":167,"SEGMENT_ID":"C","COMPANY_ID":"2366383436","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\"Z","TAXPAYER_NUM":35788122},"CONTACTS":[{"PHONE_NUMBER":"+(515) 716 22 11","CONTACT_PERSON":"Alona Shapiro"},{"PHONE_NUMBER":"+(649) 897 62 54","CONTACT_PERSON":"Wilbert Tumlin"},{"PHONE_NUMBER":"+(180) 179 20 17","CONTACT_PERSON":"Deshawn Thorpe"},{"PHONE_NUMBER":"+(12) 730 88 41","CONTACT_PERSON":"Sulema Batman"}]}}
{"File_Id":0,"Record_Id":171,"SEGMENT_ID":"C","COMPANY_ID":"3002677167","STATIC_DETAILS":{"COMPANY_NAME":"ABCD Ltd.","ADDRESS":"74 Lawn ave., New York","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"` I","TAXPAYER_NUM":56661321},"CONTACTS":[{"PHONE_NUMBER":"+(372) 400 84 96","CONTACT_PERSON":"Eliana Godfrey"},{"PHONE_NUMBER":"+(128) 167 19 48","CONTACT_PERSON":"Suk Debow"},{"PHONE_NUMBER":"+(824) 681 73 76","CONTACT_PERSON":"Wilbert Mork"}]}}
{"File_Id":0,"Record_Id":176,"SEGMENT_ID":"C","COMPANY_ID":"3086612212","STATIC_DETAILS":{"COMPANY_NAME":"ECSRONO","ADDRESS":"123/B Prome str., Denver","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"78635498","TAXPAYER_NUM":926430771},"CONTACTS":[{"PHONE_NUMBER":"+(272) 831 90 52","CONTACT_PERSON":"Otelia Benally"},{"PHONE_NUMBER":"+(816) 337 55 41","CONTACT_PERSON":"Mindy Boehme"},{"PHONE_NUMBER":"+(508) 154 21 13","CONTACT_PERSON":"Timika Sauve"},{"PHONE_NUMBER":"+(335) 303 80 26","CONTACT_PERSON":"Timika Flatt"}]}}
{"File_Id":0,"Record_Id":180,"SEGMENT_ID":"C","COMPANY_ID":"1600426180","STATIC_DETAILS":{"COMPANY_NAME":"Pear GMBH.","ADDRESS":"107 Labe str., Berlin","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"E","TAXPAYER_NUM":43447109},"CONTACTS":[{"PHONE_NUMBER":"+(768) 461 89 92","CONTACT_PERSON":"Cliff Debow"},{"PHONE_NUMBER":"+(395) 386 85 35","CONTACT_PERSON":"Gabriele Deveau"},{"PHONE_NUMBER":"+(267) 618 38 57","CONTACT_PERSON":"Deshawn Bourke"}]}}
{"File_Id":0,"Record_Id":184,"SEGMENT_ID":"C","COMPANY_ID":"6926861847","STATIC_DETAILS":{"COMPANY_NAME":"Xingzhoug","ADDRESS":"74 Qing ave., Beijing","TAXPAYER":{"TAXPAYER_TYPE":"A","TAXPAYER_STR":"65111659","TAXPAYER_NUM":909455665},"CONTACTS":[{"PHONE_NUMBER":"+(347) 457 79 19","CONTACT_PERSON":"Cassey Mackinnon"},{"PHONE_NUMBER":"+(176) 205 63 71","CONTACT_PERSON":"Alona Newcombe"},{"PHONE_NUMBER":"+(348) 375 95 34","CONTACT_PERSON":"Starr Maxim"}]}}
{"File_Id":0,"Record_Id":186,"SEGMENT_ID":"C","COMPANY_ID":"9452676140","STATIC_DETAILS":{"COMPANY_NAME":"Pear GMBH.","ADDRESS":"107 Labe str., Berlin","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":97056018},"CONTACTS":[{"PHONE_NUMBER":"+(123) 240 91 88","CONTACT_PERSON":"Willis Thorpe"}]}}
{"File_Id":0,"Record_Id":190,"SEGMENT_ID":"C","COMPANY_ID":"8581179565","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"W i","TAXPAYER_NUM":89592681},"CONTACTS":[{"PHONE_NUMBER":"+(518) 461 10 86","CONTACT_PERSON":"Otelia Flatt"},{"PHONE_NUMBER":"+(697) 268 10 81","CONTACT_PERSON":"Wilbert Lepe"},{"PHONE_NUMBER":"+(548) 150 86 82","CONTACT_PERSON":"Suk Maxim"}]}}
{"File_Id":0,"Record_Id":190,"SEGMENT_ID":"C","COMPANY_ID":"8581179565","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"Wi","TAXPAYER_NUM":89592681},"CONTACTS":[{"PHONE_NUMBER":"+(518) 461 10 86","CONTACT_PERSON":"Otelia Flatt"},{"PHONE_NUMBER":"+(697) 268 10 81","CONTACT_PERSON":"Wilbert Lepe"},{"PHONE_NUMBER":"+(548) 150 86 82","CONTACT_PERSON":"Suk Maxim"}]}}
{"File_Id":0,"Record_Id":191,"SEGMENT_ID":"C","COMPANY_ID":"7590246923","STATIC_DETAILS":{"COMPANY_NAME":"Joan Q & Z","ADDRESS":"10 Sandton, Johannesburg","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"\\","TAXPAYER_NUM":17652973},"CONTACTS":[]}}
{"File_Id":0,"Record_Id":195,"SEGMENT_ID":"C","COMPANY_ID":"2521796035","STATIC_DETAILS":{"COMPANY_NAME":"Beiereqweq.","ADDRESS":"901 Ztt, Munich","TAXPAYER":{"TAXPAYER_TYPE":"N","TAXPAYER_STR":"","TAXPAYER_NUM":84541629},"CONTACTS":[{"PHONE_NUMBER":"+(295) 174 64 72","CONTACT_PERSON":"Estelle Wallingford"},{"PHONE_NUMBER":"+(173) 201 14 38","CONTACT_PERSON":"Doretha Shapiro"},{"PHONE_NUMBER":"+(756) 614 38 41","CONTACT_PERSON":"Suk Benally"}]}}
Loading

0 comments on commit e4006e3

Please sign in to comment.