Generic data structures to hold a parsed EBCDIC record #184

yruslan · 2019-10-07T09:48:40Z

Background

Currently, the only way to extract data from an EBCDIC file fusing Cobrix is to use Spark. While Spark is great for big data workflows, other workflows would benefit on a more generic way to parse EBCDIC data.

Feature

Create generic data structures for holding parsed EBCDIC record data.

tr11 · 2020-01-11T12:35:14Z

I'm interested in this feature too. Let's discuss what's the best way to do this. Ideally this could be done together with the effort of outputting standard files from cobrix such as JSON or CSV.

yruslan · 2020-01-11T16:13:21Z

Absolutely, this is exactly the idea. I can create a design document on a Google Drive (for instance) so we can comment and come up with the design that takes into account requirements from our and your side.

Looking forward to the collaboration.

tr11 · 2020-03-27T02:12:44Z

Revisiting this. I was looking through the code and wondering whether we could:

move the RowExtractors object to the cobol-parser or potentially to a new intermediate package, say cobol-reader. It seems that the Spark portion comes only at the end of each method, so we could return a generic scala object and create a thin wrapper on the spark-cobol package that simply converts the fields to Row
move the readers and iterators to cobol-parser/cobol-reader
let the spark-cobol package simply convert the generic scala rows to Spark rows
create other converters either in the cobol-parser/cobol-reader or as separate packages to output to JSON and CSV (for non-nested structs)

yruslan · 2020-03-27T09:47:40Z

All sounds reasonable and it looks like a logical evolution of the project. It would be great if we could decrease coupling and make the project more modular and re-usable for other frameworks. The only thing we should keep in mind I think is we need to preserve performance.

I'm thinking if cobol-reader can provide generic methods that would allow conversion to Spark's Row in the reader as a single step, without an intermediate data structure. I'm also thinking of adding performance tests to make sure performance won't degrade after the re-arrangement.

tr11 · 2020-03-29T01:00:21Z

@yruslan, take a look at https://github.com/tr11/cobrix/tree/refactor-readers, it does what I mentioned in #184 (comment).
With these changes we have:

parser, reader, spark packages. Each handles one part of the process
the row conversion happens as a single step with no intermediate structures
it's possible to call the reader directly and extract data as a nested Seq or Array.

I should have a JSON record builder at some point in the next couple of weeks. Probably as an example app as there's no need to add extra dependencies to the reader.

yruslan · 2020-03-29T11:20:49Z

I've looked through the changes and it looks perfect! Will give it a more thorough look tomorrow, I might have a couple of questions.

tr11 · 2020-03-30T01:38:47Z

I added a test with a potential JSON, XML, and CSV implementation for record builders. A few questions I have:

Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.
Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

yruslan · 2020-03-30T19:34:53Z

Looking at it...

Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.

Yes, I agree. It can be in the same module, but in different packages. If for any reason we would like to split it, we could do it any time.

Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

I think, yes here as well. Direct conversion to JSON, XML and CSV might be very useful as something supported by the library, than just an example. Even if it is a very small module. I expect it can be used to bridge EBCDIC-encoded IBM MQ messages with other messaging systems.

tr11 · 2020-03-30T19:50:26Z

I'll let you go through it first and will merge the parser and reader modules after you're done.

For 2, what do you think of cobol-serializer name for the new package? I can set up the Reader classes akin to what's done in the spark side right now and then we can think about how to pass options to those.

yruslan · 2020-03-30T20:35:17Z

cobol-serializer seems alright, but gives an impression that something can be serialized to Cobol. Maybe cobol-converters? It implies that Cobol data can be converted to various sources. The fact that serializers are used for the conversion can be considered a technical detail. What do you think?

tr11 · 2020-03-30T20:47:23Z

Good point, cobol-converters it is!

tr11 · 2020-04-08T20:11:57Z

Would it be useful to create a PR with these changes for comments and suggestions?

yruslan · 2020-04-14T09:11:40Z

👍 Of course

Sorry for the late response.

yruslan added the enhancement New feature or request label Oct 7, 2019

yruslan added the help wanted Extra attention is needed label Dec 12, 2019

yruslan assigned tr11 Apr 18, 2020

yruslan closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic data structures to hold a parsed EBCDIC record #184

Generic data structures to hold a parsed EBCDIC record #184

yruslan commented Oct 7, 2019

tr11 commented Jan 11, 2020

yruslan commented Jan 11, 2020

tr11 commented Mar 27, 2020

yruslan commented Mar 27, 2020

tr11 commented Mar 29, 2020

yruslan commented Mar 29, 2020

tr11 commented Mar 30, 2020

yruslan commented Mar 30, 2020

tr11 commented Mar 30, 2020

yruslan commented Mar 30, 2020

tr11 commented Mar 30, 2020

tr11 commented Apr 8, 2020

yruslan commented Apr 14, 2020

Generic data structures to hold a parsed EBCDIC record #184

Generic data structures to hold a parsed EBCDIC record #184

Comments

yruslan commented Oct 7, 2019

Background

Feature

tr11 commented Jan 11, 2020

yruslan commented Jan 11, 2020

tr11 commented Mar 27, 2020

yruslan commented Mar 27, 2020

tr11 commented Mar 29, 2020

yruslan commented Mar 29, 2020

tr11 commented Mar 30, 2020

yruslan commented Mar 30, 2020

tr11 commented Mar 30, 2020

yruslan commented Mar 30, 2020

tr11 commented Mar 30, 2020

tr11 commented Apr 8, 2020

yruslan commented Apr 14, 2020