Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic data structures to hold a parsed EBCDIC record #184

Closed
yruslan opened this issue Oct 7, 2019 · 13 comments
Closed

Generic data structures to hold a parsed EBCDIC record #184

yruslan opened this issue Oct 7, 2019 · 13 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@yruslan
Copy link
Collaborator

yruslan commented Oct 7, 2019

Background

Currently, the only way to extract data from an EBCDIC file fusing Cobrix is to use Spark. While Spark is great for big data workflows, other workflows would benefit on a more generic way to parse EBCDIC data.

Feature

Create generic data structures for holding parsed EBCDIC record data.

@yruslan yruslan added the enhancement New feature or request label Oct 7, 2019
@yruslan yruslan added the help wanted Extra attention is needed label Dec 12, 2019
@tr11
Copy link
Collaborator

tr11 commented Jan 11, 2020

I'm interested in this feature too. Let's discuss what's the best way to do this. Ideally this could be done together with the effort of outputting standard files from cobrix such as JSON or CSV.

@yruslan
Copy link
Collaborator Author

yruslan commented Jan 11, 2020

Absolutely, this is exactly the idea. I can create a design document on a Google Drive (for instance) so we can comment and come up with the design that takes into account requirements from our and your side.

Looking forward to the collaboration.

@tr11
Copy link
Collaborator

tr11 commented Mar 27, 2020

Revisiting this. I was looking through the code and wondering whether we could:

  • move the RowExtractors object to the cobol-parser or potentially to a new intermediate package, say cobol-reader. It seems that the Spark portion comes only at the end of each method, so we could return a generic scala object and create a thin wrapper on the spark-cobol package that simply converts the fields to Row
  • move the readers and iterators to cobol-parser/cobol-reader
  • let the spark-cobol package simply convert the generic scala rows to Spark rows
  • create other converters either in the cobol-parser/cobol-reader or as separate packages to output to JSON and CSV (for non-nested structs)

@yruslan
Copy link
Collaborator Author

yruslan commented Mar 27, 2020

All sounds reasonable and it looks like a logical evolution of the project. It would be great if we could decrease coupling and make the project more modular and re-usable for other frameworks. The only thing we should keep in mind I think is we need to preserve performance.

I'm thinking if cobol-reader can provide generic methods that would allow conversion to Spark's Row in the reader as a single step, without an intermediate data structure. I'm also thinking of adding performance tests to make sure performance won't degrade after the re-arrangement.

@tr11
Copy link
Collaborator

tr11 commented Mar 29, 2020

@yruslan, take a look at https://github.com/tr11/cobrix/tree/refactor-readers, it does what I mentioned in #184 (comment).
With these changes we have:

  • parser, reader, spark packages. Each handles one part of the process
  • the row conversion happens as a single step with no intermediate structures
  • it's possible to call the reader directly and extract data as a nested Seq or Array.

I should have a JSON record builder at some point in the next couple of weeks. Probably as an example app as there's no need to add extra dependencies to the reader.

@yruslan
Copy link
Collaborator Author

yruslan commented Mar 29, 2020

I've looked through the changes and it looks perfect! Will give it a more thorough look tomorrow, I might have a couple of questions.

@tr11
Copy link
Collaborator

tr11 commented Mar 30, 2020

I added a test with a potential JSON, XML, and CSV implementation for record builders. A few questions I have:

  1. Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.
  2. Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

@yruslan
Copy link
Collaborator Author

yruslan commented Mar 30, 2020

Looking at it...

  1. Should the reader package be merged into the parser? There are no extra dependencies and it's unlikely anyone would use the parser without a reader.

Yes, I agree. It can be in the same module, but in different packages. If for any reason we would like to split it, we could do it any time.

  1. Is there a need to create a serializers package similar to the spark-cobol package? The test I mentioned uses a Map[String, Any] to hold the data and pass it to jackson-databind. Maybe an example could suffice instead of replicating a lot of what's done on the Spark side.

I think, yes here as well. Direct conversion to JSON, XML and CSV might be very useful as something supported by the library, than just an example. Even if it is a very small module. I expect it can be used to bridge EBCDIC-encoded IBM MQ messages with other messaging systems.

@tr11
Copy link
Collaborator

tr11 commented Mar 30, 2020

I'll let you go through it first and will merge the parser and reader modules after you're done.

For 2, what do you think of cobol-serializer name for the new package? I can set up the Reader classes akin to what's done in the spark side right now and then we can think about how to pass options to those.

@yruslan
Copy link
Collaborator Author

yruslan commented Mar 30, 2020

cobol-serializer seems alright, but gives an impression that something can be serialized to Cobol. Maybe cobol-converters? It implies that Cobol data can be converted to various sources. The fact that serializers are used for the conversion can be considered a technical detail. What do you think?

@tr11
Copy link
Collaborator

tr11 commented Mar 30, 2020

Good point, cobol-converters it is!

@tr11
Copy link
Collaborator

tr11 commented Apr 8, 2020

Would it be useful to create a PR with these changes for comments and suggestions?

@yruslan
Copy link
Collaborator Author

yruslan commented Apr 14, 2020

👍 Of course

Sorry for the late response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants