Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rulle-io
Copy link
Contributor

@rulle-io rulle-io commented Sep 1, 2021

This PR is meant to be a solution for issue #579 .

Also make a schema generation process less dependant on a user-provided schema and more fault-tolerant.

Current implementation

dbeam always generates an doc-related properties for a Avro schema based on input parameters and ResultSet value.
Optionally a user can provide a custom "handwritten" schema.
A user-provided schema is only used for Avro doc values.
Thus fields' names, types and type length are taken from an auto-generated schema.

Drawback(s)

One of drawbacks of this behaviour is that when a new field appears in a DB table and as consequence in a source SQL ResultSet (e.g. SELECT * is used), and a user-provided scheam doesn't contain this field, the process will throw an error.

Solution

dbeam's auto-generated schema is always used as a back-up, if a new a user-provided schema doesn't contain the field in question.

Additional use-case

An unplanned positive side-effect can be that one can use a a user-provided schema as a dictionary of descriptions (docs) for various fields, so one schema file can be used for muliple tables. We are going to use this side-effect.

  • "Unit tests are included"

Checklist for PR author(s)

  • Changes are covered by unit tests (no major decrease in code coverage %) and/or integration tests.
  • Ensure code formating (use mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)
  • Document any relevant additions/changes in the appropriate spot in javadocs/docs/README.

@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #377 (7bd0191) into master (2646c35) will increase coverage by 0.42%.
The diff coverage is 92.75%.

@@             Coverage Diff              @@
##             master     #377      +/-   ##
============================================
+ Coverage     91.47%   91.90%   +0.42%     
- Complexity      243      258      +15     
============================================
  Files            26       27       +1     
  Lines           927      963      +36     
  Branches         67       71       +4     
============================================
+ Hits            848      885      +37     
+ Misses           52       50       -2     
- Partials         27       28       +1     

@rulle-io rulle-io changed the title Add initial supplied schema validation to prevent inconsistency betwe… Change to use user-provided for doc-related data only. Sep 16, 2021
@rulle-io rulle-io changed the title Change to use user-provided for doc-related data only. Change to use user-provided scheam for doc-related info retrieval only. Sep 16, 2021
@rulle-io
Copy link
Contributor Author

@labianchin

@rulle-io rulle-io changed the title Change to use user-provided scheam for doc-related info retrieval only. Change to use user-provided schema for doc-related info retrieval only. Sep 16, 2021
@rulle-io rulle-io changed the title Change to use user-provided schema for doc-related info retrieval only. Use user-provided schema for doc-related info retrieval only. Sep 18, 2021
@rulle-io rulle-io force-pushed the supplied_schema_validation branch from d232488 to bcd97b7 Compare February 11, 2022 16:55
@labianchin
Copy link
Collaborator

Hi. Sorry it took me a while to get here, as I am putting little time on this project...

Is this PR still relevant? It has some conflicts with the just merged #380 .

If so, can you elaborate a bit further on the need for these changes? Specifically: what do we mean by "more fault-tolerant"? And what problem does "less dependant on a user-supplied schema" solves?

Ruslan Altynnikov added 2 commits March 12, 2022 11:18
…en supplied schema (expected data format) and an actual data format, returned by a SQL query.

Reorganize some code to make locations more logical.

Always use generated Avro schema.
Optional user provided schema used for `doc` fields retrieval.
@rulle-io rulle-io force-pushed the supplied_schema_validation branch from bcd97b7 to 12989f1 Compare March 12, 2022 10:47
@rulle-io
Copy link
Contributor Author

Updated the description.

@rulle-io rulle-io changed the title Use user-provided schema for doc-related info retrieval only. Always use an auto-generated schema as a back-up for Avro doc-related metadata retrieval. Jan 18, 2023
@rulle-io rulle-io changed the title Always use an auto-generated schema as a back-up for Avro doc-related metadata retrieval. Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect user-supplied Avro schema (--avroSchemaFilePath) causes dbeam to produce invalid avro files.
2 participants