Implemented CSVCodec for S3 Source, config & unit tests #1644

finnroblin · 2022-08-05T16:39:56Z

Signed-off-by: Finn Roblin [email protected]

Description

Implementation of CSVCodec to parse InputStream from S3 objects. Includes similar configuration options as the CSV Processor, with some slight changes — the CSV file's header/column names can be specified by user with header option, and the header name can be autodetected by setting detect_header to true.

Issues Resolved

Resolves #1617

Check List

New functionality includes testing.
New functionality has been documented.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Finn Roblin <[email protected]>

codecov-commenter · 2022-08-05T16:55:23Z

Codecov Report

Merging #1644 (cfb92fa) into main (c5ffce3) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #1644   +/-   ##
=========================================
  Coverage     94.26%   94.26%           
  Complexity     1206     1206           
=========================================
  Files           162      162           
  Lines          3419     3419           
  Branches        276      276           
=========================================
  Hits           3223     3223           
  Misses          134      134           
  Partials         62       62

Impacted Files	Coverage Δ
...dataprepper/model/configuration/PipelineModel.java	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

travisbenedict · 2022-08-05T17:17:11Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+@DataPrepperPlugin(name = "csv", pluginType = Codec.class, pluginConfigurationType = CSVCodecConfig.class)
+public class CSVCodec implements Codec {
+    private final CSVCodecConfig config;
+    private static final Logger LOG = LoggerFactory.getLogger(CSVCodec.class);


nit: By convention static variables are listed first

travisbenedict · 2022-08-05T17:22:48Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        } catch (JsonParseException jsonException) {
+            LOG.error("A JsonParseException occurred while reading a row of the CSV file, skipping line. Error ", jsonException);
+        } catch (final Exception e) {


You might consider combining these into a single catch block. From a user perspective I don't think the differentiation adds much information

I added some more details to the logging messages — CsvReadException = too many rows on csv file, JsonParseException = unclosed quote character from my experience testing. Do you think differentiating the two is helpful? I guess the lines get skipped either way, so I get where you're coming from with the user's perspective.

travisbenedict · 2022-08-05T17:25:59Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        char delimiterAsChar = this.config.getDelimiter().charAt(0);
+        char quoteCharAsChar = this.config.getQuoteCharacter().charAt(0);


Minor: You could define these as private member variables of this class since this same .get().charAt(0) is reused in several spots.

travisbenedict · 2022-08-05T17:30:09Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        final CsvMapper firstLineMapper = new CsvMapper();
+        firstLineMapper.enable(CsvParser.Feature.WRAP_AS_ARRAY); // allows firstLineMapper to read with empty schema


You could use createCsvMapper here instead. Should .enable(CsvParser.Feature.WRAP_AS_ARRAY) be moved into createCsvMapper like in the CSV processor? https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/csv-processor/src/main/java/com/amazon/dataprepper/plugins/processor/csv/CSVProcessor.java#L91

We discussed the getSizeOfFirstLine method offline and Travis suggested that we can get the number of columns through counting the occurrences of delimiter with a loop. I made this change, so the method is simpler and no longer uses the CsvMapper class.

travisbenedict · 2022-08-05T17:40:23Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        final CsvSchema getFirstLineLengthSchema = CsvSchema.emptySchema().withColumnSeparator(delimiterAsChar).
+                withQuoteChar(quoteCharAsChar);
+
+        try (final MappingIterator<List<String>> firstLineIterator = firstLineMapper.readerFor(List.class).with(getFirstLineLengthSchema)


Minor: Rewriting this to put the Iterator instantiation inside the try block would make this more readable

try { final MappingIterator<List<String>> firstLineIterator = firstLineMapper.readerFor(List.class) .with(getFirstLineLengthSchema) .readValues(firstLine) ... }

travisbenedict · 2022-08-05T19:11:48Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        return schema;
+    }
+
+    private String generateColumnHeader(final int colNumber) {


There's a good amount of shared logic between this class and the CSV Processor. Could we move some of that shared logic into its own class? Is there a place that make sense to put that?

@asifsmohammed looks like you created an issue for a similar problem #1643

+1 and I also noticed this while developing tests — a lot of the code from the Newline and JSON codec unit tests was applicable to the CSV codec unit tests. Sadly I'm not sure where this common code should live. I remember that David Venable suggested packaging the CSV codec and processor together, but I'm not sure how this would look with the CSV processor being its own plugin and the CSV codec attached to S3.

@finnroblin , I believe once we make a generic codec solution (#1532) then we can have one project which has both. As it is now, this code has to reside in the S3 plugin. I'm fine to refactor this later.

travisbenedict · 2022-08-05T19:18:02Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+            // construct a header from the pipeline config or autogenerate it
+            final int defaultBufferSize = 8192; // number of chars before mark is ignored (this is the buffer size, so large header files
+            // can be read since more buffers will be allocated.)
+            reader.mark(defaultBufferSize); // getting number of columns of first line will consume the line, so mark initial location
+
+            final int firstLineSize = getSizeOfFirstLine(reader.readLine());
+            reader.reset(); // move reader pointer back to beginning of file in order to parse first line
+            schema = createCsvSchemaFromConfig(firstLineSize);


If the first line of the reader contains the header do we need when creating the MappingIterator in the section below? Is the mark and reset stuff necessary?

The branch in the else statement is triggered if there's no header in the csv file. So we want to mark and reset so that the first row of the file can be parsed into parsingIterator a second time (after it's used to figure out the actual number of columns in the csv file, and generate more column names for the header if necessary). I moved this logic to a helper method with a descriptive name which should hopefully improve readability.

Signed-off-by: Finn Roblin <[email protected]>

dlvenable · 2022-08-08T17:15:49Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        return schema;
+    }
+
+    private String generateColumnHeader(final int colNumber) {


@finnroblin , I believe once we make a generic codec solution (#1532) then we can have one project which has both. As it is now, this code has to reside in the S3 plugin. I'm fine to refactor this later.

dlvenable · 2022-08-08T17:16:23Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+import java.util.function.Consumer;
+
+@DataPrepperPlugin(name = "csv", pluginType = Codec.class, pluginConfigurationType = CSVCodecConfig.class)
+public class CSVCodec implements Codec {


Let's use Csv instead of CSV. I think this is more consistent with the current code base and Java conventions. Please change other classes as well.

I agree that this change is more consistent with conventions and the code base. I suppose we should also change the CSVProcessor class to CsvProcessor — is it okay to include these changes in this pull request? I can also open a new pull request to change to CsvProcessor if it's out of scope of this pull request.

A different PR would be better. That will help get this one in faster.

dlvenable · 2022-08-08T17:17:39Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+
+    private int extractNumberOfColsFromFirstLine(final String firstLine) {
+        int numberOfSeperators = 0;
+        int charPointer = 0;


Is there any reason this is not in the for statement?

There's no reason, oversight on my part. Thanks for catching it!

dlvenable · 2022-08-08T17:17:59Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+        }
+    }
+
+    private int extractNumberOfColsFromFirstLine(final String firstLine) {


Please use full words - Cols -> Columns.

dlvenable · 2022-08-08T17:18:56Z

...lugins/s3-source/src/test/java/com/amazon/dataprepper/plugins/source/codec/CSVCodecTest.java

+
+        csvCodec = createObjectUnderTest();
+    }
+//    @Test(timeout=10000)


Please remove these commented out lines.

finnroblin · 2022-08-08T20:25:09Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CSVCodec.java

+    }
+
+    private CsvSchema createCsvSchemaFromConfig(final int firstLineSize) {
+        final List<String> userSpecifiedHeader = config.getHeader();


The configuration option getHeader can be null, and if it is then the call to size() on line 116 will throw an exception. The proper behavior if header is null is to make userSpecifiedHeader an empty ArrayList. I'll fix this bug in the next revision.

Signed-off-by: Finn Roblin <[email protected]>

…like CsvProcessorConfig) Signed-off-by: Finn Roblin <[email protected]>

travisbenedict

LGTM

travisbenedict · 2022-08-09T15:09:32Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CsvCodec.java

+        }
+    }
+
+    private int getNumberOfColumnsByMarkingBeginningOfInputStreamAndResettingReaderAfter(BufferedReader reader) throws IOException {


minor: I think you could just call this getNumberOfColumns since the caller of this function doesn't need to care how it works

travisbenedict · 2022-08-09T15:09:46Z

...er-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/CsvCodec.java

+            LOG.error("Invalid CSV row, skipping this line. This typically means the row has too many lines. Consider using the CSV " +
+                    "Processor if there might be inconsistencies in the number of columns because it is more flexible. Error ",
+                    csvException);
+        } catch (JsonParseException jsonException) {


nit: missing final

…roject#1644) * Implemented CSVCodec for S3 Source, config & unit tests Signed-off-by: Finn Roblin <[email protected]> * Addressed Travis's feedback Signed-off-by: Finn Roblin <[email protected]> * Addressed David's feedback & fixed NPE if header is null Signed-off-by: Finn Roblin <[email protected]> * Renamed all instances of CSVProcessor to CsvProcessor (and offshoots like CsvProcessorConfig) Signed-off-by: Finn Roblin <[email protected]>

Implemented CSVCodec for S3 Source, config & unit tests

eea0d1f

Signed-off-by: Finn Roblin <[email protected]>

finnroblin requested a review from a team as a code owner August 5, 2022 16:39

travisbenedict reviewed Aug 5, 2022

View reviewed changes

Addressed Travis's feedback

cfb92fa

Signed-off-by: Finn Roblin <[email protected]>

dlvenable requested changes Aug 8, 2022

View reviewed changes

finnroblin commented Aug 8, 2022

View reviewed changes

finnroblin added 2 commits August 8, 2022 13:51

Addressed David's feedback & fixed NPE if header is null

8fd9473

Signed-off-by: Finn Roblin <[email protected]>

Renamed all instances of CSVProcessor to CsvProcessor (and offshoots …

89ee226

…like CsvProcessorConfig) Signed-off-by: Finn Roblin <[email protected]>

dlvenable approved these changes Aug 9, 2022

View reviewed changes

travisbenedict approved these changes Aug 9, 2022

View reviewed changes

dlvenable merged commit 5ffd5c6 into opensearch-project:main Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented CSVCodec for S3 Source, config & unit tests #1644

Implemented CSVCodec for S3 Source, config & unit tests #1644

finnroblin commented Aug 5, 2022

codecov-commenter commented Aug 5, 2022 •

edited

Loading

travisbenedict Aug 5, 2022

travisbenedict Aug 5, 2022

finnroblin Aug 5, 2022

travisbenedict Aug 5, 2022

travisbenedict Aug 5, 2022

finnroblin Aug 5, 2022

travisbenedict Aug 5, 2022

travisbenedict Aug 5, 2022

finnroblin Aug 5, 2022

dlvenable Aug 8, 2022

travisbenedict Aug 5, 2022

finnroblin Aug 5, 2022

dlvenable Aug 8, 2022

dlvenable Aug 8, 2022

finnroblin Aug 8, 2022

dlvenable Aug 9, 2022

dlvenable Aug 8, 2022

finnroblin Aug 8, 2022

dlvenable Aug 8, 2022

dlvenable Aug 8, 2022

finnroblin Aug 8, 2022

travisbenedict left a comment

travisbenedict Aug 9, 2022

travisbenedict Aug 9, 2022

		char delimiterAsChar = this.config.getDelimiter().charAt(0);
		char quoteCharAsChar = this.config.getQuoteCharacter().charAt(0);

		final CsvMapper firstLineMapper = new CsvMapper();
		firstLineMapper.enable(CsvParser.Feature.WRAP_AS_ARRAY); // allows firstLineMapper to read with empty schema

Implemented CSVCodec for S3 Source, config & unit tests #1644

Implemented CSVCodec for S3 Source, config & unit tests #1644

Conversation

finnroblin commented Aug 5, 2022

Description

Issues Resolved

Check List

codecov-commenter commented Aug 5, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisbenedict left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 5, 2022 •

edited

Loading