Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize strings transformation #371

Merged
merged 65 commits into from
May 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
1383053
feat(normalize-strings): Add normalize string transform for GTFS
landonreed Feb 19, 2021
b3b1134
refactor(NormalizeFieldTransformation.java): add some fields
landonreed Mar 16, 2021
3de77bf
refactor(NormalizeFieldTransformation): Fix capitalize function. Add …
binh-dam-ibigroup Mar 17, 2021
f3aaa18
refactor(NormalizeFieldTransformation): Add string replacement (basic…
binh-dam-ibigroup Mar 18, 2021
6a52f5b
refactor(NormalizeFieldTransformation): Add removal of contents in pa…
binh-dam-ibigroup Mar 18, 2021
0f9b6f6
feat(normalize-strings): Add normalize string transform for GTFS
landonreed Feb 19, 2021
fb038a1
refactor(NormalizeFieldTransformation.java): add some fields
landonreed Mar 16, 2021
1616199
refactor(NormalizeFieldTransformation): Fix capitalize function. Add …
binh-dam-ibigroup Mar 17, 2021
f3c1e99
refactor(NormalizeFieldTransformation): Add string replacement (basic…
binh-dam-ibigroup Mar 18, 2021
76643d3
refactor(NormalizeFieldTransformation): Add removal of contents in pa…
binh-dam-ibigroup Mar 18, 2021
4d6c02f
refactor(NormalizeFieldTransformation): Add support for capitalizatio…
binh-dam-ibigroup Mar 19, 2021
0a92484
refactor(NormalizeFieldTransformation): Add output to zip (assume fla…
binh-dam-ibigroup Mar 19, 2021
2ff46d4
Merge remote-tracking branch 'origin/normalize-strings-transformation…
binh-dam-ibigroup Mar 19, 2021
c5a926c
refactor(NormalizeFieldTransformationTest): Convert unit tests to par…
binh-dam-ibigroup Mar 22, 2021
342cf65
refactor(NormalizeFieldTransformation): Make transform steps optional…
binh-dam-ibigroup Mar 22, 2021
f38fd40
refactor(NormalizeFieldTransformation): Make substitutions instance-b…
binh-dam-ibigroup Mar 29, 2021
330decb
refactor(NormalizeFieldTransformation): Use persisted strings to init…
binh-dam-ibigroup Mar 31, 2021
65f8545
refactor(NormalizeFieldTransformation): Tweak code.
binh-dam-ibigroup Apr 1, 2021
086a599
test(NormalizeFieldTransformation): Add basic e2e test
binh-dam-ibigroup Apr 1, 2021
3965c05
refactor(NormalizeFieldTransformation): Define cap exceptions in the …
binh-dam-ibigroup Apr 2, 2021
c4b6161
refactor(NormalizeFieldTransformJobTest): Extract new test class (mak…
binh-dam-ibigroup Apr 6, 2021
ff541d0
refactor(NormalizeFieldTransformJobTest): Remove setup code
binh-dam-ibigroup Apr 6, 2021
0b0f363
refactor(NormalizeFieldTransformJobTest): Make test derive from Datat…
binh-dam-ibigroup Apr 7, 2021
d28e000
refactor(NormalizeFieldTransformJobTest): Persist feedsource only onc…
binh-dam-ibigroup Apr 7, 2021
a62e14e
refactor(NormalizeFieldTransformJobTest): Try a test duplicate and th…
binh-dam-ibigroup Apr 7, 2021
80a725c
Revert "refactor(NormalizeFieldTransformJobTest): Try a test duplicat…
binh-dam-ibigroup Apr 7, 2021
5d5482a
refactor(NormalizeFieldTransformation): Try a delay writing processed…
binh-dam-ibigroup Apr 7, 2021
b1aa23f
refactor(NormalizeFieldTransformation): Use CsvListWriter instead of …
binh-dam-ibigroup Apr 8, 2021
a3a3c54
refactor(NormalizeFieldTransformation): Replace Thread.sleep by touch…
binh-dam-ibigroup Apr 8, 2021
f6d08e6
refactor(NormalizeFieldTransformation): Count modified records.
binh-dam-ibigroup Apr 9, 2021
78f0fb2
refactor(NormalizeFieldTransformation): Remove file timestamp hack (w…
binh-dam-ibigroup Apr 12, 2021
7e266de
refactor(NormalizeFieldTransformation): Refactor lightly and ad JavaDoc.
binh-dam-ibigroup Apr 14, 2021
6833b6f
refactor(Db/ZipTransformation): Move type checks to Db/Zip parent class.
binh-dam-ibigroup Apr 14, 2021
50ce6a6
refactor(ZipTransformation): Extract logic to get Path in zipfs.
binh-dam-ibigroup Apr 14, 2021
9a95a34
refactor(FeedTransformations): Extract validation logic.
binh-dam-ibigroup Apr 14, 2021
7183f3f
Revert "refactor(FeedTransformations): Extract validation logic."
binh-dam-ibigroup Apr 14, 2021
7d92844
refactor(FeedTransformations): Extract validation logic (2nd attempt)
binh-dam-ibigroup Apr 14, 2021
5c3e694
refactor(ZipTransformation): Check table validity against GTFS-plus too.
binh-dam-ibigroup Apr 14, 2021
673d688
Update src/main/java/com/conveyal/datatools/manager/models/transform/…
binh-dam-ibigroup Apr 15, 2021
e03367f
Update src/main/java/com/conveyal/datatools/manager/models/transform/…
binh-dam-ibigroup Apr 15, 2021
c4ef04c
refactor(*TransformJobTest): Make imports explicit
binh-dam-ibigroup Apr 16, 2021
ee52bc8
Merge remote-tracking branch 'origin/normalize-strings-transformation…
binh-dam-ibigroup Apr 16, 2021
ca72507
refactor(NormalizeFieldTransformation): Close csvReader, use Files.cr…
binh-dam-ibigroup Apr 20, 2021
cd883d9
docs(NormalizeFieldTransformation): Add class JavaDoc, tweak other co…
binh-dam-ibigroup Apr 21, 2021
497b6bc
refactor(NormalizeFieldTransformation): Capture exec time. Remove ref…
binh-dam-ibigroup Apr 22, 2021
af2aca8
refactor(NormalizeFieldTransformation): Use YAML list to define subst…
binh-dam-ibigroup Apr 23, 2021
cc52c8c
chore(env.yml.tmp): Update documentation of template YML.
binh-dam-ibigroup Apr 23, 2021
d80a972
refactor(JsonUtil): Parse json only if it is not null.
binh-dam-ibigroup Apr 28, 2021
b2dd61e
refactor(NormalizeFieldTransformation): Address PR comments 1/.
binh-dam-ibigroup Apr 28, 2021
eeec18e
refactor(NormalizeFieldTransformation): Move configs to server.yml, c…
binh-dam-ibigroup Apr 28, 2021
c44c928
refactor(NormalizeFieldTransformation): Address PR comments 2/
binh-dam-ibigroup Apr 29, 2021
862b3bd
refactor(Substitution): Add description field. Update server.env files.
binh-dam-ibigroup Apr 30, 2021
a4a375d
build(deps): bump jackson.version from 2.10.1 to 2.12.1
dependabot[bot] Feb 18, 2021
a692f9c
refactor(MonitorableJob): avoid duplicate completion in completeSucce…
landonreed Feb 23, 2021
cdc6965
refactor(TypedFeedTransformation): Move common code to Db/Zip transfo…
binh-dam-ibigroup May 4, 2021
5ddfc46
Merge branch 'normalize-strings-transformation' of https://github.com…
binh-dam-ibigroup May 4, 2021
9f53b6a
refactor(TypedFeedTransformation): Add missing file
binh-dam-ibigroup May 4, 2021
e731ec4
test(NormalizeFieldTransformJobTest): Remove redundant test, add FIXM…
binh-dam-ibigroup May 4, 2021
bcfa8a7
refactor: combine FeedTransformation and TypedFeedTransformation
evansiroky May 5, 2021
678639e
Merge pull request #379 from ibi-group/normalize-strings-transformati…
binh-dam-ibigroup May 5, 2021
7bd9975
refactor(NormalizeFieldTransformation): Remove unused import, fix cla…
binh-dam-ibigroup May 11, 2021
97c760d
Update configurations/test/server.yml.tmp
binh-dam-ibigroup May 14, 2021
6fb9a0f
refactor(NormalizeFieldTransformation): Remove redundant field 'perfo…
binh-dam-ibigroup May 14, 2021
c9f6303
Update src/main/java/com/conveyal/datatools/manager/models/transform/…
binh-dam-ibigroup May 25, 2021
499908b
refactor(NormalizeFieldTransformation): fix whitespace
landonreed May 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions configurations/default/server.yml.tmp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,28 @@ modules:
load_on_fetch: false
# use_extension: mtc
# update_frequency: 30 # in seconds
manager:
normalizeFieldTransformation:
# Enter capitalization exceptions (e.g. acronyms), in the desired case, and separated by commas.
defaultCapitalizationExceptions:
- ACE
- BART
# Enter substitutions (e.g. substitute '@' with 'at'), one dashed entry for each substitution, with:
# - pattern: the regex string pattern that will be replaced,
# - replacement: the replacement string for that pattern,
# - normalizeSpace: if true, the resulting field value will include one space before and after the replacement string.
# Note: if the replacement must be blank, then normalizeSpace should be set to false
# and whitespace management should be handled in pattern instead.
# Substitutions are executed in order they appear in the list.
defaultSubstitutions:
- description: "Replace '@' with 'at', and normalize space."
pattern: "@"
replacement: at
normalizeSpace: true
- description: "Replace '+' (\\+ in regex) and '&' with 'and', and normalize space."
pattern: "[\\+&]"
replacement: and
normalizeSpace: true
extensions:
transitland:
enabled: true
Expand Down
33 changes: 33 additions & 0 deletions configurations/test/server.yml.tmp
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,39 @@ modules:
load_on_fetch: false
# use_extension: mtc
# update_frequency: 30 # in seconds
manager:
normalizeFieldTransformation:
# Enter capitalization exceptions (e.g. acronyms), in the desired case, and separated by commas.
defaultCapitalizationExceptions:
- ACE
- BART
- SMART
- EB
- WB
- SB
- NB
# Enter substitutions (e.g. substitute '@' with 'at'), one dashed entry for each substitution, with:
# - pattern: the regex string pattern that will be replaced,
# - replacement: the replacement string for that pattern,
# - normalizeSpace: if true, the resulting field value will include one space before and after the replacement string.
# Note: if the replacement must be blank, then normalizeSpace should be set to false
# and whitespace management should be handled in pattern instead.
# Substitutions are executed in order they appear in the list.
defaultSubstitutions:
- description: "Replace '@' with 'at', and normalize space."
pattern: "@"
replacement: at
normalizeSpace: true
- description: "Replace '+' (\\+ in regex) and '&' with 'and, and normalize space."
pattern: "[\\+&]"
replacement: and
normalizeSpace: true
- description: "Remove content in parentheses and adjacent space outside the parentheses."
pattern: "\\s*\\(.+\\)\\s*"
replacement: ""
- description: "Remove content in square brackets and adjacent space outside the brackets."
pattern: "\\s*\\[.+\\]\\s*"
replacement: ""
extensions:
# Enable MTC extension so MTC-specific feed merge tests
mtc:
Expand Down
10 changes: 8 additions & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
<url>https://github.com/ibi-group/datatools-server.git</url>
</scm>
<properties>
<jackson.version>2.10.1</jackson.version>
<jackson.version>2.12.1</jackson.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<!-- Using the latest version of geotools (e.g, 20) seems to cause issues with the shapefile
plugin where the_geom for each feature is null. -->
Expand Down Expand Up @@ -255,6 +255,13 @@
<version>5.7.0</version>
<scope>test</scope>
</dependency>
<!-- Provides JUnit 5.x parametrized tests -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
<version>5.5.2</version>
<scope>test</scope>
</dependency>

<!-- Used for loading/fetching/validating/writing GTFS entities. gtfs-lib also provides access to:
- commons-io - generic utilities
Expand Down Expand Up @@ -435,7 +442,6 @@
<artifactId>aws-java-sdk-sts</artifactId>
<version>${awsjavasdk.version}</version>
</dependency>

</dependencies>

</project>
Original file line number Diff line number Diff line change
Expand Up @@ -288,15 +288,16 @@ private String getCallingMethodTrace() {
* Shorthand method to update status object on successful job completion.
*/
public void completeSuccessfully(String message) {
this.complete(false, message);
// Do not overwrite the message (and other fields), if the job has already been completed.
if (!this.completed) this.complete(false, message);
landonreed marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Set job status to completed with error and message information.
*/
private void complete(boolean isError, String message) {
this.error = isError;
// Skip message update if null.
// Skip message update if the job message is null or the message has already been defined.
if (message != null) this.message = message;
this.percentComplete = 100;
this.completed = true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
/**
* This job will apply a {@link ZipTransformation} or {@link DbTransformation} to a GTFS zip file or database namespace,
* respectively, and generate the required FeedTransformTarget object from those inputs, which is passed into the
* {@link FeedTransformation#transform} method.
* {@link FeedTransformation#doTransform} method.
*/
public class ArbitraryTransformJob extends MonitorableJob {

Expand Down Expand Up @@ -43,6 +43,6 @@ public void jobLogic() {
target.validate(status);
if (status.error) return;
// If target is valid, perform transformation.
transformation.transform(target, status);
transformation.doTransform(target, status);
}
}
Original file line number Diff line number Diff line change
@@ -1,13 +1,33 @@
package com.conveyal.datatools.manager.models.transform;

import com.conveyal.datatools.common.status.MonitorableJob;
import com.conveyal.gtfs.loader.JdbcGtfsLoader;

import java.util.List;

/**
* This is an abstract class that represents a transformation that should apply to a GTFS in database form. In other
* words, subclasses will provide a transform override method that acts on a database namespace. Sample fields
* matchField and matchValues can be used to construct a WHERE clause for applying updates to a filtered set of records.
*/
public abstract class DbTransformation extends FeedTransformation {
public abstract class DbTransformation extends FeedTransformation<FeedTransformDbTarget> {
public String matchField;
public List<String> matchValues;

protected String getTransformationTypeName() {
return FeedTransformDbTarget.class.getSimpleName();
}

protected void validateFieldNames(MonitorableJob.Status status) {
// Validate fields before running transform.

if (matchField == null) {
status.fail("Must provide valid match field");
return;
}
String cleanField = JdbcGtfsLoader.sanitize(matchField, null);
if (!matchField.equals(cleanField)) {
status.fail("Input match field contained disallowed special characters (only alphanumeric and underscores are permitted).");
}
}
}
Original file line number Diff line number Diff line change
@@ -1,24 +1,18 @@
package com.conveyal.datatools.manager.models.transform;

import com.conveyal.datatools.common.status.MonitorableJob;
import com.conveyal.datatools.manager.DataManager;
import com.conveyal.datatools.manager.gtfsplus.tables.GtfsPlusTable;
import com.conveyal.datatools.manager.models.Snapshot;
import com.conveyal.datatools.manager.models.TableTransformResult;
import com.conveyal.datatools.manager.persistence.Persistence;
import com.conveyal.gtfs.loader.JdbcGtfsLoader;
import com.conveyal.gtfs.loader.Table;
import com.conveyal.gtfs.util.InvalidNamespaceException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;

import static com.conveyal.datatools.manager.DataManager.GTFS_DATA_SOURCE;
import static com.conveyal.gtfs.util.Util.ensureValidNamespace;
Expand All @@ -43,50 +37,14 @@ public static DeleteRecordsTransformation create(String table, String matchField
}

@Override
public void transform(FeedTransformTarget target, MonitorableJob.Status status) {
if (!(target instanceof FeedTransformDbTarget)) {
status.fail("Target must be FeedTransformDbTarget.");
return;
}
// Cast transform target to DB flavor.
FeedTransformDbTarget dbTarget = (FeedTransformDbTarget)target;
public void transform(FeedTransformDbTarget dbTarget, MonitorableJob.Status status) {
// Fetch the referenced snapshot to transform.
Snapshot snapshot = Persistence.snapshots.getById(dbTarget.snapshotId);
if (snapshot == null) {
status.fail(String.format("Cannot find snapshot to transform (id=%s)", dbTarget.snapshotId));
return;
}

// TODO: Move validation code into its own method?
final List<Table> tables =
Arrays.stream(Table.tablesInOrder)
.filter(Table::isSpecTable)
.collect(Collectors.toList());
if (DataManager.isModuleEnabled("gtfsplus")) {
// Add GTFS+ tables only if MTC extension is enabled.
tables.addAll(Arrays.asList(GtfsPlusTable.tables));
}
// Check that table name is valid (to prevent SQL injection).
Table matchingTable = null;
for (Table specTable : tables) {
if (specTable.name.equals(table)) {
matchingTable = specTable;
break;
}
}
if (matchingTable == null) {
status.fail("Table must be valid GTFS spec table name (without .txt).");
return;
}
if (matchField == null) {
status.fail("Must provide valid match field");
return;
}
String cleanField = JdbcGtfsLoader.sanitize(matchField, null);
if (!matchField.equals(cleanField)) {
status.fail("Input match field contained disallowed special characters (only alphanumeric and underscores are permitted).");
return;
}
try {
ensureValidNamespace(snapshot.namespace);
} catch (InvalidNamespaceException e) {
Expand All @@ -111,9 +69,14 @@ public void transform(FeedTransformTarget target, MonitorableJob.Status status)
int deleted = preparedStatement.executeUpdate();
LOG.info("{} deleted {} records", this.getClass().getSimpleName(), deleted);
connection.commit();
target.feedTransformResult.tableTransformResults.add(new TableTransformResult(table, deleted, 0, 0));
dbTarget.feedTransformResult.tableTransformResults.add(new TableTransformResult(table, deleted, 0, 0));
} catch (SQLException e) {
throw new RuntimeException(e);
}
}

@Override
public void validateParameters(MonitorableJob.Status status) {
// Does nothing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method needed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is needed because validateParameters is now abstract in the parent class. Is @Override needed though?

}
}
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
package com.conveyal.datatools.manager.models.transform;

import com.conveyal.datatools.common.status.MonitorableJob;
import com.conveyal.datatools.manager.utils.GtfsUtils;
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonSubTypes;
import com.fasterxml.jackson.annotation.JsonTypeInfo;
import org.bson.codecs.pojo.annotations.BsonDiscriminator;

import java.io.Serializable;


/**
* This abstract class is the base for arbitrary feed transformations.
*
Expand All @@ -28,10 +28,11 @@
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME)
@JsonSubTypes({
@JsonSubTypes.Type(value = DeleteRecordsTransformation.class, name = "DeleteRecordsTransformation"),
@JsonSubTypes.Type(value = NormalizeFieldTransformation.class, name = "NormalizeFieldTransformation"),
@JsonSubTypes.Type(value = ReplaceFileFromVersionTransformation.class, name = "ReplaceFileFromVersionTransformation"),
@JsonSubTypes.Type(value = ReplaceFileFromStringTransformation.class, name = "ReplaceFileFromStringTransformation")
})
public abstract class FeedTransformation implements Serializable {
public abstract class FeedTransformation<Target extends FeedTransformTarget> implements Serializable {
private static final long serialVersionUID = 1L;
public String table;
public boolean active = true;
Expand All @@ -46,5 +47,58 @@ public boolean isActive() {
// it could return something object that contains a bool + message.
// boolean isValid();

public abstract void transform(FeedTransformTarget target, MonitorableJob.Status status);
public void doTransform(FeedTransformTarget target, MonitorableJob.Status status) {
try {
// Attempt to cast transform target to correct flavor
// (fails the job if types mismatch.)
Target typedTarget = (Target)target;

// Validate parameters before running transform.
validateTableName(status);
validateFieldNames(status);
// Let subclasses check parameters.
validateParameters(status);
if (status.error) {
return;
}

// Pass the typed transform target to subclasses to transform.
transform(typedTarget, status);
} catch (ClassCastException classCastException) {
status.fail(
String.format("Transformation must be of type '%s'.", getTransformationTypeName())
);
}
}

protected abstract String getTransformationTypeName();

/**
* Contains the logic for this database-bound transformation.
* @param target The database-bound or ZIP-file-bound target the transformation will operate on.
* @param status Used to report success or failure status and details.
*/
public abstract void transform(Target target, MonitorableJob.Status status);

/**
* At the moment, used by DbTransformation to validate field names.
*/
protected abstract void validateFieldNames(MonitorableJob.Status status);

/**
* Handles validation logic prior to performing the transformation.
* Calling status.fail prevents the transform logic from running.
*/
public abstract void validateParameters(MonitorableJob.Status status);

/**
* Checks that the table name for the transform is valid.
*/
protected void validateTableName(MonitorableJob.Status status) {
// Validate fields before running transform.
if (GtfsUtils.getGtfsTable(table) == null) {
status.fail("Table must be valid GTFS spec table name (without .txt).");
return;
}
}
}
Loading