Metafacture Fix (Metafix) is work in progress towards tools and an implementation of the Fix language for Metafacture as an alternative to configuring data transformations with Metamorph. Inspired by Catmandu Fix, Metafix processes metadata not as a continuous data stream but as discrete records. The basic idea is to rebuild constructs from the (Catmandu) Fix language like functions, selectors and binds in Java and combine with additional functionalities from the Metamorph toolbox.
See also Fix Interest Group for an initiative towards an implementation-independent specification for the Fix Language.
This repo contains the actual implementation of the Fix language as a Metafacture module and related components. It started as an Xtext web project with a Fix grammar, from which a parser, a web editor, and a language server are generated. The repo also contains an extension for VS code/codium based on that language server. (The web editor has effectively been replaced by the Metafacture Playground, but remains here for its integration into the language server, which we want to move over to the playground.)
Note: If you're using Windows, configure Git option core.autocrlf
before cloning: git config --global core.autocrlf false
.
Clone the Git repository:
git clone https://github.com/metafacture/metafacture-fix.git
Go to the Git repository root:
cd metafacture-fix/
Run the tests (in metafix/src/test/java
) and checks (.editorconfig
, config/checkstyle/checkstyle.xml
):
./gradlew clean check
To execute a Fix (embedded in a Flux) via CLI:
./gradlew :metafix-runner:run --args="$PWD/path/to.flux"
To execute a Fix (embedded in a Flux) via CLI in java debug mode:
make sure to pipe to log-stream
after your fix
command in flux resp. make
use of log-object
at the proper location. Then:
export JAVA_OPTS="-Dorg.metafacture.metafix.logLevel=DEBUG"; ./gradlew installDist; cd metafix-runner/build/install/metafix-runner; bin/metafix-runner "$PWD/path/to.flux"
(To import the projects in Eclipse, choose File > Import > Existing Gradle Project
and select the metafacture-fix
directory.)
The repo contains and uses a new Metafix
stream module for Metafacture which plays the role of the Metamorph
module in Fix-based Metafacture workflows. For the current implementation of the Metafix
stream module see the tests in metafix/src/test/java
. To play around with some examples, check out the Metafacture Playground. For real-world usage samples see openRub.fix and duepublico.fix. For reference documentation, see Functions and cookbook.
The project metafix-vsc
provides an extension for Visual Studio Code / Codium for fix
via the language server protocol (LSP). In the current state the extension supports auto completion, simple syntax highlighting and auto closing brackets and quotes. This project was created using this tutorial and the corresponding example.
Build extension:
Important
There is a problem when building the extension on Windows and installing the extension on a Linux system afterwards. In some cases the Xtext Server won't start. So if you want to use the extension not only on Windows, build the extension on a Linux system or on a Linux Subsystem on Windows.
- Install Visual Studio Code / alternative: VS Codium
- Install Node.js (including npm)
- In metafacture-fix execute:
Unix:
./gradlew installServer
Windows:.\gradlew.bat installServer
- In
metafix-vsc/
execute (tip: if you use windows, install cygwin to execute npm commands):npm install
To start the extension in development mode (starting a second code/codium instance), follow A. To create a vsix file to install the extension permanently follow B.
A) Run in dev mode:
- Open
metafix-vsc/
in Visual Studio Code / Codium - Launch vscode extension by pressing F5 (opens new window of Visual Studio Code)
- Open new file (file-ending
.fix
) or open existing fix-file (see sample below)
B) Install vsix file:
- Install vsce:
npm install -g vsce
- In
metafix-vsc/
execute:vsce package
vsce will create a vsix file in the vsc directory which can be used for installation: - Open VS Code / Codium
- Click 'Extensions' section
- Click menu bar and choose 'Install from VSIX...'
Start the web server:
./gradlew jettyRun
Visit http://localhost:8080/, and paste this into the editor:
# Fix is a macro-language for data transformations
# Simple fixes
add_field("hello", "world")
remove_field("my.deep.nested.junk")
copy_field("stats", "output.$append")
# Conditionals
if exists("error")
set_field("is_valid", "no")
log("error")
elsif exists("warning")
set_field("is_valid", "yes")
log("warning")
else
set_field("is_valid", "yes")
end
# Loops
do list(path: "foo", "var": "$i")
add_field("$i.bar", "baz")
end
Content assist is triggered with Ctrl-Space. The input above is also used in FixParsingTest.java
.
Run workflows on the web server, passing data
, flux
, and fix
:
- We recommend to use double quotation marks for arguments and values in functions, binds and conditionals.
- If using a
list
bind with a variable, thevar
option requires quotation marks (do list(path: "<sourceField>", "var": "<variableName>")
). - Fix turns repeated fields into arrays internally but only marked arrays (with
[]
at the end of the field name) are also emitted as "arrays" (entities with indexed literals), all other arrays are emitted as repeated fields. - Every Fix file should end with a final newline.
Array wildcards resemble Catmandu's concept of wildcards.
When working with arrays and repeated fields you can use wildcards instead of an index number to select elements of an array.
Wildcard | Meaning |
---|---|
* |
Selects all elements of an array. |
$first |
Selects only the first element of an array. |
$last |
Selects only the last element of an array. |
$prepend |
Selects the position before the first element of an array. Can only be used when adding new elements to an array. |
$append |
Selects the position after the last element of an array. Can only be used when adding new elements to an array. |
Path wildcards resemble Metamorph's concept of wildcards. They are not supported in Catmandu (it has specialized Fix functions instead).
You can use path wildcards to select fields matching a pattern. They only match path segments (field names), though, not whole paths of nested fields. These wildcards cannot be used to add new elements.
Wildcard | Meaning |
---|---|
* |
Placeholder for zero or more characters. |
? |
Placeholder for exactly one character. |
| |
Alternation of multiple patterns. |
[...] |
Enumeration of characters. |
Includes a Fix file and executes it as if its statements were written in place of the function call.
Parameters:
path
(required): Path to Fix file (if the path starts with a.
, it is resolved relative to the including file's directory; otherwise, it is resolved relative to the current working directory).
Options:
- All options are made available as "dynamic" local variables in the included Fix file.
include("<path>"[, <dynamicLocalVariables>...])
Sends a message to the logs.
Parameters:
logMessage
(required): Message to log.
Options:
level
: Log level to log at (one ofDEBUG
,INFO
,WARN
orERROR
). (Default:INFO
)
log("<logMessage>"[, level: "<logLevel>"])
Does nothing. It is used for benchmarking in Catmandu.
nothing()
Defines an external map for lookup from a file or a URL. Maps with more than 2 columns are supported but are reduced to a defined key and a value column.
put_filemap("<sourceFile>", "<mapName>", sep_char: "\t")
The separator (sep_char
) will vary depending on the source file, e.g.:
Type | Separator |
---|---|
CSV | , or ; |
TSV | \t |
Options:
allow_empty_values
: Sets whether to allow empty values in the filemap or to ignore these entries. (Default:false
)compression
: Sets the compression of the file.decompress_concatenated
: Flags whether to use decompress concatenated file compression.encoding
: Sets the encoding used to open the resource.expected_columns
: Sets number of expected columns; lines with different number of columns are ignored. Set to-1
to disable the check and allow arbitrary number of columns. (Default:2
)key_column
: Defines the column to be used for keys. Uses zero index. (Default:0
)value_column
: Defines the column to be used for values. Uses zero index. (Default:1
)
Defines an internal map for lookup from key/value pairs.
put_map("<mapName>",
"dog": "mammal",
"parrot": "bird",
"shark": "fish"
)
Defines an external RDF map for lookup from a file or an HTTP(S) resource. As the RDF map is reducing RDF triples to a key/value map it is mandatory to set the target. The targeted RDF property can optionally be bound by an RDF language tag.
put_rdfmap("<rdfResource>", "<rdfMapName>", target: "<rdfProperty>")
put_rdfmap("<rdfResource>", "<rdfMapName>", target: "<rdfProperty>", select_language: "<rdfLanguageTag>")
Defines a single global variable that can be referenced with $[<variableName>]
.
put_var("<variableName>", "<variableValue>")
Defines multiple global variables that can be referenced with $[<variableName>]
.
put_vars(
"<variableName_1>": "<variableValue_1>",
"<variableName_2>": "<variableValue_2>"
)
Defines a single global variable that can be referenced with $[<variableName>]
and assigns the value of the <sourceField>
.
to_var("<sourceField>", "<variableName>")
Options:
default
: Default value if source field does not exist. The option needs to be written in quotation marks because it is a reserved word in Java. (Default: Empty string)
Creates a new array (with optional values).
add_array("<targetFieldName>")
add_array("<targetFieldName>", "<value_1>"[, ...])
Creates a field with a defined value.
add_field("<targetFieldName>", "<fieldValue>")
Creates a new hash (with optional values).
add_hash("<targetFieldName>")
add_hash("<targetFieldName>", "subfieldName": "<subfieldValue>"[, ...])
Converts a hash/object into an array.
array("<sourceField>")
E.g.:
array("foo")
# {"name":"value"} => ["name", "value"]
Calls a named macro, i.e. a list of statements that have been previously defined with the do put_macro
bind.
Parameters:
name
(required): Unique name of the macro.
Options:
- All options are made available as "dynamic" local variables in the macro.
do put_macro("<macroName>"[, <staticLocalVariables>...])
...
end
call_macro("<macroName>"[, <dynamicLocalVariables>...])
Copies a field from an existing field.
copy_field("<sourceField>", "<targetField>")
Replaces the value with a formatted (sprintf
-like) version.
---- TODO: THIS NEEDS MORE CONTENT -----
format("<sourceField>", "<formatString>")
Converts an array into a hash/object.
hash("<sourceField>")
E.g.:
hash("foo")
# ["name", "value"] => {"name":"value"}
Moves a field from an existing field. Can be used to rename a field.
move_field("<sourceField>", "<targetField>")
Parses a text into an array or hash of values.
---- TODO: THIS NEEDS MORE CONTENT -----
parse_text("<sourceField>", "<parsePattern>")
Joins multiple field values into a new field. Can be combined with additional literal strings.
The default join_char
is a single space. Literal strings have to start with ~
.
paste("<targetField>", "<sourceField_1>"[, ...][, "join_char": ", "])
E.g.:
# a: eeny
# b: meeny
# c: miny
# d: moe
paste("my.string", "~Hi", "a", "~how are you?")
# "my.string": "Hi eeny how are you?"
Prints the current record as JSON either to standard output or to a file.
Parameters:
prefix
(optional): Prefix to print before the record; may include format directives for counter and record ID (in that order). (Default: Empty string)
Options:
append
: Whether to open files in append mode if they exist. (Default:false
)compression
(file output only): Compression mode. (Default:auto
)destination
: Destination to write the record to; may include format directives for counter and record ID (in that order). (Default:stdout
)encoding
(file output only): Encoding used by the underlying writer. (Default:UTF-8
)footer
: Footer which is written at the end of the output. (Default:\n
)header
: Header which is written at the beginning of the output. (Default: Empty string)id
: Field name which contains the record ID; if found, will be available for inclusion inprefix
anddestination
. (Default:_id
)internal
: Whether to print the record's internal representation instead of JSON. (Default:false
)pretty
: Whether to use pretty printing. (Default:false
)separator
: Separator which is written after the record. (Default:\n
)
print_record(["<prefix>"][, <options>...])
E.g.:
print_record("%d) Before transformation: ")
print_record(destination: "record-%2$s.json", id: "001", pretty: "true")
print_record(destination: "record-%03d.json.gz", header: "After transformation: ")
Creates (or replaces) a field with a random number (less than the specified maximum).
random("<targetField>", "<maximum>")
Removes a field.
remove_field("<sourceField>")
Replaces a regular expression pattern in subfield names of a field. Does not change the name of the source field itself.
rename("<sourceField>", "<regexp>", "<replacement>")
Deletes all fields except the ones listed (incl. subfields).
retain("<sourceField_1>"[, ...])
Currently alias for add_array
.
We advise you to use add_array
instead of set_array
due to changing behaviour in an upcoming release. For more information see: #309
Currently alias for add_field
.
We advise you to use add_field
instead of set_field
due to changing behaviour in an upcoming release. For more information see: #309
Currently alias for add_hash
.
We advise you to use add_hash
instead of set_hash
due to changing behaviour in an upcoming release. For more information see: #309
Creates (or replaces) a field with the current timestamp.
Options:
format
: Date and time pattern as in java.text.SimpleDateFormat. (Default:timestamp
)timezone
: Time zone as in java.util.TimeZone. (Default:UTC
)language
: Language tag as in java.util.Locale. (Default: The locale of the host system)
timestamp("<targetField>"[, format: "<formatPattern>"][, timezone: "<timezoneCode>"][, language: "<languageCode>"])
Deletes empty fields, arrays and objects.
vacuum()
Adds a string at the end of a field value.
append("<sourceField>", "<appendString>")
Upcases the first character in a field value.
capitalize("<sourceField>")
Counts the number of elements in an array or a hash and replaces the field value with this number.
count("<sourceField>")
Downcases all characters in a field value.
downcase("<sourceField>")
Only keeps field values that match the regular expression pattern. Works only with array of strings/repeated fields.
filter("<sourceField>", "<regexp>")
Flattens a nested array field.
flatten("<sourceField>")
Replaces the string with its JSON deserialization.
Options:
error_string
: Error message as a placeholder if the JSON couldn't be parsed. (Default:null
)
from_json("<sourceField>"[, error_string: "<errorValue>"])
Returns the index position of a substring in a field and replaces the field value with this number.
index("<sourceField>", "<substring>")
Extracts an ISBN and replaces the field value with the normalized ISBN; optionally converts and/or validates the ISBN.
Options:
to
: ISBN format to convert to (eitherISBN10
orISBN13
). (Default: Only normalize ISBN)verify_check_digit
: Whether the check digit should be verified. (Default:false
)error_string
: Error message as a placeholder if the ISBN couldn't be validated. (Default:null
)
isbn("<sourceField>"[, to: "<isbnFormat>"][, verify_check_digit: "<boolean>"][, error_string: "<errorValue>"])
Joins an array of strings into a single string.
join_field("<sourceField>", "<separator>")
Looks up matching values in a map and replaces the field value with this match. External files, internal maps as well as RDF resources can be used.
Parameters:
path
(required): Field path to look up.map
(optional): Name or path of the map in which to look up values.
Options:
default
: Default value to use for unknown values. The option needs to be written in quotation marks because it is a reserved word in Java. (Default: Old value)delete
: Whether to delete unknown values. (Default:false
)print_unknown
: Whether to print unknown values. (Default:false
)
Additional options when printing unknown values:
append
: Whether to open files in append mode if they exist. (Default:true
)compression
(file output only): Compression mode. (Default:auto
)destination
: Destination to write unknown values to; may include format directives for counter and record ID (in that order). (Default:stdout
)encoding
(file output only): Encoding used by the underlying writer. (Default:UTF-8
)footer
: Footer which is written at the end of the output. (Default:\n
)header
: Header which is written at the beginning of the output. (Default: Empty string)id
: Field name which contains the record ID; if found, will be available for inclusion inprefix
anddestination
. (Default:_id
)prefix
: Prefix to print before the unknown value; may include format directives for counter and record ID (in that order). (Default: Empty string)separator
: Separator which is written after the unknown value. (Default:\n
)
lookup("<sourceField>"[, <mapName>][, <options>...])
E.g.:
# local (unnamed) map
lookup("path.to.field", key_1: "value_1", ...)
# internal (named) map
put_map("internal-map", key_1: "value_1", ...)
lookup("path.to.field", "internal-map")
# external file map (implicit)
lookup("path.to.field", "path/to/file", sep_char: ";")
# external file map (explicit)
put_filemap("path/to/file", "file-map", sep_char: ";")
lookup("path.to.field", "file-map")
# RDF map (explicit)
put_rdfmap("path/to/file", "rdf-map", target: "<rdfProperty>")
lookup("path.to.field", "rdf-map")
# with default value
lookup("path.to.field", "map-name", "default": "NA")
# with printing unknown values to a file
lookup("path.to.field", "map-name", print_unknown: "true", destination: "unknown.txt")
Adds a string at the beginning of a field value.
prepend("<sourceField>", "<prependString>")
Replaces a regular expression pattern in field values with a replacement string. Regexp capturing is possible; refer to capturing groups by number ($<number>
) or name (${<name>}
).
replace_all("<sourceField>", "<regexp>", "<replacement>")
Reverses the character order of a string or the element order of an array.
reverse("<sourceField>")
Sorts strings in an array. Alphabetically and A-Z by default. Optional numerical and reverse sorting.
sort_field("<sourceField>")
sort_field("<sourceField>", reverse: "true")
sort_field("<sourceField>", numeric: "true")
Splits a string into an array and replaces the field value with this array.
split_field("<sourceField>", "<separator>")
Replaces a string with its substring as defined by the start position (offset) and length.
substring("<sourceField>", "<startPosition>", "<length>")
Sums numbers in an array and replaces the field value with this number.
sum("<sourceField>")
Replaces the value with its JSON serialization.
Options:
error_string
: Error message as a placeholder if the JSON couldn't be generated. (Default:null
)pretty
: Whether to use pretty printing. (Default:false
)
to_json("<sourceField>"[, pretty: "<boolean>"][, error_string: "<errorValue>"])
Replaces the value with its Base64 encoding.
Options:
-url_safe
: Perform URL-safe encoding (uses Base64URL format). (Default: false
)
to_base64("<sourceField>"[, url_safe: "<boolean>"])
Deletes whitespace at the beginning and the end of a field value.
trim("<sourceField>")
Deletes duplicate values in an array.
uniq("<sourceField>")
Upcases all characters in a field value.
upcase("<sourceField>")
Encodes a field value as URI. Aka percent-encoding.
Options:
plus_for_space
: Sets whether "space" (+
) or be percent escaped (%20
). (Default:true
)safe_chars
: Sets characters that won't be escaped. Safe characters are the ranges 0..9, a..z and A..Z. These are always safe and should not be specified. (Default:.-*_
)
uri_encode("<sourceField>"[, <options>...])
E.g.:
uri_encode("path.to.field", plus_for_space:"false", safe_chars:"")
Ignores records that match a condition.
if <condition>
reject()
end
Iterates over each element of an array. In contrast to Catmandu, it can also iterate over a single object or string.
do list(path: "<sourceField>")
...
end
Only the current element is accessible in this case (as the root element).
When specifying a variable name for the current element, the record remains accessible as the root element and the current element is accessible through the variable name:
do list(path: "<sourceField>", "var": "<variableName>")
...
end
Iterates over each named element of an array (like do list
with a variable name). If multiple arrays are given, iterates over the corresponding elements from each array (i.e., all elements with the same array index, skipping elements whose arrays have already been exhausted).
do list_as(element_1: "<sourceField_1>"[, ...])
...
end
E.g.:
# "ccm:university":["https://ror.org/0304hq317"]
# "ccm:university_DISPLAYNAME":["Gottfried Wilhelm Leibniz Universität Hannover"]
set_array("sourceOrga[]")
do list_as(orgId: "ccm:university[]", orgName: "ccm:university_DISPLAYNAME[]")
copy_field(orgId, "sourceOrga[].$append.id")
copy_field(orgName, "sourceOrga[].$last.name")
end
# {"sourceOrga":[{"id":"https://ror.org/0304hq317","name":"Gottfried Wilhelm Leibniz Universität Hannover"}]}
Executes the statements only once (when the bind is first encountered), not repeatedly for each record.
do once()
...
end
In order to execute multiple blocks only once, tag them with unique identifiers:
do once("maps setup")
...
end
do once("vars setup")
...
end
Defines a named macro, i.e. a list of statements that can be executed later with the call_macro
function.
Variables can be referenced with $[<variableName>]
, in the following order of precedence:
- "dynamic" local variables, passed as options to the
call_macro
function; - "static" local variables, passed as options to the
do put_macro
bind; - global variables, defined via
put_var
/put_vars
.
Parameters:
name
(required): Unique name of the macro.
Options:
- All options are made available as "static" local variables in the macro.
do put_macro("<macroName>"[, <staticLocalVariables>...])
...
end
call_macro("<macroName>"[, <dynamicLocalVariables>...])
Conditionals start with if
in case of affirming the condition or unless
rejecting the condition.
Conditionals require a final end
.
Additional conditionals can be set with elsif
and else
.
if <condition(params, ...)>
...
end
unless <condition(params, ...)>
...
end
if <condition(params, ...)>
...
elsif
...
else
...
end
Executes the functions if/unless the field contains the value. If it is an array or a hash all field values must contain the string.
Executes the functions if/unless the field contains the value. If it is an array or a hash one or more field values must contain the string.
Executes the functions if/unless the field does not contain the value. If it is an array or a hash none of the field values may contain the string.
Executes the functions if/unless the first string contains the second string.
Executes the functions if/unless the field value equals the string. If it is an array or a hash all field values must equal the string.
Executes the functions if/unless the field value equals the string. If it is an array or a hash one or more field values must equal the string.
Executes the functions if/unless the field value does not equal the string. If it is an array or a hash none of the field values may equal the string.
Executes the functions if/unless the first string equals the second string.
Executes the functions if/unless the field exists.
if exists("<sourceField>")
Executes the functions if/unless the field value is contained in the value of the other field.
Also aliased as is_contained_in
.
Alias for in
.
Executes the functions if/unless the field value is an array.
Executes the functions if/unless the field value is empty.
Executes the functions if/unless the field value equals false
or 0
.
Alias for is_object
.
Executes the functions if/unless the field value is a number.
Executes the functions if/unless the field value is a hash (object).
Also aliased as is_hash
.
Executes the functions if/unless the field value is a string (and not a number).
Executes the functions if/unless the field value equals true
or 1
.
Executes the functions if/unless the field value matches the regular expression pattern. If it is an array or a hash all field values must match the regular expression pattern.
Executes the functions if/unless the field value matches the regular expression pattern. If it is an array or a hash one or more field values must match the regular expression pattern.
Executes the functions if/unless the field value does not match the regular expression pattern. If it is an array or a hash none of the field values may match the regular expression pattern.
Executes the functions if/unless the string matches the regular expression pattern.
This repo has been originally set up with Xtext 2.17.0 and Eclipse for Java 2019-03, following https://www.eclipse.org/Xtext/documentation/104_jvmdomainmodel.html.