diff --git a/CHANGELOG.md b/CHANGELOG.md index 41351c7..8acda5a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,7 @@ ### Unreleased changes -* fix: error messages for `keep_columns` and `drop_columns` do not specify the columns + +* feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs +* feature: error messages for `keep_columns` and `drop_columns` now specify the columns ### v0.4.2
diff --git a/README.md b/README.md index 58adb01..0556bb3 100644 --- a/README.md +++ b/README.md @@ -254,7 +254,7 @@ Each source must have a name (which is how it is referenced by transformations a - Row-based formats: - `.csv`: Specify the number of `header_rows`, and (if `header_rows` > 0, optionally) overwrite the `column` names. Optionally specify an `encoding` to use when reading the file (the default is UTF8). - `.tsv`: Specify the number of `header_rows`, and (if `header_rows` > 0, optionally) overwrite the `column` names. Optionally specify an `encoding` to use when reading the file (the default is UTF8). - - `.txt`: a fixed-width text file; column widths are inferred from the first 100 lines. + - `.txt`: a fixed-width text file. See [here](https://github.com/edanalytics/earthmover/blob/main/docs/fixedwidth-sources.md) for usage information - Column-based formats: `.parquet`, `.feather`, `.orc` — these require the [`pyarrow` library](https://arrow.apache.org/docs/python/index.html), which can be installed with `pip install pyarrow` or similar - Structured formats: - `.json`: Optionally specify a `object_type` (`frame` or `series`) and `orientation` (see [these docs](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)) to interpret different JSON structures. diff --git a/docs/fixedwidth-sources.md b/docs/fixedwidth-sources.md new file mode 100644 index 0000000..0124b22 --- /dev/null +++ b/docs/fixedwidth-sources.md @@ -0,0 +1,93 @@ +# Working with fixed-width source files + +One challenge of working with fixed-width files (FWFs) is that they require additional metadata. In particular, any tool that reads a FWF into a tabular structure needs to know how to slice each row into its constituent columns. Earthmover supports two ways of providing this information: + +## 1. Provide a `colspec_file` + +In your earthmover.yaml config, a `fixedwidth` source is specified much like any other file source. Here is a complete example: + +```yaml +sources: + input: + file: ./data/input.txt + colspec_file: ./seed/colspecs.csv # required + colspec_headers: + name: field_name # required + start: start_index # required if `width` is not provided + end: end_index # required if `width` is not provided + width: field_length # required if `start` or `end` is not provided + type: fixedwidth # required if `file` does not end with '.txt' + header_rows: 0 +``` + +Some notes on the available options + - (required) `colspec_file`: a path to the CSV containing your colspec metadata + - (required) `colspec_headers`: a mapping between the `colspec_file`'s column names and the fields Earthmover requires. **Note that the names and positions of these columns do not matter** + - Of these, only `name` is always required. Your `colspec_file` should contain a column that assigns a name to each field in the FWF + - You must either provide `width`, or both `start` and `end` + - If you provide `width` your `colspec_file` should include a column of integer values that specifies the number of characters in each field in the FWF + - If you provide `start` and `end`, your `colspec_file` should include two columns of integer values [giving the extents of the FWF's fields as half-open intervals (i.e., \[from, to\[ )](https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html) + - (optional) `type`: if the input file has a `.txt` extension, you do not need to specify `type`. However, since there is no standard extension for FWFs, it is a good idea to use `type: fixedwidth` + - (optional) `header_rows`: this is almost always 0 for FWFs. Earthmover will usually infer this even if you don't specify it, but we recommend doing so + +### Formatting a `colspec_file` +In accordance with the above, a `colspec_file` must include a column with field names, as well as either a column with field widths, or two columns with start and end positions. Both of the following CSVs are valid and equivalent to one another: + +```csv +name,width +date,8 +id,16 +score_1,2 +score_2,2 +``` +For this file, your earthmover.yaml would look like: + +```yaml +colspec_headers: + name: name + width: width +``` + +or + +```csv +start_idx, end_idx, other_data, full_field_name, other_data_2 +0, 8, abc, date, def +8, 24, abc, id, def +24, 26, abc, score_1, def +26, 28, abc, score_2, def +``` +For this file, your earthmover.yaml would look like: + +```yaml +colspec_headers: + name: full_field_name + start: start_idx + end: end_idx +``` + +## 2. Provide `colspecs` and `columns` directly + +Alternatively, you can essentially put the same information in your Earthmover config, like this: + +```yaml +sources: + input: + file: ./data/input.txt + type: fixedwidth # required if `file` does not end with '.txt' + header_rows: 0 + colspecs: # required + - [0, 8] + - [8, 24] + - [24, 26] + - [26, 28] + columns: # required + - date + - id + - score_1 + - score_2 +``` + +Some notes on the available options + - (required) `colspecs`: a list of start/end indices [giving the extents of the FWF's fields as half-open intervals (i.e., \[from, to\[ )](https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html) + - (required) `columns`: a list of column names corresponding to the indices in `colspecs` \ No newline at end of file diff --git a/earthmover/nodes/source.py b/earthmover/nodes/source.py index 898eb96..3451cf7 100644 --- a/earthmover/nodes/source.py +++ b/earthmover/nodes/source.py @@ -106,7 +106,7 @@ class FileSource(Source): is_remote: bool = False allowed_configs: Tuple[str] = ( 'debug', 'expect', 'show_progress', 'repartition', 'chunksize', 'optional', 'optional_fields', - 'file', 'type', 'columns', 'header_rows', 'colspecs', 'rename_cols', + 'file', 'type', 'columns', 'header_rows', 'colspec_file', 'colspecs', 'colspec_headers', 'rename_cols', 'encoding', 'sheet', 'object_type', 'match', 'orientation', 'xpath', ) @@ -263,8 +263,62 @@ def _get_filetype(file: str): ext = file.lower().rsplit('.', 1)[-1] return ext_mapping.get(ext) - @staticmethod - def _get_read_lambda(file_type: str, sep: Optional[str] = None): + def __read_fwf(self, file: str, config: 'YamlMapping'): + colspec_file = config.get('colspec_file') + if not colspec_file: + names = config.get('columns') + if not names: + self.error_handler.throw("No `colspec_file` specified for fixedwidth source. In this case, `columns` must be specified, and `colspecs` may be specified, or else will be inferred") + + return dd.read_fwf(file, colspecs=config.get('colspecs', "infer"), header=config.get('header_rows', "infer"), names=names, converters={c:str for c in names}) + try: + # ensure we find the colspec file relative to the config file that references it (in case of project composition) + file_format = pd.read_csv(os.path.join(os.path.dirname(self.config.__file__), colspec_file)) + # we need to handle this separately because otherwise EM will report that the source file + # (instead of the colspec file) could not be found + except FileNotFoundError: + self.error_handler.throw( + f"colspec file '{colspec_file}' not found" + ) + + colspec_headers = config.get("colspec_headers") + if not colspec_headers: + self.error_handler.throw( + "`colspec_headers` must be specified when supplying a colspec file" + ) + + try: + # name column is required + name_col = colspec_headers["name"] + except KeyError: + self.error_handler.throw( + "a `name` column must be provided when supplying colspec_headers" + ) + + start_col = colspec_headers.get("start") + end_col = colspec_headers.get("end") + width_col = colspec_headers.get("width") + # pandas does not allow specifying both start/end and widths, but we just let start/end take precedence + if start_col and end_col: + use_widths = False + elif width_col: + use_widths = True + else: + self.error_handler.throw( + "either `width` or (`start`, `end`) must be specified when supplying colspec_headers" + ) + + names = file_format[name_col] + header = config.get('header_rows', "infer") + converters = {c:str for c in names} + if use_widths: + widths = list(file_format[width_col]) + return dd.read_fwf(file, widths=widths, header=header, names=names, converters=converters) + else: + colspecs = list(zip(file_format.start_index, file_format.end_index)) + return dd.read_fwf(file, colspecs=colspecs, header=header, names=names, converters=converters) + + def _get_read_lambda(self, file_type: str, sep: Optional[str] = None): """ :param file_type: @@ -277,13 +331,12 @@ def __get_skiprows(config: 'YamlMapping'): _header_rows = config.get('header_rows', 1) return int(_header_rows) - 1 # If header_rows = 1, skip none. - # We don't want to activate the function inside this helper function. read_lambda_mapping = { 'csv' : lambda file, config: dd.read_csv(file, sep=sep, dtype=str, encoding=config.get('encoding', "utf8"), keep_default_na=False, skiprows=__get_skiprows(config)), 'excel' : lambda file, config: pd.read_excel(file, sheet_name=config.get("sheet", 0), keep_default_na=False), 'feather' : lambda file, _ : pd.read_feather(file), - 'fixedwidth': lambda file, config: dd.read_fwf(file, colspecs=config.get('colspecs', "infer"), header=config.get('header_rows', "infer"), names=config.get('columns'), converters={c:str for c in config.get('columns')}), + 'fixedwidth': self.__read_fwf, 'html' : lambda file, config: pd.read_html(file, match=config.get('match', ".+"), keep_default_na=False)[0], 'orc' : lambda file, _ : dd.read_orc(file), 'json' : lambda file, config: dd.read_json(file, typ=config.get('object_type', "frame"), orient=config.get('orientation', "columns")),