implements duration support in extract_metadata #223

basgys · 2020-12-16T17:26:12Z

I made a first implementation to support duration in extract_metadata. This feature would break backward compatibility, so it is up to you to tell me if we should use a different strategy.

I use this website as a reference for the existing data types https://libguides.library.kent.edu/SPSS/DatesTime.

Other option

As I mentioned on the PR #220, the other option would be to label the type as NUMERIC for both dates and durations, but then include a format field in the JSON payload that gives the format:

TIME8 => hh:mm:ss

{
   "type": "NUMERIC",
   "format": "TIME",
   "pattern": "hh:mm:ss"
}

Possible formats:

NUMBER
PERCENT
CURRENCY
DATE
TIME
DATE_TIME
UNSPECIFIED

Note on data type handling

I added the following comment to the code:

    // All-or-nothing is probably not the best strategy for data type extraction.
    // When SPSS/STATA introduces new types, metadata extraction could fail.
    // It would be wiser to simply label the field as "UNKNOWN".

It is a philosophy I guess, but I would argue that this error should be handled further up in the stack. extract_metadata should be responsible to, as its name states, extracting metadata. Now if a new type is introduced, it should simply return the column as categorised instead of failing. Perhaps this new column is not important at all and does not deserve a complete failure in the extraction process.

Style

Also, note that I haven't worked with C in a long time and I did not want to start going crazy with the refactor, but I'm pretty sure we can't have a more elegant function than this chain of strncmp. What would you suggest?

In Go I would have used a map of strings:

  var categories = map[string]string{
    "TIME8": "DURATION",
    "DATE9": "DATE",
    // ...
  }

evanmiller · 2020-12-16T18:01:33Z

Hi, thanks for the contribution. I don't have a strong opinion on the type vs format debate – I would suggest however looking at the readstat utility to ensure it can still use the JSON output of extract_metadata to read in CSV. Reverse-formatting is a hard problem, so I don't necessarily expect a full solution.

Re: C style I will add some comments in a code review.

basgys · 2020-12-16T18:05:14Z

I would suggest however looking at the readstat utility

Of course! Could you point me to this utility? And regarding a potential break in backward compatibility, what do you think?

Cheers!

evanmiller · 2020-12-16T18:03:10Z

src/bin/extract_metadata.c

+        // https://libguides.library.kent.edu/SPSS/DatesTime
+        if (format && (strncmp(format, "DATE", sizeof("DATE")-1) == 0 ||
+                    strncmp(format, "ADATE", sizeof("ADATE")-1) == 0 ||
+                    strncmp(format, "EDATE", sizeof("EDATE")-1) == 0 ||
                    strncmp(format, "SDATE", sizeof("SDATE")-1) == 0)) {


Since format is a regular C string, strcmp instead of strncmp will be fine here.

evanmiller · 2020-12-16T18:11:35Z

src/bin/extract_metadata.c

+                    strncmp(format, "TIME11.2", sizeof("TIME11.2")-1) == 0 ||
+                    strncmp(format, "DTIME9", sizeof("DTIME9")-1) == 0 ||
+                    strncmp(format, "DTIME12", sizeof("DTIME12")-1) == 0 ||
+                    strncmp(format, "DTIME15.2", sizeof("DTIME15.2")-1) == 0)) {


You might simplify these comparisons with sscanf, which can do basic pattern matching. I typically do something like

long width; if (sscanf(format, "MTIME%ld", &width) == 1) { } else if (sscanf(format, "TIME%ld", &width) == 1) { ...

It's probably okay to be a little loose with the allowed widths and precisions.

evanmiller · 2020-12-16T18:12:11Z

src/bin/extract_metadata.c

+        // All-or-nothing is probably not the best strategy for data type extraction.
+        // When SPSS/STATA introduces new types, metadata extraction could fail.
+        // It would be wiser to simply label the field as "UNKNOWN".
+        // TODO: Removes this failure?


I'm okay being strict. It often brings people here to file bug reports :-)

evanmiller · 2020-12-16T18:16:25Z

I would suggest however looking at the readstat utility

Of course! Could you point me to this utility? And regarding a potential break in backward compatibility, what do you think?

Cheers!

This function would need to be modified for instance:

ReadStat/src/bin/read_csv/json_metadata.c

Line 188 in f37a744

    
           metadata_column_type_t column_type(struct json_metadata* md, const char* varname, int output_format) {

The CSV reading code (part of the main readstat program) is here:

https://github.com/WizardMac/ReadStat/tree/dev/src/bin/read_csv

I'm fine breaking backward compatibility but for the sake of politeness this would probably need to be documented and wait until a 1.2 release.

basgys · 2020-12-18T12:39:57Z

@evanmiller I've implemented format and pattern on extract_metadata.

If you are happy with this solution, I will updates readtstat CSV export to implement those new formats.

I also included STATA date/datetime as requested here: #221 (comment)

Let me know what you think.

evanmiller · 2020-12-18T13:56:46Z

Are the Google Sheets formats standard in any way?

I would strongly suggest looking closely at the TimeFormatStrings library I mentioned in another thread - this solves the exact problem of parsing and converting SPSS / Stata/ SAS format strings, and can convert them reliably into a UTS-35 Date Format Pattern or into an Excel format. The library currently lacks numeric and duration support, but I think already it would simplify your existing code and open up some other possibilities for extract_metadata to offer output format options.

Depending on your commitment to this, probably the best solution would be to move some of this logic into the core ReadStat library – i.e. add a READSTAT_TYPE_DATE and READSTAT_TYPE_DATETIME type, and return standardized date values, as well as standardized format strings. That way extract_metadata could be relatively thin. But it would be a significant amount of work, which is why I've been putting it off for a long time!

basgys · 2020-12-18T14:47:17Z

Are the Google Sheets formats standard in any way?

I took something popular enough as a starting point, but I'd be more than happy to use the Unicode date format pattern if it covers our needs. Shall we use that?

Depending on your commitment to this.

Perhaps before continuing this work, it would be good to clarify where we want to go. I would suggest not to go down the rabbit hole with an endless rework. We can start with small incremental improvements.

the best solution would be to move some of this logic into the core ReadStat library

If you want to include more logic into the code ReadStat, I would be happy to help refactoring extract_metadata to simplify this code. I believe using a library like json-c would also help to make the code more robust.

However, as I said above, I don't want to rework the whole library alone. I don't feel comfortable enough with the code base and C in general. Also, I don't have enough time on this project anyway.

If you are happy with it, I will start with the following changes:

Use UTS-35 date format pattern
Refactor CSV export

evanmiller · 2020-12-18T14:53:54Z

Yes the rabbit hole is deep in this one. I'm fine with the current solution of leaving dates as Numeric and indicating their format in a separate field as you do. I would prefer UTS-35 over Google Sheets since it's a Unicode standard – but I'm not sure how (or if) UTS-35 handles durations so you might need to check that first.

evanmiller · 2020-12-18T14:55:57Z

I believe using a library like json-c would also help to make the code more robust.

I've had good experiences with YAJL, but json-c appears to be better maintained.

basgys · 2020-12-18T15:07:14Z

I would prefer UTS-35 over Google Sheets since it's a Unicode standard – but I'm not sure how (or if) UTS-35 handles durations so you might need to check that first.

I will use UTS-35 for dates and if we can't represent duration, I will use the Google Sheets standard. It will be our dialect of UTS-35.

basgys · 2020-12-18T15:09:13Z

I've had good experiences with YAJL, but json-c appears to be better maintained.

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

evanmiller · 2020-12-18T17:17:47Z

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

It would be great to use a real JSON library on the CSV reading side as well.

basgys · 2020-12-20T15:39:53Z

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

It would be great to use a real JSON library on the CSV reading side as well.

I am not quite sure to understand why the CSV reading has to work with JSON. I assume this would be removed when most of the metadata extraction will be transferred from extract_metadata to readstat.

evanmiller · 2020-12-20T16:09:54Z

I am not quite sure to understand why the CSV reading has to work with JSON. I assume this would be removed when most of the metadata extraction will be transferred from extract_metadata to readstat.

The CSV reader uses an extract_metadata-compatible JSON file to define the output schema, as described in the README here

https://github.com/WizardMac/ReadStat#command-line-usage-with-csv-input

basgys

Changelog:

Use UTS-35 date format pattern
Refactor CSV export

As a general feedback, I don't have SPSS nor STATA, so unfortunately I can't really test all those changes. I would be great if you could run some tests on your end @evanmiller. Let me know what you think.

basgys · 2020-12-20T17:25:30Z

src/bin/read_csv/mod_dta.c

+        break;
+        case EXTRACT_METADATA_FORMAT_DATE:
+            var->type = READSTAT_TYPE_INT32;
+            snprintf(var->format, sizeof(var->format), "%s", "%td");


@evanmiller Is it supposed to be %d for dates, %td for date times, and %t for times?

I believe %d is a legacy format indicating days since epoch. The %t formats are described in detail here:

https://www.stata.com/manuals13/ddatetimedisplayformats.pdf

Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)

basgys · 2020-12-20T17:26:18Z

src/bin/read_csv/mod_sav.c

+        break;
+        case EXTRACT_METADATA_FORMAT_DATE:
+            var->type = READSTAT_TYPE_DOUBLE;
+            snprintf(var->format, sizeof(var->format), "%s", "EDATE40");


@evanmiller Same as .dta, should we include other formats for date time and time?

This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.

evanmiller · 2020-12-20T17:50:32Z

src/bin/read_csv/mod_csv.c

+            var->type = READSTAT_TYPE_DOUBLE;
+        break;
+        case EXTRACT_METADATA_FORMAT_PERCENT:
+            var->type = READSTAT_TYPE_STRING;


Is this supposed to be READSTAT_TYPE_DOUBLE? (Same with below)

evanmiller

In addition to in-line comments, are you going to parse date (and duration) strings or expect raw numbers (intervals since epoch) to be supplied in the CSV file? I would also like to see some documentation added to the README about how the new features will work.

evanmiller · 2020-12-20T17:52:20Z

src/bin/read_csv/mod_dta.c

+        break;
+        case EXTRACT_METADATA_FORMAT_CURRENCY:
+            var->type = READSTAT_TYPE_DOUBLE;
+            snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column));


Pro C tip, you can use fall-through to apply the same block to multiple cases

case EXTRACT_METADATA_FORMAT_NUMBER: case EXTRACT_METADATA_FORMAT_PERCENT: case EXTRACT_METADATA_FORMAT_CURRENCY: var->type = READSTAT_TYPE_DOUBLE; snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column)); break;

evanmiller · 2020-12-20T17:56:41Z

src/bin/read_csv/mod_dta.c

+        break;
+        case EXTRACT_METADATA_FORMAT_DATE:
+            var->type = READSTAT_TYPE_INT32;
+            snprintf(var->format, sizeof(var->format), "%s", "%td");


I believe %d is a legacy format indicating days since epoch. The %t formats are described in detail here:

https://www.stata.com/manuals13/ddatetimedisplayformats.pdf

Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)

evanmiller · 2020-12-20T17:56:56Z

src/bin/read_csv/mod_dta.c

+        break;
+        case EXTRACT_METADATA_FORMAT_DATE_TIME:
+            var->type = READSTAT_TYPE_INT32;
+            snprintf(var->format, sizeof(var->format), "%s", "%td");


See above comment about fall-through

evanmiller · 2020-12-20T17:57:22Z

src/bin/read_csv/mod_sav.c

+        break;
+        case EXTRACT_METADATA_FORMAT_CURRENCY:
+            var->type = READSTAT_TYPE_DOUBLE;
+            snprintf(var->format, sizeof(var->format), "F8.%d", get_decimals(c->json_md, column));


Same re: fall-through

evanmiller · 2020-12-20T17:59:48Z

src/bin/read_csv/mod_sav.c

+        break;
+        case EXTRACT_METADATA_FORMAT_DATE:
+            var->type = READSTAT_TYPE_DOUBLE;
+            snprintf(var->format, sizeof(var->format), "%s", "EDATE40");


This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.

basgys · 2020-12-23T19:21:59Z

@evanmiller Hi Evan! Sorry for the delay, I just got busy on other projects.

I updated based on your last feedback. Let me know if I missed something, otherwise I think we are ready to merge.

evanmiller · 2020-12-23T19:35:32Z

@basgys Thanks. Please point this PR at the dev branch (master is considered stable) and I will get this merged in.

basgys · 2020-12-26T17:31:53Z

@evanmiller done

evanmiller · 2020-12-26T18:02:29Z

@basgys Thanks. I believe this file will also need updating since the JSON schema has changed:

https://github.com/WizardMac/ReadStat/blob/master/variablemetadata_schema.json

basgys · 2021-01-04T19:51:43Z

@basgys Thanks. I believe this file will also need updating since the JSON schema has changed:

https://github.com/WizardMac/ReadStat/blob/master/variablemetadata_schema.json

I updated the schema. I am not very familiar with this standard, so I hope it reflects the output. Otherwise, I would probably need help on this.

evanmiller · 2021-01-04T20:06:56Z

@basgys Thanks. I'm not too familiar with it either, but I just wanted to make sure things weren't falling out of date. I guess we should get the schema under test coverage somehow, but that will be its own project.

The CIFuzzer indicates that there's a new crash introduced by this PR. I will need to investigate it before merging.

evanmiller · 2021-01-04T22:01:17Z

The crash turned out to be unrelated so I'll get this merged in. Thanks for your patience!

basgys · 2021-01-20T18:36:05Z

@evanmiller I just tried to compile the development branch and got the following error:

src/bin/read_csv/read_csv.c: In function 'produce_column_header':
src/bin/read_csv/read_csv.c:35:41: error: 'METADATA_COLUMN_TYPE_DATE' undeclared (first use in this function)
   35 |     c->is_date[c->columns] = coltype == METADATA_COLUMN_TYPE_DATE;
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
src/bin/read_csv/read_csv.c:35:41: note: each undeclared identifier is reported only once for each function it appears in
make: *** [Makefile:3424: src/bin/read_csv/readstat-read_csv.o] Error 1

Did you merge the bug fix?

evanmiller · 2021-01-20T18:44:58Z

@basgys What is "the bug fix"?

basgys · 2021-01-20T20:07:46Z

@basgys What is "the bug fix"?

When you said

The crash turned out to be unrelated so I'll get this merged in. Thanks for your patience!

I thought you meant you had to fix something before merging the PR. My bad.

Does this branch compiles for you? Or do you have a similar error?

evanmiller · 2021-01-20T20:14:15Z

I wasn't compiling with CSV support so I didn't see this error. Should be fixed in dev now!

basgys · 2021-01-20T23:25:54Z

I wasn't compiling with CSV support so I didn't see this error. Should be fixed in dev now!

It failed again. EXTRACT_METADATA_TYPE_DATE does not exist anymore. I replaced it with

extract_metadata_format_t colformat = column_format(c->json_md, column);
c->is_date[c->columns] = colformat == EXTRACT_METADATA_FORMAT_DATE;

See PR #230

evanmiller reviewed Dec 16, 2020

View reviewed changes

Extends data types support (date, time, …)

ef0f3a3

basgys force-pushed the feature/duration branch from 6d44744 to ef0f3a3 Compare December 18, 2020 12:38

mclements mentioned this pull request Dec 20, 2020

extract_metadata error: not a number: d #221

Closed

basgys added 2 commits December 20, 2020 17:28

Uses UTS35 for date/time patterns

2cb7160

Implements column format in read_csv

f0e3ca7

basgys commented Dec 20, 2020

View reviewed changes

evanmiller reviewed Dec 20, 2020

View reviewed changes

basgys added 3 commits December 23, 2020 20:19

csv: Uses double for currency and percent

9e150ae

Changes dta output for date times to %tC

69dfaa2

Refactors switch statements

6666b9a

basgys changed the base branch from master to dev December 26, 2020 17:31

Updates json metadata schema

554afdd

evanmiller merged commit d5b9300 into WizardMac:dev Jan 4, 2021

implements duration support in extract_metadata #223

implements duration support in extract_metadata #223

Conversation

basgys commented Dec 16, 2020 • edited Loading

Other option

Note on data type handling

Style

evanmiller commented Dec 16, 2020

basgys commented Dec 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanmiller commented Dec 16, 2020

basgys commented Dec 18, 2020 • edited Loading

evanmiller commented Dec 18, 2020

basgys commented Dec 18, 2020 • edited Loading

evanmiller commented Dec 18, 2020

evanmiller commented Dec 18, 2020

basgys commented Dec 18, 2020

basgys commented Dec 18, 2020

evanmiller commented Dec 18, 2020

basgys commented Dec 20, 2020

evanmiller commented Dec 20, 2020

basgys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanmiller Dec 20, 2020 • edited Loading

Choose a reason for hiding this comment

evanmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

basgys commented Dec 23, 2020

evanmiller commented Dec 23, 2020

basgys commented Dec 26, 2020

evanmiller commented Dec 26, 2020

basgys commented Jan 4, 2021

evanmiller commented Jan 4, 2021

evanmiller commented Jan 4, 2021 • edited Loading

basgys commented Jan 20, 2021

evanmiller commented Jan 20, 2021

basgys commented Jan 20, 2021

evanmiller commented Jan 20, 2021

basgys commented Jan 20, 2021 • edited Loading

basgys commented Dec 16, 2020 •

edited

Loading

basgys commented Dec 18, 2020 •

edited

Loading

basgys commented Dec 18, 2020 •

edited

Loading

evanmiller Dec 20, 2020 •

edited

Loading

evanmiller commented Jan 4, 2021 •

edited

Loading

basgys commented Jan 20, 2021 •

edited

Loading