Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implements duration support in extract_metadata #223

Merged
merged 7 commits into from
Jan 4, 2021

Conversation

basgys
Copy link
Contributor

@basgys basgys commented Dec 16, 2020

Hi @evanmiller,

I made a first implementation to support duration in extract_metadata. This feature would break backward compatibility, so it is up to you to tell me if we should use a different strategy.

I use this website as a reference for the existing data types https://libguides.library.kent.edu/SPSS/DatesTime.

Other option

As I mentioned on the PR #220, the other option would be to label the type as NUMERIC for both dates and durations, but then include a format field in the JSON payload that gives the format:

TIME8 => hh:mm:ss
{
   "type": "NUMERIC",
   "format": "TIME",
   "pattern": "hh:mm:ss"
}

Possible formats:

  1. NUMBER
  2. PERCENT
  3. CURRENCY
  4. DATE
  5. TIME
  6. DATE_TIME
  7. UNSPECIFIED

Note on data type handling

I added the following comment to the code:

    // All-or-nothing is probably not the best strategy for data type extraction.
    // When SPSS/STATA introduces new types, metadata extraction could fail.
    // It would be wiser to simply label the field as "UNKNOWN".

It is a philosophy I guess, but I would argue that this error should be handled further up in the stack. extract_metadata should be responsible to, as its name states, extracting metadata. Now if a new type is introduced, it should simply return the column as categorised instead of failing. Perhaps this new column is not important at all and does not deserve a complete failure in the extraction process.

Style

Also, note that I haven't worked with C in a long time and I did not want to start going crazy with the refactor, but I'm pretty sure we can't have a more elegant function than this chain of strncmp. What would you suggest?

In Go I would have used a map of strings:

  var categories = map[string]string{
    "TIME8": "DURATION",
    "DATE9": "DATE",
    // ...
  }

@evanmiller
Copy link
Contributor

Hi, thanks for the contribution. I don't have a strong opinion on the type vs format debate – I would suggest however looking at the readstat utility to ensure it can still use the JSON output of extract_metadata to read in CSV. Reverse-formatting is a hard problem, so I don't necessarily expect a full solution.

Re: C style I will add some comments in a code review.

@basgys
Copy link
Contributor Author

basgys commented Dec 16, 2020

I would suggest however looking at the readstat utility

Of course! Could you point me to this utility? And regarding a potential break in backward compatibility, what do you think?

Cheers!

// https://libguides.library.kent.edu/SPSS/DatesTime
if (format && (strncmp(format, "DATE", sizeof("DATE")-1) == 0 ||
strncmp(format, "ADATE", sizeof("ADATE")-1) == 0 ||
strncmp(format, "EDATE", sizeof("EDATE")-1) == 0 ||
strncmp(format, "SDATE", sizeof("SDATE")-1) == 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since format is a regular C string, strcmp instead of strncmp will be fine here.

strncmp(format, "TIME11.2", sizeof("TIME11.2")-1) == 0 ||
strncmp(format, "DTIME9", sizeof("DTIME9")-1) == 0 ||
strncmp(format, "DTIME12", sizeof("DTIME12")-1) == 0 ||
strncmp(format, "DTIME15.2", sizeof("DTIME15.2")-1) == 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might simplify these comparisons with sscanf, which can do basic pattern matching. I typically do something like

long width;
if (sscanf(format, "MTIME%ld", &width) == 1) {
} else if (sscanf(format, "TIME%ld", &width) == 1) {
...

It's probably okay to be a little loose with the allowed widths and precisions.

// All-or-nothing is probably not the best strategy for data type extraction.
// When SPSS/STATA introduces new types, metadata extraction could fail.
// It would be wiser to simply label the field as "UNKNOWN".
// TODO: Removes this failure?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay being strict. It often brings people here to file bug reports :-)

@evanmiller
Copy link
Contributor

I would suggest however looking at the readstat utility

Of course! Could you point me to this utility? And regarding a potential break in backward compatibility, what do you think?

Cheers!

This function would need to be modified for instance:

metadata_column_type_t column_type(struct json_metadata* md, const char* varname, int output_format) {

The CSV reading code (part of the main readstat program) is here:

https://github.com/WizardMac/ReadStat/tree/dev/src/bin/read_csv

I'm fine breaking backward compatibility but for the sake of politeness this would probably need to be documented and wait until a 1.2 release.

@basgys
Copy link
Contributor Author

basgys commented Dec 18, 2020

@evanmiller I've implemented format and pattern on extract_metadata.

If you are happy with this solution, I will updates readtstat CSV export to implement those new formats.

I also included STATA date/datetime as requested here: #221 (comment)

Let me know what you think.

@evanmiller
Copy link
Contributor

Are the Google Sheets formats standard in any way?

I would strongly suggest looking closely at the TimeFormatStrings library I mentioned in another thread - this solves the exact problem of parsing and converting SPSS / Stata/ SAS format strings, and can convert them reliably into a UTS-35 Date Format Pattern or into an Excel format. The library currently lacks numeric and duration support, but I think already it would simplify your existing code and open up some other possibilities for extract_metadata to offer output format options.

Depending on your commitment to this, probably the best solution would be to move some of this logic into the core ReadStat library – i.e. add a READSTAT_TYPE_DATE and READSTAT_TYPE_DATETIME type, and return standardized date values, as well as standardized format strings. That way extract_metadata could be relatively thin. But it would be a significant amount of work, which is why I've been putting it off for a long time!

@basgys
Copy link
Contributor Author

basgys commented Dec 18, 2020

Are the Google Sheets formats standard in any way?

I took something popular enough as a starting point, but I'd be more than happy to use the Unicode date format pattern if it covers our needs. Shall we use that?

Depending on your commitment to this.

Perhaps before continuing this work, it would be good to clarify where we want to go. I would suggest not to go down the rabbit hole with an endless rework. We can start with small incremental improvements.

the best solution would be to move some of this logic into the core ReadStat library

If you want to include more logic into the code ReadStat, I would be happy to help refactoring extract_metadata to simplify this code. I believe using a library like json-c would also help to make the code more robust.

However, as I said above, I don't want to rework the whole library alone. I don't feel comfortable enough with the code base and C in general. Also, I don't have enough time on this project anyway.

If you are happy with it, I will start with the following changes:

  1. Use UTS-35 date format pattern
  2. Refactor CSV export

@evanmiller
Copy link
Contributor

Yes the rabbit hole is deep in this one. I'm fine with the current solution of leaving dates as Numeric and indicating their format in a separate field as you do. I would prefer UTS-35 over Google Sheets since it's a Unicode standard – but I'm not sure how (or if) UTS-35 handles durations so you might need to check that first.

@evanmiller
Copy link
Contributor

I believe using a library like json-c would also help to make the code more robust.

I've had good experiences with YAJL, but json-c appears to be better maintained.

@basgys
Copy link
Contributor Author

basgys commented Dec 18, 2020

I would prefer UTS-35 over Google Sheets since it's a Unicode standard – but I'm not sure how (or if) UTS-35 handles durations so you might need to check that first.

I will use UTS-35 for dates and if we can't represent duration, I will use the Google Sheets standard. It will be our dialect of UTS-35.

@basgys
Copy link
Contributor Author

basgys commented Dec 18, 2020

I've had good experiences with YAJL, but json-c appears to be better maintained.

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

@evanmiller
Copy link
Contributor

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

It would be great to use a real JSON library on the CSV reading side as well.

@basgys
Copy link
Contributor Author

basgys commented Dec 20, 2020

I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic).

It would be great to use a real JSON library on the CSV reading side as well.

I am not quite sure to understand why the CSV reading has to work with JSON. I assume this would be removed when most of the metadata extraction will be transferred from extract_metadata to readstat.

@evanmiller
Copy link
Contributor

I am not quite sure to understand why the CSV reading has to work with JSON. I assume this would be removed when most of the metadata extraction will be transferred from extract_metadata to readstat.

The CSV reader uses an extract_metadata-compatible JSON file to define the output schema, as described in the README here

https://github.com/WizardMac/ReadStat#command-line-usage-with-csv-input

Copy link
Contributor Author

@basgys basgys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog:

  1. Use UTS-35 date format pattern
  2. Refactor CSV export

As a general feedback, I don't have SPSS nor STATA, so unfortunately I can't really test all those changes. I would be great if you could run some tests on your end @evanmiller. Let me know what you think.

break;
case EXTRACT_METADATA_FORMAT_DATE:
var->type = READSTAT_TYPE_INT32;
snprintf(var->format, sizeof(var->format), "%s", "%td");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanmiller Is it supposed to be %d for dates, %td for date times, and %t for times?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe %d is a legacy format indicating days since epoch. The %t formats are described in detail here:

https://www.stata.com/manuals13/ddatetimedisplayformats.pdf

Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)

break;
case EXTRACT_METADATA_FORMAT_DATE:
var->type = READSTAT_TYPE_DOUBLE;
snprintf(var->format, sizeof(var->format), "%s", "EDATE40");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanmiller Same as .dta, should we include other formats for date time and time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.

var->type = READSTAT_TYPE_DOUBLE;
break;
case EXTRACT_METADATA_FORMAT_PERCENT:
var->type = READSTAT_TYPE_STRING;
Copy link
Contributor

@evanmiller evanmiller Dec 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be READSTAT_TYPE_DOUBLE? (Same with below)

Copy link
Contributor

@evanmiller evanmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to in-line comments, are you going to parse date (and duration) strings or expect raw numbers (intervals since epoch) to be supplied in the CSV file? I would also like to see some documentation added to the README about how the new features will work.

break;
case EXTRACT_METADATA_FORMAT_CURRENCY:
var->type = READSTAT_TYPE_DOUBLE;
snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pro C tip, you can use fall-through to apply the same block to multiple cases

         case EXTRACT_METADATA_FORMAT_NUMBER:
         case EXTRACT_METADATA_FORMAT_PERCENT:
         case EXTRACT_METADATA_FORMAT_CURRENCY:
             var->type = READSTAT_TYPE_DOUBLE;
             snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column));
        break;

break;
case EXTRACT_METADATA_FORMAT_DATE:
var->type = READSTAT_TYPE_INT32;
snprintf(var->format, sizeof(var->format), "%s", "%td");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe %d is a legacy format indicating days since epoch. The %t formats are described in detail here:

https://www.stata.com/manuals13/ddatetimedisplayformats.pdf

Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)

break;
case EXTRACT_METADATA_FORMAT_DATE_TIME:
var->type = READSTAT_TYPE_INT32;
snprintf(var->format, sizeof(var->format), "%s", "%td");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about fall-through

break;
case EXTRACT_METADATA_FORMAT_CURRENCY:
var->type = READSTAT_TYPE_DOUBLE;
snprintf(var->format, sizeof(var->format), "F8.%d", get_decimals(c->json_md, column));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same re: fall-through

break;
case EXTRACT_METADATA_FORMAT_DATE:
var->type = READSTAT_TYPE_DOUBLE;
snprintf(var->format, sizeof(var->format), "%s", "EDATE40");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.

@basgys
Copy link
Contributor Author

basgys commented Dec 23, 2020

@evanmiller Hi Evan! Sorry for the delay, I just got busy on other projects.

I updated based on your last feedback. Let me know if I missed something, otherwise I think we are ready to merge.

@evanmiller
Copy link
Contributor

@basgys Thanks. Please point this PR at the dev branch (master is considered stable) and I will get this merged in.

@basgys basgys changed the base branch from master to dev December 26, 2020 17:31
@basgys
Copy link
Contributor Author

basgys commented Dec 26, 2020

@evanmiller done

@evanmiller
Copy link
Contributor

@basgys Thanks. I believe this file will also need updating since the JSON schema has changed:

https://github.com/WizardMac/ReadStat/blob/master/variablemetadata_schema.json

@basgys
Copy link
Contributor Author

basgys commented Jan 4, 2021

@basgys Thanks. I believe this file will also need updating since the JSON schema has changed:

https://github.com/WizardMac/ReadStat/blob/master/variablemetadata_schema.json

I updated the schema. I am not very familiar with this standard, so I hope it reflects the output. Otherwise, I would probably need help on this.

@evanmiller
Copy link
Contributor

@basgys Thanks. I'm not too familiar with it either, but I just wanted to make sure things weren't falling out of date. I guess we should get the schema under test coverage somehow, but that will be its own project.

The CIFuzzer indicates that there's a new crash introduced by this PR. I will need to investigate it before merging.

@evanmiller
Copy link
Contributor

evanmiller commented Jan 4, 2021

The crash turned out to be unrelated so I'll get this merged in. Thanks for your patience!

@evanmiller evanmiller merged commit d5b9300 into WizardMac:dev Jan 4, 2021
@basgys
Copy link
Contributor Author

basgys commented Jan 20, 2021

@evanmiller I just tried to compile the development branch and got the following error:

src/bin/read_csv/read_csv.c: In function 'produce_column_header':
src/bin/read_csv/read_csv.c:35:41: error: 'METADATA_COLUMN_TYPE_DATE' undeclared (first use in this function)
   35 |     c->is_date[c->columns] = coltype == METADATA_COLUMN_TYPE_DATE;
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
src/bin/read_csv/read_csv.c:35:41: note: each undeclared identifier is reported only once for each function it appears in
make: *** [Makefile:3424: src/bin/read_csv/readstat-read_csv.o] Error 1

Did you merge the bug fix?

@evanmiller
Copy link
Contributor

@basgys What is "the bug fix"?

@basgys
Copy link
Contributor Author

basgys commented Jan 20, 2021

@basgys What is "the bug fix"?

When you said

The crash turned out to be unrelated so I'll get this merged in. Thanks for your patience!

I thought you meant you had to fix something before merging the PR. My bad.

Does this branch compiles for you? Or do you have a similar error?

@evanmiller
Copy link
Contributor

I wasn't compiling with CSV support so I didn't see this error. Should be fixed in dev now!

@basgys
Copy link
Contributor Author

basgys commented Jan 20, 2021

I wasn't compiling with CSV support so I didn't see this error. Should be fixed in dev now!

It failed again. EXTRACT_METADATA_TYPE_DATE does not exist anymore. I replaced it with

extract_metadata_format_t colformat = column_format(c->json_md, column);
c->is_date[c->columns] = colformat == EXTRACT_METADATA_FORMAT_DATE;

See PR #230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants