-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implements duration support in extract_metadata #223
Conversation
Hi, thanks for the contribution. I don't have a strong opinion on the type vs format debate – I would suggest however looking at the Re: C style I will add some comments in a code review. |
Of course! Could you point me to this utility? And regarding a potential break in backward compatibility, what do you think? Cheers! |
src/bin/extract_metadata.c
Outdated
// https://libguides.library.kent.edu/SPSS/DatesTime | ||
if (format && (strncmp(format, "DATE", sizeof("DATE")-1) == 0 || | ||
strncmp(format, "ADATE", sizeof("ADATE")-1) == 0 || | ||
strncmp(format, "EDATE", sizeof("EDATE")-1) == 0 || | ||
strncmp(format, "SDATE", sizeof("SDATE")-1) == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since format
is a regular C string, strcmp
instead of strncmp
will be fine here.
src/bin/extract_metadata.c
Outdated
strncmp(format, "TIME11.2", sizeof("TIME11.2")-1) == 0 || | ||
strncmp(format, "DTIME9", sizeof("DTIME9")-1) == 0 || | ||
strncmp(format, "DTIME12", sizeof("DTIME12")-1) == 0 || | ||
strncmp(format, "DTIME15.2", sizeof("DTIME15.2")-1) == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might simplify these comparisons with sscanf
, which can do basic pattern matching. I typically do something like
long width;
if (sscanf(format, "MTIME%ld", &width) == 1) {
} else if (sscanf(format, "TIME%ld", &width) == 1) {
...
It's probably okay to be a little loose with the allowed widths and precisions.
src/bin/extract_metadata.c
Outdated
// All-or-nothing is probably not the best strategy for data type extraction. | ||
// When SPSS/STATA introduces new types, metadata extraction could fail. | ||
// It would be wiser to simply label the field as "UNKNOWN". | ||
// TODO: Removes this failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay being strict. It often brings people here to file bug reports :-)
This function would need to be modified for instance: ReadStat/src/bin/read_csv/json_metadata.c Line 188 in f37a744
The CSV reading code (part of the main https://github.com/WizardMac/ReadStat/tree/dev/src/bin/read_csv I'm fine breaking backward compatibility but for the sake of politeness this would probably need to be documented and wait until a 1.2 release. |
6d44744
to
ef0f3a3
Compare
@evanmiller I've implemented format and pattern on If you are happy with this solution, I will updates readtstat CSV export to implement those new formats. I also included STATA date/datetime as requested here: #221 (comment) Let me know what you think. |
Are the Google Sheets formats standard in any way? I would strongly suggest looking closely at the TimeFormatStrings library I mentioned in another thread - this solves the exact problem of parsing and converting SPSS / Stata/ SAS format strings, and can convert them reliably into a UTS-35 Date Format Pattern or into an Excel format. The library currently lacks numeric and duration support, but I think already it would simplify your existing code and open up some other possibilities for Depending on your commitment to this, probably the best solution would be to move some of this logic into the core ReadStat library – i.e. add a |
I took something popular enough as a starting point, but I'd be more than happy to use the Unicode date format pattern if it covers our needs. Shall we use that?
Perhaps before continuing this work, it would be good to clarify where we want to go. I would suggest not to go down the rabbit hole with an endless rework. We can start with small incremental improvements.
If you want to include more logic into the code ReadStat, I would be happy to help refactoring However, as I said above, I don't want to rework the whole library alone. I don't feel comfortable enough with the code base and If you are happy with it, I will start with the following changes:
|
Yes the rabbit hole is deep in this one. I'm fine with the current solution of leaving dates as Numeric and indicating their format in a separate field as you do. I would prefer UTS-35 over Google Sheets since it's a Unicode standard – but I'm not sure how (or if) UTS-35 handles durations so you might need to check that first. |
I've had good experiences with YAJL, but json-c appears to be better maintained. |
I will use UTS-35 for dates and if we can't represent duration, I will use the Google Sheets standard. It will be our dialect of UTS-35. |
I don't have any strong opinion on that matter. I just think concatenating strings to output a JSON payload is waiting for a disaster to happen (to be slightly dramatic). |
It would be great to use a real JSON library on the CSV reading side as well. |
I am not quite sure to understand why the CSV reading has to work with JSON. I assume this would be removed when most of the metadata extraction will be transferred from |
The CSV reader uses an https://github.com/WizardMac/ReadStat#command-line-usage-with-csv-input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changelog:
- Use UTS-35 date format pattern
- Refactor CSV export
As a general feedback, I don't have SPSS nor STATA, so unfortunately I can't really test all those changes. I would be great if you could run some tests on your end @evanmiller. Let me know what you think.
break; | ||
case EXTRACT_METADATA_FORMAT_DATE: | ||
var->type = READSTAT_TYPE_INT32; | ||
snprintf(var->format, sizeof(var->format), "%s", "%td"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evanmiller Is it supposed to be %d
for dates, %td
for date times, and %t
for times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe %d
is a legacy format indicating days since epoch. The %t
formats are described in detail here:
https://www.stata.com/manuals13/ddatetimedisplayformats.pdf
Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)
src/bin/read_csv/mod_sav.c
Outdated
break; | ||
case EXTRACT_METADATA_FORMAT_DATE: | ||
var->type = READSTAT_TYPE_DOUBLE; | ||
snprintf(var->format, sizeof(var->format), "%s", "EDATE40"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evanmiller Same as .dta, should we include other formats for date time and time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.
src/bin/read_csv/mod_csv.c
Outdated
var->type = READSTAT_TYPE_DOUBLE; | ||
break; | ||
case EXTRACT_METADATA_FORMAT_PERCENT: | ||
var->type = READSTAT_TYPE_STRING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be READSTAT_TYPE_DOUBLE? (Same with below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to in-line comments, are you going to parse date (and duration) strings or expect raw numbers (intervals since epoch) to be supplied in the CSV file? I would also like to see some documentation added to the README about how the new features will work.
break; | ||
case EXTRACT_METADATA_FORMAT_CURRENCY: | ||
var->type = READSTAT_TYPE_DOUBLE; | ||
snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pro C tip, you can use fall-through to apply the same block to multiple cases
case EXTRACT_METADATA_FORMAT_NUMBER:
case EXTRACT_METADATA_FORMAT_PERCENT:
case EXTRACT_METADATA_FORMAT_CURRENCY:
var->type = READSTAT_TYPE_DOUBLE;
snprintf(var->format, sizeof(var->format), "%%9.%df", get_decimals(c->json_md, column));
break;
break; | ||
case EXTRACT_METADATA_FORMAT_DATE: | ||
var->type = READSTAT_TYPE_INT32; | ||
snprintf(var->format, sizeof(var->format), "%s", "%td"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe %d
is a legacy format indicating days since epoch. The %t
formats are described in detail here:
https://www.stata.com/manuals13/ddatetimedisplayformats.pdf
Internally Stata may store dates and times as milliseconds since epoch, seconds since epoch, days since epoch, weeks since epoch, etc. (You might start to understand why I punted on the issue.)
src/bin/read_csv/mod_dta.c
Outdated
break; | ||
case EXTRACT_METADATA_FORMAT_DATE_TIME: | ||
var->type = READSTAT_TYPE_INT32; | ||
snprintf(var->format, sizeof(var->format), "%s", "%td"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment about fall-through
break; | ||
case EXTRACT_METADATA_FORMAT_CURRENCY: | ||
var->type = READSTAT_TYPE_DOUBLE; | ||
snprintf(var->format, sizeof(var->format), "F8.%d", get_decimals(c->json_md, column)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same re: fall-through
src/bin/read_csv/mod_sav.c
Outdated
break; | ||
case EXTRACT_METADATA_FORMAT_DATE: | ||
var->type = READSTAT_TYPE_DOUBLE; | ||
snprintf(var->format, sizeof(var->format), "%s", "EDATE40"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably adequate for now, unless you want to match the UTS-35 patterns to produce the appropriate output format.
@evanmiller Hi Evan! Sorry for the delay, I just got busy on other projects. I updated based on your last feedback. Let me know if I missed something, otherwise I think we are ready to merge. |
@basgys Thanks. Please point this PR at the |
@evanmiller done |
@basgys Thanks. I believe this file will also need updating since the JSON schema has changed: https://github.com/WizardMac/ReadStat/blob/master/variablemetadata_schema.json |
I updated the schema. I am not very familiar with this standard, so I hope it reflects the output. Otherwise, I would probably need help on this. |
@basgys Thanks. I'm not too familiar with it either, but I just wanted to make sure things weren't falling out of date. I guess we should get the schema under test coverage somehow, but that will be its own project. The CIFuzzer indicates that there's a new crash introduced by this PR. I will need to investigate it before merging. |
The crash turned out to be unrelated so I'll get this merged in. Thanks for your patience! |
@evanmiller I just tried to compile the development branch and got the following error:
Did you merge the bug fix? |
@basgys What is "the bug fix"? |
When you said
I thought you meant you had to fix something before merging the PR. My bad. Does this branch compiles for you? Or do you have a similar error? |
I wasn't compiling with CSV support so I didn't see this error. Should be fixed in |
It failed again. extract_metadata_format_t colformat = column_format(c->json_md, column);
c->is_date[c->columns] = colformat == EXTRACT_METADATA_FORMAT_DATE; See PR #230 |
Hi @evanmiller,
I made a first implementation to support duration in
extract_metadata
. This feature would break backward compatibility, so it is up to you to tell me if we should use a different strategy.I use this website as a reference for the existing data types https://libguides.library.kent.edu/SPSS/DatesTime.
Other option
As I mentioned on the PR #220, the other option would be to label the type as
NUMERIC
for both dates and durations, but then include a format field in the JSON payload that gives the format:Possible formats:
Note on data type handling
I added the following comment to the code:
It is a philosophy I guess, but I would argue that this error should be handled further up in the stack.
extract_metadata
should be responsible to, as its name states, extracting metadata. Now if a new type is introduced, it should simply return the column as categorised instead of failing. Perhaps this new column is not important at all and does not deserve a complete failure in the extraction process.Style
Also, note that I haven't worked with C in a long time and I did not want to start going crazy with the refactor, but I'm pretty sure we can't have a more elegant function than this chain of
strncmp
. What would you suggest?In Go I would have used a map of strings: