Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time/Duration format in extract_metadata #220

Closed
basgys opened this issue Dec 14, 2020 · 4 comments
Closed

Time/Duration format in extract_metadata #220

basgys opened this issue Dec 14, 2020 · 4 comments
Labels

Comments

@basgys
Copy link
Contributor

basgys commented Dec 14, 2020

Hi WizardMac team!

First of all, thank you very much for your hard work on this open source project.

I've started to build a tool to read SPSS files and I have a problem with time/duration columns. I use readstat/extract_metadata and extract_metadata extracts time column as simple integer. However, when I use WizardMac, it recognises the time column. (See screenshot)

Our tool vs WizardMac

Screenshot 2020-11-24 at 15 07 06

extract_metadata output

I tested with both 1.1.4 and 1.1.5.

{
  "type": "SPSS",
  "variables": [
    {
      "type": "DATE",
      "name": "Date1",
      "label": "Date format 1"
    },
    {
      "type": "DATE",
      "name": "Date2",
      "label": "Date format 2"
    },
    {
      "type": "NUMERIC",
      "name": "Heure",
      "label": "Date - heure/seconde"
    },
    {
      "type": "STRING",
      "name": "Texte",
      "label": "Blabla"
    }
  ]
}

Source file

date.sav.zip

Time-related PRs

  1. Question: variables with DATE and DATETIME formats not read correctly in PSPP #211
  2. New time display format for SPSS #155 (Has this been implemented already?)
  3. Support new SPSS formats MTIME and YMDHMS #91

Metadata

Besides time/duration, is there another known data type currently not extracted?

Cheers!

@evanmiller
Copy link
Contributor

Hi, thanks for the report. The ReadStat library is time-agnostic and simply returns raw numbers alongside the format string – it's up to the client program to make sense of "DATE5", etc. Different formats use different conventions - SAS and Stata use seconds since 1960, certain Stata formats use days since 1960, and SPSS uses seconds since the advent of the Gregorian calendar in 1582 (!). ReadStat is essentially silent on these issues.

I believe extract_metadata has some additional time facilities – but this tool was contributed by someone else, and I don't often use it so I can't shed much light without digging into the source code. As you discovered, Wizard has extensive logic to process time, date, and duration formats correctly, but this logic is not in the main library or in the CLI tools.

Part of that logic exists in a separate library here:

https://github.com/WizardMac/TimeFormatStrings

But this is not utilized by ReadStat or extract_metadata.

Note that all data types are extracted exactly as stored, so it's "just" an issue of formatting. There may be other formats such as Hex or currency that are not presented as expected.

@basgys
Copy link
Contributor Author

basgys commented Dec 14, 2020

Hi Evan!

Thanks for your prompt and extensive answer.

As for date formats, I believe we have implemented the logic to correctly process SPSS/STATA dates from our Go package, which looks like that:

package spss

import (
	"time"
)

// ConvertDate converts an SPSS date into a standard time struct, where `d`
// is the number of seconds since `1582-10-14`
func ConvertDate(d int64) time.Time {
	return time.Unix(d+epochDelta+adjustment, 0).UTC()
}

var (
	// Epoch contains the number of seconds between the 1582-10-14 to 1970-01-01
	//
	// Dates in SPSS are recorded in seconds since October 14, 1582,
	// the date of the beginning of the Julian calendar
	epochDelta int64 = -12219379200

	// adjustment is the number of seconds adjusted from the julian
	// to the gregorian calendar
	adjustment int64 = 864000
)

This works because extract_metadata tells us it is a date. However, the field heure on the source file given above is shown as NUMERIC. It is indeed a numeric field, but with a special formatting (hh:mm). It would be really nice to have this information added to the JSON output. Adding an extract property for those different data types would be really helpful and would keep the output backward compatible.

{
  "type": "SPSS",
  "variables": [
    {
      "type": "DATE",
      "name": "Date1",
      "label": "Date format 1"
    },
    {
      "type": "DATE",
      "name": "Date2",
      "label": "Date format 2"
    },
    {
      "type": "NUMERIC",
      "name": "Heure",
      "label": "Date - heure/seconde",
      "representation": "duration", // Just an example
      "format": "hh:mm", // Just an example
    },
    {
      "type": "STRING",
      "name": "Texte",
      "label": "Blabla"
    }
  ]
}

@basgys
Copy link
Contributor Author

basgys commented Jan 21, 2021

@evanmiller I've released a new version of our project Polymorph and someone will start testing more thoroughly the metadata extraction. We might have to improve extract_metadata soon.

Otherwise, I think we can consider this feature done and thus close this issue.

@evanmiller
Copy link
Contributor

@basgys Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants