-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Dates #698
WIP: Dates #698
Conversation
Couple of further comments, for the record
|
I know that @StefanKarpinski has strong opinions about how dates should be dealt with. All I know is that it's a big hairy mess, and that people like the representations used by Joda Time in Java and Lubridate in R. In particular, they make distinctions among instants, intervals, durations, and periods, which can be incredibly useful in dealing with relative date/time offsets, descriptions of time slices, arithmetic on time objects, etc. For statisticians and people in finance, this is incredibly important functionality. I'm super-glad someone's interested enough to work on this and contribute. I don't think borrowing C's date/time classes is a good idea at all. |
This is great functionality to have. My only strong opinion, actually, is that I think that seconds since the epoch as a
(For more time resolution, using a To elaborate on the third point, what I mean is that a single floating-point-since-the-epoch value has a completely unambiguous meaning: the instant when that amount of time has passed since Jan 1st, 1970 UTC. Now interpreting that moment is a different matter and can be ambiguous. That's where different calendar systems give you different results. But at least there's an unambiguous underlying time that is pinpointed. All that being said, I'm by no means an expert on representing and working with times. |
In fact, I can see having times be parameterized by any real type: type Time{T<:Real}
time::T
end
type Interval{T<:Real}
interval::T
end You could have -(t1::Time, t2::Time) = Time(t1.time-t2.time) |
Another library to consider stealing from would be the NumPy datetime64 data type: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html I think this is basically what Stefan is proposing, at first glance. This said, it's extremely useful to be able to say "January 12th", without reference to a year, as you can in Joda Time/Lubridate, or an interval of "4 months", which works properly when added/subtracted to a timestamp, or a duration which is a pair of timestamps. It may well be possible to leverage these concepts on top of a representation like that that's being suggested, I don't know. |
@HarlanH So this current implementation is essentially an instance in JodaTime terms. It supports one duration unit: of a day. Therefore, I think other units and classes can be layered over it. I want to do a financial analytics library in Julia, which cant obviously cant be done without dates, hence this effort. I think I have all the primitives I need to now implement a business calendar @StefanKarpinski I'd started with the idea of storing dates and times as milliseconds since epoch. Which is how java and ruby's Time class do it. However, the conversions to and from civil dates turn out to be much more involved. My implementation stores a canonical representation as a days (and fraction of a day for times.. coming soon) since julian epoch, which is 1/Jan/-4712 . You can implement exactly the same API using milli(micro(nano))seconds since 1/1/1970, including pretty much all functionality of the python impl HarlanH refers to above. Anyways, the point is, a single number is the canonical representation. Everything else (except possibly a timezone attribute) that's stored in composite types is effectively a cache for performance, and can be regenrated in constant time. Storing julian days as float64 will give us a resolution of 1e-11 seconds |
If it's relatively cheap to parametrize the type, it would be nice to have that option. |
Just as a heads up, a new version should be ready any time now. Not had much time last week, but back on it now. |
DateTimes are now parameterised by the type of the julian day number they store internally. Dates are an alias to DateTime{Integer}. Precision for DateTime{Float64} is about 10E-5 . So you get at least millisecond accuracy. Comments welcome. |
type DateTime{T<:Real} | ||
jd::T | ||
off::Float | ||
zone::ASCIIString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make these packable in arrays, could time-zones be a fixed-length character buffer instead of an arbitrary string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. A char[3] should do. The zone is actually only a cosmetic field, never used for computations. The offset is the canonical info, but its not enough unfortunately to display the zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are fewer than 256 timezones, so a Uint8
offset into an array would be best (an immutable one if we had those).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be possible to express timezone offsets as ±15-minute increments, ranging from -48 for UTC-12:00 to +56 for UTC+14:00. That's only 103 possible values, so it can be represented with an Int8
value easily. This avoids needing a lookup table and seems likely to make computation with timezones a lot easier and more immediate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timezone is actually stored in the off:Float field, as a day fraction.
The zone field was meant to store the string zone abbreviation, since there could potentially be a one to many relationship between time offsets, and named zones.
Thinking some more about it, i think this is a mixing of concerns, the date object should not care what the human readable zone it. It should be a concern for the data formatting routines. I am therefore minded to remove the string zone field altogether from the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oo. Using an entire 8-byte float to store the time-zone seems excessive. Currently all values are stored in composite types as pointers to heap-allocated values, but the fact that offset is of an abstract type forces that representation. Once Jeff and I get around to implementing immutability of composite types, a field with a concrete type like Float64
or Int8
could be stored in the composite structure. Consider tz::Int8
instead that is the timezone offset from UTC in 15-minute units. Also, making something like a timezone offset subject to the complexities of floating-point arithmetic seems like asking for all kinds of trouble.
a few lessons learned making accurate, high performance datetime stuff (1) Users care about accuracy, speed and resolution For important classes of apps (e.g. financial market analysis), it matters that datetime arithmetic is accurate and fast [no, faster]. (2) Nobody has requested sub-second resolution with dates before 1900. The astrophysics community has found that it takes a pair of 64-bit values to work with Julian dates at current levels of accuracy. GPS time signals are accurate to within (less than) 32 nanoseconds. If representation compatible, having a long span datetime and higher resolution modern datetime is one good approach. (3) With time-of-day, timezones matter (even when they are elided) and their robust, portable use will rely on iana's time zone database. Daylight savings time means that once a year, an hour is traversed twice -- that ambiguity should be resolved (by user input, perhaps with a default), not presumed. UTC is as a 'common denominator' when working with timezones. Multiple conversions are much less error-prone when storing datetimes as, say, (UTC, timechange, isDST, zone) and converting to local time when show()ing it. Indexing each timezone in the iana database requires 9 bits. Encoding timechange to/from UT in 15min increments requres another 8 bits (9 if timechange is for Standard Time and a separate bit is used to reflect DST -- a more flexible representation). An implementation, might use 16 msbs for maskable timezone, and 16 lsbs for timechange in, say, 4sec units + DST bit. (4) Leap seconds exist (starting in 1972, may be discontinued 2016). Most datetime systems ignore them for speed, which is ok only if the user does not care when UTC times are off by a second and the count of seconds from 1972 through 2012 is off by 25. (5) While floating point is often used, integer types/structures with proper algorithms are safer, faster, and more robust for datetime representation. (6) datetime users think of timepoints as moments and as boundriess; they think of, timespans (durations), and temporal intervals arithmetically. It helps if the datetime implementation is designed for clean and snappy support of such types [of the sort used in the R package lubradate, JSR-310 has another take]. |
@JeffreySarnoff: this is an incredibly useful list of observations. Thank you so much. Is there a solid C library that implements what you describe sufficiently that we could wrap? Do you consider lubridate a good example to look at? What do they get wrong and right? What about other date/time libraries? |
@StefanKarpinski: (thanks) There are available libs that work, but not in the way suggested, Lubridate is full of solutions to prior R date/time hiccups and other date ops, in addition to the type stuff. Rather than send you there, I am happy to write more focused design notes over the next days: is there some place to deposit a pdf? |
That would be great. The wiki would be a good place for that sort of thing although I'm not entirely sure how to put images up there. Maybe just email a PDF document to the dev list? |
Let me also commend Jeff on his excellent post. The C libraries i'm aware of are: |
@nolta: (thanks) I had not seen ICU, good to know; SOFA is an old friend. |
SOFA is kind of terrifying. |
Not that it looks bad or anything, it just does sooo much. |
I was too terse. SOFA is wholly inappropriate for Julia's core datetime functionality. I just meant that it is known to me and it is solid: I have used some of their routines to check other work, including spherical trigonometry. ... no worries ... toward the end of the week ... my comment will start: Time for Julia |
Julia's datetime encapsulates calendar, clock, microtimer and time zone information. Encoding compactly in C, that information exceeds 64 bits but does not exceed 128 (integer) bits. (is this accurate) This should enhance seamless handling of datetime-typed vectors when working with long timeseries. And would create flexibility -- it then becomes easy for Julia to give itself a package like R's zoo, xts. The datetime realization would use the same C libraries' routines with or without UInt128 and Int128. Absent those types, we have a less facile facility; some of the higher level stuff is best done in Julia and requires the representation handshake to 'just work' seamlessly. |
So it seems like this is a real-life use case for 128-bit integers? @JeffBezanson and I had discussed that at some point and were unable to come up with cases where 64 bits wasn't enough or it wouldn't be more appropriate to just use arbitrary precision integers. I'd still like to allow the programmer to choose how much precision they want, but having the option to use an |
Time for Julia [a] (a) do more by doing less I have re-reviewed some accessible datetime libraries. After dismissing the problematic and the more restrictively licensed, and disqualifying some good ones because they were not C, FORTRAN or C++ libraries, few remained. Use of the IANA Time Zone database is crucial. Available APIs with that facility do not support leap seconds. Julia has a better time with both capablities. The most reliable and well-maintained Julia-wrappable datetime API is within ICU (kudos to @nolta). Fortuitously, it provides an encoding which fits well with my proposal. Given (date, time, region) ICU4C would deliver (GMT datetime, timezone), that would be converted into (UTC datetime, timezone index) [leap seconds, fast lookup] then gently folded into a box. To show() it unfold, unleap, show(ccall(ICU)). The ICU datetime API knows about regions and instants. Offering capabilities found in Joda-Time and Lubridate requires that and more: giving Julia facile understanding of regions, instants, intervals, durations and granules. For some interested party, other parts of ICU are useful in making an internationalization package. "ICU is a cross-platform Unicode based globalization library. It includes support for locale-sensitive string comparison, date/time/number/currency/message formatting, text boundary detection, character set conversion and so on." - FAQ. ICU http://site.icu-project.org/, License http://source.icu-project.org/repos/icu/icu/trunk/license.html, |
Time for Julia [b] (b) links for Lubridate, Joda-Time, JSR-310: Joda-Time brought its community a better class logic and mellowness. Lubridate co-opted some of that, broadened arithmetic ops and offered R-centric smoothness. This development arc continues with the work Stephen Colebourne is doing on JSR-310, A new Date and Time API for JAVA. Even in pre-release form it is a clear step up. Read "Why JSR-310 isn't Joda-Time": http://blog.joda.org/2009/11/why-jsr-310-isn-joda-time_4941.html For more on Joda-Time and Lubridate: http://joda-time.sourceforge.net/userguide.html, http://www.jstatsoft.org/v40/i03/paper, http://cran.r-project.org/web/packages/lubridate/lubridate.pdf. |
Time for Julia [c] (c) partial solution, for the most part Some Time Measure Terms Proleptic: The calendar system obtained by projecting a given calendar system back in time past its actual inception. For example, The Gregorian Calendar was first adopted in 1582, but the proleptic Gregorian calendar can be used to indicate earlier dates. To match such an earlier date with whatever calendar was in use at that earlier time, a conversion is required. Local Time: The time read from an accurate wall clock. Local Standard Time: Local Time when Daylight Savings Time is not in effect. Local Daylight Time: Local Standard Time adjusted by Daylight Saving offset when Daylight Saving Time is in effect. Timezone: A region where Local Standard Time is obtained by given offset from UTC and wherein Daylight Saving Time, if used, is in effect over given dates. Second: The SI Second is a consistently measurable duration, roughly equal to 1/86400 of the [inexactly defined] mean solar day. Formally, it is the duration of 9,192,631,770 periods of radiation corresponding to the transition between two hyperfine levels of the ground state of the cesium 133 atom at 0K. TAI (International Atomic Time): A time scale with unit interval exactly equal to 1 SI second. First measured in 1955, it uses 1958-Jan-01 for an origin. As each step of the TAI clock corresponds to a duration of 1 SI second, TAI does not reflect leapseconds. "It is recommended by the BIPM that systems which cannot handle leapseconds use TAI instead." UTC (Coordinated Universal Time): The internationally adopted timebase from which Local Standard Time (and so, Local Daylight Time) is determined. It incorporates leapseconds. Introduced on 1972-Jan-01, from it differs from TAI by an integral number of seconds thence forward (proleptic use requires formulae for 1961-1971). These categories of timescale+resolution cover the uses I have seen: (perhaps interfacing to an updated, resolution customized version of libtai http://cr.yp.to/libtai.html with a separate interface to the timezone database is worth a look) If the Julia community feel it important to have a flavor of datetime that performs as fast as possible, that is available for the 'common scale'. Doing so would entail having at least one other (probably two) specializations of generic datetime. Alternatively, one may subsume the four timescale categories in a single type of datetime, each instance using a larger multi-field structure. One way of realizing the 'common scale' (a specialization of datetime) follows. The Gregorian calendar has a largest natural cycle of 400 years. This cycle (re)started on 1600-Jan-01 and 2000-Jan-01 (and will restart on 2400-Jan-01). It is advantageous to work with this cycle, as it allows more coherence through lower level routines. For 'common scale', covering the two cycles surrounding 2000 offers sufficient breadth. Restricting the resolution to microseconds offers sufficient resolution. There are fewer than 600 timezone identifiers. The Gregorian calendar has a 400 year cycle of 146097 days. Cardinally labeled as sequential daynumbers, to identify each day in 400 years uses 17 bits. The near cycles are: 1600..1999, 2000..2399. Together they cover 292194 days; day labeling 800 years uses 18 bits. It takes 19 bits to label each of the million microseconds within one second; it takes 29 bits to label each of the billion nanoseconds in a second. A single day has 86400|1 seconds, so 36 bits will label a day of microseconds. The microsecond count must correspond to TAI or UTC (not local time). Local time is determined with UTC offset by timezone indexed lookup to ascertain offset from UTC for Local Time. Covering 292194 days of microseconds requires 54 bits. Allocating 9 bits to hold unique timezone index numbers would suffice. So, 63 bits allow representing the years 1600-2399 in microseconds, with timezone indices. With the timezone index in the lower order bits, 'common scale' datetimes over different regions would sort correctly (not necesarily a stable sort) without premasking, but safely using the microsecond count requires pre-shifting. With the microsecond count in the lower order bits, masking out the timezone index is forced for most operations -- which may be safer. |
A reliable, more current calendar&clock, available-for-use library (needs a separate interface to the IANA timezone database) is http://code.google.com/p/cdatecalc/ |
Ok, I'm just trying to make sure I understand the issues here. Can you give an example of how two valid UTC strings can map to the same POSIX time? |
This is relevant, of course: http://en.wikipedia.org/wiki/Unix_time#Encoding_time_as_a_number |
Like i said, take a look at the 2nd table, particularly the bold entries. |
Right, ok. So it seems like converting from my proposed |
Sorry about not seeing your comment before posting the same link, Mike. Once I refreshed the page I saw it. |
No -- although for floating point calcs, using ~2.390625 x the desired results' bitwidth is a really good heuristic. The 64bit-limited version of our internal timebase is not of the same nature as the representations we may fully support and others with which we may interwork. OTOH all of these considerations are obviated with the 128bit type for most common uses (timing results from CERN over the life of some project is one exception, long-term ephemeris development from multisourced observations is another). |
POSIX time is profoundly mucked up -- requires waders. The first POSIX On Sat, Dec 1, 2012 at 5:42 PM, Stefan Karpinski
|
Quoting @JeffreySarnoff since his post is getting mangled:
I would very much like to have a design that could, e.g., be used for things like analyzing CERN data. I'm willing to go 128-bit, if that's what it takes. 128 bits seems like it should be enough for anything sane. I'm still not understanding this rounding/precision issue at all. |
do go 128 bits -- there is much that is nicer for users and it saves having The rounding/precision is not an issue that applies to many cases of any On Sat, Dec 1, 2012 at 5:57 PM, Stefan Karpinski
|
Consider using +/- (2^111 - 1), reserving the remaining most On Sat, Dec 1, 2012 at 6:15 PM, Jeffrey Sarnoff
|
I'm pretty weirded out by having 16 bits that we're not doing anything with? What are we going to stuff in there? |
And slightly akimbo, did not view it that way -- an app is not directly accessing the significance of those bits. |
It would be very helpful to have minimal examples of best practice when defining a bitstype that is a subtype of this one. Frequent trial and more frequent error has dogged my placing of or omitting of the parameter. It has been more an effort in permutation counting than understanding the whys and wherefores of explicit vs implicitly given at each possible location within a function signature. Seeing the proper approach to dependency ordering for multi-parameterized bitstype dispatch-specific definitions of reinterpret, convert, promote_type, show and [what are the others that should be part of the mix?] would be great. |
So what's the proposal here? That we use TAI as the internal representation? TAI doesn't exist in any form before 1955. How do we represent earlier dates? Do we instead switch to terrestrial time (TT), of which TAI is a realization? Since people usually specify dates as UT, we'd have to apply a ΔT = TT-UT correction. |
Like TAI's absence from history before 195x, the Gregorian calendar is absent from history before 156x and then remained unadopted everywhere for another ~20 years. The Julian calendar is used only irregularly before 6 CE, and is wholly absent before ~360 BCE (as I recollect). Established calendric frames with desirable properties are not historically covering. For an internal representation, we want the sands of time to be as glass, fused rather than occasionally shifting. [Allegorically, forest rescue teams advise staying put gives them the best chance to find someone. Adopting an internal representation that 'stays put' (is used in a proleptic manner) will allow easier coding of intercoversions and intrinsically squash some potential inexactness The first way is to keep the internally monotonic and uniform day|second|femtosecond|Planck-time counter riven to the SI second (which itself may be redefined in a few years) and, therewith the TAI day of 86_400 SI seconds. Use proleptic astronomical [i.e. year number zero exists] Gregorian rules to obtain an internally consistent, perfectly unwindable internal labeling of year month day with SI-seconds+subsecond-tocs of the day. The second way is to obtain best-in-show DeltaT values from NASA's polynomials (available for years after -2000) http://eclipse.gsfc.nasa.gov/SEcat5/deltatpoly.html and let TAI slip backwards as TT when internalizing time. Externalizing such time uses piecewise rational approximations to invert the quartic, quintic .. polynomials. |
Could you explain what your "first way" in more detail? |
Some simplification does not change the What and Why of the first way. The internal timebase covers all years -3999 .. +3999 and supports no others. One Day_SI (defined as 86_400 Seconds_SI [60 Secs_SI * 60 MinsPerSec_SI * 24 HoursPerMin_SI]) is adopted as the nominal unit of internal time (nominal because our time has a heartbeat: uint(beats) times each Day_SI); then there is no conflation of TAI proleptic handling with Gregorian proleptic handling. The temporal perspective that the "first way" evinces applies with the "first way" applies pari passu. Here is that. The daynumber computation enfolds the first way. A well known algorithm provides the return trip. |
These daynumbers are not a translation of Julian dates. Julian dates are tallying days of 86_400 Seconds_SI. |
Every historical date is astronomical. If you're going to represent time as "proleptic TAI", then you're going to have to apply a ΔT correction to convert back and forth. |
We should not choose proleptic use of the historical TAI for our internal timebase. TAI has only been well aligned with the SI second for the past 10+ years. Before this, the same name has a different nature. DeltaT improves accuracy of distant glimpses, and has is of import to many. The sorts of interframe shifting and intraframe sliding entailing DeltaT is not naturally self-managed. A cleaner mechanism obtains when the core competency our internal timeway governs. |
"the first way seen a second way" M The first way is to keep the internally monotonic and uniform day|second|femtosecond|Planck-time counter riven to the SI second (which itself may be redefined in a few years) and, therewith the TAI day of 86_400 SI seconds. Use proleptic astronomical [i.e. year number zero exists] Gregorian rules to obtain an internally consistent, perfectly unwindable internal labeling of year month day with SI-seconds+subsecond-tocs of the day.
|
As I ready files for public display ( fresh DeltaT, daily specials ) and see something of specific interest ...
|
jasQuoRem(n::Signed, d::Unsigned) = begin q=fld(n,d); (q, Convert(Signed,(n-(d*q)))) end jasQuoRem(n::Int64, d::Uint64) = begin q=fld(n,d); (q, Convert(Int64,(n-(d_q)))) end jasQuoRem(n::Int64, d::Int64 ) = jasQuoRem(n, convert(Unsigned, abs(d))) |
Carrying timezone information in Julia without a Composite Type requires two parameters. I find three works best.
|
Hi, I want to do some work with dates and want to know if there is any code to use. I couldn't find a module related to dates inside ~/.julia/. Glen |
I have a feeling that we have a bunch of Date and Time stuff, but I can't see it in the packages. Would be nice to have it in there. |
There's the Calendar package as well Stefan's draft DateTime type. |
From what I see in Calendar and tm4julia they look really good. Seems much more intuitive than Python! I'll use Calendar until something better is packaged. Out of curiosity, anyone know where Stephan's DataTime lives? |
Glen -- I appreciate your good mention of tm4julia. I am progressing better than Sisyphus. |
So here is an initial support for Date and DateTime objects in Julia. Currently implemented is date arithmetic.
There are more things to do of course. First is to implement similar date arithmetic on DateTime. The big one however is date formats and parsing... that's a large piece of work by itself. However, the current functionality is self contained and useful.
There are quite a few tests in tests/date.jl
I've had to read a struct off libc, and so there is some simple but dirty looking C code in support/timefuncs.c/h. Not sure if that is the best way.
Let me know what you guys think about all this.