Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Making data gaps visible in line charts by inserting zeros #3512

Closed
wants to merge 1 commit into from

Conversation

fabianmenges
Copy link
Contributor

@fabianmenges fabianmenges commented Sep 21, 2017

Description

For many different kinds of metrics you might want to immediately see that you did not record any values in a given time period (e.g. not registering any user impressions for the first minute of every hour). With the current line chart visualization this would be very difficult to spot, you would have to spot a "missing" marker: 1 out of 60 every hour.

This merge request adds a "insert zeros" control that takes the current selected Time Grain into account to insert 0s into the time series data if the time series did not contain any data points for a given period specified by the Time Grain.

The main difference to resampling is, that is does this entirely on the client side and only very few data points are added to the time series.

Example 1:

We are visualizing the birth_names example datasource. As you can see looking at the markers on the first screenshot we have one data point per year.
markers

As you can see, we selected the time grain "month". Checking the box "Insert Zeros" will now
scan the time series: if two data points are more than one month apart from one another it will insert a 0 data point into the time series accordingly.

zeros

Example 2:

Same dataset birth_names. I deleted all rows with a timestamps between 1967 and 1968. Without markers you would not be able to notice them missing, with markers its still pretty difficult:
markers_missing
When you insert 0s it is easy to see:
zeros_missing

Features:

  • Everything happens on the client side
  • Uses the time series grain,
  • Works with Druid and Sql
  • Works with Grouped and Filtered Data
  • If the time grain is set to "Time Column" or "All" it applies a heuristic to guess the period length.
  • It will only insert a minimal number of 0s to make the missing data visible.

I added unit tests for the granularity/time-grain/period parsing. Happy to add a unit test for the adding 0 logic, I couldn't find a good place to add the spec file.

This is related to: #487 without putting 0s everywhere
Happy to get some feedback

@coveralls
Copy link

coveralls commented Sep 21, 2017

Coverage Status

Coverage remained the same at 69.535% when pulling 1f477f6 on tc-dc:fmegnes/zeros into 9af34ba on apache:master.

@fabianmenges
Copy link
Contributor Author

If you think it makes more sense to enhance resampling on the server side (e.g. support the various period types, derive it from the time grain) I'm happy to improve that experience instead.

@mistercrunch
Copy link
Member

I like the fact that "insert zeros" as a trigger is much more intuitive than "Resampling". I'm thinking server side on this one. Just because pandas is so simple for this.

See how time grain is fairly loosely defined at the moment? https://github.com/apache/incubator-superset/blob/master/superset/db_engine_specs.py#L192

One easy approach would be add a corresponding pandas freq to each time grain. Knowing this, it would be easy to resample whenever needed.

"Resampling" came before the "Time-Grain" concept and is now confusing/obsolete. I feel like we could trigger the pandas resampling based on time grain configuration with a simple checkbox.

In that context, there may be some very few specific use cases where people may want the extra flexibility of resampling with specific fill patterns but I wouldn't be against deprecating the feature.

@mistercrunch
Copy link
Member

mistercrunch commented Sep 22, 2017

An extra point for pandas vs having our own JS function for this is that pandas (from memory) deals nicely with timezones, such that when we want to start thinking about comprehensive time zone management pandas may do some of that magic too.

@fabianmenges
Copy link
Contributor Author

K, I'll take care of this and bring it down to the server side, should be able to get something going by end of next week.

We probably need to have a function to translate the time grains to pandas frequencies (like the one I wrote in JS to ms) because we also need to cover Druid and the granularities for Druid are handled completely independent. Even better would probably be to unify the concept of druid granularity/pandas frequencies (offset-aliases)/sql time grain because they are all used almost interchangeably within superset.

Apart from that I don't think that my function would have problems with time zones since it just looks at raw milliseconds since epoch...

@mosche
Copy link
Member

mosche commented Mar 1, 2018

@fabianmenges This would be super helpful with respect to the time grain problems mentioned in #487
Did you manage to implement this feature on the server side in the meanwhile?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants