Skip to content
This repository has been archived by the owner on Oct 29, 2024. It is now read-only.

chore(line_protocol): fix nanosecond timestamp resolution for points #811

Merged
merged 1 commit into from
Apr 8, 2020

Conversation

sebito91
Copy link
Contributor

@sebito91 sebito91 commented Apr 8, 2020

Closes #407.
Closes #650.
Closes #649.
Closes #527.
Closes #489.
Closes #346.
Closes #344.
Closes #340.

This PR merges work done in #407 into the current master without inclusion of external requirements (e.g. pandas). Thanks to @AndreCAndersen and @clslgrnc !

@sebito91 sebito91 requested review from aviau and xginn8 as code owners April 8, 2020 20:25
@sebito91 sebito91 self-assigned this Apr 8, 2020
@sebito91 sebito91 removed request for aviau and xginn8 April 8, 2020 20:26
@sebito91 sebito91 merged commit 04205cf into master Apr 8, 2020
@sebito91 sebito91 deleted the merge_407 branch April 8, 2020 21:17
@inselbuch
Copy link

rockstars!

@sebito91
Copy link
Contributor Author

sebito91 commented Apr 8, 2020

Please let me know if it's not working and we'll get it fixed. Release v5.2.4 set to come out Friday, April 10th, 2020.

@clslgrnc
Copy link
Contributor

clslgrnc commented Apr 9, 2020

Should release v5.2.4 come with a warning in the unusual case where someone tries to access a point inserted with v5.2.3 from its timestamp?

As an example if I insert a point every millisecond, some are inserted with a wrong timestamp, but it might not be an issue because when I try to retrieve a point at a given timestamp the same error is made and I retrieve the right point. Migrating to v5.2.4 the error is not made anymore and influxdb would return no points at the requested (correct) millisecond (because the actual point is slightly off).

I agree that this is a corner case and probably not how people should use influxdb.


Edit: Actually I am not sure the same error is made when retrieving points, and its probably how this bug was detected

@cuxcrider
Copy link

Hi all,

Thank you for continuing to work on this.

I am finding potentially two issues:

  1. (minor) I can confirm I can read in nanosecond timestamps with the dataframe client but only if I specify epoch = 'ns'. If I do not specify epoch I get microsecond precision. I believe this is contrary to the documentation that says nanosecond is the default.

  2. (more troublesome) The dataframe client will write a nanosecond timestamp, but if there is a timestamp that is within a few hundred nanoseconds it seems to think it is the same timestamp and does not write the additional datapoint. This is strange to me because I do not remember this behavior after some discussions with contributors on Incorrect nanosecond timestamps being written to influxdb #649 , but now if you run my code I posted in Incorrect nanosecond timestamps being written to influxdb #649 you will see that you only get two data points rather than all four. I have tried running Pandas 1.1.2 and 0.23.4 with the same results. I am on Numpy 1.19.1, Python 3.7.9, and Influxdb-python 5.3.0. If you change timestamp '2019-10-04 06:27:19.850557111+00:00' to '2019-10-04 06:27:19.850555111+00:00' then the datapoint is written.

Any thoughts?

Here is my example code you can use to demonstrate the result, just enter in your user, password, host-ip and database name:

from influxdb import InfluxDBClient, DataFrameClient
import numpy as np
import pandas as pd
pd.show_versions()

#I use this make sure I write to influxdb anything that is a number as a float
def df_int_to_float(df):
    try:
        for i in df.select_dtypes('number').columns.values:
            df[i] = df[i].astype('float64')
    except:
        print('cycle not in dataframe')
    return df

###remember to enter your host, user, and password
def main(host = , port='8086'):
    """Instantiate a connection to the InfluxDB."""
    
    user = 
    password = 
    db_name = 
    client = InfluxDBClient(host, port, user, password)
    client.drop_database(db_name) 
    client.create_database(db_name)
    dfclient = DataFrameClient(host, port, user, password, db_name)
            
    for_df_dict = {"nanFloats": [1.1, float('nan') , 3.3, 4.4], "onlyFloats": [1.1, 2.2, 3.3, 4.4], 
                                  "strings":['one_one', 'two_two' ,'three_three', 'four_four']}
    df = pd.DataFrame.from_dict(for_df_dict)
    df['time'] = ['2019-10-04 06:27:19.850557111+00:00', '2019-10-04 06:27:19.850557184+00:00', '2019-10-04 06:27:42.251396864+00:00',
      '2019-10-04 06:27:42.251396974+00:00']
    df['time'] = pd.to_datetime(df['time'], unit='ns')
    df = df.set_index('time')
    df = df_int_to_float(df) 
    #####  df_types just for informational purposes
    df_types_float = df.select_dtypes(include = ['float64']) 
    df_types_bool = df.select_dtypes(include = ['bool'])
    df_types_obj = df.select_dtypes(include = ['object'])
    ########
    dfclient.write_points(df, 'test', time_precision='n')  
    df_dict = dfclient.query('SELECT * FROM \"test\" ', epoch = 'ns')
    
    

if __name__ == '__main__':
    main()

@lihaoml
Copy link

lihaoml commented Nov 9, 2020

Hi all,

Thank you for continuing to work on this.

I am finding potentially two issues:

  1. (minor) I can confirm I can read in nanosecond timestamps with the dataframe client but only if I specify epoch = 'ns'. If I do not specify epoch I get microsecond precision. I believe this is contrary to the documentation that says nanosecond is the default.
  2. (more troublesome) The dataframe client will write a nanosecond timestamp, but if there is a timestamp that is within a few hundred nanoseconds it seems to think it is the same timestamp and does not write the additional datapoint. This is strange to me because I do not remember this behavior after some discussions with contributors on Incorrect nanosecond timestamps being written to influxdb #649 , but now if you run my code I posted in Incorrect nanosecond timestamps being written to influxdb #649 you will see that you only get two data points rather than all four. I have tried running Pandas 1.1.2 and 0.23.4 with the same results. I am on Numpy 1.19.1, Python 3.7.9, and Influxdb-python 5.3.0. If you change timestamp '2019-10-04 06:27:19.850557111+00:00' to '2019-10-04 06:27:19.850555111+00:00' then the datapoint is written.

Any thoughts?

Here is my example code you can use to demonstrate the result, just enter in your user, password, host-ip and database name:

from influxdb import InfluxDBClient, DataFrameClient
import numpy as np
import pandas as pd
pd.show_versions()

#I use this make sure I write to influxdb anything that is a number as a float
def df_int_to_float(df):
    try:
        for i in df.select_dtypes('number').columns.values:
            df[i] = df[i].astype('float64')
    except:
        print('cycle not in dataframe')
    return df

###remember to enter your host, user, and password
def main(host = , port='8086'):
    """Instantiate a connection to the InfluxDB."""
    
    user = 
    password = 
    db_name = 
    client = InfluxDBClient(host, port, user, password)
    client.drop_database(db_name) 
    client.create_database(db_name)
    dfclient = DataFrameClient(host, port, user, password, db_name)
            
    for_df_dict = {"nanFloats": [1.1, float('nan') , 3.3, 4.4], "onlyFloats": [1.1, 2.2, 3.3, 4.4], 
                                  "strings":['one_one', 'two_two' ,'three_three', 'four_four']}
    df = pd.DataFrame.from_dict(for_df_dict)
    df['time'] = ['2019-10-04 06:27:19.850557111+00:00', '2019-10-04 06:27:19.850557184+00:00', '2019-10-04 06:27:42.251396864+00:00',
      '2019-10-04 06:27:42.251396974+00:00']
    df['time'] = pd.to_datetime(df['time'], unit='ns')
    df = df.set_index('time')
    df = df_int_to_float(df) 
    #####  df_types just for informational purposes
    df_types_float = df.select_dtypes(include = ['float64']) 
    df_types_bool = df.select_dtypes(include = ['bool'])
    df_types_obj = df.select_dtypes(include = ['object'])
    ########
    dfclient.write_points(df, 'test', time_precision='n')  
    df_dict = dfclient.query('SELECT * FROM \"test\" ', epoch = 'ns')
    
    

if __name__ == '__main__':
    main()

I got the same issue here with v5.3.0 and v5.2.3

I think the bug is in _dataframe_client.py:

replacing
time = ((dataframe.index.to_timestamp().values.astype(np.int64) / precision_factor).astype(np.int64).astype(str))

with
time = ((dataframe.index.to_timestamp().values.astype(np.int64) // precision_factor).astype(np.int64).astype(str))

fixes the issue for me.

Not a python expert, but it looks like in python3 int/int = float

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.