Memory not released #133

abel-diaz · 2021-10-07T21:21:09Z

abel-diaz
Oct 7, 2021

I have a for loop, through 4 tables and each time I delete the df and I run garbage collect. When using pandas read_sql, this worked fine and resources were kept low. When using connector x , its holding the resources. The photo shows python consumed 43GB to do all 4 tables. You can see the print statemements, "creating dataframe" is using connector x to get data. I admit its impressive it took 8 minutes to get 3M rows of data using 6 cores. But it came at a cost. The memory consumption is not being released.
Is there a call to close the connection or allow del and gc.collet() to do their job? It was working with pandas read_sql, but not with connector x. Another issue (small) is that print statements are held till completion and I am forced to use a sys.stdout.flush() to see the statements as they occur. this was not the case using pandas read_sql.
Main focus, why the memory hold and what to do?
Thanks
Abel

dovahcrow · 2021-10-07T21:29:20Z

dovahcrow
Oct 7, 2021
Maintainer

Hmmm, these two issues both looks weird to me. for the memory, we indeed release all the memory used except for the final df. But whether or not the memory is returned to the system is a behaviour of the allocator. We will investigate this to see if there's anything we can do here.
gc.collect only controls the memory allocated by Python, that's why it works for pandas.

For the printing issue, my guess is the python interpreter is blocked by connectorx so that it cannot flush the stdout automatically. We will also investigate whether this is the case and how to fix it.

1 reply

abel-diaz Oct 7, 2021
Author

You mentioned "final df"
If it helps, my steps are df2 = cx.read_sql... (to create the df)
then i make modifications on columns
ex: df2['SHIP_DC_NUM'] = df2['SHIP_DC_NUM'].fillna(0)

df2['ORDR_QTY'] = pd.to_numeric(df2['ORDR_QTY'],downcast='float')

once modifications are made on the same df, i write it to big query
job = df_to_bq(df2,table_id,schema=schema2,write_mode='WRITE_APPEND')

then i garbage collect and delete the df
All this is done on a single df that gets del then re-created for the next table read. 4 reads in total, 4 dels
The memory was released when using pandas read_sql, not sure what can be done to have connector x do the same so the mem does not accumulate. Note in photo python has the memory build up

Thanks for the help

abel-diaz · 2021-10-08T00:47:13Z

abel-diaz
Oct 8, 2021
Author

Here is a photo of the same process but ran with pandas read_sql. Note the python memory ends at 118.2 MB as opposed to the 43,652.3 MB when using connector x. Time to create df using pd.read_sql was on average 13 min for each loop (4 loops total). Connector x averaged 8 min. Connector x was 40% faster. But my concern is the memory consumption as this process will be in GCP where resources are costly. Which is why I hope to use connector x in replace of pd.read_sql (in addition to being faster). Also, when pd.read_sql was used, the memory would rise to about 16gb then drop down to 100 MB for each loop and not accumulate.

Hope this helps the cause!
Thank you
Abel

0 replies

abel-diaz · 2021-10-11T20:34:12Z

abel-diaz
Oct 11, 2021
Author

Hi will there be any solutions for this? Pardon me if you are still working on it.

Thanks
Abel

1 reply

dovahcrow Oct 11, 2021
Maintainer

Hi @abel-diaz, Sorry about that. yeah we are still investigating the issue. I'm not sure how long will it take since it is quite low level but will get back to you asap.

dovahcrow · 2021-10-12T09:46:11Z

dovahcrow
Oct 12, 2021
Maintainer

Hey @abel-diaz can you try the 0.2.1-alpha.5 version? It will be automatically published once the pipeline is finished. https://github.com/sfu-db/connector-x/actions/runs/1332536963

1 reply

abel-diaz Oct 13, 2021
Author

Hi All, I am letting you know I will do this today. I will download version 0.2.1-alpha.5 and re-run the same process and report back.

abel-diaz · 2021-10-13T21:24:33Z

abel-diaz
Oct 13, 2021
Author

Success! Connector -X is awesome!
I ran the program with cx and then again with pd. 3.26M rows per chunk, 4 chunks (I used an index col and 6 cores)
CX took 9~10 min per chunk to pull the sql to df and its max mem use was 14GB
PD took 17-19 min per chunk to pull the sql to df and its max mem use was 21GB

the bug has been fixed in cx, i watched as the mem climbed then lowered with each chunk. Mem went up to 14GB, then down to 0, as each chunk was written. Also print statements do not need a flush to be seen.

On average, CX was 47% faster than PD in reading sql to df and consumed 33% less memory.

Huge thank you, this is a great program.

Thank you,
Abel

Photos show CX and PD performance. Notice no memory held after CX completed. Before it was 43 GB.

1 reply

dovahcrow Oct 13, 2021
Maintainer

Success! Connector -X is awesome! I ran the program with cx and then again with pd. 3.26M rows per chunk, 4 chunks (I used an index col and 6 cores) CX took 9~10 min per chunk to pull the sql to df and its max mem use was 14GB PD took 17-19 min per chunk to pull the sql to df and its max mem use was 21GB

the bug has been fixed in cx, i watched as the mem climbed then lowered with each chunk. Mem went up to 14GB, then down to 0, as each chunk was written. Also print statements do not need a flush to be seen.

On average, CX was 47% faster than PD in reading sql to df and consumed 33% less memory.

Huge thank you, this is a great program.

Thank you, Abel

Photos show CX and PD performance. Notice no memory held after CX completed. Before it was 43 GB.

Thanks for the feedback! It's our pleasure that our project helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory not released #133

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Memory not released #133

abel-diaz Oct 7, 2021

Replies: 5 comments · 4 replies

dovahcrow Oct 7, 2021 Maintainer

abel-diaz Oct 7, 2021 Author

abel-diaz Oct 8, 2021 Author

abel-diaz Oct 11, 2021 Author

dovahcrow Oct 11, 2021 Maintainer

dovahcrow Oct 12, 2021 Maintainer

abel-diaz Oct 13, 2021 Author

abel-diaz Oct 13, 2021 Author

dovahcrow Oct 13, 2021 Maintainer

abel-diaz
Oct 7, 2021

Replies: 5 comments 4 replies

dovahcrow
Oct 7, 2021
Maintainer

abel-diaz Oct 7, 2021
Author

abel-diaz
Oct 8, 2021
Author

abel-diaz
Oct 11, 2021
Author

dovahcrow Oct 11, 2021
Maintainer

dovahcrow
Oct 12, 2021
Maintainer

abel-diaz Oct 13, 2021
Author

abel-diaz
Oct 13, 2021
Author

dovahcrow Oct 13, 2021
Maintainer