Replies: 5 comments 4 replies
-
Hmmm, these two issues both looks weird to me. for the memory, we indeed release all the memory used except for the final df. But whether or not the memory is returned to the system is a behaviour of the allocator. We will investigate this to see if there's anything we can do here. For the printing issue, my guess is the python interpreter is blocked by connectorx so that it cannot flush the stdout automatically. We will also investigate whether this is the case and how to fix it. |
Beta Was this translation helpful? Give feedback.
-
Here is a photo of the same process but ran with pandas read_sql. Note the python memory ends at 118.2 MB as opposed to the 43,652.3 MB when using connector x. Time to create df using pd.read_sql was on average 13 min for each loop (4 loops total). Connector x averaged 8 min. Connector x was 40% faster. But my concern is the memory consumption as this process will be in GCP where resources are costly. Which is why I hope to use connector x in replace of pd.read_sql (in addition to being faster). Also, when pd.read_sql was used, the memory would rise to about 16gb then drop down to 100 MB for each loop and not accumulate. |
Beta Was this translation helpful? Give feedback.
-
Hi will there be any solutions for this? Pardon me if you are still working on it. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hey @abel-diaz can you try the |
Beta Was this translation helpful? Give feedback.
-
Success! Connector -X is awesome! the bug has been fixed in cx, i watched as the mem climbed then lowered with each chunk. Mem went up to 14GB, then down to 0, as each chunk was written. Also print statements do not need a flush to be seen. On average, CX was 47% faster than PD in reading sql to df and consumed 33% less memory. Huge thank you, this is a great program. Thank you, Photos show CX and PD performance. Notice no memory held after CX completed. Before it was 43 GB. |
Beta Was this translation helpful? Give feedback.
-
I have a for loop, through 4 tables and each time I delete the df and I run garbage collect. When using pandas read_sql, this worked fine and resources were kept low. When using connector x , its holding the resources. The photo shows python consumed 43GB to do all 4 tables. You can see the print statemements, "creating dataframe" is using connector x to get data. I admit its impressive it took 8 minutes to get 3M rows of data using 6 cores. But it came at a cost. The memory consumption is not being released.
Is there a call to close the connection or allow del and gc.collet() to do their job? It was working with pandas read_sql, but not with connector x. Another issue (small) is that print statements are held till completion and I am forced to use a sys.stdout.flush() to see the statements as they occur. this was not the case using pandas read_sql.
Main focus, why the memory hold and what to do?
Thanks
Abel
Beta Was this translation helpful? Give feedback.
All reactions