-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'mlr cut' is very slow #1527
Comments
@tooptoop4 how many columns does the 5GB CSV have? (Also, if possible, can you link to the CSV file itself? No worries if not, but it'd be helpful). The reason I ask about column-count is #1507 which will be in the next Miller release (6.12). |
Hi @johnkerl I have made a test using this big CSV https://opencoesione.gov.it/media/open_data//progetti_esteso_20231231.zip QSV
3.988 total, using Miller
I have duckdb
1.382 total, using v0.10.0 20b1486d11 |
This is not a time to consider. Also duckdb fails and does not extract all 1977163 rows. Now I investigate further. |
In duckdb 0.10 there was this bug. It has been closed, but is not available in the compiled stable version. import duckdb
import time # Importa il modulo time
start_time = time.time()
query = """
COPY (select OC_COD_CICLO from read_csv('progetti_esteso_20231231.csv',delim=';')) to 'cut_duckdb.csv'
"""
duckdb.query(query)
end_time = time.time()
execution_time = end_time - start_time
print("Tempo di esecuzione: {:.2f} secondi".format(execution_time)) |
Thanks @tooptoop4 and @aborruso Also I would note, |
John I know, I love Miller, it is too comfortable, it is brilliant. You were asking for an example and I included one that I am working with these days. |
Indeed @aborruso I should have mentioned -- the example is quite sufficient -- thank you! :) |
Hi @johnkerl, using the new release - 6.12.0 - I have had no errors. The processing time was 12:45.40 total. |
for a 5gb csv running 'mlr cut' for one column takes 5mins but only takes 30seconds with 'xsv select' (https://github.com/BurntSushi/xsv)
The text was updated successfully, but these errors were encountered: