You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I will be manipulating several large (8,000+ rows) CSV files to analyse and clean the data. First, I'm getting familiar with DataFrame using small CSV files -- for performance, development and testing purposes.
I am getting some unexpected behaviour. After spending many hours on this, I'm hoping you could give me some suggestions on how to resolve the problem.
| dfworkDirVvDirinFileRefoutFileRefcellValcellValNew |
workDir :=FileSystem disk workingDirectory.
((VvDir:= workDir /'Vv') isDirectory)
ifFalse: [ self halt. ].
inFileRef := (workDir /'Vv'/'test.csv') asFileReference.
df :=DataFramereadFromCsv: inFileRef withSeparator: (Character tab).
Transcript open; clear.
"Evaluate aBlock on the column with columnName and replace column with the result."
df column:#pwtransform: [ :x |
x keys do:[ :eachkey | "| cellVal cellValNew |"
cellVal := x at: eachkey.
Transcript cr;
show:' key: ', (eachkey asString);
show:' cellVal: ', (cellVal class asString),
'', (cellVal asString), ' | '.
cellValNew := cellVal asString.
Transcriptshow:' to: ', (cellValNew class asString);
show:'', (cellValNew).
" ((eachkey) > 8) ifTrue:[ self halt ].""--- HALT_2 ---"
x at: eachkey put: (cellValNew ).
]
].
" self halt.""--- HALT_3 ---"
outFileRef := (workDir /'Vv'/'testOut.csv') asFileReference.
df writeToCsv: outFileRef.
My small tab-delimited TEST.CSV text file contains a header row plus 9 data rows:
id pw Name phone balance
1a 111 1a Company 111-1111 0.00
2b 222 2b Company 222-2222 50.22
3c 333 3c Company 333-3333 33.33
4d 444 4d Company 444-4444 0.00
5e 555 5e Company 555-5555 500.00
6f 666 6f Company 666-6666 600
7g 777 7g Company 777-7777 7.00
8h 888 8h Company 888-8888 8.88
9i 999 9i Company 999-9999 9.99
Transcript output is as follows:
key: 1 cellVal: SmallInteger 111 | to: ByteString 111
key: 2 cellVal: SmallInteger 222 | to: ByteString 222
key: 3 cellVal: SmallInteger 333 | to: ByteString 333
key: 4 cellVal: SmallInteger 444 | to: ByteString 444
key: 5 cellVal: SmallInteger 555 | to: ByteString 555
key: 6 cellVal: SmallInteger 666 | to: ByteString 666
key: 7 cellVal: SmallInteger 777 | to: ByteString 777
key: 8 cellVal: SmallInteger 888 | to: ByteString 888
key: 9 cellVal: SmallInteger 999 | to: ByteString 999
At the line with "--- SELF HALT_2 ---" all of the #pw fields (column data) are converted to a String.
But, by the line with "--- SELF HALT_3 ---" all of the data for the #pw column have reverted to Integer.
I'm hoping you are able to provide some insights, suggestions or a solution, as I've spent many hours on this problem.
On writing data out to the testOut.csv other fields get converted to data that looks like time values. After resolving the problem described above, I'm assuming that I can do similar #transform: [ aBlock ] to ensure I can convert all my data to String objects.
I will be manipulating all data as String objects, sorting, identifying duplicate key values, finding and fixing invalid field data... I hope DataFrame can be a foundation for this project.
The text was updated successfully, but these errors were encountered:
(from the email by Peter Odehnal)
I will be manipulating several large (8,000+ rows) CSV files to analyse and clean the data. First, I'm getting familiar with DataFrame using small CSV files -- for performance, development and testing purposes.
I am getting some unexpected behaviour. After spending many hours on this, I'm hoping you could give me some suggestions on how to resolve the problem.
My small tab-delimited
TEST.CSV
text file contains a header row plus 9 data rows:Transcript output is as follows:
At the line with
"--- SELF HALT_2 ---"
all of the#pw
fields (column data) are converted to a String.But, by the line with
"--- SELF HALT_3 ---"
all of the data for the#pw
column have reverted to Integer.I'm hoping you are able to provide some insights, suggestions or a solution, as I've spent many hours on this problem.
On writing data out to the
testOut.csv
other fields get converted to data that looks like time values. After resolving the problem described above, I'm assuming that I can do similar#transform: [ aBlock ]
to ensure I can convert all my data to String objects.I will be manipulating all data as String objects, sorting, identifying duplicate key values, finding and fixing invalid field data... I hope DataFrame can be a foundation for this project.
The text was updated successfully, but these errors were encountered: