Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem parsing a simple CSV file #133

Open
olekscode opened this issue Sep 8, 2020 · 1 comment
Open

Problem parsing a simple CSV file #133

olekscode opened this issue Sep 8, 2020 · 1 comment

Comments

@olekscode
Copy link
Member

(from the email by Peter Odehnal)

I will be manipulating several large (8,000+ rows) CSV files to analyse and clean the data. First, I'm getting familiar with DataFrame using small CSV files -- for performance, development and testing purposes.

I am getting some unexpected behaviour. After spending many hours on this, I'm hoping you could give me some suggestions on how to resolve the problem.

| df workDir VvDir inFileRef outFileRef  cellVal cellValNew |  
workDir := FileSystem disk workingDirectory.
((VvDir := workDir / 'Vv') isDirectory) 
    ifFalse: [ self halt. ].
inFileRef := (workDir / 'Vv' / 'test.csv') asFileReference.

df := DataFrame readFromCsv: inFileRef withSeparator: (Character tab).
Transcript open; clear.
"Evaluate aBlock on the column with columnName and replace column with the result."
df column: #pw transform: [ :x | 
    x keys do:[ :eachkey | "| cellVal cellValNew |"
        cellVal := x at: eachkey.
        Transcript cr;
          show: ' key: ', (eachkey asString);
          show: ' cellVal: ', (cellVal class asString), 
            ' ', (cellVal asString), ' | '.
          cellValNew := cellVal asString.
        Transcript show: ' to: ', (cellValNew class asString);
          show: ' ', (cellValNew).
        " ((eachkey) > 8) ifTrue:[ self halt ]." "--- HALT_2 ---"
        x at: eachkey put: (cellValNew ).
        ]
    ].
" self halt."   "--- HALT_3 ---"
outFileRef := (workDir / 'Vv' / 'testOut.csv') asFileReference.
df writeToCsv: outFileRef.

My small tab-delimited TEST.CSV text file contains a header row plus 9 data rows:

id pw Name phone balance
1a 111 1a Company 111-1111 0.00
2b 222 2b Company 222-2222 50.22
3c 333 3c Company 333-3333 33.33
4d 444 4d Company 444-4444 0.00
5e 555 5e Company 555-5555 500.00
6f 666 6f Company 666-6666 600
7g 777 7g Company 777-7777 7.00
8h 888 8h Company 888-8888 8.88
9i 999 9i Company 999-9999 9.99

Transcript output is as follows:

 key: 1 cellVal: SmallInteger 111 |  to: ByteString 111
 key: 2 cellVal: SmallInteger 222 |  to: ByteString 222
 key: 3 cellVal: SmallInteger 333 |  to: ByteString 333
 key: 4 cellVal: SmallInteger 444 |  to: ByteString 444
 key: 5 cellVal: SmallInteger 555 |  to: ByteString 555
 key: 6 cellVal: SmallInteger 666 |  to: ByteString 666
 key: 7 cellVal: SmallInteger 777 |  to: ByteString 777
 key: 8 cellVal: SmallInteger 888 |  to: ByteString 888
 key: 9 cellVal: SmallInteger 999 |  to: ByteString 999

At the line with "--- SELF HALT_2 ---" all of the #pw fields (column data) are converted to a String.
But, by the line with "--- SELF HALT_3 ---" all of the data for the #pw column have reverted to Integer.

I'm hoping you are able to provide some insights, suggestions or a solution, as I've spent many hours on this problem.

On writing data out to the testOut.csv other fields get converted to data that looks like time values. After resolving the problem described above, I'm assuming that I can do similar #transform: [ aBlock ] to ensure I can convert all my data to String objects.

I will be manipulating all data as String objects, sorting, identifying duplicate key values, finding and fixing invalid field data... I hope DataFrame can be a foundation for this project.

@olekscode olekscode self-assigned this Sep 8, 2020
@olekscode olekscode added the bug label Sep 8, 2020
@olekscode
Copy link
Member Author

Related to #117

@olekscode olekscode added this to the v3.0 milestone Jul 26, 2021
@olekscode olekscode removed their assignment Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant