Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Accept any valid delimiters/EOF markers in TCK files #720

Merged
merged 14 commits into from
Feb 4, 2019

Conversation

soichih
Copy link
Contributor

@soichih soichih commented Jan 27, 2019

According to mrtrix/tck specification.

The binary track data themselves are stored as triplets of floating-point values (at this stage in 32 bit floating-point format), one per vertex along the track. Tracks are separated using a triplet of NaN values. Finally, a triplet of Inf values is used to indicate the end of the file.

https://mrtrix.readthedocs.io/en/latest/getting_started/image_data.html#tracks-file-format-tck

The current implementation of streamlines/tck.py / _read() expects that these delimiters are in specific binary format; whatever numpy.nan and numpy.inf happens to be). However, according to IEEE standard, NaN/Inf can have many different binary representations.

EEE 754 NaNs are represented with the exponent field filled with ones (like infinity values), and some non-zero number in the significand (to make them distinct from infinity values); 

https://en.wikipedia.org/wiki/NaN

Therefore, streamlines generated by other programming languages (such as Matlab) can store these NaN and Inf with a slightly different binary format, and nibabel.streamlines.tck fails to detect these delimieters which results in the following error

brainlife/app-convert-tck-to-trk#1 (comment)

I have re-implemented _read() so that it will actually look at each values to see if they are NaN values to determine if the values are stream line delimiters. My code runs around 30% slower (4.7sec v.s. 3.6sec for loading 500k streamlines) but I believe proper checking of NaN/Inf are unavoidable so that this code can load streamlines generated by other programming languages (or numpy might change the definition for numpy.nan and numpy.inf in the future)

@matthew-brett
Copy link
Member

Thanks for this.

Can you add a test, maybe with a currently failing small trk file?

Can you think of a way of speeding up again? Can we first check what the binary format is, then use that in subsequent reads from the same file?

@soichih
Copy link
Contributor Author

soichih commented Jan 28, 2019

@matthew-brett Yes, I can try adding a unit test. Currently though.. "make test" takes a really long time.. (it's stuck on 99% CPU usage for >30 min when I try it) Is it normal?

Can we first check what the binary format is, then use that in subsequent reads from the same file?

Maybe.. but I think it's a bit risky.. tck file could be generated by concatenating different .tck files from different sources, so depending on the implementation, it could end up with having a mix of different binary formats within a single file (I can think of a case where someone tries to by-pass the same obj. buffer that's loaded from a file and appended it to the new aggregated file)

Using a proper float32 NaN/Inf function is slower, but loading 600K fiber would only take a few seconds so I wouldn't worry too much about optimizing this. I think it's best to do this in the right way (and simpler); looking for a specific binary pattern for NaN/Inf or any floating decimal values for that matter is a bit iffy.

I've attached a sample .tck file that fails to load with the current implementation.

matlab.zip

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments that may explain at least some of the slowdown. I think you're accidentally coercing to float64, and doing many more appends than necessary.

nibabel/streamlines/tck.py Outdated Show resolved Hide resolved
nibabel/streamlines/tck.py Outdated Show resolved Hide resolved
@soichih
Copy link
Contributor Author

soichih commented Jan 28, 2019

@effigies Thanks for the suggestion. I've applied a few of your suggestions and now my code runs about 5% faster compared to the current production version (instead of slower). I assumed that numpy.append with an empty array is a no-op, but I guess I was wrong.

I've kept most of my original code, as your version doesn't handle the EOF delimiter checking correctly. I think there is a way to make it work.. but I thought mine is a bit simpler. I still can't get the unittest to work, but I've tested my code with various input files I had and made sure that it loads the correct number of fibers/streams.

@matthew-brett
Copy link
Member

For some reason I can't get the Travis-CI page with the test failures - can you describe them?

Can you add a small test file written by Matlab to confirm this fixes the read?

different binary format from the current numpy.nan and numpy.inf
@soichih
Copy link
Contributor Author

soichih commented Jan 30, 2019

@matthew-brett I've added matlab_nan.tck in tests/data. Please let me know if there is anything else I should do.

@effigies
Copy link
Member

The test failures occur because numpy expects a real file object in fromfile, which fails if we have any stream that isn't that. I will try a bytesarray fix...

@coveralls
Copy link

coveralls commented Jan 31, 2019

Coverage Status

Coverage increased (+0.009%) to 91.821% when pulling 486bbb2 on soichih:master into ad6b890 on nipy:master.

@codecov-io
Copy link

codecov-io commented Jan 31, 2019

Codecov Report

Merging #720 into master will increase coverage by <.01%.
The diff coverage is 90.47%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #720      +/-   ##
=========================================
+ Coverage    89.1%   89.1%   +<.01%     
=========================================
  Files          93      93              
  Lines       11468   11469       +1     
  Branches     1991    1990       -1     
=========================================
+ Hits        10218   10220       +2     
+ Misses        911     910       -1     
  Partials      339     339
Impacted Files Coverage Δ
nibabel/cifti2/cifti2.py 96.38% <0%> (ø) ⬆️
nibabel/streamlines/tck.py 99.45% <95%> (+0.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad6b890...486bbb2. Read the comment docs.

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for tolerating my refactors. One tiny thing I failed to clean up.

@matthew-brett @MarcCote Do one or both of you have time for a quick look? In particular I'd appreciate a check that the logic of the DataError matches the original intent; I tried to replicate the cases that would produce each message, but the structure changed a bit.

nibabel/streamlines/tck.py Outdated Show resolved Hide resolved
@matthew-brett
Copy link
Member

@effigies - sorry - I'm completely overwhelmed with work at the moment, will be until Wednesday or so. @MarcCote - do you mind having a look?

@effigies
Copy link
Member

Thanks for the heads up Matthew.

@MarcCote
Copy link
Contributor

Thanks for pinging me. I'm not explicitly watching Nibabel repo anymore but I'm happy to help whenever I can. I'll have a look tonight.

@effigies effigies changed the title fixed a bug where streamlines/tck can't read .tck generated by matlab FIX: Accept any valid delimiters/EOF markers in TCK files Jan 31, 2019
n_streams += 1
begin = delim + 1

# The rest gets appended to the leftover
Copy link
Contributor

@MarcCote MarcCote Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The rest gets appended to the leftover
# The rest becomes the new leftover.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you apply this suggestion, as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor

@MarcCote MarcCote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thank you @soichih.

@effigies
Copy link
Member

effigies commented Feb 1, 2019

Thanks, @MarcCote. @soichih If you have a couple minutes to apply the two suggestions, we can go ahead and merge ASAP.

@soichih
Copy link
Contributor Author

soichih commented Feb 4, 2019

@effigies Did I do it right? I believe I've applied @MarcCote 's suggestions.

@effigies
Copy link
Member

effigies commented Feb 4, 2019

Great! Thanks!

@effigies effigies merged commit 4b6ca81 into nipy:master Feb 4, 2019
@effigies effigies added this to the 2.4.0 milestone Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants