Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output file names and forbidden characters #425

Closed
chrisfoulon opened this issue Sep 10, 2020 · 13 comments
Closed

Output file names and forbidden characters #425

chrisfoulon opened this issue Sep 10, 2020 · 13 comments

Comments

@chrisfoulon
Copy link

Hi!

I am trying to reproduce the file naming of dcm2niix and I noticed something that I am not sure how to handle it.
When I try the format %d_%t%s so SeriesDescription, Time and InstanceNumber I have 2 different behaviours.
For example, 2 files where the SeriesDescription contains forbidden characters, as explained in https://github.com/rordenlab/dcm2niix/blob/master/FILENAMING.md:
'COR T2* GRE' gives me 'COR_T2_GRE_20160812173101_6'
But:
'MPR Ob_Ax_I -> S_Min IP_sp:10.0_th:10.0' gives me 'MPR_Ob_Ax_I_-__S_Min_IP_sp_10.0_th_10.0_20150805144117_900'

So, I don't know why '>' adds an underscore in the second case but not the other ':' '*' ... characters?
Is there a different rule for some of the forbidden characters, or is it not normal behaviour?

Thank you in advance!
Chris.

@neurolabusc
Copy link
Collaborator

@chrisfoulon the forbidden characters are converted to an underscore at the file naming stage. However, the asterisk is deleted from the string earlier (during the header reading stage) - not only will it not be allowed for file names, it will not be allowed in any strings saved to the JSON file, along with a few other characters (comma, space, /, , %, Tab, LineFeed, Vertical Tab, Carriage Return) - all of which are replaced by an underscore ("_") with multiple sequential underscores replaced by a single underscore. This has been the consistent behavior, but does describe a major difference in the BIDS JSON files created by dcm2niix versus dicm2nii. I really do not have a strong opinion about this, and do understand that reporting "T2*" in the series description may be less misleading than "T2" (as a T2* image attempts to capture only the susceptibility artifact of a T2-weighted image). However, it will mean all JSON files created by future versions of dcm2niix will be different from prior versions..

@satra, @xiangruili and @effigies would you have any concerns if future versions of dcm2niix preserved line spaces, commas, slashes and backslashes (converted to double backslash), and asterisk symbols to mimic dicm2nii:

	"SeriesDescription": "COR T2* GRE",
	"SequenceName": "*epfid2d1_64",
	"InstitutionAddress": "Medical Center Dr 5,Columbia,SC,US,29203",

versus the current behavior:

	"SeriesDescription": "COR_T2_GRE",
	"SequenceName": "_epfid2d1_64",
	"InstitutionAddress": "Medical_Center_Dr_5_Columbia_SC_US_29203",

Since dicm2nii is widely used, I suspect there will not be many unintended consequences.

If we include the asterisks, we should also consider other symbols like the comma, slash and backslash. However, I am less sure we should include the control Characters, CR, LF, FF, and ESC allowed by DICOM. In general, it might be a good time to make dcm2niix and dicm2nii consistent.

A full list of dicm2nii behavior:

  • space: allowed
  • ,: allowed
  • %: allowed
  • asterisk: allowed
  • /: allowed
  • ascii9 tab: allowed
  • ascii10 line feed: allowed
  • ascii11 vertical tab: allowed
  • asci13 carriage return: allowed
  • backslash converted to double backslash. Note that this character is only allowed for some DICOM VRs.

@effigies
Copy link

  • space: allowed
  • ,: allowed
  • %: allowed
  • asterisk: allowed
  • /: allowed
  • backslash converted to double backslash.

These all make sense to me. I really can't see any good reason to include '\r', '\n', '\t' or vertical tab, and would drop them.

Alternatively, I would be okay with translating any unprintable or non-space whitespace characters to hex escapes and double-backslashing, like \\x00-\\xFF. This would allow downstream tools to reconstruct the original values if desired.

@satra
Copy link

satra commented Sep 11, 2020

@neurolabusc - i didn't realize dcm2niix was sanitizing info. i think the values should just reflect what is in dicom and should use any standard json encoding to convert to the json file.

@effigies - if someone wanted to put a structured field in patient description, that should be pulled out as was entered on the scanner console without sanitization. is there a specific reason to sanitize anything at the dcm2niix level other than json compatibility?

also is there any mandate on which form of utf encoding to adopt on the bids side? i can see many dicom fields containing arbitrary characters depending on locations on the planet.

@effigies
Copy link

Oh right, JSON will just accept \uABCD in strings. That probably makes the most sense.

@xiangruili
Copy link

xiangruili commented Sep 11, 2020

I agree with @satra that we won't bother to convert anything, but simply keep what it is in dicom, unless it is forbidden by JSON.

dicm2nii.m uses output file names which are also legal variable names for Matlab. The consideration was to save a full dicom header in a Matlab struct (field name has the same requirement as variable name) with the same name as the file name.

@neurolabusc
Copy link
Collaborator

@effigies @satra and @xiangruili it is great to have a consensus! While we are changing this, do you all have opinions on how to handle non-English characters for filenames. Consider the name "Søren Müller‬":

  1. DICOM will store this string using ISO 8859, so is stored 0x53F8
  2. JSON and Unix file systems use UTF-8, so is stored 0x53C3B8
  3. Windows filenames require UCS-2/UTF-16, so is stored 0x005300F8

In my experience, many tools have issues with filenames/paths that include non-English characters. This may have changed over the last few years, but I recall large projects like Matlab and Slicer 3D both having issues with this. I suspect many smaller projects will also have issues.

Currently, dcm2niix will translate "Søren Müller‬" to read "Soren Muller‬" in the JSON and "Soren_Muller‬" for the filename (preserving the string length). dicm2nii directly copies the byte values, so the JSON reports "S¯ren M¸ller" and the filename is "S_ren_M_ller". My own sense is we want to stick with English characters, as there may be a lot of unintended consequences if other characters are used in filenames. However, I wonder if some will advocate for a better translation, e.g. "Soren Mueller‬".

@effigies
Copy link

I don't really have a strong opinion. I do generally think that it's each tool's job to address itself to standards, and not to implement sanitization to accommodate other tools' failures. It seems a shame to spend time implementing a hodge podge of heuristic conversions to ASCII, when this is part of what Unicode was designed to put an end to.

I guess I would say that you should either do nothing, or do something simple. It's not your job to maintain an elaborate normalization scheme.

@satra
Copy link

satra commented Sep 11, 2020

@neurolabusc - i would make a distinction between filenames and json files. json files can encode a lot more things than filenames in different OS'es can. so json files should be able to directly save anything in dicom which can be sanitized for filename compliance across OSes. but the value of the key in the json should just directly reflect the dicom value.

@neurolabusc
Copy link
Collaborator

@effigies I agree. UTF-8 must be rated as one of the most successful hacks in computer science. It provides robust backward compatibility with old data, compact representation for English characters, and the extensibility for other scripts. Unfortunately, both DICOM and the Windows file system predate UTF-8.

@satra - To support the UTF-8 used by JSON we need to convert the 95 written characters that differ with ISO 8859. It is worth noting that the Attribute Specific Character Set (0008,0005) allows one to specify additional non-UTF8 character sets, none of which I have examples of. It is interesting that neither gdcmdump or dcmdump support non-English characters, even though they are led by European developers. Horos does support these features.

@neurolabusc
Copy link
Collaborator

I have submitted a path that converts ISO8859 to UTF8 for the JSON (e.g. Søren Müller‬ -> Søren Müller‬ )‬, and applies a simple ISO8859 filter to the filename (e.g. Søren Müller‬ -> Soren Muller)‬. Due to the changes in the JSON, this will cause the dcm_qa_* repositories to report differences. It would be great if those with large datasets can check for unintended consequences (@captainnova). It is about time for a new stable release, so hopefully this modification has no unintended consequences.

Unix users can test the latest developmental commit:

git clone --branch development https://github.com/rordenlab/dcm2niix.git
cd dcm2niix/console
make
./dcm2niix

Windows users should be able to get a compiled version from AppVeyor.

@captainnova
Copy link
Collaborator

It would be great if those with large datasets can check for unintended consequences (@captainnova).

I look forward to testing it, but don't think we have much non-ASCII in our data, which is mostly from countries where keyboards tend to make it difficult to enter non-ASCII characters.

I think using UTF8 in the BIDS file and simple ASCIIization of the filenames is the right approach. I know it's a chicken-and-egg problem, but because unicode would break much of our existing downstream processing, we tend to manually convert fields to restricted ASCII when the need arises. At this moment a colon in a PatientID is causing heartburn because colons are not allowed in qsub job names. (Apparently Siemens scanners fill it in with the time if it is left blank, as often happens with volunteer/phantom scans.)

@captainnova
Copy link
Collaborator

It works fine with the test suite I have gathered, and minimized alteration of values in the .json matches my expectation better. That said, the test suite I used does not include any nonASCII PatientNames or IDs.

@neurolabusc
Copy link
Collaborator

OK, I also had several European users test the new version. New behavior will be noted in the release notes for the upcoming stable release.

neurolabusc added a commit to neurolabusc/dcm_qa that referenced this issue Oct 30, 2020
neurolabusc added a commit that referenced this issue Feb 9, 2021
…ation and Windows function GetOpenFileName will truncate (#425)
yarikoptic added a commit to neurodebian/dcm2niix that referenced this issue Apr 6, 2021
* tag 'v1.0.20210317': (23 commits)
  CI: Travis updates submodules always from remote.
  Update dcm_qa_nih submodule.
  Remove diagnostic message
  help should describe accession number (rordenlab#496)
  Describe and provide kludge for mangled Canon DICOMs (rordenlab#495)
  Update Philips notes (rordenlab#493)
  Update dcm_qa submodule.
  Support Canon Enhanced DICOM (rordenlab#491)
  Use first recognized manufacturer tag (rordenlab#487)
  If mosaics where ANY volume is non-planar, save ALL volumes as 3D(rordenlab#481)
  Add notes
  Kludge to separate vNav files as 3D (rordenlab#481)
  Tell user to override merging (-m o) to separate coils, specific with fmrib usage that disrupts Siemens DCE processing (rordenlab#187)
  Forbid semicolon in filenames as Linux uses this for command concatenation and Windows function GetOpenFileName will truncate (rordenlab#425)
  report totalReadoutTime once
  EstimatedEffectiveEchoSpacing -> EffectiveEchoSpacing (rordenlab#480)
  GE Round factor for Total Readout Time
  Update explicit naming of DICOMs (rordenlab#252), GE file naming changes (rordenlab#476)
  Update issue templates
  GE TotalReadouTime, BEP009 fixes (rordenlab#476)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants