Failed to deserialize category 'entity' with ValueError: No closing quotation #570

0ut0fcontrol · 2024-05-23T12:31:27Z

1n5m.cif pdbx_description in entity category has no closing quotation:
'2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'

There are a total of 3 problematic examples in the wwPDB database: 1n5m, 6szp, 1tsl

I can fix it, but I'm not sure about the best way to do so. Do you have any suggestions?

code and trackback

# test_example.py
# biotite=0.40.0
import biotite.structure.io.pdbx as pdbx
from biotite.database import rcsb
import biotite.structure as struc
import biotite.structure.io as strucio

# There are a total of 3 problematic examples in the wwPDB database.
pdb_id = "1n5m"
# pdb_id = "6szp"
# pdb_id = "1tsl"

try:
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")
except:
    rcsb.fetch(pdb_id, "cif", target_path="/tmp")
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")

entity = cif_file.block["entity"]

$ python test_example.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 658, in __getitem__
    category = CIFCategory.deserialize(category, expect_whitespace)
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 382, in deserialize
    category_dict = CIFCategory._deserialize_looped(
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 486, in _deserialize_looped
    values = shlex.split(data_line)
  File "/usr/lib/python3.9/shlex.py", line 315, in split
    return list(lex)
  File "/usr/lib/python3.9/shlex.py", line 300, in __next__
    token = self.get_token()
  File "/usr/lib/python3.9/shlex.py", line 109, in get_token
    raw = self.read_token()
  File "/usr/lib/python3.9/shlex.py", line 191, in read_token
    raise ValueError("No closing quotation")
ValueError: No closing quotation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/test_example.py", line 18, in <module>
    entity = cif_file.block["entity"]
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 660, in __getitem__
    raise DeserializationError(
biotite.structure.io.pdbx.DeserializationError: Failed to deserialize category 'entity'


$ grep '^_entity.id' -A 20 /tmp/1n5m.cif 
_entity.id 
_entity.type 
_entity.src_method 
_entity.pdbx_description 
_entity.formula_weight 
_entity.pdbx_number_of_molecules 
_entity.pdbx_ec 
_entity.pdbx_mutation 
_entity.pdbx_fragment 
_entity.details 
1 polymer     man acetylcholinesterase                                                     59592.309 2   3.1.1.7 ? 
'CATALYTIC DOMAIN' ? 
2 branched    man 'alpha-L-fucopyranose-(1-6)-2-acetamido-2-deoxy-beta-D-glucopyranose'    367.349   1   ?       ? ? ? 
3 non-polymer syn 'IODIDE ION'                                                             126.904   10  ?       ? ? ? 
4 non-polymer syn 'HEXAETHYLENE GLYCOL'                                                    282.331   1   ?       ? ? ? 
5 non-polymer syn '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]' 510.816   1   ?       ? ? ? 
6 non-polymer man 2-acetamido-2-deoxy-beta-D-glucopyranose                                 221.208   1   ?       ? ? ? 
7 non-polymer syn 'CARBONATE ION'                                                          60.009    1   ?       ? ? ? 
8 non-polymer syn 'TETRAETHYLENE GLYCOL'                                                   194.226   1   ?       ? ? ? 
9 water       nat water                                                                    18.015    551 ?       ? ? ? 
#                 1

The text was updated successfully, but these errors were encountered:

padix-key · 2024-05-23T13:53:56Z

First I thought, the problem is an error in quote escaping on the side of the RCSB. However, then I looked into the CIF specification:

Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.

Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item _example 'a dog's life' is legal; the data value is a dog's life.

Note that constructs such as 'an embedded ' quote' do not behave as in the case of many current programming languages; i.e. the backslash character in this context does not escape the special meaning of the delimiter character. A backslash preceding the apostrophe or double-quote characters does, however, have special meaning in the context of accented characters (paragraph 32 of the document Common semantic features) provided there is no white space immediately following the apostrophe or double-quote character.

This means the quote escaping using the shlex module in biotite.io.pdbx is wrong.
This means

biotite/src/biotite/structure/io/pdbx/cif.py

Line 450 in 2e053cb

parts = shlex.split(line)

biotite/src/biotite/structure/io/pdbx/cif.py

Lines 476 to 494 in 2e053cb

    
           # Rows may be split over multiple lines -> do not rely on 
        
           # row-line-alignment at all and simply cycle through columns 
        
           column_names = itertools.cycle(column_names) 
        
           for data_line in data_lines: 
        
               # If whitespace is expected in quote protected values, 
        
               # use standard shlex split 
        
               # Otherwise use much more faster whitespace split 
        
               # and quote removal if applicable, 
        
               # bypassing the slow shlex module 
        
               if expect_whitespace: 
        
                   values = shlex.split(data_line) 
        
               else: 
        
                   values = data_line.split() 
        
                   for k in range(len(values)): 
        
                       # Remove quotes 
        
                       if (values[k][0] == '"' and values[k][-1] == '"') or ( 
        
                           values[k][0] == "'" and values[k][-1] == "'" 
        
                       ): 
        
                           values[k] = values[k][1:-1]

biotite/src/biotite/structure/io/pdbx/cif.py

Line 974 in 2e053cb

processed_lines[out_i] = shlex.quote(multi_line_str)

biotite/src/biotite/structure/io/pdbx/cif.py

Lines 974 to 978 in 2e053cb

    
               processed_lines[out_i] = shlex.quote(multi_line_str) 
        
               out_i += 1 
        
           else: 
        
               # Append multiline string to previous line 
        
               processed_lines[out_i - 1] += " " + shlex.quote(multi_line_str)

need to be replaced. For splitting re.split() should work instead of shlex.split(), if you find some robust pattern. shlex.quote() can probably be replaced by

biotite/src/biotite/structure/io/pdbx/cif.py

Lines 995 to 1012 in 2e053cb

    
           def _quote(value): 
        
               """ 
        
               A less secure but much quicker version of ``shlex.quote()``. 
        
               """ 
        
               if len(value) == 0: 
        
                   return "''" 
        
               elif value[0] == "_": 
        
                   return "'" + value + "'" 
        
               elif "'" in value: 
        
                   return '"' + value + '"' 
        
               elif '"' in value: 
        
                   return "'" + value + "'" 
        
               elif " " in value: 
        
                   return "'" + value + "'" 
        
               elif "\t" in value: 
        
                   return "'" + value + "'" 
        
               else: 
        
                   return value

padix-key · 2024-05-23T13:54:46Z

Should I assign the issue to you then?

0ut0fcontrol · 2024-05-24T02:47:19Z

Sure, please assign it to me.

…_pdbx.py

padix-key added the bug label May 23, 2024

padix-key assigned 0ut0fcontrol May 24, 2024

0ut0fcontrol added a commit to 0ut0fcontrol/biotite that referenced this issue Jun 30, 2024

fix biotite-dev#570, add embedded quote example of 1n5m.cif into test…

01039d3

…_pdbx.py

0ut0fcontrol added a commit to 0ut0fcontrol/biotite that referenced this issue Jun 30, 2024

fix biotite-dev#570, add embedded quote example of 1n5m.cif into test…

b1d6fd1

…_pdbx.py

0ut0fcontrol added a commit to 0ut0fcontrol/biotite that referenced this issue Jun 30, 2024

fix biotite-dev#570, add embedded quote example of 1n5m.cif into test…

082550b

…_pdbx.py

0ut0fcontrol mentioned this issue Jun 30, 2024

Handle embedded quote in mmcif #619

Merged

padix-key closed this as completed in #619 Jul 14, 2024

padix-key closed this as completed in 0404084 Jul 14, 2024

padix-key mentioned this issue Aug 29, 2024

Bug: Deserialization of some CIF blocks #648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

0ut0fcontrol commented May 23, 2024 •

edited

Loading

padix-key commented May 23, 2024 •

edited

Loading

padix-key commented May 23, 2024

0ut0fcontrol commented May 24, 2024

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

Comments

0ut0fcontrol commented May 23, 2024 • edited Loading

padix-key commented May 23, 2024 • edited Loading

padix-key commented May 23, 2024

0ut0fcontrol commented May 24, 2024

0ut0fcontrol commented May 23, 2024 •

edited

Loading

padix-key commented May 23, 2024 •

edited

Loading