Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

Closed
0ut0fcontrol opened this issue May 23, 2024 · 3 comments · Fixed by #619
Closed

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

0ut0fcontrol opened this issue May 23, 2024 · 3 comments · Fixed by #619
Assignees
Labels

Comments

@0ut0fcontrol
Copy link
Contributor

0ut0fcontrol commented May 23, 2024

1n5m.cif pdbx_description in entity category has no closing quotation:
'2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'

There are a total of 3 problematic examples in the wwPDB database: 1n5m, 6szp, 1tsl

I can fix it, but I'm not sure about the best way to do so. Do you have any suggestions?

code and trackback

# test_example.py
# biotite=0.40.0
import biotite.structure.io.pdbx as pdbx
from biotite.database import rcsb
import biotite.structure as struc
import biotite.structure.io as strucio

# There are a total of 3 problematic examples in the wwPDB database.
pdb_id = "1n5m"
# pdb_id = "6szp"
# pdb_id = "1tsl"

try:
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")
except:
    rcsb.fetch(pdb_id, "cif", target_path="/tmp")
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")

entity = cif_file.block["entity"]
$ python test_example.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 658, in __getitem__
    category = CIFCategory.deserialize(category, expect_whitespace)
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 382, in deserialize
    category_dict = CIFCategory._deserialize_looped(
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 486, in _deserialize_looped
    values = shlex.split(data_line)
  File "/usr/lib/python3.9/shlex.py", line 315, in split
    return list(lex)
  File "/usr/lib/python3.9/shlex.py", line 300, in __next__
    token = self.get_token()
  File "/usr/lib/python3.9/shlex.py", line 109, in get_token
    raw = self.read_token()
  File "/usr/lib/python3.9/shlex.py", line 191, in read_token
    raise ValueError("No closing quotation")
ValueError: No closing quotation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/test_example.py", line 18, in <module>
    entity = cif_file.block["entity"]
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 660, in __getitem__
    raise DeserializationError(
biotite.structure.io.pdbx.DeserializationError: Failed to deserialize category 'entity'


$ grep '^_entity.id' -A 20 /tmp/1n5m.cif 
_entity.id 
_entity.type 
_entity.src_method 
_entity.pdbx_description 
_entity.formula_weight 
_entity.pdbx_number_of_molecules 
_entity.pdbx_ec 
_entity.pdbx_mutation 
_entity.pdbx_fragment 
_entity.details 
1 polymer     man acetylcholinesterase                                                     59592.309 2   3.1.1.7 ? 
'CATALYTIC DOMAIN' ? 
2 branched    man 'alpha-L-fucopyranose-(1-6)-2-acetamido-2-deoxy-beta-D-glucopyranose'    367.349   1   ?       ? ? ? 
3 non-polymer syn 'IODIDE ION'                                                             126.904   10  ?       ? ? ? 
4 non-polymer syn 'HEXAETHYLENE GLYCOL'                                                    282.331   1   ?       ? ? ? 
5 non-polymer syn '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]' 510.816   1   ?       ? ? ? 
6 non-polymer man 2-acetamido-2-deoxy-beta-D-glucopyranose                                 221.208   1   ?       ? ? ? 
7 non-polymer syn 'CARBONATE ION'                                                          60.009    1   ?       ? ? ? 
8 non-polymer syn 'TETRAETHYLENE GLYCOL'                                                   194.226   1   ?       ? ? ? 
9 water       nat water                                                                    18.015    551 ?       ? ? ? 
#                 1 
@padix-key
Copy link
Member

padix-key commented May 23, 2024

First I thought, the problem is an error in quote escaping on the side of the RCSB. However, then I looked into the CIF specification:

  1. Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.
  1. Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item _example 'a dog's life' is legal; the data value is a dog's life.
  1. Note that constructs such as 'an embedded ' quote' do not behave as in the case of many current programming languages; i.e. the backslash character in this context does not escape the special meaning of the delimiter character. A backslash preceding the apostrophe or double-quote characters does, however, have special meaning in the context of accented characters (paragraph 32 of the document Common semantic features) provided there is no white space immediately following the apostrophe or double-quote character.

This means the quote escaping using the shlex module in biotite.io.pdbx is wrong.
This means

parts = shlex.split(line)

# Rows may be split over multiple lines -> do not rely on
# row-line-alignment at all and simply cycle through columns
column_names = itertools.cycle(column_names)
for data_line in data_lines:
# If whitespace is expected in quote protected values,
# use standard shlex split
# Otherwise use much more faster whitespace split
# and quote removal if applicable,
# bypassing the slow shlex module
if expect_whitespace:
values = shlex.split(data_line)
else:
values = data_line.split()
for k in range(len(values)):
# Remove quotes
if (values[k][0] == '"' and values[k][-1] == '"') or (
values[k][0] == "'" and values[k][-1] == "'"
):
values[k] = values[k][1:-1]

processed_lines[out_i] = shlex.quote(multi_line_str)

processed_lines[out_i] = shlex.quote(multi_line_str)
out_i += 1
else:
# Append multiline string to previous line
processed_lines[out_i - 1] += " " + shlex.quote(multi_line_str)

need to be replaced. For splitting re.split() should work instead of shlex.split(), if you find some robust pattern. shlex.quote() can probably be replaced by

def _quote(value):
"""
A less secure but much quicker version of ``shlex.quote()``.
"""
if len(value) == 0:
return "''"
elif value[0] == "_":
return "'" + value + "'"
elif "'" in value:
return '"' + value + '"'
elif '"' in value:
return "'" + value + "'"
elif " " in value:
return "'" + value + "'"
elif "\t" in value:
return "'" + value + "'"
else:
return value

@padix-key
Copy link
Member

Should I assign the issue to you then?

@padix-key padix-key added the bug label May 23, 2024
@0ut0fcontrol
Copy link
Contributor Author

Sure, please assign it to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants