Added a FS for any KS-generated parser. #13

KOLANICH · 2018-07-04T12:14:44Z

It's buggy, doesn't work (I don't see there hinary blobs for now), but I still publish it.

KOLANICH · 2019-07-16T15:30:10Z

@GreyCat, @dgelessus, now it seems to be working.

dgelessus · 2019-07-17T12:50:27Z

I've done a bit of testing - it seems to be working for the most part, but I got this exception repeatedly:

Uncaught exception from FUSE operation read, returning errno.EINVAL.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 734, in _wrapper
    return func(*args, **kwargs) or 0
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 846, in read
    offset, fh)
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 1251, in __call__
    ret = getattr(self, op)(path, *args)
  File "/home/user/work/Kaitai Struct/kaitai_fs/kaitaifs/fs/AnyKSFS.py", line 317, in read
    data = self.get_file_body(obj, offset, length)
  File "/home/user/work/Kaitai Struct/kaitai_fs/kaitaifs/fs/AnyKSFS.py", line 321, in get_file_body
    return obj[offset:offset + length]
TypeError: 'SignatureType' object is not subscriptable

I was testing with the archive/phar_without_stub.ksy spec. signature_type is an enum.

Other than that it seems to be working well. Though I don't quite understand the logic behind how the fields are exposed as files - most fields are grouped together into a single __dict__.json file, but some fields become separate .bin files. Sometimes the __dict__.json actually contains a JSON array rather than an object (which is odd, because in Python __dict__ is always a dict and not a list).

KOLANICH · 2019-07-17T13:48:16Z

Though I don't quite understand the logic behind how the fields are exposed as files - most fields are grouped together into a single __dict__.json

In fact it is not necessarily json, it can also be msgpack or bson or yaml, depending on filter chain. Though I have not yet implemented configuring it from CLI. Also it is possible to represent objects of other types as different files depending on their types by adding a filter. I have not yet decided what to do with strings, to put them into a dict or to put them into txt files, probably we should act depending on their sizes.

but some fields become separate .bin files.

if property type is subclass of bytes or bytearray it goes there

Sometimes the __dict__.json actually contains a JSON array rather than an object (which is odd, because in Python __dict__ is always a dict and not a list).

Yes, I have not yet implemented different naming for arrays.

I was testing with the archive/phar_without_stub.ksy spec. signature_type is an enum

Good catch. Though I am unsure how to represent it. Probably we should switch to from json to yaml and add enum name as an end line comment.

Alsk I wanted to implement caching. Currently to get sizes of json objects they are generated and then discarded. We should cache the sizes.

dgelessus · 2019-07-17T14:09:27Z

probably we should act depending on their sizes

I don't think this would work well in practice - if you do that, every program that wants to read a string field needs to check both for a separate file and in the __dict__.json. Even for human users it would be confusing to have to check in multiple places.

Alsk I wanted to implement caching. Currently to get sizes of json objects they are generated and then discarded. We should cache the sizes.

It might be a good idea to cache the entire string/bytes representation of the JSON objects, that way reading them is also faster. Though to be honest I don't think performance is a big issue in practice - my "generic FS" implementation (#17) has no caching at all, and I never had any noticeable performance problems with it (even on a 2010 ThinkPad with HDD).

dgelessus

Some general comments:

The code doesn't follow PEP 8 - almost all names are camelCase where they should be snake_case, and spaces are missing in many places (especially in assignments and type annotations).
Some parts of the code are IMHO not very understandable currently, especially the TypeSplitter class hierarchy - it's not obvious which methods/fields are meant to be used externally and which ones are just implementation details. In general there should be more comments and docstrings.
Some of the TypeSplitter subclasses are meant as alternatives to each other and cannot be used together. It would be helpful to document this somewhere.
A few third-party libraries (e. g. ruamel.yaml, msgpack) are used even though they are not listed as dependencies in setup.cfg.

dgelessus · 2019-07-17T12:53:42Z

kaitaifs/__main__.py


 from fuse import FUSE

+from functools import partial
+
+from kaitaifs.parser.rar import Rar


Is the Rar import used anywhere in __main__?

dgelessus · 2019-07-17T12:56:01Z

kaitaifs/__main__.py

+def getParserCtor(moduleName:str):
+    import importlib
+    import ast
+    from pathlib import Path


Any particular reason why the modules are imported inside the function? The usual Python convention is to import everything at top level, unless there's a specific reason not to (like lazy loading for performance reasons).

dgelessus · 2019-07-17T12:58:37Z

kaitaifs/__main__.py

 FILESYSTEMS = {
    'rar': RarFS,
    'quake_pak': QuakePakFS,
    'heroes_of_might_and_magic_agg': HeroesOfMightAndMagicAggFS,
    'iso9660': Iso9660FS,
 }

+def getParserCtor(moduleName:str):


The function name should be get_parser_ctor, as per PEP 8. Also, get_parser_class might be a better name (from my experience, the term "ctor" is not used often in Python).

dgelessus · 2019-07-17T13:00:10Z

kaitaifs/__main__.py

+    from pathlib import Path
+    try:
+        mod=importlib.import_module("."+moduleName, package="kaitaifs.parser")
+    except:


This should be except ImportError to avoid catching unrelated exceptions.

I'm also not sure if this try block is a good idea. The main use case for a "generic" KaitaiFS is for working with specs that KaitaiFS doesn't support natively, so there's usually no need to use it with the specs included with KaitaiFS.

Also, this could lead to confusing behavior when a spec exists both at top level and under kaitaifs.parser. If you run kaitaifs --any rar ..., the current code would always use kaitaifs.parser.rar, even if you have a rar module on your PYTHONPATH. If this try block is kept, the logic should be reversed so that it first searches for a top-level module, and only looks into kaitaifs.parser if no top-level module is found.

dgelessus · 2019-07-17T13:11:29Z

kaitaifs/__main__.py

+    a=ast.parse(p.read_text())
+    for el in a.body:
+        if isinstance(el, ast.ClassDef):
+            return getattr(mod, el.name)


Using ast here might be a little overkill. Also, do we know for sure that the top-level class will always be the first class in the generated file?

For context, my implementation originally required the user to pass both module name and class name explicitly. Later I changed it so that the class name is automatically extracted from the module name (by taking the last part of the module name and converting it to CamelCase).

dgelessus · 2019-07-17T13:31:07Z

kaitaifs/fs/AnyKSFS.py

+class TypeSplitter:
+    fileExtension = None
+    def __init__(self):
+        raise NotImplementedError()


This is not a good way to implement abstract classes, because it prevents subclasses from calling super().__init__(...). It would be better to use Python's standard abc module here.

dgelessus · 2019-07-17T13:43:01Z

kaitaifs/fs/AnyKSFS.py

+    objectIndexOverhead = 2 # pair of quotes
+
+    def __init__(self):
+        import ruamel.yaml


Are there any practical differences between this module and PyYAML? I'm asking because in my experience PyYAML is more widely used (I haven't heard of ruamel.yaml before).

dgelessus · 2019-07-17T13:50:09Z

kaitaifs/fs/AnyKSFS.py

+        return res
+
+    #def getxattr(self, path, name, position=0):
+    #    return {}


This should be removed instead of commented out.

dgelessus · 2019-07-17T13:51:52Z

kaitaifs/fs/AnyKSFS.py

+        }
+
+        if isinstance(obj, (bytes, bytearray)):
+            res['st_size'] = len(obj)


What happens for other object types? I remember getting odd behavior with some programs when files don't have a correct size.

dgelessus · 2019-07-17T13:56:34Z

kaitaifs/kaitai_tree_fs.py

-        'st_gid': 0,
-        'st_uid': 0,
+        'st_gid': os.getgid() if not isWin else 0,
+        'st_uid': os.geteuid() if not isWin else 0,


Is this condition necessary? I know that there's some kind of Windows port of FUSE, but I would expect it to handle file UIDs/GIDs properly (or at least ignore them if they don't apply on Windows).

If this condition is really needed, it would be good to add a comment explaining why.

KOLANICH · 2019-07-17T14:38:53Z

if you do that, every program that wants to read a string field needs to check both for a separate file and in the dict.json

IMHO it is not programs, but is for humans manually examining files and executing hex editors on different unknown chunks or doing diffing with generic tools. Finished specialized programs should use the code generated by KSC directly.

Even for human users it would be confusing to have to check in multiple places.

Yes. it is just not comfortable to view large or multiline strings in json. We should definitely use YAML by default.

KOLANICH · 2019-07-17T14:40:09Z

Another thing we may want to have is symlinks when multiple instances point to the same object.

… when not installed

KOLANICH changed the title ~~Addedd a FS for any KS-generated parser.~~ Added a FS for any KS-generated parser. Jul 4, 2018

dgelessus mentioned this pull request Feb 28, 2019

Add a generic filesystem that can mount files using any Kaitai spec #17

Open

KOLANICH force-pushed the any branch from 446f7b9 to 6850edf Compare July 16, 2019 14:32

dgelessus reviewed Jul 17, 2019

View reviewed changes

KOLANICH added 4 commits October 2, 2023 15:00

Fixes in setup.cfg

b29a31d

Fixing imports

9913f75

Moved the cli into __main__ to enable usage with python -m kaitaifs…

c9c6c4a

… when not installed

Added a FS to parse arbitrary files

44e9df0

KOLANICH force-pushed the any branch from 6850edf to 44e9df0 Compare October 2, 2023 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a FS for any KS-generated parser. #13

Added a FS for any KS-generated parser. #13

KOLANICH commented Jul 4, 2018

KOLANICH commented Jul 16, 2019 •

edited

Loading

dgelessus commented Jul 17, 2019

KOLANICH commented Jul 17, 2019 •

edited

Loading

dgelessus commented Jul 17, 2019

dgelessus left a comment

dgelessus Jul 17, 2019 •

edited

Loading

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

dgelessus Jul 17, 2019

KOLANICH commented Jul 17, 2019

KOLANICH commented Jul 17, 2019

Added a FS for any KS-generated parser. #13

Are you sure you want to change the base?

Added a FS for any KS-generated parser. #13

Conversation

KOLANICH commented Jul 4, 2018

KOLANICH commented Jul 16, 2019 • edited Loading

dgelessus commented Jul 17, 2019

KOLANICH commented Jul 17, 2019 • edited Loading

dgelessus commented Jul 17, 2019

dgelessus left a comment

Choose a reason for hiding this comment

dgelessus Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KOLANICH commented Jul 17, 2019

KOLANICH commented Jul 17, 2019

KOLANICH commented Jul 16, 2019 •

edited

Loading

KOLANICH commented Jul 17, 2019 •

edited

Loading

dgelessus Jul 17, 2019 •

edited

Loading