Possible extended ASCII string decompress problem. #490

kirk-sayre-work · 2019-10-17T21:40:15Z

Affected tool:
olevba

Describe the bug
It looks like olevba may be improperly decompressing the values of some VBA strings that contain extended ASCII characters. There are some different extended ASCII VBA characters that result in the same byte sequence in the output of olevba.

File/Malware sample to reproduce the bug
An example Word document is available at https://github.com/kirk-sayre-work/talks/blob/master/test.docm

How To Reproduce the bug
Compare the output of olevba on the file with the output of oledump.py test.docm -s A3 -v . The string contents are different between the 2 tools, with the output of oledump.py for the string appearing to be possibly correct.

Version information:

OS: Linux
OS Ubuntu 16, 64 bit
Python version: 2.7 /64 bits
oletools version: olevba 0.55.dev4 on Python 2.7.12

Additional context
There are some maldoc campaigns (currently IcedID) that are encoding payloads in strings with extended ASCII characters. Vipermonkey fails to properly decode the payloads due to what appear to be issues with the decompression of the extended ASCII strings.

decalage2 · 2021-03-10T21:53:23Z

In this sample, the VBA string with special characters seems to be 8F 88 in hex. This is what I get using oledump, or when copy-pasting from the VBA editor into a text editor. The code page for the sample is 1252, so it's standard Western encoding.
olevba on Python 2 converts the string to EF BF BD CB. The proper encoding of 8F 88 in UTF-8 should be C2 8F C2 88.
When olevba parses the VBA code, VBA_Module.code_raw contains the right string with 8F 88. So the issue happens when converting that raw string to unicode using the cp1252 codec, and then converting the unicode to UTF-8.
In fact, the cp1252 codec triggers an exception when converting 8F 88 to Unicode:

>>> s=b'\x8F\x88'
>>> u=s.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 0: character maps to <undefined>

But that exception is hidden by olevba because it uses errors='replace' in VBA_Project.decode_bytes:

>>> u=s.decode('cp1252', errors='replace')
>>> u
u'\ufffd\u02c6'
>>> u.encode('utf8')
'\xef\xbf\xbd\xcb\x86'

And this is why the the UTF-8 encoded output is incorrect.
It looks like the cp1252 python codec considers 8F 88 as illegal characters, whereas they are accepted by MS Office...

On Wikipedia about CP1252: "According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too."

decalage2 · 2021-03-11T07:42:41Z

TODO:

At least detect when unicode conversion fails (without errors="replace"), and issue a warning that the VBA source code contains special characters that cannot be converted to unicode.
Improve the olevba API so that calling applications can be informed when special characters are present, and can get the raw source code instead of the unicode/UTF-8 on demand.
Check how Windows converts those special characters from code page 1252 to unicode, and to UTF-8
If possible build a modified version of the cp1252 codec to mimic the behaviour of Windows
Check if other code pages have the same issue with undefined characters

kirk-sayre-work · 2021-03-11T15:35:43Z

There is an additional weird wrinkle to the extended ASCII characters. You have tried copying and pasting from the VBA editor, now try adding a loop to Debug.Print each character in the string with Mid(), copy/paste the debug text, and look at the byte values in that text. In this case the original 128...256 byte value (single byte) is used for each of the extended ASCII characters. So it looks like Office uses unicode for display in the VBA editor but under the covers it is still using the single byte extended ASCII values when accessed in VBA (this is also the behavior I see with VBA string decode loops). Maybe there can be an olevba option for display text values vs. raw/underlying text values?

decalage2 self-assigned this Oct 18, 2019

decalage2 added 🐛 bug olevba labels Oct 18, 2019

decalage2 added this to the oletools 0.55 milestone Oct 18, 2019

kirk-sayre-work mentioned this issue Mar 9, 2021

Possible fix for extended ASCII issue (#490) #666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible extended ASCII string decompress problem. #490

Possible extended ASCII string decompress problem. #490

kirk-sayre-work commented Oct 17, 2019

decalage2 commented Mar 10, 2021 •

edited

Loading

decalage2 commented Mar 11, 2021

kirk-sayre-work commented Mar 11, 2021

Possible extended ASCII string decompress problem. #490

Possible extended ASCII string decompress problem. #490

Comments

kirk-sayre-work commented Oct 17, 2019

decalage2 commented Mar 10, 2021 • edited Loading

decalage2 commented Mar 11, 2021

kirk-sayre-work commented Mar 11, 2021

decalage2 commented Mar 10, 2021 •

edited

Loading