Skip to content

A library and command line tool for extracting indicators of compromise (IOCs) from security reports in PDF, HTML, Word, or text format

License

Notifications You must be signed in to change notification settings

malicialab/iocsearcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iocsearcher

iocsearcher is a Python library and command-line tool to extract indicators of compromise (IOCs), also known as cyber observables, from HTML, PDF, Word (.docx), and text files. It can identify both defanged (e.g., URL hxxp://example[DOT]com) and unmodified IOCs (e.g., URL http://example.com).

Installation

pip install iocsearcher

Supported IOCs

iocsearcher can extract the following IOC types:

  • URLs (url)
  • Domain names (fqdn)
  • IP addresses (ip4, ip6)
  • IP subnets (ip4Net)
  • Hashes (md5, sha1, sha256)
  • Email addresses (email)
  • Phone numbers (phoneNumber)
  • Copyright strings (copyright)
  • CVE vulnerability identifiers (cve)
  • Tor v3 addresses (onionAddress)
  • Social network handles (facebookHandle, githubHandle, instagramHandle, linkedinHandle, pinterestHandle, telegramHandle, twitterHandle, whatsappHandle, youtubeHandle, youtubeChannel)
  • Advertisement/analytics identifiers (googleAdsense, googleAnalytics, googleTagManager)
  • Blockchain addresses (bitcoin, bitcoincash, cardano, dashcoin, dogecoin, ethereum, litecoin, monero, ripple, solana, tezos, tronix, zcash)
  • Payment addresses (webmoney)
  • Chinese Internet Content Provider licenses (icp)
  • Bank account numbers (iban)
  • Trademarks (trademark)
  • Universal unique identifiers (uuid)
  • Android package name (packageName)
  • MITRE ATT&CK Technique identifiers (ttp)
  • Spanish NIF identifiers (nif)

Command Line Usage

To find IOCs in a given file just provide the -f (--file) option. By default, found IOCs are printed to stdout, defanged IOCs are rearmed, and IOCs are deduplicated so they only appear once.

iocsearcher -f file.pdf
iocsearcher -f page.html
iocsearcher -f document.docx
iocsearcher -f input.txt

You can use the -o (--output) option to place IOCs to a file instead of stdout:

iocsearcher -f file.pdf -o iocs.txt

By default all regexp are applied to the input. If you are only interested in some specific IOC types, it is more efficient to specify those using the -t (--target) option, which can be applied multiple times:

iocsearcher -f file.pdf -t url -t email

We also have a shortcut to scan for all blockchain addresses with -t BLOCKCHAIN

iocsearcher -f file.pdf -t BLOCKCHAIN

You can also search for IOCs in all files in a directory using the -d (--dir) option. IOCs extracted from each file will be placed in their own .iocs file. You can also place all IOCs founds across the input files in the same output file by also adding the -o (--output) option:

iocsearcher -d directoryWithFiles -o all.iocs

In HTML files, only the readable text is examined (i.e., think of the text shown by Firefox's Reader View). If you want to scan the whole HTML content you can use the -r (--raw) option:

iocsearcher -f page.html -r

If you have a file that you want to interpret as text avoiding filetype detection, you can use the -F (--forcetext) option:

iocsearcher -f input.txt -F

You can store the text extracted from a PDF/HTML/Word file using the -T (--text) option, which will produce a .text file for each input file:

iocsearcher -f file.pdf -T

By default IOCs are deduplicated, you can instead output the offset of each IOC without deduplication by using the -v (--verbose) option:

iocsearcher -f file.pdf -v

You can also produce a ranking of IOCs by number of appearances (without deduplication) by using the -C (--count) option:

iocsearcher -f file.pdf -C -o rank.iocs

Library Usage

You can also use iocsearcher as a library by creating a Searcher object and then invoking the functions search_data to identify rearmed and deduplicated IOCs and search_raw to identify all matches, their offsets, and the defanged string. The Searcher object needs to be created only once to parse the regexps. Then, it can be reused to find IOCs in multiple input strings.

python3
>>> import iocsearcher
>>> from iocsearcher.searcher import Searcher
>>> test = 'Find this email contact[AT]example[dot]com'
>>> searcher = Searcher()
>>> searcher.search_data(test)
{('email', '[email protected]'), ('fqdn', 'example.com')}
>>> searcher.search_data(test, targets={'email'})
{('email', '[email protected]')}
>>> searcher.search_raw(test)
[('email', '[email protected]', 16, 'contact[AT]example[dot]com'), ('fqdn', 'example.com', 27, 'example[dot]com')]

You can also open a document without needing to provide its type, get its text, and then use a Searcher object to search for IOCs in the text. For example, if you have a file called file.pdf you can do:

python3
>>> import iocsearcher
>>> from iocsearcher.document import open_document
>>> from iocsearcher.searcher import Searcher
>>> doc = open_document("file.pdf")
>>> text,_ = doc.get_text() if doc is not None else ""
>>> searcher = Searcher()
>>> searcher.search_data(text)

If the file is not a PDF, HTML, Word (.docx), or text document, open_document throws a warning and returns None

Defang and Rearm

Many security reports defang (i.e., remove the teeth from) malicious indicators, especially network indicators such as URLs, domains, IP addresses, and email addresses. This practice helps to prevent users from inadvertently clicking on a malicious indicator and start a network connection to it. Defanged indicators do not follow the indicator specification and thus require relaxed regular expressions to detect them.

iocsearcher supports some popular defang operations and rearms the IOCs by default so that deduplication works even if the same IOC has been defanged in different ways. However, it is not possible to support all defang operations, as every analyst can come up with their own. If you think iocsearcher is missing support for some popular defang operation, let us know by providing pointers to reports that use them.

Customizing the Regular Expressions

iocsearcher reads its regular expressions from an INI configuration file. If you want to modify a regexp, add a regexp, change the IOC type associated to a regexp, or disable validation for an existing regexp, you can create a copy of the patterns.ini file in the GitHub repo, edit your copy, and pass it as input to iocsearcher using the -P (--patterns) option:

iocsearcher -f file.pdf -P mypatterns.ini

Note that if you add a new regexp, the output will be the outermost group if a group exists, and the whole match if the regexp has no groups.

Related Tools

There exist multiple other open-source IOC extraction tools and we developed iocsearcher to improve on those. In our FGCS journal paper we propose a novel evaluation methodology for IOC extraction tools and apply it to compare iocsearcher with the following tools:

We believe the results show iocsearcher performs generally best, but that is up to you to judge. We encourage you to read our paper if you have questions about how iocsearcher compares with the above tools and to try the above tools if iocsearcher does not meet your goals.

Filtering

Technically speaking, iocsearcher is an indicator extraction tool, i.e., it extracts indicators regardless if they are benign or malicious. Currently, iocsearcher, similar to most other tools mentioned above, does not differentiate malicious indicators (i.e., IOCs) from benign indicators. For example, it will extract all URLs in the given input, regardless if they are malicious or benign.

Filtering of benign indicators is typically application-specific, so we prefer to keep it as a separate step. Such filtering is oftentimes performed with blocklists or through Natural Language Processing (NLP) techniques.

License

iocsearcher is released under the MIT license

This repository includes Base58 decoding code from the monero-python project. That code is located in the iocsearcher/monero folder and it is licensed under BSD 3-Clause.

References

The design and evaluation of iocsearcher and the comparison with prior IOC extraction tools are detailed in our FGCS journal paper:

Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas.
GoodFATR: A Platform for Automated Threat Report Collection and IOC Extraction.
In Future Generation Computer Systems, 2023.

Contributors

The main developer and maintainer for iocsearcher is Juan Caballero. Other members of the MaliciaLab at the IMDEA Software Institute have contributed fixes and helped with testing: Gibran Gomez, Silvia Sebastian, Srdjan Matic

About

A library and command line tool for extracting indicators of compromise (IOCs) from security reports in PDF, HTML, Word, or text format

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages