Skip to content

Create parquet files from network packet captures (pcap files)

License

Notifications You must be signed in to change notification settings

poorting/pcap2parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pcap2parquet

Create parquet files from network packet captures (pcap files).

This tool uses Python, tcpdump and tshark and converting large pcaps can take some time. If you want (way) faster conversions, please take a look at the Rust version I made: pcap-converter. Which is between 5 to 16 times faster.

Why?

Analysing pcaps can be tricky, especially if you are not used to working with the tools that are frequently used for analysing pcaps (wireshark, tshark, ...).

If you are fairly comfortable using SQL, then you can use duckdb for easy analysis with the resulting parquet files, as duckdb can process these natively!

See below for an example.

requirements

You need to have tshark and tcpdump installed (and on the path) as both are called by pcap2parquet.

They are invoked without sudo, so make sure both can be executed by your user account.

tcpdump is used to split bigger pcap files into smaller chunks before processing, it does not need access to network interfaces. (simply do a tcpdump -r <pcap file> to check permissions are set properly).

tshark is used to export the pcap file to an intermediate CSV file which is then converted to the final parquet file.

Usage

usage: pcap2parquet.py [-h] [-l] [-r] [--debug] [-V] source parquetdir

Convert pcap file(s) (produced by tshark, tcpdump, snort or others) to parquet format

positional arguments:
  source                Source pcap file or directory containing pcap files
  parquetdir            Directory where to store resulting parquet files

options:
  -h, --help            show this help message and exit
  -l, --log_parse_errors
                        Any lines that cannot be parsed will be stored in a file
                        The filename is equal to the file being processed, 
                        with '-parse-errors.txt' appended. It will be stored
                        in the current working directory
  -r, --recursive       recursively searches for pcap files if source specifies a directory.
  --debug               show debug output
  -V, --version         print version and exit

example

./pcap2parquet.py sample1.pcap .

Resulting parquet files are stored in the parquetdir (destination directory) with the same name as the original file, appended with '.parquet'.

If the source points to a directory then all pcap files in that directory will be converted to parquet files and stored in the destination directory.

Pcap files up to tens of megabytes are converted in a matter of seconds. Larger files of a few hundred megabytes take more time. Files over a gigabyte in size can take minutes to tens of minutes.

To speed up conversion, files larger than 100MB are first split into chunks of 100MB, then converted in parallel to the intermediate (CSV) format. The number of cores used is by default half of the numbers of cores present in the system. Note that a cpu with hyperthreading effectively doubles the number of cores. You can use the -n argument to specify more or fewer cores to be used than the default.

Using duckdb

To analyse resulting parquet file using duckdb, install and fire up the duckdb cli. Then do something like:

# Create a view from the contents of the 'sample.parquet' file in the current working directory
# This way the contents will not be loaded into memory (as opposed to 'create table')

create view pcap as select * from 'sample.parquet';

# describe the structure of the pcap view:
describe pcap;

┌─────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│     column_name     │ column_type │  null   │   key   │ default │ extra │
│       varchar       │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ frame_time          │ TIMESTAMP   │ YES     │         │         │       │
│ ip_src              │ VARCHAR     │ YES     │         │         │       │
│ ip_dst              │ VARCHAR     │ YES     │         │         │       │
│ ip_proto            │ UTINYINT    │ YES     │         │         │       │
│ tcp_flags           │ VARCHAR     │ YES     │         │         │       │
│ col_source          │ VARCHAR     │ YES     │         │         │       │
│ col_destination     │ VARCHAR     │ YES     │         │         │       │
│ col_protocol        │ VARCHAR     │ YES     │         │         │       │
│ dns_qry_name        │ VARCHAR     │ YES     │         │         │       │
│ dns_qry_type        │ VARCHAR     │ YES     │         │         │       │
│ eth_type            │ USMALLINT   │ YES     │         │         │       │
│ frame_len           │ USMALLINT   │ YES     │         │         │       │
│ udp_length          │ USMALLINT   │ YES     │         │         │       │
│ http_request_uri    │ VARCHAR     │ YES     │         │         │       │
│ http_host           │ VARCHAR     │ YES     │         │         │       │
│ http_request_method │ VARCHAR     │ YES     │         │         │       │
│ http_user_agent     │ VARCHAR     │ YES     │         │         │       │
│ icmp_type           │ UTINYINT    │ YES     │         │         │       │
│ ip_frag_offset      │ USMALLINT   │ YES     │         │         │       │
│ ip_ttl              │ UTINYINT    │ YES     │         │         │       │
│ ntp_priv_reqcode    │ VARCHAR     │ YES     │         │         │       │
│ tcp_dstport         │ USMALLINT   │ YES     │         │         │       │
│ tcp_srcport         │ USMALLINT   │ YES     │         │         │       │
│ udp_dstport         │ USMALLINT   │ YES     │         │         │       │
│ udp_srcport         │ USMALLINT   │ YES     │         │         │       │
│ col_info            │ VARCHAR     │ YES     │         │         │       │
│ pcap_file           │ VARCHAR     │ YES     │         │         │       │
├─────────────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤
│ 27 rows                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────┘

# Return the number of rows in pcap
select count(*) from pcap;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        28683 │
└──────────────┘

# List the different IP destinations and the number of packet with that destination
# Order by number of packets (descending order) and limit to the first 10

select ip_dst, count() as count from pcap group by ip_dst order by count desc limit 10;
┌────────────────┬────────┐
│     ip_dst     │ count  │
│    varchar     │ int64  │
├────────────────┼────────┤
│ 172.16.139.250 │ 181258 │
│ 68.64.21.62    │  25733 │
│ 172.16.133.57  │  19723 │
│ 172.16.133.26  │  17767 │
│ 67.217.64.99   │  16875 │
│ 157.56.240.102 │  14851 │
│ 172.16.133.78  │  13602 │
│ 172.16.133.36  │  11675 │
│ 172.16.133.25  │  11569 │
│ 172.16.133.39  │  10818 │
├────────────────┴────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

About

Create parquet files from network packet captures (pcap files)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages