tshark extract fields with their string representation

2.6k views Asked by At

I have a tshark's pcap file with data that I want to analyze. I would like to analyze it and export to CSV or xls file. In the tshark documentation I can see that I can either use -z option with proper arguments or -T together with -E and -e. I'm using python 3.6 on Debian machine. Currently, my command looks like this:

command="tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
              "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
              "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
              "Subscription-Id,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
              "Multiple-Services-Credit-Control,Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
              "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
              "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)

Later I'm processing it with pandas dataframe like so:

# loops adding TCP and/or UDP ports to scan traffic from
    if args.tcp:
        for port in args.tcp:
            command += " -d tcp.port=={},diameter".format(port)

    if args.udp:
        for port in args.udp:
            command += " -d udp.port=={},diameter".format(port)

    # calling subprocess with output redirection to task variable
    task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)

    # a loop adding new data dictionaries to data_list
    for line in task.stdout:
        line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
        # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
        line = re.split(r"\s|=", line)

        # convert obtained list to ordered dictionary to preserve column order
        # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
        dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
        data_list.append(dict)

    # remove last 4 dictionaries (last 4 lines of task.stdout)
    data_list = data_list[:-4]
    df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
    df.to_excel("{}.xls".format(args.output_file), index=False)
    print("Please remember that 'frame' column may not correspond to row index!")

When I open output file I can see that it works ok, except the fact that in e.g. CC-Request-Number I have numeric values instead of string representation, that is e.g. in Wireshark I have data like this:

enter image description here

and in the output excel file in the CC-Request-Number column I can see 3 in the row corresponding to this packet, instead of TERMINATION-REQUEST.

My question is: how can I translate this number to its string representation, while using -z option, or (as I can guess from what I've seen on the web) how can I get fields mentioned above with their values using -T and -e command? I listed all available fields with tshark -G but there are too many of them and I can't think of any reasonable way to find the ones that I want.

2

There are 2 answers

0
Colonder On BEST ANSWER

Thanks to John Zwick's suggestion, this answer and Python documentation on The ElementTree XML API I implemented code presented below (I downloaded dictionary.xml and chargecontrol.xml from official Wireshark Github repository):

chargecontrol_tree = ET.parse("chargecontrol.xml")
dictionary_tree = ET.parse("dictionary.xml")
chargecontrol_root = chargecontrol_tree.getroot()
dictionary_root = dictionary_tree.getroot()

# list that will contain data dictionaries
data_list = []

# base command
command = "tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
          "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
          "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
          "Subscription-Id-Data,Subscription-Id-Type,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
          "Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
          "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
          "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)

# loops adding tcp and/or udp ports to scan traffic from
if args.tcp:
    for port in args.tcp:
        command += " -d tcp.port=={},diameter".format(port)

if args.udp:
    for port in args.udp:
        command += " -d udp.port=={},diameter".format(port)

# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)

# a loop adding new data dictionaries to data_list
for line in task.stdout:
    line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
    # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
    line = re.split(r"\s|=", line)

    # convert obtained list to ordered dictionary to preserve column order
    # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
    dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
    data_list.append(dict)

# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"

# values taken from official wireshark repository
# https://github.com/boundary/wireshark/blob/master/diameter/dictionary.xml
# https://github.com/wireshark/wireshark/blob/2832f4e97d77324b4e46aac40dae0ce898ae559d/diameter/chargecontrol.xml
df["Auth-Application-Id"] = df["Auth-Application-Id"].map({node.attrib["code"]:node.attrib["name"] for node in
      dictionary_root.findall(".//*[@name='Auth-Application-Id']/enum")})

# list of columns that values of have to be substituted
for col in ["CC-Request-Type", "CC-Session-Failover", "Credit-Control-Failure-Handling", "Subscription-Id-Type"]:
    df[col] = df[col].map({node.attrib["code"]: node.attrib["name"] for node in
          chargecontrol_root.findall((".//*[@name='{}']/enum").format(col))})


df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")
5
John Zwinck On

Strangely, with -T fields and -e, tshark always prints numeric representations, but with the "Custom Fields" output format, it prints textual representations. The good news is that the Custom Fields mode is actually 3x faster than the -T fields mode. The bad news I know of no way to control the separate character between the custom fields, so it seems rather unusable if your field content may contain spaces.

Instead of -z, try this:

-o column.format:'"time", "%t", "type", "%Cus:diameter.CC-Request-Number"'