Exercise 2
From pcap to packets
Login via ssh to the Lab Environment and cd working_directory.
Do you think that Go-Flows has any advantage compared with tcpdump?
Go-Flows has the advantage over tcpdump if a lot of customized options to filter the traffic capture is needed. Other than that, tcpdump is usually already known and easy to get started with. For simple filtering purposes I consider tcpdump to be faster than Go-Flows.
What are the proportions of TCP, UDP, and ICMP traffic? And traffic that is not TCP, UDP, or ICMP?
About half (~47%) of the capture is TCP traffic. ICMP traffic is about 40% and UDP traffic about 7%. The rest of the traffic makes up about 6%.
How much traffic is related to websites (HTTP, HTTPS)? And DNS traffic?
HTTP traffic: ~14.12% HTTPS traffic: ~15.25% DNS traffic: ~00.82%
rep-10
Run the following command inside working_directory:
tcpdump -tt -c 10 -nr Ex2_team13.pcap
-ttfor timestamps-c 10for showing the first 10 packets-nfor not converting addresses to names-rfor reading from pcap
Last line (10th packet) says:
1546318980.014549 IP 203.74.52.109 > 200.130.97.12: ICMP echo request, id 16190, seq 4544, length 12
rep-11
After running the command
go-flows run features pcap2pkts.json export csv Ex2_team13.csv source libpcap Ex2_team13.pcap
we get the file Ex2_team13.csv.
The following python script quickly extracts the protocolIdentifier and their
occurrences:
import numpy as np
import pandas as pd
df = pd.read_csv(r'./Ex2_team13.csv')
print(df['protocolIdentifier'].value_counts(sort=True))
Output:
6 889752
1 761985
17 124772
47 107355
58 1308
50 66
103 15
41 2
Name: protocolIdentifier, dtype: int64
From Pcap to Flow Vectors
Remember that here we have extracted flows within a time-frame of 10 seconds. Can you think about legitimate and illegitimate situations for case (c), i.e., a source sending traffic to many different destinations in a short time?
TBA
You can additionally count the number of flows that show TCP, UDP, ICMP, and other IP protocols as "mode" protocol. Do you think that you will get a similar proportion as in [rep-11]? Beyond answering "yes" or "no", think about reasons that might make such proportions similar or different (there are some that are worth considering).
TBA
rep-12
After running the command
go-flows run features pcap2flows.json export csv Ex2flows_team13.csv source libpcap Ex2_team13.pcap
we get the file Ex2flows_team13.csv.
The following python script quickly extracts the percentage of sources communicating with one or more than ten destinations:
import pandas as pd
df = pd.read_csv(r'../data/Ex2flows_team13.csv')
dataLength = len(df)
singleDestinationFilter = df['distinct(destinationIPAddress)'] == 1
moreThan10DestinationsFilter = df['distinct(destinationIPAddress)'] > 10
percentageOfSingleDst = len(df[singleDestinationFilter]) / dataLength * 100
percentageOfMoreThan10Dst = len(df[moreThan10DestinationsFilter]) / dataLength * 100
print("Single Destination: {} %".format(round(percentageOfSingleDst, 3)))
print("More than 10 destinations: {} %".format(round(percentageOfMoreThan10Dst, 3)))
Output:
Length of dataset: 209434
Single Destination: 94.901 %
More than 10 destinations: 0.796 %
From Pcap to Aggregated Vectors
It is obvious that the three explored time series have different order-of-magnitude, but are they correlated? Time series must be plotted, so we encourage you to do that. Depending on the analysis platform (Python, MATLAB, R, etc.), you have commands that evaluate correlations between signals by outputting a numerical value (0: no correlation, 1: maximum direct correlation, -1: maximum inverse correlation). However, whenever possible, we recommend using plots and visual representations. Plot the three time-series. To better assess correlations, you can scale/normalize signals before plotting them.
TBA
Additionally, you can assess value distributions by plotting histograms. We recommend also plotting central tendency values (mean, median, standard deviation) superposed on the histograms to check if they are representative of the data. Are they?
TBA