159 lines
4.2 KiB
Markdown
159 lines
4.2 KiB
Markdown
# Exercise 2
|
|
|
|
## From pcap to packets
|
|
|
|
Login via `ssh` to the Lab Environment and `cd working_directory`.
|
|
|
|
>>>
|
|
Do you think that Go-Flows has any advantage compared with tcpdump?
|
|
>>>
|
|
|
|
Go-Flows has the advantage over tcpdump if a lot of customized options to
|
|
filter the traffic capture is needed. Other than that, tcpdump is usually already
|
|
known and easy to get started with. For simple filtering purposes I consider
|
|
tcpdump to be faster than Go-Flows.
|
|
|
|
>>>
|
|
What are the proportions of TCP, UDP, and ICMP traffic? And traffic that is not
|
|
TCP, UDP, or ICMP?
|
|
>>>
|
|
|
|
About half (~47%) of the capture is TCP traffic. ICMP traffic is about 40% and
|
|
UDP traffic about 7%. The rest of the traffic makes up about 6%.
|
|
|
|
>>>
|
|
How much traffic is related to websites (HTTP, HTTPS)? And DNS traffic?
|
|
>>>
|
|
HTTP traffic: ~14.12%
|
|
HTTPS traffic: ~15.25%
|
|
DNS traffic: ~00.82%
|
|
|
|
### rep-10
|
|
|
|
Run the following command inside `working_directory`:
|
|
|
|
`tcpdump -tt -c 10 -nr Ex2_team13.pcap`
|
|
|
|
* `-tt` for timestamps
|
|
* `-c 10` for showing the first 10 packets
|
|
* `-n` for not converting addresses to names
|
|
* `-r` for reading from pcap
|
|
|
|
Last line (10th packet) says:
|
|
|
|
`1546318980.014549 IP 203.74.52.109 > 200.130.97.12: ICMP echo request, id 16190, seq 4544, length 12`
|
|
|
|
### rep-11
|
|
|
|
After running the command
|
|
|
|
`go-flows run features pcap2pkts.json export csv Ex2_team13.csv source libpcap Ex2_team13.pcap`
|
|
|
|
we get the file `Ex2_team13.csv`.
|
|
|
|
The following python script quickly extracts the `protocolIdentifier` and their
|
|
occurrences:
|
|
|
|
```python
|
|
import numpy as np
|
|
import pandas as pd
|
|
|
|
df = pd.read_csv(r'./Ex2_team13.csv')
|
|
|
|
print(df['protocolIdentifier'].value_counts(sort=True))
|
|
```
|
|
|
|
Output:
|
|
|
|
```
|
|
6 889752
|
|
1 761985
|
|
17 124772
|
|
47 107355
|
|
58 1308
|
|
50 66
|
|
103 15
|
|
41 2
|
|
Name: protocolIdentifier, dtype: int64
|
|
```
|
|
|
|
## From Pcap to Flow Vectors
|
|
|
|
>>>
|
|
Remember that here we have extracted flows within a time-frame of 10 seconds.
|
|
Can you think about legitimate and illegitimate situations for case (c), i.e., a
|
|
source sending traffic to many different destinations in a short time?
|
|
>>>
|
|
|
|
TBA
|
|
|
|
>>>
|
|
You can additionally count the number of flows that show TCP, UDP, ICMP, and
|
|
other IP protocols as "mode" protocol. Do you think that you will get a similar
|
|
proportion as in [rep-11]? Beyond answering "yes" or "no", think about reasons
|
|
that might make such proportions similar or different (there are some that are
|
|
worth considering).
|
|
>>>
|
|
|
|
TBA
|
|
|
|
### rep-12
|
|
|
|
After running the command
|
|
|
|
`go-flows run features pcap2flows.json export csv Ex2flows_team13.csv source libpcap Ex2_team13.pcap`
|
|
|
|
we get the file `Ex2flows_team13.csv`.
|
|
|
|
The following python script quickly extracts the percentage of sources
|
|
communicating with one or more than ten destinations:
|
|
|
|
```python
|
|
import pandas as pd
|
|
|
|
df = pd.read_csv(r'../data/Ex2flows_team13.csv')
|
|
|
|
dataLength = len(df)
|
|
|
|
singleDestinationFilter = df['distinct(destinationIPAddress)'] == 1
|
|
moreThan10DestinationsFilter = df['distinct(destinationIPAddress)'] > 10
|
|
|
|
percentageOfSingleDst = len(df[singleDestinationFilter]) / dataLength * 100
|
|
percentageOfMoreThan10Dst = len(df[moreThan10DestinationsFilter]) / dataLength * 100
|
|
|
|
print("Single Destination: {} %".format(round(percentageOfSingleDst, 3)))
|
|
print("More than 10 destinations: {} %".format(round(percentageOfMoreThan10Dst, 3)))
|
|
```
|
|
|
|
Output:
|
|
|
|
```
|
|
Length of dataset: 209434
|
|
Single Destination: 94.901 %
|
|
More than 10 destinations: 0.796 %
|
|
```
|
|
|
|
## From Pcap to Aggregated Vectors
|
|
|
|
>>>
|
|
It is obvious that the three explored time series have different
|
|
order-of-magnitude, but are they correlated? Time series must be plotted, so we
|
|
encourage you to do that. Depending on the analysis platform (Python, MATLAB, R,
|
|
etc.), you have commands that evaluate correlations between signals by outputting
|
|
a numerical value (0: no correlation, 1: maximum direct correlation, -1: maximum
|
|
inverse correlation). However, whenever possible, we recommend using plots and
|
|
visual representations. Plot the three time-series. To better assess
|
|
correlations, you can scale/normalize signals before plotting them.
|
|
>>>
|
|
|
|
TBA
|
|
|
|
>>>
|
|
Additionally, you can assess value distributions by plotting histograms. We
|
|
recommend also plotting central tendency values (mean, median, standard
|
|
deviation) superposed on the histograms to check if they are representative of
|
|
the data. Are they?
|
|
>>>
|
|
|
|
TBA
|