netsec-lab/ex3/README.md

# Exercise 3

## Analyzing Darkspace Evolution

>>>
Check results from [rep-14] again. Are they correlated? Think for a second
about the possible meaning of the analyzed time series being correlated. What
could be the reason why the drop in the number of unique IP sources after Jan
16 does not cause a proportional drop in the other signals?
>>>

The results are mostly either strongly or somewhat correlated. Looking at the
different correlations, it could be that the drop happened because someone was
scanning the network or performing some kind of attack on a lot of different
hosts. This hypothesis is supported by the high correlation of unique
destination IPs with the amount of packets and the amount of bytes sent. It
follows that, since the unique source IPs dropped, one IP address had a lot of
outflow of traffic to a lot of unique destination IPs.

>>>
Check results from [rep-15] again. Do the results make sense for you? Would you
expect a different ratio in a normal network (no darkspace)?
>>>

In a normal network I would expect the ratio to be much closer to one, albeit
still higher than one. Thinking about my traffic at home, most requests have a
response associated with them and thus the ratio should be much closer to one.
This ratio is easily offset by doing a horizontal scan on the network for
example.

>>>
You used the median in [rep-15], but you could have used the mean. Does it make
any difference? What's better in your opinion? When to use mean and when
median? Can you figure out pros and cons for both measures of central tendency?
>>>

The median definitely makes more sense in this case since it has a strong
rejection of outliers. The traffic data is very diverse and spread out, meaning
that the mean would look very different from the median.

## Analyzing a Short Darkspace Period

>>>
Do values in Table A and Table B coincide? If not, why?
>>>

The values mostly coincide, except for the sums of course. This is to be
expected since both datasets are from the same timeframe. The standard deviation
of bytes is higher on the daily table, because there are probably times outside
of this particular month where a lot of bytes were sent, which causes the
standard deviation to be higher.

>>>
Histograms, but particularly box plots, corresponding to hourly counts might
differ from the equivalent histograms and box plots calculated with daily
averaged data. Do you know why? Can you find an explanation?
>>>

Having more fine-grained data with the hourly plots, also results in more
striking differences in the box plots especially. It takes usually less data to
elongate the whiskers of box plots because spikes in traffic are more
pronounced.

>>>
Make sure that you are familiar with the three main protocols appearing in the
`team13_protocol.csv` file. You should know their definition and what they are
used for.
>>>

The different protocols are:

* ICMP (Internet Control Message Protocol) with Identifier 1
* TCP (Transmission Control Protocol) with Identifier 6
* UDP (User Datagram Protocol) with Identifier 17

ICMP is mostly used for error reporting. Devices send ICMP packets for example
to make sure that a particular host is reachable or to alert the sending device
that a packet was too large for the receiver. ICMP can also be abused in DDoS
attacks where victims are flooded with packets or pinged to death.

TCP is the backbone of the internet as all HTTP(S) packets are sent over TCP. It
is connection-oriented as it establishes a session between client and server.
TCP is well-suited for applications that require packets to be sent in order and
where dropped packets are not wanted.

UDP is the opposite of TCP as it only operates connectionless. There is no
session between client and server established. Due to this property, it lends
itself well for applications such as VoIP, where data has to be sent quickly and
we do not care much about out-of-orderness or dropped packets.

>>>
Did you get negative values in [rep-19]? Can you figure out why? And why not in
the case of packets?
>>>

The negative values come from the fact that some source IPs appear multiple
times in different protocols (ICMP, TCP and UDP). The same goes for the
destination IPs. Adding those together gives a percentage higher than 100%.
Thus, the percentage of IPs _not_ belonging to these protocols must be smaller
than 0%. With packets it is not possible that they belong to multiple protocols
at once. Packets can only either be sent over ICMP, TCP or UDP.

## Analyzing Temporal Patterns

>>>
Do signals in [rep-20] show periodicities?
>>>

Yes, especially the number of unique TCP source IPs shows a strong diurnal
pattern. It increases drastically every day from after midnight until the
evening where it sharply drops off until midnight again. There is also a
distinct drop during lunchtime.

The number of TCP packets, on the other hand, does not seem to have any obvious
periodicities.