One of the many perks with running
QoSient Argus on a perimeter device is that by default it will generate TCP metrics.
Argus by default, generates extended metrics for TCP that include the connection setup time, window sizes, base sequence numbers, and retransmission counters. You can suppress this detailed information using this variable.
After noticing significant latency to a cloud hosted Virtual Machine (VM), traceroute was used which showed my Internet Service Provider (ISP) making questionable routing decisions. That being said, Border Gateway Protocol (BGP) is dynamic routing protocol. This means routing decisions should only be of concern if being made consistently and providing degraded service over the prior configuration. This left me to answer two questions:
- When did the change occur?
- Was service degraded since the change?
Identify Routing Changes
Having already identified the questionable routing decision I sought out to determine when this routing change was implemented. Looking through the
ra manual page showed the following:
estimate of number of IP hops from dst to this point.
Doing some basic tests with
"uniq -c" proved that I should graph this data to take a closer look.
ragraph is included with
argus-clients, my particular system is unable to run it so a delimited file will be produced for use in Excel.
The following command will generate a pipe (“|”) delimited file showing the start time (
stime) of the flow and the estimated number of hops (
stime value will be in epoch and grep is used to remove any flows that don’t have a value for
dhops. For the
-m switch, “
correct” is used to consolidate flows into a single flow record.
racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime dhops - dst host 188.8.131.52 | grep -v "^.*|$" > /var/argus/184.108.40.206-dhops.csv
From here the data was a straight copy/paste into Excel which showed exactly when the routing change occurred.
Identify Service Degradation
Next was to identify any service degradation since the routing change occurred. Going back through the
ra manual page revealed:
TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’
Charting that will surely show latency. Modifying the above command to include the new field produced the following:
racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime tcprtt - dst host 220.127.116.11 | grep -v "^.*|$" > /var/argus/18.104.22.168-tcprtt.csv
After doing a copy/paste into Excel and creating a graph, it was clear that significant latency was introduced for a period of time after the change occurred. However performance has since returned to how things were well before the change was made.
Excel – Converting Epoch
Converting the epoch
stime to a user friendly format requires two steps.
The first step is to insert a new column and use the following formula. Note that
A2 is the cell containing the epoch timestamp.
=A2/(24*60*60) + DATE(1970,1,1)
Finally, format the cells using the following Custom type.
QoSient Argus provides a wealth of information. Just have a look at the
ra manual page. If you don’t suffer from information overload then your name must be Carter and I’m honored you’re reading this.
Solutions such as Smokeping would provide similar results however it must be configured to monitor specific metrics and destinations. Performance data is only generated when tests are launched which create a surprising amount of traffic.
This is in contrast to
Argus where traffic is passively monitored and metrics get generated based on the flows it sees. A very handy utility in one’s toolbox.