One of the many perks with running QoSient Argus
on a perimeter device is that by default it will generate TCP metrics.
ARGUS_GENERATE_TCP_PERF_METRIC
Argus by default, generates extended metrics for TCP that include the connection setup time, window sizes, base sequence numbers, and retransmission counters. You can suppress this detailed information using this variable.
Source
After noticing significant latency to a cloud hosted Virtual Machine (VM), traceroute was used which showed my Internet Service Provider (ISP) making questionable routing decisions. That being said, Border Gateway Protocol (BGP) is dynamic routing protocol. This means routing decisions should only be of concern if being made consistently and providing degraded service over the prior configuration. This left me to answer two questions:
- When did the change occur?
- Was service degraded since the change?
Identify Routing Changes
Having already identified the questionable routing decision I sought out to determine when this routing change was implemented. Looking through the ra
manual page showed the following:
dhops
estimate of number of IP hops from dst to this point.
Doing some basic tests with "uniq -c"
proved that I should graph this data to take a closer look.
While ragraph
is included with argus-clients
, my particular system is unable to run it so a delimited file will be produced for use in Excel.
The following command will generate a pipe (“|”) delimited file showing the start time (stime
) of the flow and the estimated number of hops (dhops
). The stime
value will be in epoch and grep is used to remove any flows that don’t have a value for dhops
. For the -m
switch, “correct
” is used to consolidate flows into a single flow record.
racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime dhops - dst host 1.1.1.1 | grep -v "^.*|$" > /var/argus/1.1.1.1-dhops.csv
From here the data was a straight copy/paste into Excel which showed exactly when the routing change occurred.
Identify Service Degradation
Next was to identify any service degradation since the routing change occurred. Going back through the ra
manual page revealed:
tcprtt
TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’
Charting that will surely show latency. Modifying the above command to include the new field produced the following:
racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime tcprtt - dst host 1.1.1.1 | grep -v "^.*|$" > /var/argus/1.1.1.1-tcprtt.csv
After doing a copy/paste into Excel and creating a graph, it was clear that significant latency was introduced for a period of time after the change occurred. However performance has since returned to how things were well before the change was made.
Excel – Converting Epoch
Converting the epoch stime
to a user friendly format requires two steps.
Formula
The first step is to insert a new column and use the following formula. Note that A2
is the cell containing the epoch timestamp.
=A2/(24*60*60) + DATE(1970,1,1)
Format Cells
Finally, format the cells using the following Custom type.
m/d/yyyy h:mm:ss.000
Closing
QoSient Argus
provides a wealth of information. Just have a look at the ra
manual page. If you don’t suffer from information overload then your name must be Carter and I’m honored you’re reading this.
Solutions such as Smokeping would provide similar results however it must be configured to monitor specific metrics and destinations. Performance data is only generated when tests are launched which create a surprising amount of traffic.
This is in contrast to Argus
where traffic is passively monitored and metrics get generated based on the flows it sees. A very handy utility in one’s toolbox.