Qosient Argus – Identifying Changes in the Network

One of the many perks with running QoSient Argus on a perimeter device is that by default it will generate TCP metrics.

ARGUS_GENERATE_TCP_PERF_METRIC
Argus by default, generates extended metrics for TCP that include the connection setup time, window sizes, base sequence numbers, and retransmission counters. You can suppress this detailed information using this variable.
Source

After noticing significant latency to a cloud hosted Virtual Machine (VM), traceroute was used which showed my Internet Service Provider (ISP) making questionable routing decisions. That being said, Border Gateway Protocol (BGP) is dynamic routing protocol. This means routing decisions should only be of concern if being made consistently and providing degraded service over the prior configuration. This left me to answer two questions:

  1. When did the change occur?
  2. Was service degraded since the change?

Identify Routing Changes

Having already identified the questionable routing decision I sought out to determine when this routing change was implemented. Looking through the ra manual page showed the following:

dhops
estimate of number of IP hops from dst to this point.

Doing some basic tests with "uniq -c" proved that I should graph this data to take a closer look.

While ragraph is included with argus-clients, my particular system is unable to run it so a delimited file will be produced for use in Excel.

The following command will generate a pipe (“|”) delimited file showing the start time (stime) of the flow and the estimated number of hops (dhops). The stime value will be in epoch and grep is used to remove any flows that don’t have a value for dhops. For the -m switch, “correct” is used to consolidate flows into a single flow record.

racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime dhops - dst host 1.1.1.1 | grep -v "^.*|$" > /var/argus/1.1.1.1-dhops.csv

From here the data was a straight copy/paste into Excel which showed exactly when the routing change occurred.

Identify Service Degradation

Next was to identify any service degradation since the routing change occurred. Going back through the ra manual page revealed:

tcprtt
TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’

Charting that will surely show latency. Modifying the above command to include the new field produced the following:

racluster -L 0 -c "|" -u -r /var/argus/archive/2019.*.out -m correct -s stime tcprtt - dst host 1.1.1.1 | grep -v "^.*|$" > /var/argus/1.1.1.1-tcprtt.csv

After doing a copy/paste into Excel and creating a graph, it was clear that significant latency was introduced for a period of time after the change occurred. However performance has since returned to how things were well before the change was made.

Excel – Converting Epoch

Converting the epoch stime to a user friendly format requires two steps.

Formula

The first step is to insert a new column and use the following formula. Note that A2 is the cell containing the epoch timestamp.

=A2/(24*60*60) + DATE(1970,1,1)

Format Cells

Finally, format the cells using the following Custom type.

m/d/yyyy h:mm:ss.000

Closing

QoSient Argus provides a wealth of information. Just have a look at the ra manual page. If you don’t suffer from information overload then your name must be Carter and I’m honored you’re reading this.

Solutions such as Smokeping would provide similar results however it must be configured to monitor specific metrics and destinations. Performance data is only generated when tests are launched which create a surprising amount of traffic.

This is in contrast to Argus where traffic is passively monitored and metrics get generated based on the flows it sees. A very handy utility in one’s toolbox.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.