[ICANN-CSC] Detailed information on recent data collection issue

Wed Sep 13 19:08:55 UTC 2017

Dear colleagues,

On the last CSC call, PTI staff were asked to provide a fuller explanation of the data collection issue we identified in the July 2017 report. Here is more information on the topic. If there are further questions please do not hesitate to ask.

Summary:

During July 2017, PTI identified that service level related event data was not fully recorded for a partial period. The majority of this data was subsequently recovered through restoration of backups and through augmentation with alternative logging mechanisms built into our system.

Detailed Explanation:

The Root Zone Management System (RZMS) emits SLA-related events onto a messaging bus (Apache Kafka). These events represent event type, high-resolution timestamp, and other metadata used to categorize the events according to the SLA categories. This is later consumed for deposit into an SLA database for long-term storage, which underpins the SLE dashboard and the tools used to produce the monthly CSC reports. While compiling the monthly CSC report, staff identified anomalies in the data, which investigation revealed to be from a gap in SLE events for a period of time. RZMS logging suggests all events were successfully deposited on the messaging bus, but not all were able to be retrieved by the consumer application.

Root Cause Analysis:

To date, we have not definitively identified the underlying technical issue that caused this data capture outage, and thus can not be sure the issue will not recur. We have improved logging and alerting in the system that should help in any recurrence. We’ve upgraded Kafka to the latest version in case the issue was from a software bug that has since been addressed upstream. Should we be unable to satisfactorily diagnose the root cause, we are also analyzing whether the message flow can be improved in subsequent versions of RZMS and the SLE Dashboard.

Mitigations:

While the root cause is not fully understood, additional mitigations have been put in place that will limit its impact should it recur, and should allow for complete recovery of any missing data if the issue was to manifest again:

1. Kafka's message retention has been increased from 14 days to 60 days;
2. Alarms have been implemented to notify engineering staff and PTI staff if the dashboard has not seen events for more than 24 hours. (We are studying implementing a heartbeat message for the future that will allow us to tune this to a much smaller interval.);
3. We have implemented more comprehensive logging via conventional system logs of the data, which is archived via a wholly different path, such that should there be a Kafka-related failure in the future, we have an alternative mechanism to recover all the missing messages.

Conclusion:

We believe the above additional logging, monitoring and data retention should provide us confidence we have a wholly independent method of recording events that can mitigate data loss fully in the future, while we continue to identify the root cause of the issue with the primary mechanism. Should we be unable to satisfactorily identify the likely root cause of the earlier outage, we will evaluate replacing the current technique with an alternative method of capturing the messages in a later software update.

kim