[CPWG] ME community satetment about the ICANN Open data prlatform

John McCormac jmcc at hosterstats.com
Thu Feb 3 18:54:43 UTC 2022


On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
> Dear Friends,
> During the ICANN72 ME space session 
> <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we 
> submitted a statement 
> <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view> 
> to the board about the ICANN Open Data Platform, and we received the 
> following answers 
> <https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=sharing> 
> from the Board.

In the presentation given, I thnk that Ashwin Rangan may have been 
unaware of the issues with the ODP when it came to the per-registrar 
data. The problems with the per-registrar transactions were mainly that 
the importation of the CSV files into the ODP was not a simple process 
due to missing data, corrupted data and differing formats in the CSV files.

The limitation of the ODP in handling what are effectively trivial 
datasets is disturbing. With the expansion of the numbers of gTLDs and 
subsequent rounds, the ODP, with a limited dataset licence, would 
quickly be of limited value. That should have been immediately obvious 
to ICANN.

The retention of CSVs in parallel with the ODP is the best strategy. 
This is because the CSV is a more robust format and errors are much 
easier to identify. This is how it was possible to identify the problems 
with the per-registrar data.

There is a serious normalisation problem with the per-registrar data in 
that some registries have their own names for the registrars. The 
language for the column headers issue is a relatively simple issue with 
a properly designed database schema but I am not sure how the ODP could 
handle multiple languages. I tried subscribing to the ICANN ME mailing 
list after the presentation.

Though the ODP is a useful tool, it is lacking historical depth. Some of 
this is due to data formats and data being in PDF format (which varied 
from registry to registry) rather than CSV. I successfully 
reverse-engineered and extracted the data from most of these PDFs back 
to 2006 for some gTLDs to build a database of historical per-registrar 
transactions. It was an interesting exercise.

The formatting in the PDFs varied. Some of the data (deletion figures) 
for .COM and .NET was missing from the per-registrar reports until 
Verisign adopted the new reporting format. There were some other data 
quality issues that have persisted The .AFRICA per-registrar reports 
have been missing the new-adds and renews data and have been so since 
the gTLD launched. The latest (October 2021) report for the gTLD is 
still missing this data.

The ODP offers a useful interface for dealing with the data but the best 
application would be one in Python, Ruby or other programming language 
to download datasets to be processed locally. The database schema for 
the per-registrar reports is standardised so it is easy enough to load 
this data into a database with a single statement. The schema for the 
other datasets is also available on the ODP, I think.

Regards...jmcc
-- 
**********************************************************
John McCormac  *  e-mail: jmcc at hosterstats.com
MC2            *  web: http://www.hosterstats.com/
22 Viewmount   *  Domain Registrations Statistics
Waterford      *  Domnomics - the business of domain names
Ireland        *  https://amzn.to/2OPtEIO
IE             *  Skype: hosterstats.com
**********************************************************

-- 
This email has been checked for viruses by AVG.
https://www.avg.com



More information about the CPWG mailing list