[CPWG] ME community satetment about the ICANN Open data prlatform
John McCormac
jmcc at hosterstats.com
Thu Feb 3 18:54:43 UTC 2022
On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
> Dear Friends,
> During the ICANN72 ME space session
> <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we
> submitted a statement
> <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view>
> to the board about the ICANN Open Data Platform, and we received the
> following answers
> <https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=sharing>
> from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been
unaware of the issues with the ODP when it came to the per-registrar
data. The problems with the per-registrar transactions were mainly that
the importation of the CSV files into the ODP was not a simple process
due to missing data, corrupted data and differing formats in the CSV files.
The limitation of the ODP in handling what are effectively trivial
datasets is disturbing. With the expansion of the numbers of gTLDs and
subsequent rounds, the ODP, with a limited dataset licence, would
quickly be of limited value. That should have been immediately obvious
to ICANN.
The retention of CSVs in parallel with the ODP is the best strategy.
This is because the CSV is a more robust format and errors are much
easier to identify. This is how it was possible to identify the problems
with the per-registrar data.
There is a serious normalisation problem with the per-registrar data in
that some registries have their own names for the registrars. The
language for the column headers issue is a relatively simple issue with
a properly designed database schema but I am not sure how the ODP could
handle multiple languages. I tried subscribing to the ICANN ME mailing
list after the presentation.
Though the ODP is a useful tool, it is lacking historical depth. Some of
this is due to data formats and data being in PDF format (which varied
from registry to registry) rather than CSV. I successfully
reverse-engineered and extracted the data from most of these PDFs back
to 2006 for some gTLDs to build a database of historical per-registrar
transactions. It was an interesting exercise.
The formatting in the PDFs varied. Some of the data (deletion figures)
for .COM and .NET was missing from the per-registrar reports until
Verisign adopted the new reporting format. There were some other data
quality issues that have persisted The .AFRICA per-registrar reports
have been missing the new-adds and renews data and have been so since
the gTLD launched. The latest (October 2021) report for the gTLD is
still missing this data.
The ODP offers a useful interface for dealing with the data but the best
application would be one in Python, Ruby or other programming language
to download datasets to be processed locally. The database schema for
the per-registrar reports is standardised so it is easy enough to load
this data into a database with a single statement. The schema for the
other datasets is also available on the ODP, I think.
Regards...jmcc
--
**********************************************************
John McCormac * e-mail: jmcc at hosterstats.com
MC2 * web: http://www.hosterstats.com/
22 Viewmount * Domain Registrations Statistics
Waterford * Domnomics - the business of domain names
Ireland * https://amzn.to/2OPtEIO
IE * Skype: hosterstats.com
**********************************************************
--
This email has been checked for viruses by AVG.
https://www.avg.com
More information about the CPWG
mailing list