Every night, the implementation of HDX freshness in Python reads all the datasets from HDX (using the HDX Python library) and then iterates through them one by one performing a sequence of steps.
At a high level it does the following:
Compares expected update frequency with time since last update to give status: fresh, due, overdue or delinquent.
eg. update frequency = every week, last updated 5 days ago:
7 days > 5 days => Fresh
If last updated:
0-6 days ago => Fresh 7-13 days ago => Due
14-21 days ago => Overdue >21 days ago => Delinquent
The following goes into more detail:
The steps involved in data freshness are:
- Read the dataset last modified date from HDX (which updates if the data of any resource changes) and the resource last modified dates (which updates if the resource data is changed so should be encapsulated by the dataset last modified date).
- Load the previous run's last modified dates and replace dates if more recent.
- Check the dataset's resources. If any have not been hashed for 30 days and the count of number of datasets unhashed in 30 days < 1/30 the number of resources, then we mark the resources as to be hashed.
- Get the dataset's update frequency if it has one. If that update frequency is Never (-1 in API), Live (0) or Adhoc (-2) then the dataset is always fresh.
- If the dataset is not fresh based on metadata, then examine the urls of the resources. If they are internal urls (data.humdata.org - the HDX filestore, manage.hdx.rwlabs.org - CPS) then there is no further checking that can be done because when the files pointed to by these urls update, the HDX metadata also updates.
If the url is externally hosted, open an HTTP GET request to the file and check the header returned for the Last-Modified field. If that field exists, then read the date and time from it and if it is more recent than the resource last modification date, replace it.If the resource is not fresh by this measure then:- Download the file and calculate an MD5 hash for it.
- Check if the hash has changed compared with hash from the previous run stored in the database.
- There are some resources where the hash changes constantly because they connect to an api which generates a file on the fly. To identify these, download and hash again and check if the hash changes in the few seconds since the previous hash calculation. If so, we cannot determine freshness.
- If the hash has changed since the last run, but doesn't change constantly (API), store the hash and replace the resource last modified date with the current date (now).
- Record the dataset and resources last modified dates.
- Calculate freshness by measuring the difference between the current date and last modified date and comparing with the update frequency, setting a status: fresh, due, overdue or delinquent.
Data freshness is available from Docker image: https://hub.docker.com/r/mcarans/hdx-data-freshness/. The code for the implementation is here: https://github.com/OCHA-DAP/hdx-data-freshness. It has tests with a high level of coverage.
It produces some simple metrics eg. on a first run (empty database):
*** Resources ***
* total: 10205 *,
adhoc-revision: 3068,
internal-revision: 4921,
revision: 1829,
revision,api: 47,
revision,error: 86,
revision,hash: 192,
revision,http header: 62
*** Datasets ***
* total: 4440 *,
0: Fresh, Updated metadata: 1883,
0: Fresh, Updated metadata,revision,api: 15,
0: Fresh, Updated metadata,revision,hash: 100,
0: Fresh, Updated metadata,revision,http header: 8,
1: Due, Updated metadata: 1710,
2: Overdue, Updated metadata: 12,
3: Delinquent, Updated metadata: 361,
3: Delinquent, Updated metadata,revision,http header: 3,
Freshness Unavailable, Updated metadata: 348
1521 datasets have update frequency of Never
eg. a second run one day later:
*** Resources ***
* total: 10207 *,
adhoc-nothing: 3068,
api: 7,
error: 84,
hash: 1,
internal-nothing: 4920,
internal-revision: 1,
nothing: 2115,
revision: 6,
same hash: 5
*** Datasets ***
* total: 4441 *,
0: Fresh, Updated api: 7,
0: Fresh, Updated hash: 1,
0: Fresh, Updated metadata: 3,
0: Fresh, Updated nothing: 1995,
1: Due, Updated nothing: 1711,
2: Overdue, Updated nothing: 12,
3: Delinquent, Updated nothing: 364,
Freshness Unavailable, Updated nothing: 348
1521 datasets have update frequency of Never
For more detailed analysis, the database it builds can can be queried eg.
select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where fresh is null); select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where update_frequency is null);
The above lines returned the same value (but may not now as the datasets had their update frequency changed), confirming to us that for 48 resources which have a url containing "ourairports", their freshness value is not calculable because the update frequency of the dataset is not set. This is only possible for datasets created prior to the HDX release which made the update frequency (renamed expected update frequency) mandatory.