Data Freshness Process

Every night, the implementation of HDX freshness in Python reads all the datasets from HDX (using the HDX Python library) and then iterates through them one by one performing a sequence of steps.

At a high level it does the following:

Compares expected update frequency with time since last update to give status: fresh, due, overdue or delinquent.

eg. update frequency = every week, last updated 5 days ago:

7 days > 5 days => Fresh

If last updated:

0-6 days ago => Fresh 7-13 days ago => Due

14-21 days ago => Overdue >21 days ago => Delinquent

The following goes into more detail:

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX ie. in the file store, then the update time is reflected in the HDX dataset metadata. If it is hosted externally, then it is not as straightforward to find out if the file pointed to by a url has changed. ~~It is possible to use the last_modified field that is returned from an HTTP GET request provided the hosting server supports it or not.~~ (For performance reasons, we open a stream, so that we have the option to only read the header information rather than the entire file). If it is a link to a file on a server like Apache or Nginx then the field may exist, but if it is a url that generates a result on the fly, then it does not.

An alternative approach discussed with researchers at the University of Vienna is to download the external urls, hash the files and compare the hashes. Since there can be temporary connection and download issues with urls, the code has multiple retry functionality with increasing delays. Also as there are many requests to be made, rather than perform them one by one, they are executed concurrently using the asynchronous functionality that has been added to the most recent versions of Python. The researchers had done some calculations and asserted that it would be too resource intensive to hash all of the files daily mainly due to the time to download all of the urls. However, by applying logic so that we do not need to download all of the files every day, we have restricted the load significantly.

The steps involved in data freshness are:

Read the dataset last modified date from HDX (which updates if the data of any resource changes) and the resource last modified dates (which updates if the resource data is changed so should be encapsulated by the dataset last modified date).
Load the previous run's last modified dates and replace dates if more recent.
Check the dataset's resources. If any have not been hashed for 30 days and the count of number of datasets unhashed in 30 days < 1/30 the number of resources, then we mark the resources as to be hashed.
Get the dataset's update frequency if it has one. If that update frequency is Never (-1 in API), Live (0) or Adhoc (-2) then the dataset is always fresh.
If the dataset is not fresh based on metadata, then examine the urls of the resources. If they are internal urls (data.humdata.org - the HDX filestore, manage.hdx.rwlabs.org - CPS) then there is no further checking that can be done because when the files pointed to by these urls update, the HDX metadata also updates.
If the url is externally hosted, open an HTTP GET request to the file and check the header returned for the Last-Modified field. If that field exists, then read the date and time from it and if it is more recent than the resource last modification date, replace it.
~~If the resource is not fresh by this measure then:~~
1. Download the file and calculate an MD5 hash for it.
2. Check if the hash has changed compared with hash from the previous run stored in the database.
There are some resources where the hash changes constantly because they connect to an api which generates a file on the fly. To identify these, download and hash again and check if the hash changes in the few seconds since the previous hash calculation. If so, we cannot determine freshness.
If the hash has changed since the last run, but doesn't change constantly (API), store the hash and replace the resource last modified date with the current date (now).
Record the dataset and resources last modified dates.
Calculate freshness by measuring the difference between the current date and last modified date and comparing with the update frequency, setting a status: fresh, due, overdue or delinquent.

The flowchart below represents the logical flow for each dataset and resource in HDX and occurs nightly:

Data freshness is available from Docker image: https://hub.docker.com/r/mcarans/hdx-data-freshness/. The code for the implementation is here: https://github.com/OCHA-DAP/hdx-data-freshness. It has tests with a high level of coverage.

It produces some simple metrics eg. on a first run (empty database):

*** Resources ***
* total: 10205 *,
adhoc-revision: 3068,
internal-revision: 4921,
revision: 1829,
revision,api: 47,
revision,error: 86,
revision,hash: 192,
revision,http header: 62

*** Datasets ***
* total: 4440 *,
0: Fresh, Updated metadata: 1883,
0: Fresh, Updated metadata,revision,api: 15,
0: Fresh, Updated metadata,revision,hash: 100,
0: Fresh, Updated metadata,revision,http header: 8,
1: Due, Updated metadata: 1710,
2: Overdue, Updated metadata: 12,
3: Delinquent, Updated metadata: 361,
3: Delinquent, Updated metadata,revision,http header: 3,
Freshness Unavailable, Updated metadata: 348

1521 datasets have update frequency of Never

eg. a second run one day later:

*** Resources ***
* total: 10207 *,
adhoc-nothing: 3068,
api: 7,
error: 84,
hash: 1,
internal-nothing: 4920,
internal-revision: 1,
nothing: 2115,
revision: 6,
same hash: 5

*** Datasets ***
* total: 4441 *,
0: Fresh, Updated api: 7,
0: Fresh, Updated hash: 1,
0: Fresh, Updated metadata: 3,
0: Fresh, Updated nothing: 1995,
1: Due, Updated nothing: 1711,
2: Overdue, Updated nothing: 12,
3: Delinquent, Updated nothing: 364,
Freshness Unavailable, Updated nothing: 348

1521 datasets have update frequency of Never

For more detailed analysis, the database it builds can can be queried eg.

select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where fresh is null);
select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where update_frequency is null);

The above lines returned the same value (but may not now as the datasets had their update frequency changed), confirming to us that for 48 resources which have a url containing "ourairports", their freshness value is not calculable because the update frequency of the dataset is not set. This is only possible for datasets created prior to the HDX release which made the update frequency (renamed expected update frequency) mandatory.