Page Comparison

...

Introduction

...

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested.

Important fields

Field	Description	Purpose
data_update_frequency	Dataset expected update frequency	Shows how often the data is expected to be updated or at least checked to see if it needs updating
last_modified	Resource last modified date	Indicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_date	Dataset time period	The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

There are two dates that data can have and this can cause confusion, so we define them clearly here :as they pertain to datasets:

Date of update: The last time any resource in the

...

dataset was

...

modified or the dataset was manually confirmed as up to date. The ideal is that the

...

time between updates corresponds with what is selected in the expected update frequency. This is last_modified.

...

Time period of data: The

...

earliest start date and latest end date across all resources included in the dataset. This is dataset_date.

The method of determining whether a resource is updated depends upon where the file is hosted. If it is hosted by HDX, then the

...

last modified date is recorded, but if externally, then there can be challenges in determining if a url has been updated or not

...

.

...

Dataset Aging Methodology

Once we have

...

the last modified dates for all of a dataset's resources and the last date the dataset was manually confirmed as updated in the UI if available, we can

...

calculate the latest of all of them, which we refer to as “last modified date” from here on. This is used to calculate the dataset’s age and combined with the update frequency, we can ascertain the freshness of the dataset.

Progress

Drawio

baseUrl	https://humanitarian.atlassian.net/wiki
diagramName	HDX structure
width	631
pageId	20742158
height	481
revision	2

The implementation of HDX freshness in Python reads all the datasets from HDX (using the HDX Python library) and then iterates through them one by one performing a sequence of steps.

It gets the dataset's update frequency if it has one. If that update frequency is Never, then the dataset is always fresh.
If not, it checks if the dataset and resource metadata have changed - this qualifies as an update from a freshness perspective. It compares the difference between the current time and update time with the update frequency and sets a status: fresh, due, overdue or delinquent.
If the dataset is not fresh based on metadata, then the urls of the resources are examined. If they are internal urls (data.humdata.org - the HDX filestore, manage.hdx.rwlabs.org - CPS) then there is no further checking that can be done because when the files pointed to by these urls update, the HDX metadata is updated.
If they are urls with an adhoc update frequency (proxy.hxlstandard.org, ourairports.com), then freshness cannot be determined. Currently, there is no mechanism in HDX to specify adhoc update frequencies, but there is a proposal to add this to the update frequency options. At the moment, the freshness value for adhoc datasets is based on whatever has been set for update frequency, but these datasets can be easily identified and excluded from results if needed.
If the url is externally hosted and not adhoc, then we can open an HTTP GET request to the file and check the header returned for the Last-Modified field. If that field exists, then we read the date and time from it and check if that is more recent than the dataset or resource metadata modification date. If it is, we recalculate freshness.
If the resource is not fresh by this measure, then we download the file and calculate an MD5 hash for it. In our database, we store previous hash values, so we can check if the hash has changed since the last time we took the hash.
There are some resources where the hash changes constantly because they connect to an api which generates a file on the fly. To identify these, we hash again and check if the hash changes in the few seconds since the previous hash calculation.

Since there can be temporary connection and download issues with urls, the code has multiple retry functionality with increasing delays. Also as there are many requests to be made, rather than perform them one by one, they are executed concurrently using the asynchronous functionality that has been added to the most recent versions of Python. The code for the implementation is here: https://github.com/mcarans/hdx-data-freshness. It has tests with a high level of coverage.

It produces some simple metrics eg. on a first run (empty database):

...

eg. a second run one day later:

...

For more detailed analysis, the database it builds can can be queried eg.

select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where fresh is null);
select count(*) from dbresources where url like '%ourairports%' and dataset_id in (select id from dbdatasets where update_frequency is null);

The above lines return the same value, confirming to us that for 48 resources which have a url containing "ourairports", their freshness value is not calculable because the update frequency of the dataset is not set. This is only possible for datasets created prior to the HDX release which made the update frequency (renamed expected update frequency) mandatory. On this subject, critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. expected update frequency

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4919

. It was confirmed that this field should stay at dataset level as our recommendation to data providers would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. The field is a dropdown with values: every day, every weekly, every two weeks, every month, every three months, every six months, every year, never and it has been implemented.

It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4254

. Related to that is ongoing work to make the field visible in the UI

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4894

.

A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.

A collaboration has been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We will see what we can learn from them but will likely proceed with a more basic and practical approach than what they envisage. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.

Next Steps

Running data freshness has shown that there are many datasets with an update frequency of never. This is understandable because for a long time, it was the default option in the UI. As the data freshness database holds organisation information, the first step is to contact all organisations who have datasets with update frequency never and encourage them to put in a time period.

There needs to be a way to specify that although a dataset is updated, it is not according to any schedule ie. it is adhoc. However, since it would be an enticing option to pick as it does not require much thought, it is proposed to not let data contributors choose this in the UI, but to wait for them to ask us or for us to identify their dataset as stale and contact them about it.

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-5046

Only administrators would be able to set this in the UI. Similarly, it is proposed to make never only available to administrators because contributors may pick this simpy because they don't want to commit to putting an expected update frequency. It is probable that they will pick every year and when that timeframe passes and their dataset is not updated, we would contact them about it and they could then tell us it will never be updated.

An issue that needs to be addressed is where should data freshness run and where should it output. Currently it is running on a local computer with a local file (SQLite) database, but it uses a database framework (SQLAlchemy) so the database can be changed. Consider that for the HDX UI to use freshness information, it needs access to the freshness data. For mailing of providers when their data is overdue, that also requires access (unless this is added as a feature to data freshness itself). Also, the data team need to be able to read the data for reporting. There are two possibilities for data storage:

Use a standalone database like PostGres hosted somewhere to be determined
Add data into metadata of datasets and resources

For where to run data freshness, one possibility is the data team server, but if there are production processes reliant on it, then it may be better if it is on a server under the control of the developers.

As data freshness collects a lot of metadata, it could be used for more general reporting. If needed, the list of metadata collected could be extended.

Once data freshness is running daily, there are some simple improvements we can make that will have a positive impact on data freshness. For example, we should send an automated mail reminder to data contributors if the update frequency time window for any of their datasets is missed by a certain amount. Even for ones which have an update frequency of "never", there could be an argument for a very rare mail reminder just to confirm data really is static. For the case where data is unchanged, we should give the option for contributors to respond directly to the automated mail to say so (perhaps by clicking a button in the message). Where data has changed, we would provide the link to the dataset that needs updating. We should consider if/how we batch emails if many datasets from one organisation need updating so they are not bombarded.

Drawio

baseUrl	https://humanitarian.atlassian.net/wiki
diagramName	Emails and Alerts
width	502
pageId	20742158
height	932
revision	3

The amount of datasets that are hosted outside of HDX is growing rapidly and these represent a problem for data freshness if their update time is not available. Rather than ignore them, the easiest solution is to send a reminder to users according to the update frequency - the problem is that this would be irrespective of whether they have already updated and so potentially annoying.

Another way is to provide guidance to data contributors so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. using a Google spreadsheet with our update trigger added. We could investigate a fuller integration between HDX and Google spreadsheets so that if a data provider clicks a button in HDX, it will create a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials. We may need to investigate other platforms for example creating document alerts in OneDrive for Business and/or macros in Excel spreadsheets (although as noted earlier, this might create a support headache).

Important fields

...

Dataset Aging Methodology

A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.

Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we would use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date (but in the email we would tell the data provider that we think their data is not fresh and needs to be updated rather than referring to states like overdue). The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.

...

Update Frequency

...

Dataset age state thresholds

(how old must a dataset be for it to have this status)

...

Up-to-date

...

Due

...

Overdue

...

Delinquent

...

Daily

...

0 days old

...

1 day old

due_age = f

...

2 days old

overdue_age = f + 2

...

3 days old

delinquent_age = f + 3

...

Weekly

...

0 - 6 days old

...

7 days old

due_age = f

...

14 days old

overdue_age = f + 7

...

21 days old

delinquent_age = f + 14

...

Fortnightly

...

0 - 13 days old

...

14 days old

due_age = f

...

21 days old

overdue_age = f + 7

...

28 days old

delinquent_age = f + 14

...

Monthly

...

0 -29 days old

...

30 days old

due_age = f

...

44 days old

overdue_age = f + 14

...

60 days old

delinquent_age = f + 30

...

Quarterly

...

0 - 89 days old

...

90 days old

due_age = f

...

120 days old

overdue_age = f + 30

...

150 days old

delinquent_age = f + 60

...

Semiannually

...

0 - 179 days old

...

180 days old

due_age = f

...

210 days old

overdue_age = f + 30

...

240 days old

delinquent_age = f + 60

...

Annually

...

0 - 364 days old

...

365 days old

due_age = f

...

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90

...

Number of Files Locally and Externally Hosted

...

22%

...

26%

...

27%

...

2%

...

24%

...

100%

Determining if a Resource is Updated

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX ie. in the file store, then the update time is readily available. If it is hosted externally, then it is not as straightforward to find out if the file pointed to by a url has changed. It is possible to use the last_modified field that is returned from an HTTP GET request provided the hosting server supports it or not. (For performance reasons, we open a stream, so that we have the option to only read the header information rather than the entire file). If it is a link to a file on a server like Apache or Nginx then the field often exists, but if it is a url that generates a result on the fly, then it does not. The field has turned out to be valuable needs to be determined by examining the percentage of datasets it correctly covers. According to Vienna University the figure could be as high as 60%, but this must be verified.

An alternative approach discussed with the researchers is to download all the external urls, hash the files and compare the hashes. The Vienna team had done some calculations and asserted that it would be too resource intensive to hash all of the files mainly due to the time to download all of the urls (and we would need to consider datasets like WorldPop which are huge). However, we can apply logic so that we do not need to download all of the files every day (except on first run) based on the due date described earlier. We could restrict the load further if necessary by checking datasets with a lower update frequency less often than nightly.

The flowchart below represents the logical flow for each resource in HDX and would occur nightly (unless we need to reduce the load):

Drawio

baseUrl	https://humanitarian.atlassian.net/wiki
diagramName	Hashing Flowchart
width	419.5
pageId	20742158
height	1052
revision	3

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#

Dataset Aging service:

https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit

https://github.com/luiscape/hdx-monitor-ageing-service

University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:

https://www.adequate.at/wp-content/uploads/2016/04/neumaier2016ODFreshness.pdf

University of Vienna presentation of data freshness:

View file

name	freshness_hdx.pdf
height	250

proxy.hxlstandard.org', 'ourairports.com

...

A dataset's age can be measured using today's date - last modified date. This value can be compared with the update frequency to determine an age status for the dataset.

Thought had previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date. The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.

Update Frequency	Dataset age state thresholds (how old must a dataset be for it to have this status)
	Fresh		Not Fresh
	Up-to-date	Due	Overdue	Delinquent
Daily	0 days old	1 day old due_age = f	2 days old overdue_age = f + 2	3 days old delinquent_age = f + 3
Weekly	0 - 6 days old	7 days old due_age = f	14 days old overdue_age = f + 7	21 days old delinquent_age = f + 14
Fortnightly	0 - 13 days old	14 days old due_age = f	21 days old overdue_age = f + 7	28 days old delinquent_age = f + 14
Monthly	0 -29 days old	30 days old due_age = f	44 days old overdue_age = f + 14	60 days old delinquent_age = f + 30
Quarterly	0 - 89 days old	90 days old due_age = f	120 days old overdue_age = f + 30	150 days old delinquent_age = f + 60
Semiannually	0 - 179 days old	180 days old due_age = f	210 days old overdue_age = f + 30	240 days old delinquent_age = f + 60
Annually	0 - 364 days old	365 days old due_age = f	425 days old overdue_age = f + 60	455 days old delinquent_age = f + 90
Never	Always	Never	Never	Never
Live	Always	Never	Never	Never
As Needed	Always	Never	Never	Never

Here is a presentation about data freshness from January 2017 that provides a good introduction.

Data Freshness Architecture

Data freshness consists of a database, REST API, freshness process and freshness emailer.

There is a docker container hosting the Postgres database (https://hub.docker.com/r/unocha/alpine-postgres/ - 201703-PR116) and a port is open on there to allow connection from external database clients (hdxdatateam.xyz:5432). There is a another Docker container (https://hub.docker.com/r/mcarans/alpine-haskell-postgrest/) that exposes a REST API to the database (http://hdxdatateam.xyz:3000/) - the docker setup for this is here: https://github.com/OCHA-DAP/alpine-haskell-postgrest. The freshness process and freshness emailer are also within their own Docker containers. The docker-compose that brings all these containers together is here: https://github.com/OCHA-DAP/hdx-data-freshness-docker.

Here is an overall view of the architecture:

...

Data Freshness Process

Data Freshness Emailer

Completed Work

Data Freshness Roadmap

Statistics

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#

Dataset Aging service:

https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit

https://github.com/luiscape/hdx-monitor-ageing-service

University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:

https://www.adequate.at/wp-content/uploads/2016/04/neumaier2016ODFreshness.pdf

University of Vienna presentation of data freshness:

View file

name	freshness_hdx.pdf

Version	Old Version 56	New Version Current
Changes made by	Michael Rans	Michael Rans
Saved on	16 Aug 2017	06 Oct 2024

Versions Compared

Key

Introduction

Introduction

Important fields

Dataset Aging Methodology

Next Steps

Important fields

Dataset Aging Methodology

Number of Files Locally and Externally Hosted

Determining if a Resource is Updated

References

Data Freshness Architecture

Data Freshness Process

Data Freshness Emailer

Completed Work

Data Freshness Roadmap

Statistics

References

Page Comparison

Versions Compared

Key

<span class="diff-html-changed" data-a11y-before="Start of changed content" data-a11y-after="End of changed content" id="changed-diff-0">[data-colorid=</span>

<span class="diff-html-added" data-a11y-before="Start of added content" data-a11y-after="End of added content" id="added-diff-2">gqn02vmjpl</span><span class="diff-html-changed" data-a11y-before="Start of changed content" data-a11y-after="End of changed content" id="changed-diff-3">]{color:</span>

<span class="diff-html-added" data-a11y-before="Start of added content" data-a11y-after="End of added content" id="added-diff-4">gqn02vmjpl</span><span class="diff-html-changed" data-a11y-before="Start of changed content" data-a11y-after="End of changed content" id="changed-diff-5">]{color:</span>

Introduction

Introduction

Important fields

Dataset Aging Methodology

Next Steps

Important fields

Dataset Aging Methodology

Number of Files Locally and Externally Hosted

Determining if a Resource is Updated

References

Data Freshness Architecture

Data Freshness Process

Data Freshness Emailer

Completed Work

Data Freshness Roadmap

Statistics

References