Page Comparison

Important fields

Field

frequency of updates => it will indicate

Description	Purpose
data_update_frequency	Dataset suggested update frequency	Shows how often the data is expected to

change
Last_modified => it will indicate

be updated or at least checked to see if it needs updating
revision_last_updated	Resource last modified date	Indicates the last time the

dataset (

resource

) was changed, it is not only to monitor new data but also minor updates
date of dataset => date to which data refers to. It has to change when new data comes to hdx but it does have to change for minor updates

Thoughts

There are two aspects of data freshness:

1. Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency

2. Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed

We should send an automated mail reminder to data contributors if the update frequency time window is missed by a certain amount. Perhaps we should give the option for contributors to respond directly to that mail to say that data is unchanged so they don't even need to log into HDX in that case, otherwise provide the link to their dataset that needs updating.

The amount of datasets that are outside of HDX is growing. I think we should try to handle this situation now. The simple but perhaps annoying solution is to send a reminder to users according to the update frequency (irrespective of whether they have already updated as we cannot tell).

Another way to do so is to provide guidance to users so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. Google spreadsheet with update trigger, document alerts in OneDrive for Business, macro in Excel spreadsheet. I don't know if this is possible, but complete automation would be if they could click something in HDX that creates a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials.

Approach

...

was updated irrespective of whether it was a major or minorchange
dataset_date	Dataset date	The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

Approach

Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.

...

Collect frequency of updates based on interns work?

...

Define the age of datasets by calculating: Today's date - last

...

modified date

...

Compare age with frequency and define the logic: how do we define an outdated dataset

Number of Files Locally and Externally Hosted

Type	Number of Resources	Percentage
File Store	2,102	22%
CPS	2,459	26%
HXL Proxy	2,584	27%
ScraperWiki	162	2%
Others	2,261	24%
Total	9,568	100%

Actions

Update frequency needs to be mandatory:

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4919

Investigate http get last modification date field - 60% in HDX have this according to UofV.

Classifying the Age of Datasets

Thought has previously gone into classification of the age of datasets and reviewing this work, the statuses used (up to date, due, overdue and delinquent) and formulae for determining those statuses is sound. Hence, using that work, we have:

Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90

Thoughts

There are two aspects of data freshness:

Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed

We should send an automated mail reminder to data contributors if the update frequency time window is missed by a certain amount. Perhaps we should give the option for contributors to respond directly to that mail to say that data is unchanged so they don't even need to log into HDX in that case, otherwise provide the link to their dataset that needs updating.

The amount of datasets that are outside of HDX is growing. I think we should try to handle this situation now. The simple but perhaps annoying solution is to send a reminder to users according to the update frequency (irrespective of whether they have already updated as we cannot tell).

Another way to do so is to provide guidance to users so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. Google spreadsheet with update trigger, document alerts in OneDrive for Business, macro in Excel spreadsheet. I don't know if this is possible, but complete automation would be if they could click something in HDX that creates a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials.

Actions

Update frequency needs to be mandatory:

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4919

Investigate http get last modification date field - 60% in HDX have this according to UofV.

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

...

https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit

https://github.com/luiscape/hdx-monitor-ageing-service

University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:

...

Version	Old Version 13	New Version 14
Changes made by	Michael Rans	Michael Rans
Saved on	19 Sept 2016	19 Sept 2016

Versions Compared

Key

Important fields

Thoughts

Approach

Approach

Number of Files Locally and Externally Hosted

Actions

Classifying the Age of Datasets

Thoughts

Actions

References