Introduction
...
The implementation of HDX freshness in Python reads all the datasets from HDX (using the HDX Python library) and then goes through a sequence of steps. Firstly, it gets the dataset's update frequency if it has one. If that update frequency is Never, then the dataset is always fresh. If not, it checks if dataset and resource metadata has changed - this qualifies as an update from a freshness perspective. It compares the difference between the current time and update time with the update frequency and sets a status: fresh, due, overdue or delinquent. If the dataset is not fresh based on metadata, then the urls of the resources are examined. If they are internal urls (data.humdata.org - the HDX filestore, manage.hdx.rwlabs.org - CPS) then when the files pointed to by these urls update, the HDX metadata is updated, so there is no further checking to be done. If they are urls with an adhoc update frequency (proxy.hxlstandard.org, ourairports.com), then freshness cannot be determined. Currently, there is no mechanism in HDX to specify adhoc update frequencies, but there is a proposal to add this to the update frequency options. The At the moment, the freshness value for adhoc datasets is based on whatever has been set for update frequency which may not
It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production
. Related to that is ongoing work to make the field visible in the UI Jira Legacy server JIRA (humanitarian.atlassian.net) serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-4254
. Jira Legacy server JIRA (humanitarian.atlassian.net) serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-4894
Critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. expected update frequency Jira Legacy
It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production
. Related to that is ongoing work to make the field visible in the UI Jira Legacy server JIRA (humanitarian.atlassian.net) serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-4254
. Jira Legacy server JIRA (humanitarian.atlassian.net) serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-4894
Critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. expected update frequency
. It was confirmed that this field should stay at dataset level as our recommendation to data providers would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. Assuming the field is a dropdown, it could have values: daily, weekly, fortnightly, monthly, quarterly, semiannually, annually, never. It would be good to have something pop up if the user chooses "never" making it clear that this is for datasets for which data is static. We will have to audit datasets where people pick this option as we don't want people choosing "never" because they don't want to commit to putting an expected update frequency. . It was confirmed that this field should stay at dataset level as our recommendation to data providers would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. Assuming the field is a dropdown, it could have values: daily, weekly, fortnightly, monthly, quarterly, semiannually, annually, never. It would be good to have something pop up if the user chooses "never" making it clear that this is for datasets for which data is static. We will have to audit datasets where people pick this option as we don't want people choosing "never" because they don't want to commit to putting an expected update frequency. Jira Legacy server JIRA (humanitarian.atlassian.net) serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-4919
A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.
A collaboration has been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We will see what we can learn from them but will likely proceed with a more basic and practical approach than what they envisage. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.
...
A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.
A collaboration has been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We will see what we can learn from them but will likely proceed with a more basic and practical approach than what they envisage. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.
Next Steps
Contact all organisations who have datasets with update frequency Never.
Where should data freshness run and where should it output eg. database, HDX metadata? Consider that for UI to use freshness information, it needs access and data team need access for reporting. Aside more general reporting can be done as data freshness runs every day and collects alot of dataset metadata (the list of which could be extended)
adhoc update_frequency and only allow admins to set "Never" and "Adhoc".
The expected update frequency field requires further thought particularly on the issue of static datasets, following which there will be interface design and development effort.
...
Field | Description | Purpose |
---|---|---|
data_update_frequency | Dataset expected update frequency | Shows how often the data is expected to be updated or at least checked to see if it needs updating |
revision_last_updated | Resource last modified date | Indicates the last time the resource was updated irrespective of whether it was a major or minor change |
dataset_date | Dataset date | The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates |
Dataset Aging Methodology
A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.
Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we would use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date (but in the email we would tell the data provider that we think their data is not fresh and needs to be updated rather than referring to states like overdue). The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.
Update Frequency | Dataset age state thresholds (how old must a dataset be for it to have this status) | |||
---|---|---|---|---|
Fresh | Not Fresh | |||
Up-to-date | Due | Overdue | Delinquent | |
Daily | 0 days old | 1 day old due_age = f | 2 days old overdue_age = f + 2 | 3 days old delinquent_age = f + 3 |
Weekly | 0 - 6 days old | 7 days old due_age = f | 14 days old overdue_age = f + 7 | 21 days old delinquent_age = f + 14 |
Fortnightly | 0 - 13 days old | 14 days old due_age = f | 21 days old overdue_age = f + 7 | 28 days old delinquent_age = f + 14 |
Monthly | 0 -29 days old | 30 days old due_age = f | 44 days old overdue_age = f + 14 | 60 days old delinquent_age = f + 30 |
Quarterly | 0 - 89 days old | 90 days old due_age = f | 120 days old overdue_age = f + 30 | 150 days old delinquent_age = f + 60 |
Semiannually | 0 - 179 days old | 180 days old due_age = f | 210 days old overdue_age = f + 30 | 240 days old delinquent_age = f + 60 |
Annually | 0 - 364 days old | 365 days old due_age = f | 425 days old overdue_age = f + 60 | 455 days old delinquent_age = f + 90 |
Never | Always | Never | Never | Never |
Drawio | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Number of Files Locally and Externally Hosted
Type | Number of Resources | Percentage |
---|---|---|
File Store | 2,102 | 22% |
CPS | 2,459 | 26% |
HXL Proxy | 2,584 | 27% |
ScraperWiki | 162 | 2% |
Others | 2,261 | 24% |
Total | 9,568 | 100% |
Determining if a Resource is Updated
Drawio | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
References
...
View file | ||||
---|---|---|---|---|
|
proxy.hxlstandard.org', 'ourairports.com