Important fields
Field | Description | Purpose |
---|---|---|
data_update_frequency | Dataset suggested update frequency | Shows how often the data is expected to be updated or at least checked to see if it needs updating |
revision_last_updated | Resource last modified date | Indicates the last time the resource was updated irrespective of whether it was a major or minorchange |
dataset_date | Dataset date | The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates |
Approach
- Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.
- Collect frequency of updates based on interns work?
- Define the age of datasets by calculating: Today's date - last modified date
- Compare age with frequency and define the logic: how do we define an outdated dataset
Determining if a Resource is Updated
Number of Files Locally and Externally Hosted
Type | Number of Resources | Percentage | Example |
---|---|---|---|
File Store | 2,102 | 22% | |
CPS | 2,459 | 26% | |
HXL Proxy | 2,584 | 27% | |
ScraperWiki | 162 | 2% | |
Others | 2,261 | 24% | |
Total | 9,568 | 100% |
Classifying the Age of Datasets
Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for determining those statuses is sound and so we will build on that foundation:
Update Frequency | Dataset age state thresholds (how old must a dataset be for it to have this status) | |||
---|---|---|---|---|
Up-to-date | Due | Overdue | Delinquent | |
Daily | 0 days old | 1 day old due_age = f | 2 days old overdue_age = f + 2 | 3 days old delinquent_age = f + 3 |
Weekly | 0 - 6 days old | 7 days old due_age = f | 14 days old overdue_age = f + 7 | 21 days old delinquent_age = f + 14 |
Fortnightly | 0 - 13 days old | 14 days old due_age = f | 21 days old overdue_age = f + 7 | 28 days old delinquent_age = f + 14 |
Monthly | 0 -29 days old | 30 days old due_age = f | 44 days old overdue_age = f + 14 | 60 days old delinquent_age = f + 30 |
Quarterly | 0 - 89 days old | 90 days old due_age = f | 120 days old overdue_age = f + 30 | 150 days old delinquent_age = f + 60 |
Semiannually | 0 - 179 days old | 180 days old due_age = f | 210 days old overdue_age = f + 30 | 240 days old delinquent_age = f + 60 |
Annually | 0 - 364 days old | 365 days old due_age = f | 425 days old overdue_age = f + 60 | 455 days old delinquent_age = f + 90 |
Thoughts
- Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
- Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed
Actions
References
Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:
https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#
Dataset Aging service:
https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit
https://github.com/luiscape/hdx-monitor-ageing-service