Important fields
Last_modified => it will indicate the last time the dataset (resource) was changed, it is not only to monitor new data but also minor updates
date of dataset => date to which data refers to. It has to change when new data comes to hdx but it does have to change for minor updates
Thoughts
Approach
a) determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.
b) Collect frequency of updates based on interns work
c) Define the age of datasets by calculating: Today's date- last_modified
d) Compare age with frequency and define the logic: how do we define an outdated dataset
Number of Files Locally and Externally Hosted
Type | Number of Resources | Percentage |
---|---|---|
File Store | 2,102 | 22% |
CPS | 2,459 | 26% |
HXL Proxy | 2,584 | 27% |
ScraperWiki | 162 | 2% |
Others | 2,261 | 24% |
Total | 9,568 | 100% |
Actions
Classifying the Age of Datasets
Thought has previously gone into classification of the age of datasets and reviewing this work, the statuses used (up to date, due, overdue and delinquent) and formulae for determining those statuses is sound. Hence, using that work, we have:
Update Frequency | Dataset age state thresholds (how old must a dataset be for it to have this status) | |||
---|---|---|---|---|
Up-to-date | Due | Overdue | Delinquent | |
Daily | 0 days old | 1 day old due_age = f | 2 days old overdue_age = f + 2 | 3 days old delinquent_age = f + 3 |
Weekly | 0 - 6 days old | 7 days old due_age = f | 14 days old overdue_age = f + 7 | 21 days old delinquent_age = f + 14 |
Fortnightly | 0 - 13 days old | 14 days old due_age = f | 21 days old overdue_age = f + 7 | 28 days old delinquent_age = f + 14 |
Monthly | 0 -29 days old | 30 days old due_age = f | 44 days old overdue_age = f + 14 | 60 days old delinquent_age = f + 30 |
Quarterly | 0 - 89 days old | 90 days old due_age = f | 120 days old overdue_age = f + 30 | 150 days old delinquent_age = f + 60 |
Semiannually | 0 - 179 days old | 180 days old due_age = f | 210 days old overdue_age = f + 30 | 240 days old delinquent_age = f + 60 |
Annually | 0 - 364 days old | 365 days old due_age = f | 425 days old overdue_age = f + 60 | 455 days old delinquent_age = f + 90 |
References
Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:
https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#
Dataset Aging service:
https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit