Duplicate Content · User Defined Rules #58

Closed
opened 2024-09-13 18:55:36 +00:00 by corbz · 1 comment
Owner

It's clear now that the ability to uniquely identify articles needs to become a user problem, as hard-coding it will not work due to the varying implementations of uniqueness from differing RSS sources.

Example: one tested feed produced the same article under two different guids and urls, one using /local/ in the path, and the other using /town-name/ instead, making it impossible to rely on guids or urls for uniqueness.

Another example: The BBC publishes updates to their existing articles using updated guids, usually ending in #<revision-number-here>, meaning guids alone arent good enough, but they also release the article under multiple categories with different urls, causing the same issue as before.


Task:

  • Create a new data model and supported UI elements, that allows the user to define custom rules for identifying unique articles.
It's clear now that the ability to uniquely identify articles needs to become a user problem, as hard-coding it will not work due to the varying implementations of uniqueness from differing RSS sources. Example: one tested feed produced the same article under two different guids and urls, one using `/local/` in the path, and the other using `/town-name/` instead, making it impossible to rely on guids or urls for uniqueness. Another example: The BBC publishes updates to their existing articles using updated guids, usually ending in `#<revision-number-here>`, meaning guids alone arent good enough, but they also release the article under multiple categories with different urls, causing the same issue as before. ------------- ### Task: - Create a new data model and supported UI elements, that allows the user to define custom rules for identifying unique articles.
corbz added the
enhancement
bug
labels 2024-09-13 18:55:36 +00:00
corbz self-assigned this 2024-09-13 18:55:36 +00:00
corbz added this to the PYRSS project 2024-09-13 18:55:36 +00:00
corbz added a new dependency 2024-09-13 19:59:06 +00:00
Author
Owner

I've added various options into the webui for how to identify duplicate articles, being:

  • GUID
  • ID
  • URL
  • Title
  • Content Hash

The user can select one or many per subscription.

I've added various options into the webui for how to identify duplicate articles, being: - GUID - ID - URL - Title - Content Hash The user can select one or many per subscription.
corbz closed this issue 2024-10-08 09:18:40 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Reference: corbz/PYRSS-Website#58
No description provided.