Page 1 of 1

Podcast episode list contains duplicates

Posted: 27 Jan 2017, 16:20
by troycarpenter
A number of my podcasts now have duplicate entries in the episode lists. Here's a screenshot of what I'm talking about:
podcast-3.JPG
Notice that every entry is repeated.

When I look in the podcast database, it looks like the entries are repeated there as well. All the fields are the same except the entry ID.

It appears that using the manual "check for new episodes" command seems to trigger it.

Re: Podcast episode list contains duplicates

Posted: 02 Feb 2017, 17:44
by troycarpenter
I now have one podcast that seems to get duplicate entries all the time. It seems like even the daily scheduled updates may cause the problem because every day the duplicates appear.

I also now know why.

What is happening is that there is a prefix number attached early on into the URL. Here's an example from the database:

Code: Select all

http://rss.podcastfeed.net/download/5892e700d7fc40f695bb22d6020ef50e7a0f143d/2017/01/Jan%2027%202017%20-%20Hour%203.mp3
http://rss.podcastfeed.net/download/58943880afbd6f361ea54be8cd604508b33975ed/2017/01/Jan%2027%202017%20-%20Hour%203.mp3	
http://rss.podcastfeed.net/download/58958a004c0130ad1dc79e6e1bb375f3e87aef7b/2017/01/Jan%2027%202017%20-%20Hour%203.mp3
(NOTE that I've modified the URLs to avoid comments on the podcast content itself instead of focusing in the problem. They only represent the entries, not the actual entries.)

When you look at the three entries above, you can see that that every day there is new numerical identifier at the beginning of the URL, but other than that, the episode content is exactly the same. Right now, Madsonic downloads the new episode right away, but what I haven't tried is to see if an older episode with an older URL can still be downloaded. I will let the duplicates add up to see if that's the case.

What I don't have is an easy solution proposal to this. I need to fully understand the implications of using the older URLs (maybe that's why the account was shut down in the past...trying to use the older URLs may have triggered something). Basically my solution approach before analysis would be to compare the latest rss feed with the database and change the existing URL entries to the latest if the other fields match. This would probably require an option on the podcast screen to allow "deeper analysis" of the rss feed beyond the existing method (which seems to work most of the time so far) so that not all podcasts need the extra scrutiny. However, even that might not work if the URL has changed without re-checking the rss feed.

I'll report back as I analyze this more.

Re: Podcast episode list contains duplicates

Posted: 02 Feb 2017, 17:59
by troycarpenter
UPDATE:

So it seems that the URLs are custom generated when the RSS feed is fetched. This seems to be related to the fact that this is a premium feed and the URLs generated are valid for a limited amount of time. So in the example above, the first two URLs gave a "Forbidden" error while trying to download, but the third one (the latest one fetched) did work.

I also just did a re-fetch on the rss feed itself. That identifier was still there. It has been less than a day since the feed that generated the database entry was fetched.

I also note that the identifier is random for each URL in the feed, it's not common to all the URLs.

So right now the best approach would be to do the deep analysis on the podcast entry to see if all the other fields match, and if so, replace the URL instead of adding a new entry. I don't know how to solve the problem where someone may request an older episode that has not been downloaded yet, and the URL has changed in the meantime. Possible solution for that would be to reprocess the xml feed to get new URLs if a download fails. Not sure how I feel about that as a solution though.

Re: Podcast episode list contains duplicates

Posted: 07 Feb 2017, 16:57
by troycarpenter
How about an option for certain podcasts that basically wipes the database for that channel and replaces it with the newly fetched XML. That would imply that the podcast will NOT keep old episodes and need to physically delete episodes no longer in the XML. Then Madsonic can check the disk to see if episodes had previously been downloaded and mark them complete without having to use bandwidth to re-download.

That would solve this problem nicely at the expense of older episodes that drop off the XML feed.