Looking at my sort-of-todo list, I see at least 4 projects that either need or would greatly benefit from a feed aggregator-like functionality, e.g. not just simply parsing, but updating a list of feeds and keeping track of the items.
So it’s seems clear that the right thing to do is to implement this only once, preferably as some sort of generic library, and then reuse it. Unfortunately, the requirements are quite different: One project needs to track enclosures (which by the way also changes how item guids can potentially be identified). Sometimes notifications need to be sent out. Content has to be analysed in different ways. Two of the apps potentially handle large lists of feeds that require prioritized parsing and sophisticated error handling – in the other cases the list is small, and it wouldn’t be worth to bother. Other issues evolve around how and if to handle cover images, how to handle redirects, whether to ignore entries under certain conditions, and even what data to collect.
You get the point: It’s not that easy to all of this under one hood, which is also the reason I’ve been putting it off for a while now. I think I finally came up with a solution that I find satisfying though.
The whole thing centers around a Django settings.py-like config file. Now per default, two tables would be used, for feeds and items, and save for primary and foreign keys, each table would only need one column: The feed table the feed url, the item table the item guid.
Then, in said config file, you would specify a list of addins that each provide a particular, isolated piece of functionality. Addins can depend on each other, and it would be easy to write your own.
A configuration file might look like this:
# my-feedbot-config.py USER_AGENT = 'MyFeedBot/%s (+url)' % get_version() ADDINS = [ # builtin collect('title', 'description', 'author'), enclosures(require=True), save_bandwith(), # needs columns for storing of etag etc. in db custom_item_filter(handler_func), # custom check_for_claimcode(), ]
Now, given the addins specified above, there’d probably be a new enclosure table, new columns for storing the meta data and http header info in the item table, and the parsing process would call out to your code to handle filtering and claimcodes.
I’ve already checked a decent amount of code into a bazaar branch, but it’s far from finished (or usable). I’ll hopefully have enough time to work on this during the next few weeks and plan to post some updates as I go along (btw, this is not depending on Django, for once).