9.05.2006

Feedorific design, part 1: Feedreader

Here are my thoughts on the design of Feedorific, Django-version:

The various apps will be:
  • Feedreader -
    Accepts and stores feeds from users, gets the xml, parses out and displays entries and descriptions. Feeds entered will also be stored to DB.
  • Structureparser -
    Parses stored feeds for their structure.
  • Contentparser -
    Parses stored feeds for their content.
  • Organization -
    Very unsure about this last section's design. It will display the fully-parsed feeds, allow searching, tagging, etc. This may also be integrated with a visually-organized display of feed articles by content.
Here is what I have on the design of the feedreader element so far:

Feedreader design

Django to the rescue

I found myself having much difficulty getting what I had of the feedparser setup in such a way that I could return html code that looked good. Also, there are no real tutorials, since no one seems to just use Python and html. It seems a lot more common to use some templating language or some other framework to link everything.

So, I decided to rewrite it all with Django. I don't know Django very well, but it seems powerful enough without having to be another enterprise level CMS that I don't need. It is also well-documented, Python-centric, and designed for fast-paced development.

It's (a)live! Feedorific!

Since the old immortalcuriosity.com was all in Plone, after switching to a different server, I had a blank slate to work with. What better to fill this void with than my feedparsing project! So with a little Apache, a little Python, and a little help from my friends, the current form of the feedparsing system is accessible at www.immortalcuriosity.com. A warning: It looks like junk and barely works in IE. It's perfect in Firefox. Sorry, but I just don't care about a bad browser enough to put the hacks in yet. Maybe later.

Pythonic Discoveries, Part 2

While working on the feed parsing program, I wanted to display a numbered list. In C++, I would have used an incrementing variable and thought nothing of it. But I found this a little more complicated in Python. When I tried to simply set a variable and then display it along with an item in the list I was iterating through, I got an error that an int and a string cannot be concatenated. So I ended up doing it thus:

entrylist = []
for entry in current.entries:
entrylist.append(entry.title)
bullet = 1
for x in entrylist:
print str(bullet) + "- " + x
bullet+=1
I don't know if this was a bad hack, and if there is an easier way to do this, but it worked. It seem interesting that an increment operator is not built into Python.\

Feed Parsing, Attempt 1

I wrote a basic program tonight which accepts an RSS feed, displays its entries, and gives the user the option to view a given entry's description. It's not very complex, but it allowed me to figure out some basic Python stuff, and I am pleased with how well it works.
Things to add:
  • Escape out html in the description or display it differently.
  • Make it work on immortalcuriosity.com instead of just the command line.
  • Store the feeds to a database and give the option to refresh data for a feed previously entered.
I think once I get these 3 completed, I will have a good start on the first component of my system.

Oh, here's the code:
# A test program to learn about feedparser. It accepts a
# feed, displays its entries, and gives the option to
# display a given entries description. At least works with
# Slashdot and KurzweilAI feeds.

import feedparser

# Get the feed to parse
uri = raw_input("Please enter the feed to be parsed: ")

# Grab the feed
current = feedparser.parse(uri)

# Parse the feed
title = current.feed.title
description = current.feed.description

# Print data on the feed
print
print
print uri + " aka " + title + " is described by its owner as: "
print description + "."
print

# Store entry titles and print them
print "The current items are: "
entrylist = []
for entry in current.entries:
entrylist.append(entry.title)
bullet = 1
for x in entrylist:
print str(bullet) + "- " + x
bullet+=1
print

# Store item descriptions
entrydescs=[]
for desc in current.entries:
entrydescs.append(desc.description)

# See if any additional data is desired
contin = raw_input("Would you like to view any of those (Y or N)?: ")

# Find the item and print its description
if contin == "Y":
checkme = raw_input("Ok, which item do you want to view? ")
print
print entrylist[int(checkme)-1] + ": "
print entrydescs[int(checkme)-1]

Helpful Components

I discovered 2 helpful Python modules that will most likely prove very useful in my parsing project.
  • Feedparser - This is a Python module which can parse a wide variety of the most common syndication formats. It is well documented, and seems well suited for the component I will need to take a feed and parse it according to fixed components.
  • pyparsing - This module allows for the creation of grammars directly in Python code.

I also came across a powerful new technique for extracting information from text: text-mining. Instead of the tedious formation of grammars and topics through supervised learning, this technique uses "topic modeling" to form topics and appropriate divisions based on a system of combinations of words which are common.

Picture!

Although there may be different components added later, here is a basic diagram of the major components of the feed parsing system I aim to create.

A consideration of a personal business interaction

I wished for something today that at first surprised me, but made more sense in afterthought. At a grocery store, when it was my turn to checkout, I knew exactly what the cashier would be likely to say, the exact procedure she would follow. After the perfunctory greeting, she would ask whether I had a discount card for the particular store, then scan some or all of the items (depending on their number), then ask which bag type I would like. After all items were scanned and the price totalled, I would brandish my check card, which would prompt the query "Credit or debit?". Of course, I would select debit because I would not want to take the time to get the receipt and sign it when I could simply put in my PIN. In any case, since I selected debit, she would ask whether I could like any cash back, to which I would respond in the negative. After the transaction was completed, she would provide me with a receipt and I would be on my way. Nothing too extraordinary.

But this is what I will do every time, aside from the extremely rare cases of asking for an alternative bag type or for cash back, as a result of special circumstances. In today's case, as I walked out, I found myself thinking, "I sure wish I could set my grocery store config file to those responses as default, and pass special options when necessary." I most likely thought this as a result of being in a programming environment commonly, where seeking such improvements would be a matter of course. I smiled at my thought, but then asked "Wait, why not?" Why need I take my thought away from other considerations to respond whether I would like paper or plastic? Why would that be necessary more than once? Why not have my preferences for such things stored in my debit card, which I would slide through a reader after unpacking my items from the cart?

I think pondering such possibilties is fruitful, in that there is no reason that even small elements of everyday life such as this should not be improved when possible.

More Efficient Transportation

I considered recently, while sitting in traffic backed-up from a toll bridge, how a system could be implemented to solve such bottlenecks. It would be much preferable to have a national system of automatic tolls. Instead of having people sitting in booths taking cash, readers could be placed in such lanes, identical or similar to those in place for the "fast lanes" at tolls. Stickers with barcodes (or the equivalent patterns) could be distributed to every car owner via a national distribution, allowing for uniformaity and a better selection of unique identifiers.

Local governments could still be given the option to opt-in or not, setting their own timetable for integration, but the data would be prepared. The readers may be expensive, as well as the maintenance fee that should be charged to allow for participation in the program, but it seems the cost would definitely be less than maintaining a human workforce at the tolls.

Such a system would be far faster from the beginning, reduce traffic volume at such common chokepoints greatly, thus decreasing the likelihood for accidents, not to mention reducing the frustration of drivers greatly.

Pythonic discoveries, Part 1

The "item1.function(variable)" form in Python was confusing me, until I got something to work in the interpreter. I made a list:
>>> countries = ['USA', 'Russia', 'Cuba', 'Iceland', 'Greenland', 'Atlantis'])
and wanted to add 'France' to it. Trying "append('France')" did not work, as the append() function did not know where to act. However, "countries.append('France')" did work. I realized I had been taking the "x.y" notation as something more complex than it actually is. It can more easily understood after thinking about List Comprehension. List Comprehension takes an expression and applies a for conditional within it, followed by zero or more for and if conditionals. Thus in:
>>> num = [2, 4, 6]
>>> [3*x for x in num]
[6, 12, 18]
>>> [3*x for x in num if x > 3]
[12, 18]
the expression "3*x" is applied to each term x in the list "num".

When something is imported, say a module called "fruit", then one can say "import fruit", and "fruit.peel()" (assuming that peel was defined in "fruit"), the same x.y() form. The "x.y" means: "do, or look for, y in x, or in the context of x."

Adumbrations

I used to spend several hours a day combing through feeds, newsletters, and portals concerning topics I found interesting. At first, I merely noted them in memory. Then, as I became interested in being able to refer back to particular developments, I started a simple text file, divided into basic topics, such as "Security", "Programming Languages", etc. After a short time, this became unwieldy and ineffectual.

My next approach was to create an extensible folder structure, finely-grained by topic, containing document files which included the links to stories, tutorials, and other sources of information pertinent to the topic of their directory. I also found it easier to add personal notes and commentary on the information in this way. This method was much more effective, but still very limited.

As time progressed and my collection of information grew, I realized that the structure I had created was limited in two critical respects: 1) adding or editing information was overly time-consuming, as I had to find the appropriate document in the appropriate folder structure, then open, edit, and save it, 2) searching the structure was limited to the search functionality built into Windows, and later to Google Desktop search. It became clear at that stage that the design of my record system needed to be centered around 3 basic principles:

  1. Ease of entering new information
  2. Ease of searching for information
  3. Ease of altering topical structure
Around the time when I began to grow dissatisfied with the folder/document system, I discovered Treepad. This is a wonderful program, with many potential uses. It allows one to form a structure of "trees" and "leaves" within a single file. Each of the leaves allow for insertion of text and images, and the entire structure can be altered, linked, and seached with ease.

Using Treepad was much more effective, and I would have been satisfied with it, had I not found Freemind. This allowed for all the features of Treepad, except that I was not limited to an Explorer-style visual interface, but rather a topic map which could be unfolded as desired. I found it much easier to find information, track developments, and understand topics through this display method than any structure I had used previously.

I ran into one final difficulty: as my topics of interest and my databank grew, I found that the time required to search through various feeds and portals, pick out interesting information, and record them in my topic structure was simply too great. Thus emerged the need for a system which I now set out to create. This system (for which I lack an effective name currently) will perform the following basic functions:
  • Record an initial list of syndicated feeds and URLs of portals of interest, and allow for new feeds and sites to be imported
  • Read each of the feeds or portals as appropriate and parse the information they contain
  • Tag each piece of content based on an extensible topical structure
  • Record content and applicable metadata in a database
  • Provide an interface for the content to be searched based on a variety of criteria

The system may also be extended to include the following much more advanced features:

  • Web spiders to find new feeds and sites which should be of interest based on the current topical structure and areas for which information is lacking and determined to be desired
  • Visual interface for graphical display of topical structure and content relations

I hope to write this system mainly in Python, as it seems suit the task well in its simplicity and ability. This blog will serve as a record of the steps in its development, from things I learn about Python, to semantics, and probably much more.