Archive for the ‘Python’ Category

Web Based Data Mining the Easy Way

Tuesday, January 10th, 2006

Data mining has been a topic of interest for me as of late. Something as boring as taking html pages and pulling information from them seems fairly simple at first glance but is really a very challenging problem.

There have been various times that I’ve needed to pull a piece or two of information from a web page and I’ve found myself researching the best ways to so. Before I started using Python and BeautifulSoup this was no easy challenge. My mind does not work in regular expressions and suspect that there really are very few individuals who’s mind does. I always seem to wing it enough because I rarely need to write them. Just when I feel the stroke of genius and accomplishment from getting just the right expression created then the whole thing melts before my eyes when an obscure test pattern does not seem to work.

A few weeks ago, I found the need to pull some obscure data from a few auction categories and I started pulling information and processing it with Python. For my purposes, I needed to pull feedback and item pages for a particular user then extract their auction details so I could store information and use it for statistical research. With BeautifulSoup this was very straight forward but neither quick nor very generic in nature. I was happy with BeautifulSoup because it’s the best tool that I’ve used so far but I was still hoping to find something better or more straight forward.

Today at work I was testing a shopping cart on a recently completed project and I pulled up some old web testing code that I typically use to help automate some of the more tedious parts that I hate testing. I used a very basic pipe delimited text file format and I used the Pamie library to handle the testing. There were a few features that I wanted to add to my file testing scripts and I decided to rewrite my file spec to be less data file and more of a generic language for testing. Within a few minutes, I came across the pyparsing library but I ran out of daylight at work.

When I got home tonight I started browsing through the example folder and I came across a SQL parser example. After playing around with the demo, I thought why couldn’t pulling website urls from a page be as easy as “SELECT a FROM example.html”. With in an hour or so of hacking I created just that!

With some more hacking my little demo started to become a pretty powerful tool. Not only could I pull out individual tags and attributes of html data as easily as pulling from a database, but I started working on pulling from multiple html files and other features. A pattern that I soon started working on was to pull the href property out of an anchor tag. So I settled on the syntax “SELECT a.href FROM *.html WHERE a.href <> NULL”. This will pull every non empty website url from a directory of html files and bring back a list.

Some features that would be cool to add:

  • Pull data from live websites and not just locally saved files / folders. I would also want to add some intelligent data caching to speed up pulling down websites.
  • Add UNIQUE and/or GROUP BY features and other various useful SQL commands to gather more info. Doing basic COUNT(*) would be useful to add.
  • Add a LIMIT key word to limit how many records get collected.

As long as this tool is useful and I have time, I’ll continue to refine it. I’ll debate cleaning up the code and posting it but I’m not sure of the general usefulness to anyone but myself.