Flangy > Software Development > Python > HTML Stripper
Keywords: Python, Strip HTML tags, sgmllib
This class shows how to use sgmllib to remove tags from HTML. More accurate than using a RegEx to strip tags, as it handles unescaped brackets in attributes better.
Many people seem averse to using the parser classes to handle HTML, but they're really not that hard to use, and doing more than the simplest RegEx over markup is asking for trouble.
import sgmllib
class Stripper(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
def strip(self, some_html):
self.theString = ""
self.feed(some_html)
self.close()
return self.theString
def handle_data(self, data):
self.theString += data
stripper = Stripper()
print stripper.strip("<tag>some boring <a>text</a> goes here</tag>")