Flangy > Software Development > Python > HTML Stripper

Keywords: Python, Strip HTML tags, sgmllib

HTML Stripper

This class shows how to use sgmllib to remove tags from HTML. More accurate than using a RegEx to strip tags, as it handles unescaped brackets in attributes better.

Many people seem averse to using the parser classes to handle HTML, but they're really not that hard to use, and doing more than the simplest RegEx over markup is asking for trouble.

import sgmllib

class Stripper(sgmllib.SGMLParser):
	def __init__(self):
		sgmllib.SGMLParser.__init__(self)
		
	def strip(self, some_html):
		self.theString = ""
		self.feed(some_html)
		self.close()
		return self.theString
		
	def handle_data(self, data):
		self.theString += data

stripper = Stripper()
print stripper.strip("<tag>some boring <a>text</a> goes here</tag>")