Knowledge of the following is required:
- Python 3
- Basic HTML
- Urllib2 (not mandatory but recommended)
- Basic OOP concepts
- Python data structures - Lists, Tuples
Why parse HTML?
Python is one of the languages that is extensively used to scrap data from web pages. This is a very easy way to gather information. For instance, it can be very helpful for quickly extracting all the links in a web page and checking for their validity. This is only one example of many potential uses... so read on!
The next question is: where is this information extracted from? To answer this, let's use an example. Go to the website NYTimes and right click on the page. Select View page source or simply press the keys Ctrl + u on your keyboard. A new page opens containing a number of links, HTML tags, and content. This is the source from which the HTML Parser scraps content for NYTimes!
What is HTML Parser?
HTML Parser, as the name suggests, simply parses a web page’s HTML/XHTML content and provides the information we are looking for. This is a class that is defined with various methods that can be overridden to suit our requirements. Note that to use HTML Parser, the web page must be fetched. For this reason, HTML Parser is often used with urllib2.
To use the HTML Parser, you have to import this module:
from html.parser import HTMLParser
Methods in HTML Parser
- HTMLParser.feed(data) - It is through this method that the HTML Parser reads data. This method accepts data in both unicode and string formats. It keeps processing data as it gets and waits for incomplete data to be buffered. Only after the data is fed using this method can other methods of the HTML Parser be called.
- HTMLParser.close() - This method is called to mark the end of the input feed to the HTML Parser.
- HTMLParser.reset() - This method resets the instance and all unprocessed data is lost.
- HTMLParser.handle_starttag(tag, attrs) - This method deals with the start tags only, like <title>. The tag argument refers to the name of the start tag whereas the attrs refers to the content inside the start tag. For example, for the tag <Meta name="PT"> the method call would be handle_starttag(‘meta’, [(‘name’,’PT’)]). Note that the tag name was converted to lowercase and the contents of the tag were converted to key,value pairs. If a tag has attributes they will be converted to a key, value pair tuple and added to the list. For example, in the tag <meta name="application-name" content="The New York Times" /> the method call would be handle_starttag(‘meta’, [(‘name’,’application-name’),(‘content’.’The New York Times’)]).
- HTMLParser.handle_endtag(tag) - This method is pretty similar to the above method, except that this deals with only end tags like </body>. Since there will be no content inside an end tag, this method takes only one argument which is the tag itself. For example, the method call for </body> will be: handle_endtag(‘body’). Similar to the handle_starttag(tag,attrs) method, this also converts tag names to lowercase.
- HTMLParser.handle_startendtag(tag, attrs) - As the name suggests, this method deals with the start end tags like, <a href=http://nytimes.com />. The arguments tag and attrs are similar to the HTMLParser.handle_starttag(tag, attrs) method.
- HTMLParser.handle_data(data) - This method is used to deal with data/content like <p> ……. </p>. This is particularly helpful when you want to look for specific words or expressions. This method combined with regular expressions can work wonders.
- HTMLParser.handle_comment(data) - As the name suggests, this method is used to deal with comments like <!--ny times--> and the method call would be like HTMLParser.handle_comment(‘ny times’).
Whew! That's a lot to process, but these are some of the main (and most useful) methods of HTML Parser. If your head is swirling don’t worry, let's look at an example to make things a little more clear.
How does HTML Parser work?
Now that you are equipped with theoretical knowledge, let’s test things out practically. To try out the below example you must have urllib2 installed or follow the below steps to install it:
- Install pip
- Install urllib - pip install urllib2
from html.parser import HTMLParser import urllib.request as urllib2 class MyHTMLParser(HTMLParser): #Initializing lists lsStartTags = list() lsEndTags = list() lsStartEndTags = list() lsComments = list() #HTML Parser Methods def handle_starttag(self, startTag, attrs): self.lsStartTags.append(startTag) def handle_endtag(self, endTag): self.lsEndTags.append(endTag) def handle_startendtag(self,startendTag, attrs): self.lsStartEndTags.append(startendTag) def handle_comment(self,data): self.lsComments.append(data) #creating an object of the overridden class parser = MyHTMLParser() #Opening NYTimes site using urllib2 html_page = html_page = urllib2.urlopen("https://www.nytimes.com/") #Feeding the content parser.feed(str(html_page.read())) #printing the extracted values print(“Start tags”, parser.lsStartTags) #print(“End tags”, parser.lsEndTags) #print(“Start End tags”, parser.lsStartEndTags) #print(“Comments”, parser.lsComments)
Alternatively, if you don’t want to install urllib2, you can directly feed a string of HTML tags to the parser like so:
parser = MyHTMLParser() parser.feed('<html><body><title>Test</title></body>')
Print one output at a time to avoid crashing as you are dealing with a lot of data!
NOTE: In case you get the error: IDLE cannot start the process, start your Python IDLE in administrator mode. This should solve the problem.
HTMLParser.HTMLParseError - This exception is raised when the HTML Parser encounters corrupt data. This exception gives information in the form of three attributes. The msg attribute tells you the reason for the error, the lineno attribute specifies the line number where the error occurred and the offset attribute gives the exact character where the construct starts.
That brings us to the end of this article on HTML Parser. Be sure to try out more examples on your own to improve your understanding! Do read about BeautifulSoup which is another amazing module in Python that helps in HTML scraping. However, to use this module, you will have to install it. Keep learning and happy Pythoning!