Parsing HTML using Python

28/12/2020
Parsing HTML is one of the most common task done today to collect information from the websites and mine it for various purposes, like to establish price performance of a product over time, reviews of a book on a website and much more. There exist many libraries like BeautifulSoup in Python which abstracts away so many painful points in parsing HTML but it is worth knowing how those libraries actually work beneath that layer of abstraction.

In this lesson, that is what we intend to do. We will find out how values of different HTML tags can be extracted and also override the default functionality of this module to add some logic of our own. We will do this using the HTMLParser class in Python in html.parser module. Let’s see the code in action.

Looking at HTMLParser class

To parse HTML text in Python, we can make use of HTMLParser class in html.parser module. Let’s look at the class dfinition for the HTMLParser class:

class html.parser.HTMLParser(*, convert_charrefs=True)

The convert_charrefs field, if set to True will make all the character references converted to their Unicode equivalents. Only the script/style elements aren’t converted. Now, we will try to understand each function for this class as well to better understand what each function does.

  • handle_startendtag This is the first function which is triggered when HTML string is passed to the class instance. Once the text reaches here, the control is passed to other functions in the class which narrows down to other tags in the String. This is also clear in the definition for this function:
    def handle_startendtag(self, tag, attrs):
    self.handle_starttag(tag, attrs)
    self.handle_endtag(tag)
  • handle_starttag: This method manages the start tag for the data it receives. Its definition is as shown below:
    def handle_starttag(self, tag, attrs):
    pass
  • handle_endtag: This method manages the end tag for the data it receives:
    def handle_endtag(self, tag):
    pass
  • handle_charref: This method manages the character references in the data it receives. Its definition is as shown below:
    def handle_charref(self, name):
    pass
  • handle_entityref: This function handles the entity references in the HTML passed to it:
    def handle_entityref(self, name):
    pass
  • handle_data:This is the function where real work is done to extract values from the HTML tags and is passed the data related to each tag. Its definition is as shown below:
    def handle_data(self, data):
    pass
  • handle_comment: Using this function, we can also get comments attached to an HTML source:
    def handle_comment(self, data):
    pass
  • handle_pi: As HTML can also have processing instructions, this is the function where these Its definition is as shown below:
    def handle_pi(self, data):
    pass
  • handle_decl: This method handles the declarations in the HTML, its definition is provided as:
    def handle_decl(self, decl):
    pass

Subclassing the HTMLParser class

In this section, we will sub-class the HTMLParser class and will take a look at some of the functions being called when HTML data is passed to class instance. Let’s write a simple script which do all of this:

from html.parser import HTMLParser

class LinuxHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag encountered:", tag)

def handle_endtag(self, tag):
print("End tag encountered :", tag)

def handle_data(self, data):
print("Data found :", data)

parser = LinuxHTMLParser()
parser.feed(

<h1>Python HTML parsing module</h1>
)

Here is what we get back with this command:

Python HTMLParser subclass

HTMLParser functions

In this section, we will work with various functions of the HTMLParser class and look at functionality of each of those:

from html.parser import HTMLParser
from html.entities import name2codepoint

class LinuxHint_Parse(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)

def handle_endtag(self, tag):
print("End tag :", tag)

def handle_data(self, data):
print("Data :", data)

def handle_comment(self, data):
print("Comment :", data)

def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)

def handle_charref(self, name):
if name.startswith(‘x’):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)

def handle_decl(self, data):
print("Decl :", data)

parser = LinuxHint_Parse()

With various calls, let us feed separate HTML data to this instance and see what output these calls generate. We will start with a simple DOCTYPE string:

parser.feed(‘<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ‘
‘"http://www.w3.org/TR/html4/strict.dtd">’)

Here is what we get back with this call:

DOCTYPE String

Let us now try an image tag and see what data it extracts:

parser.feed(‘<img src="https://linuxhint.com/wp-content/uploads/2016/12/linuxhint_fb_1.png" alt="The Python logo">’)

Here is what we get back with this call:

HTMLParser image tag

Next, let’s try how script tag behaves with Python functions:

parser.feed(‘<script type="text/javascript">’
‘alert("<strong>LinuxHint Python</strong>");</script>’)
parser.feed(‘<style type="text/css">#python { color: green }</style>’)
parser.feed(‘#python { color: green }’)

Here is what we get back with this call:

Script tag in htmlparser

Finally, we pass comments to the HTMLParser section as well:

parser.feed(‘<!– This marks the beginning of samples. –>’
‘<!– [if IE 9]>IE-specific content<![endif]–>’)

Here is what we get back with this call:

Parsing comments

Conclusion

In this lesson, we looked at how we can parse HTML using Python own HTMLParser class without any other library. We can easily modify the code to change the source of the HTML data to an HTTP client.

Read more Python based posts here.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

NLTK Tutorial in Python

The era of data is already here. The rate at which the data is generated today is higher than ever and it is always growing....
29/12/2020

Python Matplotlib Tutorial

In this lesson on Python Matplotlib library, we will look at various aspects of this data visualisation library which we...
29/12/2020

Python NumPy Tutorial

In this lesson on Python NumPy library, we will look at how this library allows us to manage powerful N-dimensional array...
29/12/2020