Pyquery Installation
To install Pyquery in Ubuntu, use the command specified below:
You can also install latest version of Pyquery from “pip” package manager by running the following two commands in succession:
$ pip3 install pyquery
To install Pyquery in other Linux distributions, install “pip3” from the package manager and run the second command mentioned above.
Creating a Parsable Document Tree
Before you can parse and extract data from an HTML document, you need to create a document tree. You can create a document tree from a simple HTML markup using the code sample below:
document = pq("Hello World !!</html>")
print (document)
print (type(document))
The first statement imports the “PyQuery” class from the “pyquery” module. Next, a new instance of PyQuery class is created. After running the code sample above, you should get the following output:
<class 'pyquery.pyquery.PyQuery'>
Notice the second line in the output. Here “document”, which is an instance of “PyQuery” class, does not return a string type object. You can quickly query all the methods available for “document” instance by adding the following extra line to the code sample above:
document = pq("<html>Hello World !!</html>")
print (help(document))
You can also browse API for PyQuery class online.
To create document tree from a URL, use the following code instead (replace “url” with your own desired address):
document = pq(url='https://example.com')
print (document)
To create a document tree form local HTML file, use the below code (replace the value of “filename” according to your needs):
document = pq(filename='index.html')
print (document)
Now that you have a document tree, you can start parsing it.
Manipulating the Document Tree
You can extract data and manipulate document trees using a variety of methods. Some of the most common methods are listed below with samples. For all usable methods, refer to the API available here.
You can use “text” method to get text content of an element:
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
print (p.text())
You can choose a specific tag / element by supplying its name as argument to the “document” instance. After running the above code sample, you should get the following output:
Hello World !!
You can get attributes of a tag by using the “attr” method. To do so, pick a tag you want to parse (‘p’ in this case) and supply the attribute name as an argument (‘id’ in this case) or use dot notation.
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
print (document)
print (p.attr("id"), p.attr.id)
After running the above code sample, you should get the following output:
You can manipulate CSS using the “css” method. To add CSS styles to
or any other tag, you can use the following code:
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
p.css({"color": "red"})
print (document)
print (p.attr("style"))
Replace “{“color”: “red”}” part with your own custom styles. After running the above code sample, you should get the following output and can verify that CSS has been correctly applied:
color: red
If you have a pre-styled class, you can just use the “addClass” method to apply existing styles.
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
p.addClass("mystyle")
You can append and prepend your own custom markup using the code sample below:
document = pq('''<p id="hw">Hello World !!</p>''')
p = document('p')
p.prepend("<p>Hi</p>")
p.append("<p>Bye</p>")
print (document)
Replace arguments in the “prepend” and “append” method with your own values. After running the above code sample, you should get the following output:
To remove contents of an element, use the “empty” method.
document = pq('''<p id="hw">Hello World !!</p>''')
p = document('p')
p.empty()
print (document)
After running the above code sample, you should get the following output:
You can use the “filter” method to select specific elements when there are multiple tags of the same type. For instance, the code below picks up a “<p>” tag having an “id” as “hello”:
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''')
p = document('p')
print (p.filter("#hello"))
After running the above code sample, you should get the following output:
You can find multiple tags / elements at once using “find” method:
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''')
print (document.find('p'))
Supply the tag / element name as argument to the “find” method. After running the above code sample, you should get the following output:
You can switch between “xml” and “html” parsers using an additional “parser” argument:
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''', parser="html")
print (document)
If you need further help with Pyquery, refer to its official documentation and examples available here.
Conclusion
PyQuery allows you to quickly parse html documents by writing minimum code, as it includes numerous helper functions that completely omit the need for writing custom code. Its “jQuery” like syntax and structure also helps in selecting elements and nodes without going deeper into the document tree, especially when there is a lot of nested markup.
from Linux Hint https://ift.tt/3xE0Jfr
0 Comments