Examples

This page contains simple examples of how to use Selectolax for HTML parsing and manipulation.

Note

All examples use the Lexbor backend (from selectolax.lexbor import LexborHTMLParser) which provides better performance and features compared to the older Modest backend.

Basic HTML Parsing

There are 3 ways to create or parse objects in Selectolax:

  1. Parse HTML as a full document using LexborHTMLParser()

  2. Parse HTML as a fragment using LexborHTMLParser(..., is_fragment=True)

  3. Create single node using LexborHTMLParser(...).create_node()

  • LexborHTMLParser() - Returns the HTML tree as parsed by Lexbor, unmodified. The HTML is assumed to be a full document. <html>, <head>, and <body> tags are added if missing.

  • LexborHTMLParser(..., is_fragment=True) - Intended for HTML fragments/partials.

    Behaves the same way as DocumentFragment in browsers. Drops <html>, <head>, and <body> tags if present in the input HTML. Use it to parse snippets of HTML that are not complete documents.

from selectolax.lexbor import LexborHTMLParser

html = """
<body>
    <span id="vspan"></span>
    <h1>Welcome to selectolax tutorial</h1>
    <div id="text">
        <p class='p3' style='display:none;'>Excepteur <i>sint</i> occaecat cupidatat non proident</p>
        <p class='p3' vid>Lorem ipsum</p>
    </div>
    <div>
        <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
    </div>
</body>
"""

fragment = """
<div>
    <p class="p3">
        Hello there!
    </p>
</div>
<script>
    document.querySelector(".p3").addEventListener("click", () => { ... });
</script>
"""

# Parse HTML as a full document
parser = LexborHTMLParser(html)

# Parse HTML as a fragment
frag_parser = LexborHTMLParser(html, is_fragment=True)

# Create a new node for  `parser`.
node = parser.create_node("div")

CSS Selectors

Select All Elements with CSS

Find all paragraph elements with class ‘p3’ and examine their properties.

from selectolax.lexbor import LexborHTMLParser

html = """
<body>
    <div id="text">
        <p class='p3' style='display:none;'>Excepteur <i>sint</i> occaecat cupidatat non proident</p>
        <p class='p3' vid>Lorem ipsum</p>
    </div>
    <div>
        <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
    </div>
</body>
"""

parser = LexborHTMLParser(html)
selector = "p.p3"

for node in parser.css(selector):
    print('---------------------')
    print('Node: %s' % node.html)
    print('attributes: %s' % node.attributes)
    print('node text: %s' % node.text(deep=True, separator='', strip=False))
    print('tag: %s' % node.tag)
    print('parent tag: %s' % node.parent.tag)
    if node.last_child:
        print('last child inside current node: %s' % node.last_child.html)
    print('---------------------\n')

Output:

---------------------
Node: <p class='p3' style='display:none;'>Excepteur <i>sint</i> occaecat cupidatat non proident</p>
attributes: {'class': 'p3', 'style': 'display:none;'}
node text: Excepteur sint occaecat cupidatat non proident
tag: p
parent tag: div
last child inside current node: Excepteur <i>sint</i> occaecat cupidatat non proident
---------------------

---------------------
Node: <p class='p3' vid>Lorem ipsum</p>
attributes: {'class': 'p3', 'vid': ''}
node text: Lorem ipsum
tag: p
parent tag: div
last child inside current node: Lorem ipsum
---------------------

Select First Match

Get the first matching element using CSS selectors.

parser = LexborHTMLParser(html)

# Get first h1 element
print("H1: %s" % parser.css_first('h1').text())

Output:

H1: Welcome to selectolax tutorial

Default Return Values

Handle cases where no elements match your selector by providing a default value.

# Return default value if no matches found
print("Title: %s" % parser.css_first('title', default='not-found'))

Output:

Title: not-found

Strict Mode

Ensure exactly one match exists, otherwise raise an error.

# This will raise an error if multiple matches are found
try:
    result = parser.css_first("p.p3", default='not-found', strict=True)
except Exception as e:
    print(f"Error: {e}")

Output:

ValueError: Expected 1 match, but found 2 matches

CSS Chaining

Chain multiple CSS selectors to progressively filter results.

html = """
<div id="container">
    <span class="red"></span>
    <span class="green"></span>
    <span class="red"></span>
    <span class="green"></span>
</div>
"""

parser = LexborHTMLParser(html)

# Chain selectors: start with div, then span, then .red
red_spans = parser.select('div').css("span").css(".red").matches
print([node.html for node in red_spans])

Output:

['<span class="red"></span>', '<span class="red"></span>']

HTML manipulation

Getting HTML data back

You can get HTML data back using .html or .inner_html properties. They can be called on any node.

from selectolax.lexbor import LexborHTMLParser
html = """
<div id="main">
  <div>Hi there</div>
  <div id="updated">2021-08-15</div>
 </div>
"""
parser = LexborHTMLParser(html)
node = parser.css_first("#main")
print("Inner html:\n")
print(node.inner_html)
print("\nOuter html:\n")
print(node.html)

Output:

Inner html:

  <div>Hi there</div>
  <div id="updated">2021-08-15</div>

Outer html:

<div id="main">
  <div>Hi there</div>
  <div id="updated">2021-08-15</div>
 </div>

Changing HTML

You can also change HTML by setting the .inner_html property.

from selectolax.lexbor import LexborHTMLParser
html = """
<div id="main">
  <div>Hi there</div>
 </div>
"""
parser = LexborHTMLParser(html)
node = parser.css_first("#main")
print("Old html:\n")
print(node.html)

node.inner_html = "<span>Test</span>"
print("\nNew html:\n")
print(node.inner_html)

Output:

Old html:

<div id=”main”>

<div>Hi there</div>

</div>

New html:

<div id=”main”><span>Test</span></div>

DOM Navigation

Parent Elements

Get parent element in the DOM tree.

# Print parent of p#stext
print(parser.css_first('p#stext').parent.html)

Output:

<div>
        <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
    </div>

Nested Selectors

Chain CSS selectors to find nested elements.

# Chain CSS selectors
result = parser.css_first('div#text').css_first('p:nth-child(2)').html
print(result)

Output:

<p class='p3' vid>Lorem ipsum</p>

Iterating Over Child Nodes

Walk all child nodes of an element.

for node in parser.css("div#text"):
    for cnode in node.iter():
        print(cnode.tag, cnode.html)

Output:

p <p class="p3" style="display:none;">Excepteur <i>sint</i> occaecat cupidatat non proident</p>
p <p class="p3" vid>Lorem ipsum</p>

DOM Modification

Tag Removal

Completely remove elements from the DOM tree.

parser = LexborHTMLParser(html)

# Remove all p tags
for node in parser.tags('p'):
    node.decompose()

print(parser.body.html)

Output:

<body>
    <span id="vspan"></span>
    <h1>Welcome to selectolax tutorial</h1>
    <div id="text">


    </div>
    <div>

    </div>
</body>

Tag Unwrapping

Remove tags but preserve their content.

parser = LexborHTMLParser(html)

# Remove p and i tags but keep their content
parser.unwrap_tags(['p', 'i'])
print(parser.body.html)

Output:

<body>
    <span id="vspan"></span>
    <h1>Welcome to selectolax tutorial</h1>
    <div id="text">
        Excepteur sint occaecat cupidatat non proident
        Lorem ipsum
    </div>
    <div>
        Lorem ipsum dolor sit amet, ea quo modus meliore platonem.
    </div>
</body>

Attribute Manipulation

Add, modify, and remove element attributes.

parser = LexborHTMLParser(html)
node = parser.css_first('div#text')

# Set attributes
node.attrs['data'] = 'secret data'
node.attrs['id'] = 'new_id'
print(node.attributes)

# Remove attributes
del node.attrs['id']
print(node.attributes)
print(node.html)

Output:

{'id': 'new_id', 'data': 'secret data'}
{'data': 'secret data'}
<div data="secret data">
        <p class="p3" style="display:none;">Excepteur <i>sint</i> occaecat cupidatat non proident</p>
        <p class="p3" vid>Lorem ipsum</p>
    </div>

Inserting Nodes

Insert new content into the DOM at specific positions.

html = """
<div id="container">
    <span class="red"></span>
    <span class="green"></span>
    <span class="red"></span>
    <span class="green"></span>
</div>
"""

parser = LexborHTMLParser(html)

# Insert text before an element
red_node = parser.css_first('.red')
red_node.insert_before("Hello")

# Insert HTML nodes
subtree = LexborHTMLParser("<div>Hi</div>")
green_node = parser.css_first('.green')
green_node.insert_before(subtree)

# Insert before, after, or as child
car_div = parser.create_node("div")
car_div.inner_html = "Car"
green_node.insert_before(car_div)
green_node.insert_after(car_div)
green_node.insert_child(car_div)

print(parser.body.html)

Tree Traversal

Walk every node in the DOM tree and extract text content.

parser = LexborHTMLParser(html)

# Traverse the entire tree
for node in parser.root.traverse(include_text=True):
    if node.tag == '-text':
        text = node.text(deep=True).strip()
        if text:
            print(text)
    else:
        print(node.tag)

Output:

html
head
body
div
p
Excepteur
i
sint
occaecat cupidatat non proident
p
Lorem ipsum
div
p
Lorem ipsum dolor sit amet, ea quo modus meliore platonem.

Common Patterns

Extract Text Content

Extract text content from HTML elements with various formatting options.

parser = LexborHTMLParser('<div><p>Hello <b>world</b>!</p></div>')

# Get text content with different options
node = parser.css_first('p')

# Get all text content
print(node.text())  # "Hello world!"

# Get text with custom separator
print(node.text(separator=' | '))  # "Hello | world | !"

# Get text without stripping whitespace
print(node.text(strip=False))

Output:

Hello world!
Hello  | world | !
Hello world!

Clean HTML

Remove potentially dangerous or unwanted HTML elements.

dirty_html = '''
<div>
    <p>Good content</p>
    <script>alert('xss')</script>
    <style>body { color: red; }</style>
    <p>More content</p>
</div>
'''

parser = LexborHTMLParser(dirty_html)

# Remove unwanted tags
for tag in parser.css('script, style'):
    tag.decompose()

print(parser.body.html)

Output:

<body><div>
    <p>Good content</p>


    <p>More content</p>
</div>
</body>

Advanced selectors

Text Content Filtering

Use advanced selectors to filter elements based on their text content.

html = """
<script>
 var super_variable = 100;
</script>
<script>
 console.log('debug');
</script>
"""

parser = LexborHTMLParser(html)

# Filter script tags containing specific text
scripts_with_super = parser.select('script').text_contains("super").matches
print([node.text() for node in scripts_with_super])

Output:

['\n var super_variable = 100;\n']

CSS Attribute and Pseudo-class Selectors

html = """
<div>
    <article class="post published" data-id="1">
        <h2>First Post</h2>
        <p>Content of first post</p>
        <div class="meta">
            <span class="author">John</span>
            <span class="date">2023-01-01</span>
        </div>
    </article>
    <article class="post draft" data-id="2">
        <h2>Second Post</h2>
        <p>Content of second post</p>
        <div class="meta">
            <span class="author">Jane</span>
            <span class="date">2023-01-02</span>
        </div>
    </article>
    <aside class="sidebar">
        <div class="widget">
            <h3>Popular Posts</h3>
            <ul>
                <li><a href="#1">First Post</a></li>
                <li><a href="#2">Second Post</a></li>
            </ul>
        </div>
    </aside>
</div>
"""

parser = LexborHTMLParser(html)

# Attribute selectors
published_posts = parser.css('article.post.published')
print(f"Published posts: {len(published_posts)}")

# Descendant selectors
authors = parser.css('article .meta .author')
for author in authors:
    print(f"Author: {author.text()}")

# Pseudo-class selectors
first_article = parser.css('article:first-child')
if first_article:
    print(f"First article title: {first_article[0].css_first('h2').text()}")

# Attribute value selectors
specific_post = parser.css_first('article[data-id="1"]')
if specific_post:
    print(f"Post ID 1 title: {specific_post.css_first('h2').text()}")

Output:

Published posts: 1
Author: John
Author: Jane
First article title: First Post
Post ID 1 title: First Post

Text Content Pseudo-class Selectors

Use lexbor-specific pseudo-classes for case-sensitive and case-insensitive text matching.

html = '<div><p>hello </p><p id="main">lexbor is AwesOme</p></div>'
parser = LexborHTMLParser(html)

# Case-insensitive search
results_ci = parser.css('p:lexbor-contains("awesome" i)')
print(f"Case-insensitive results: {len(results_ci)}")

# Case-sensitive search
results_cs = parser.css('p:lexbor-contains("AwesOme")')
print(f"Case-sensitive results: {len(results_cs)}")
print(f"Matching text: {results_cs[0].text()}")

Output:

Case-insensitive results: 1
Case-sensitive results: 1
Matching text: lexbor is AwesOme

Sibling Navigation

Navigate between sibling elements in the DOM.

html = """
<nav>
    <a href="/">Home</a>
    <a href="/about">About</a>
    <a href="/contact" class="active">Contact</a>
    <a href="/blog">Blog</a>
</nav>
"""

parser = LexborHTMLParser(html)
active_link = parser.css_first("a.active")

if active_link:
    print(f"Active link: {active_link.text()}")
    # We need to call it twice, because there are text nodes (spaces and new lines) between <a> elements
    if active_link.prev:
        print(f"Previous link: {active_link.prev.prev.text()}")

    if active_link.next:
        print(f"Next link: {active_link.next.next.text()}")

Output:

Active link: Contact
Previous link: About
Next link: Blog

Table Parsing

Parse HTML tables and extract structured data.

table_html = """
<table class="data-table">
    <thead>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>City</th>
            <th>Occupation</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Alice Johnson</td>
            <td>28</td>
            <td>New York</td>
            <td>Software Engineer</td>
        </tr>
        <tr>
            <td>Bob Smith</td>
            <td>35</td>
            <td>Los Angeles</td>
            <td>Designer</td>
        </tr>
        <tr>
            <td>Carol Brown</td>
            <td>42</td>
            <td>Chicago</td>
            <td>Manager</td>
        </tr>
    </tbody>
</table>
"""

parser = LexborHTMLParser(table_html)

# Extract headers
headers = [th.text() for th in parser.css('thead th')]
print("Headers:", headers)

# Extract data rows
rows = []
for tr in parser.css('tbody tr'):
    row_data = [td.text() for td in tr.css('td')]
    rows.append(row_data)

# Display as structured data
for i, row in enumerate(rows):
    print(f"\nRow {i+1}:")
    for header, value in zip(headers, row):
        print(f"  {header}: {value}")

Output:

Headers: ['Name', 'Age', 'City', 'Occupation']

Row 1:
  Name: Alice Johnson
  Age: 28
  City: New York
  Occupation: Software Engineer

Row 2:
  Name: Bob Smith
  Age: 35
  City: Los Angeles
  Occupation: Designer

Row 3:
  Name: Carol Brown
  Age: 42
  City: Chicago
  Occupation: Manager

Form Data Extraction

Parse HTML forms and extract input data.

form_html = """
<form id="contact-form" method="post" action="/submit">
    <div class="form-group">
        <label for="name">Name:</label>
        <input type="text" id="name" name="name" value="John Doe" required>
    </div>
    <div class="form-group">
        <label for="email">Email:</label>
        <input type="email" id="email" name="email" placeholder="john@example.com">
    </div>
    <div class="form-group">
        <label for="country">Country:</label>
        <select id="country" name="country">
            <option value="us">United States</option>
            <option value="ca" selected>Canada</option>
            <option value="uk">United Kingdom</option>
        </select>
    </div>
    <div class="form-group">
        <label>
            <input type="checkbox" name="newsletter" checked> Subscribe to newsletter
        </label>
    </div>
    <div class="form-group">
        <label for="message">Message:</label>
        <textarea id="message" name="message" rows="4">Hello there!</textarea>
    </div>
    <button type="submit">Submit</button>
</form>
"""

parser = LexborHTMLParser(form_html)

# Extract form metadata
form = parser.css_first('form')
print(f"Form ID: {form.attrs.get('id')}")
print(f"Form method: {form.attrs.get('method')}")
print(f"Form action: {form.attrs.get('action')}")

# Extract input fields
print("\nInput fields:")
for input_field in parser.css('input'):
    field_type = input_field.attrs.get('type', 'text')
    name = input_field.attrs.get('name')
    value = input_field.attrs.get('value', '')
    checked = 'checked' in input_field.attrs

    print(f"  {name} ({field_type}): {value} {'[checked]' if checked else ''}")

# Extract select options
print("\nSelect fields:")
for select in parser.css('select'):
    name = select.attrs.get('name')
    print(f"  {name}:")
    for option in select.css('option'):
        value = option.attrs.get('value')
        text = option.text()
        selected = 'selected' in option.attrs
        print(f"    {value}: {text} {'[selected]' if selected else ''}")

# Extract textarea
print("\nTextarea fields:")
for textarea in parser.css('textarea'):
    name = textarea.attrs.get('name')
    content = textarea.text()
    print(f"  {name}: {content}")

Output:

Form ID: contact-form
Form method: post
Form action: /submit

Input fields:
  name (text): John Doe
  email (email):
  newsletter (checkbox):  [checked]

Select fields:
  country:
    us: United States
    ca: Canada [selected]
    uk: United Kingdom

Textarea fields:
  message: Hello there!