``. ``fragments_fromstring(string)``: Returns a list of the elements found in the fragment. ``fromstring(string)``: Returns ``document_fromstring`` or ``fragment_fromstring``, based on whether the string looks like a full document, or just a fragment. Really broken pages ------------------- The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page in a useful way. A way to deal with this is ElementSoup_, which deploys the well-known BeautifulSoup_ parser to build an lxml HTML tree. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ .. _ElementSoup: elementsoup.html However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster. HTML Element Methods ==================== HTML elements have all the methods that come with ElementTree, but also include some extra methods: ``.drop_tree()``: Drops the element and all its children. Unlike ``el.getparent().remove(el)`` this does *not* remove the tail text; with ``drop_tree`` the tail text is merged with the previous element. ``.drop_tag()``: Drops the tag, but keeps its children and text. ``.find_class(class_name)``: Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so ``doc.find_class_name('highlight')`` will find an element like ``

``. Class names *are* case sensitive. ``.find_rel_links(rel)``: Returns a list of all the ```` elements. E.g., ``doc.find_rel_links('tag')`` returns all the links `marked as tags `_. ``.get_element_by_id(id, default=None)``: Return the element with the given ``id``, or the ``default`` if none is found. If there are multiple elements with the same id (which there shouldn't be, but there often is), this returns only the first. ``.text_content()``: Returns the text content of the element, including the text content of its children, with no markup. ``.cssselect(expr)``: Select elements from this element and its children, using a CSS selector expression. (Note that ``.xpath(expr)`` is also available as on all lxml elements.) ``.label``: Returns the corresponding ```` element for this element, if any exists (None if there is none). Label elements have a ``label.for_element`` attribute that points back to the element. ``.base_url``: The base URL for this element, if one was saved from the parsing. This attribute is not settable. Is None when no base URL was saved. ``.classes``: Returns a set-like object that allows accessing and modifying the names in the 'class' attribute of the element. (New in lxml 3.5). ``.set(key, value=None)``: Sets an HTML attribute. If no value is given, or if the value is ``None``, it creates a boolean attribute like ```` or ``

``. In XML, attributes must have at least the empty string as their value like ````, but HTML boolean attributes can also be just present or absent from an element without having a value. Running HTML doctests ===================== One of the interesting modules in the ``lxml.html`` package deals with doctests. It can be hard to compare two HTML pages for equality, as whitespace differences aren't meaningful and the structural formatting can differ. This is even more a problem in doctests, where output is tested for equality and small differences in whitespace or the order of attributes can let a test fail. And given the verbosity of tag-based languages, it may take more than a quick look to find the actual differences in the doctest output. Luckily, lxml provides the ``lxml.doctestcompare`` module that supports relaxed comparison of XML and HTML pages and provides a readable diff in the output when a test fails. The HTML comparison is most easily used by importing the ``usedoctest`` module in a doctest: .. sourcecode:: pycon >>> import lxml.html.usedoctest Now, if you have an HTML document and want to compare it to an expected result document in a doctest, you can do the following: .. sourcecode:: pycon >>> import lxml.html >>> html = lxml.html.fromstring('''\ ... ...

Hi !

... ... ''') >>> print lxml.html.tostring(html)

Hi !

>>> print lxml.html.tostring(html)

Hi !

>>> print lxml.html.tostring(html)

Hi !

In documentation, you would likely prefer the pretty printed HTML output, as it is the most readable. However, the three documents are equivalent from the point of view of an HTML tool, so the doctest will silently accept any of the above. This allows you to concentrate on readability in your doctests, even if the real output is a straight ugly HTML one-liner. Note that there is also an ``lxml.usedoctest`` module which you can import for XML comparisons. The HTML parser notably ignores namespaces and some other XMLisms. Creating HTML with the E-factory ================================ .. _`E-factory`: http://online.effbot.org/2006_11_01_archive.htm#et-builder lxml.html comes with a predefined HTML vocabulary for the `E-factory`_, originally written by Fredrik Lundh. This allows you to quickly generate HTML pages and fragments: .. sourcecode:: pycon >>> from lxml.html import builder as E >>> from lxml.html import usedoctest >>> html = E.HTML( ... E.HEAD( ... E.LINK(rel="stylesheet", href="great.css", type="text/css"), ... E.TITLE("Best Page Ever") ... ), ... E.BODY( ... E.H1(E.CLASS("heading"), "Top News"), ... E.P("World News only on this page", style="font-size: 200%"), ... "Ah, and here's some more text, by the way.", ... lxml.html.fromstring("

... and this is a parsed fragment ...

") ... ) ... ) >>> print lxml.html.tostring(html) Best Page Ever

Top News

World News only on this page

Ah, and here's some more text, by the way.

... and this is a parsed fragment ...

Note that you should use ``lxml.html.tostring`` and **not** ``lxml.tostring``. ``lxml.tostring(doc)`` will return the XML representation of the document, which is not valid HTML. In particular, things like ```` will be serialized as `` ... ... ... ... ... ... a link ... another link ...

a paragraph

... ...

... ... ... annoying EVIL! ... spam spam SPAM! ...

... ... ''' To remove the all suspicious content from this unparsed document, use the ``clean_html`` function: .. sourcecode:: pycon >>> from lxml.html.clean import clean_html >>> print clean_html(html)

a link another link

a paragraph

secret EVIL!

of EVIL! Password: annoying EVIL!spam spam SPAM!

The ``Cleaner`` class supports several keyword arguments to control exactly which content is removed: .. sourcecode:: pycon >>> from lxml.html.clean import Cleaner >>> cleaner = Cleaner(page_structure=False, links=False) >>> print cleaner.clean_html(html) a link another link

a paragraph

secret EVIL!

of EVIL! Password: annoying EVIL! spam spam SPAM!

>>> cleaner = Cleaner(style=True, links=True, add_nofollow=True, ... page_structure=False, safe_attrs_only=False) >>> print cleaner.clean_html(html) a link another link

a paragraph

secret EVIL!

of EVIL! Password: annoying EVIL! spam spam SPAM!

You can also whitelist some otherwise dangerous content with ``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow embedded media from YouTube, while still filtering out embedded media from other sites. See the docstring of ``Cleaner`` for the details of what can be cleaned. autolink -------- In addition to cleaning up malicious HTML, ``lxml.html.clean`` contains functions to do other things to your HTML. This includes autolinking:: autolink(doc, ...) autolink_html(html, ...) This finds anything that looks like a link (e.g., ``http://example.com``) in the *text* of an HTML document, and turns it into an anchor. It avoids making bad links. Links in the elements ``

``, ``<pre>``, ``<code>``,
anything in the head of the document. You can pass in a list of
elements to avoid in ``avoid_elements=['textarea', ...]``.

Links to some hosts can be avoided.  By default links to
``localhost*``, ``example.*`` and ``127.0.0.1`` are not
autolinked.  Pass in ``avoid_hosts=[list_of_regexes]`` to control
this.

Elements with the ``nolink`` CSS class are not autolinked.  Pass
in ``avoid_classes=['code', ...]`` to control this.

The ``autolink_html()`` version of the function parses the HTML
string first, and returns a string.

wordwrap
--------

You can also wrap long words in your html::

word_break(doc, max_width=40, ...)

word_break_html(html, ...)

This finds any long words in the text of the document and inserts
```` in the document (which is the Unicode zero-width space).

This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
You can control this with ``avoid_elements=['textarea', ...]``.

It also avoids elements with the CSS class ``nobreak``.  You can
control this with ``avoid_classes=['code', ...]``.

Lastly you can control the character that is inserted with
``break_character=u'\u200b'``.  However, you cannot insert markup,
only text.

``word_break_html(html)`` parses the HTML document and returns a
string.

HTML Diff
=========

The module ``lxml.html.diff`` offers some ways to visualize
differences in HTML documents.  These differences are *content*
oriented.  That is, changes in markup are largely ignored; only
changes in the content itself are highlighted.

There are two ways to view differences: ``htmldiff`` and
``html_annotate``. One shows differences with ``<ins>`` and
``<del>``, while the other annotates a set of changes similar to ``svn
blame``. Both these functions operate on text, and work best with
content fragments (only what goes in ``<body>``), not complete
documents.

Example of ``htmldiff``:

.. sourcecode:: pycon

>>> from lxml.html.diff import htmldiff, html_annotate
 >>> doc1 = '''Here is some text.'''
 >>> doc2 = '''Here is a lot of text.'''
 >>> doc3 = '''Here is a little text.'''
 >>> print htmldiff(doc1, doc2)
 Here is <ins>a lot of text.</ins> <del>some text.</del> 
 >>> print html_annotate([(doc1, 'author1'), (doc2, 'author2'),
 ... (doc3, 'author3')])
 Here is
 a
 little
 text
 .

As you can see, it is imperfect as such things tend to be.  On larger
tracts of text with larger edits it will generally do better.

The ``html_annotate`` function can also take an optional second
argument, ``markup``.  This is a function like ``markup(text,
version)`` that returns the given text marked up with the given
version.  The default version, the output of which you see in the
example, looks like:

.. sourcecode:: python

def default_markup(text, version):
 return '%s' % (
 cgi.escape(unicode(version), 1), text)

Examples
========

Microformat Example
-------------------

This example parses the `hCard <http://microformats.org/wiki/hcard>`_
microformat.

First we get the page:

.. sourcecode:: pycon

>>> import urllib
    >>> from lxml.html import fromstring
    >>> url = 'http://microformats.org/'
    >>> content = urllib.urlopen(url).read()
    >>> doc = fromstring(content)
    >>> doc.make_links_absolute(url)

Then we create some objects to put the information in:

.. sourcecode:: pycon

>>> class Card(object):
    ...     def __init__(self, **kw):
    ...         for name, value in kw:
    ...             setattr(self, name, value)
    >>> class Phone(object):
    ...     def __init__(self, phone, types=()):
    ...         self.phone, self.types = phone, types

And some generally handy functions for microformats:

.. sourcecode:: pycon

>>> def get_text(el, class_name):
    ...     els = el.find_class(class_name)
    ...     if els:
    ...         return els[0].text_content()
    ...     else:
    ...         return ''
    >>> def get_value(el):
    ...     return get_text(el, 'value') or el.text_content()
    >>> def get_all_texts(el, class_name):
    ...     return [e.text_content() for e in els.find_class(class_name)]
    >>> def parse_addresses(el):
    ...     # Ideally this would parse street, etc.
    ...     return el.find_class('adr')

Then the parsing:

.. sourcecode:: pycon

>>> for el in doc.find_class('hcard'):
    ...     card = Card()
    ...     card.el = el
    ...     card.fn = get_text(el, 'fn')
    ...     card.tels = []
    ...     for tel_el in card.find_class('tel'):
    ...         card.tels.append(Phone(get_value(tel_el),
    ...                                get_all_texts(tel_el, 'type')))
    ...     card.addresses = parse_addresses(el)