========================
Extract HTML or XML Data
========================

The function to extract data from the html or xml file
powered by lxml_ to support XPath_, by cssselect_ to support CSS-Selectors_.

Run below command to install optional dependency.

.. code-block:: shell

    pip install "data_extractor[lxml]"  # For using XPath
    pip install "data_extractor[cssselect]"  # For using CSS-Selectors

Download RSS Sample file for demonstrate.

.. code-block:: shell

    wget http://www.rssboard.org/files/sample-rss-2.xml

Parse it into :class:`data_extractor.lxml.Element`.

.. code-block:: python3

    from pathlib import Path

    from lxml.etree import fromstring

    root = fromstring(Path("sample-rss-2.xml").read_text())

Using :class:`data_extractor.lxml.XPathExtractor` to extract rss channel title.

.. code-block:: python3

    from data_extractor import XPathExtractor

    assert XPathExtractor("//channel/title/text()").extract_first(root) == "Liftoff News"

Using :class:`data_extractor.lxml.TextCSSExtractor`
to extract all rss item links.

.. code-block:: python3

    from data_extractor import TextCSSExtractor

    assert TextCSSExtractor("item>link").extract(root) == [
        "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
        "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
        "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
    ]

Using :class:`data_extractor.lxml.AttrCSSExtractor` to extract rss version.

.. code-block:: python3

    from data_extractor import AttrCSSExtractor

    assert AttrCSSExtractor("rss", attr="version").extract_first(root) == "2.0"

.. _lxml: https://lxml.de
.. _XPath: https://www.w3.org/TR/xpath-10/
.. _cssselect: https://cssselect.readthedocs.io/en/latest/
.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/