========================
Extract HTML or XML Data
========================
The function to extract data from the html or xml file
powered by lxml_ to support XPath_, by cssselect_ to support CSS-Selectors_.
Run below command to install optional dependency.
.. code-block:: shell
pip install "data_extractor[lxml]" # For using XPath
pip install "data_extractor[cssselect]" # For using CSS-Selectors
Download RSS Sample file for demonstrate.
.. code-block:: shell
wget http://www.rssboard.org/files/sample-rss-2.xml
Parse it into :class:`data_extractor.lxml.Element`.
.. code-block:: python3
from pathlib import Path
from lxml.etree import fromstring
root = fromstring(Path("sample-rss-2.xml").read_text())
Using :class:`data_extractor.lxml.XPathExtractor` to extract rss channel title.
.. code-block:: python3
from data_extractor import XPathExtractor
assert XPathExtractor("//channel/title/text()").extract_first(root) == "Liftoff News"
Using :class:`data_extractor.lxml.TextCSSExtractor`
to extract all rss item links.
.. code-block:: python3
from data_extractor import TextCSSExtractor
assert TextCSSExtractor("item>link").extract(root) == [
"http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
]
Using :class:`data_extractor.lxml.AttrCSSExtractor` to extract rss version.
.. code-block:: python3
from data_extractor import AttrCSSExtractor
assert AttrCSSExtractor("rss", attr="version").extract_first(root) == "2.0"
.. _lxml: https://lxml.de
.. _XPath: https://www.w3.org/TR/xpath-10/
.. _cssselect: https://cssselect.readthedocs.io/en/latest/
.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/