Extract HTML or XML Data

The function to extract data from the html or xml file powered by lxml to support XPath, by cssselect to support CSS-Selectors.

Run below command to install optional dependency.

pip install "data_extractor[lxml]"  # For using XPath
pip install "data_extractor[cssselect]"  # For using CSS-Selectors

Download RSS Sample file for demonstrate.

wget http://www.rssboard.org/files/sample-rss-2.xml

Parse it into data_extractor.lxml.Element.

from pathlib import Path

from lxml.etree import fromstring

root = fromstring(Path("sample-rss-2.xml").read_text())

Using data_extractor.lxml.XPathExtractor to extract rss channel title.

from data_extractor import XPathExtractor

assert XPathExtractor("//channel/title/text()").extract_first(root) == "Liftoff News"

Using data_extractor.lxml.TextCSSExtractor to extract all rss item links.

from data_extractor import TextCSSExtractor

assert TextCSSExtractor("item>link").extract(root) == [
    "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
    "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
    "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
]

Using data_extractor.lxml.AttrCSSExtractor to extract rss version.

from data_extractor import AttrCSSExtractor

assert AttrCSSExtractor("rss", attr="version").extract_first(root) == "2.0"