Extract HTML or XML Data¶
The function to extract data from the html or xml file powered by lxml to support XPath, by cssselect to support CSS-Selectors.
Run below command to install optional dependency.
pip install "data_extractor[lxml]" # For using XPath
pip install "data_extractor[cssselect]" # For using CSS-Selectors
Download RSS Sample file for demonstrate.
wget http://www.rssboard.org/files/sample-rss-2.xml
Parse it into data_extractor.lxml.Element
.
from pathlib import Path
from lxml.etree import fromstring
root = fromstring(Path("sample-rss-2.xml").read_text())
Using data_extractor.lxml.XPathExtractor
to extract rss channel title.
from data_extractor import XPathExtractor
assert XPathExtractor("//channel/title/text()").extract_first(root) == "Liftoff News"
Using data_extractor.lxml.TextCSSExtractor
to extract all rss item links.
from data_extractor import TextCSSExtractor
assert TextCSSExtractor("item>link").extract(root) == [
"http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
]
Using data_extractor.lxml.AttrCSSExtractor
to extract rss version.
from data_extractor import AttrCSSExtractor
assert AttrCSSExtractor("rss", attr="version").extract_first(root) == "2.0"