What is the best way to parse HTML?
Hi
I'm trying to parse HTML.
When I passed HTML as string into Titanium.XML.parseString(), it crashed.
Tried something like this:
http.send(); // http is a synchronous http client
var result = http.responseText;
var dom = Titanium.XML.parseString(result);//crash!!
My error is like this:
[ERROR] Error Domain=com.google.GDataXML Code=-1 "The operation couldn’t be completed. (com.google.GDataXML error -1.)". in -[TiDOMDocumentProxy parseString:] (TiDOMDocumentProxy.m:48)
Am I doing something wrong?
Titanium.XML.parseString just can't parse HTML? Then is there any way to parse HTML? I need something like getElementById, getElementsByClassName….
3 Answers
-
Modify these two to run under Titanium's system. These allow you to parse non XML correct HTML.
What I did was this:
Modified htmlparser to expose it's exports to a regular object and used Ti.include to "include the file as if it was written there".
I did the same for soupselect, and they worked well together and passed the unit tests :)
Essentially I added this to the top of the source files:
exports = {};
and this at the bottom:
htmlparser = exports;
with soupselect, I had to substitute the line:
var domUtils = require('htmlparser').DomUtils;
with
var domUtils = htmlparser.DomUtils;
Ti.include('htmlparser.js'); Ti.include('soupselect.js'); var select = soupselect.select; var body = '<html><head><title>Test</title></head>' + '<body>' + '<img src="http://cdn.cad-comic.com/comics/2859286598c11964un2ya69354216.jpg" />' + '</body></html>'; var handler = new htmlparser.DefaultHandler(function(err, dom) { if (err) { alert('Error: ' + err); } else { var img = select(dom, 'img'); img.forEach(function(img) { alert('src: ' + img.attribs.src); }); } }); var parser = new htmlparser.Parser(handler); parser.parseComplete(body);
-
YQL it's the best way to parse html, as long as the webpage does not block it.
-
Finally, I implemented a parsing procedure based on string itself.