What is the best way to parse HTML?

Modify these two to run under Titanium's system. These allow you to parse non XML correct HTML.

What I did was this:

Modified htmlparser to expose it's exports to a regular object and used Ti.include to "include the file as if it was written there".

I did the same for soupselect, and they worked well together and passed the unit tests :)

Essentially I added this to the top of the source files:

exports = {};

and this at the bottom:

htmlparser = exports;

with soupselect, I had to substitute the line:

var domUtils = require('htmlparser').DomUtils;

with

var domUtils = htmlparser.DomUtils;

Ti.include('htmlparser.js');
Ti.include('soupselect.js');

var select = soupselect.select;

var body = '<html><head><title>Test</title></head>'
+ '<body>'
+ '<img src="http://cdn.cad-comic.com/comics/2859286598c11964un2ya69354216.jpg" />'
+ '</body></html>';

var handler = new htmlparser.DefaultHandler(function(err, dom) {
  if (err) {
    alert('Error: ' + err);
  } else {
    var img = select(dom, 'img');

    img.forEach(function(img) {
      alert('src: ' + img.attribs.src);
    });
  }
});

var parser = new htmlparser.Parser(handler);
parser.parseComplete(body);

— answered October 19th 2010 by Robin Duckett
permalink

2 Comments

This is the way to go, I can confirm it with Titanium 1.2.2.

Need to remember: soupselect file needs to have:
```
soupselect = exports;
```
not htmlparser = exports;.

Additionally soupselect doesn't install for me as of now correctly using npm, but I've just downloaded it from github.

Also beware, that the Ti.include needs a path, in my case:
```
Ti.include('lib/htmlparser/lib/htmlparser.js')
Ti.include('lib/soupselect/lib/soupselect.js')
```
Unfortunatelly a lot of warnings when including both libs. Works as a charm however, fast even.

— commented April 15th 2011 by Cezary Krzyzanowski
htmlparser.js will not work in Titanium. It uses 'stream'

— commented February 2nd 2015 by Rainer Schleevoigt

can you help me?
I have the same problem…

— commented September 17th 2010 by matteo annibali

Ciao, I have too the same problem of parsing remote HTML. How did you solve the problem? can you share your parsing procedure with us?
thank you in advance

— commented October 3rd 2010 by Antonio Calanducci

I would also love to get more detail on this.

— commented April 27th 2011 by nick c

Titanium Community Questions & Answer Archive

We felt that 6+ years of knowledge should not die so this is the Titanium Community Questions & Answer Archive

3 Answers