Titanium Community Questions & Answer Archive

We felt that 6+ years of knowledge should not die so this is the Titanium Community Questions & Answer Archive

What is the best way to parse HTML?

Hi
I'm trying to parse HTML.
When I passed HTML as string into Titanium.XML.parseString(), it crashed.
Tried something like this:

http.send(); // http is a synchronous http client
var result = http.responseText;
var dom = Titanium.XML.parseString(result);//crash!!

My error is like this:

[ERROR] Error Domain=com.google.GDataXML Code=-1 "The operation couldn’t be completed. (com.google.GDataXML error -1.)". in -[TiDOMDocumentProxy parseString:] (TiDOMDocumentProxy.m:48)

Am I doing something wrong?
Titanium.XML.parseString just can't parse HTML? Then is there any way to parse HTML? I need something like getElementById, getElementsByClassName….

— asked September 13th 2010 by Hoseong Hwang
  • html
  • parsing
  • xml
0 Comments

3 Answers

  • node-htmlparser

    node-soupselect

    Modify these two to run under Titanium's system. These allow you to parse non XML correct HTML.

    What I did was this:

    Modified htmlparser to expose it's exports to a regular object and used Ti.include to "include the file as if it was written there".

    I did the same for soupselect, and they worked well together and passed the unit tests :)

    Essentially I added this to the top of the source files:

    exports = {};
    

    and this at the bottom:

    htmlparser = exports;
    

    with soupselect, I had to substitute the line:

    var domUtils = require('htmlparser').DomUtils;
    

    with

    var domUtils = htmlparser.DomUtils;
    
    Ti.include('htmlparser.js');
    Ti.include('soupselect.js');
    
    var select = soupselect.select;
    
    var body = '<html><head><title>Test</title></head>'
    + '<body>'
    + '<img src="http://cdn.cad-comic.com/comics/2859286598c11964un2ya69354216.jpg" />'
    + '</body></html>';
    
    var handler = new htmlparser.DefaultHandler(function(err, dom) {
      if (err) {
        alert('Error: ' + err);
      } else {
        var img = select(dom, 'img');
    
        img.forEach(function(img) {
          alert('src: ' + img.attribs.src);
        });
      }
    });
    
    var parser = new htmlparser.Parser(handler);
    parser.parseComplete(body);
    
    — answered October 19th 2010 by Robin Duckett
    permalink
    2 Comments
    • This is the way to go, I can confirm it with Titanium 1.2.2.

      Need to remember: soupselect file needs to have:

      soupselect = exports;
      

      not htmlparser = exports;.

      Additionally soupselect doesn't install for me as of now correctly using npm, but I've just downloaded it from github.

      Also beware, that the Ti.include needs a path, in my case:

      Ti.include('lib/htmlparser/lib/htmlparser.js')
      Ti.include('lib/soupselect/lib/soupselect.js')
      

      Unfortunatelly a lot of warnings when including both libs. Works as a charm however, fast even.

      — commented April 15th 2011 by Cezary Krzyzanowski
    • htmlparser.js will not work in Titanium. It uses 'stream'

      — commented February 2nd 2015 by Rainer Schleevoigt
  • YQL it's the best way to parse html, as long as the webpage does not block it.

    — answered September 13th 2010 by Dan Tamas
    permalink
    1 Comment
    • Hmm, actually the website is close to public that I couldn't access using yql..

      — commented September 13th 2010 by Hoseong Hwang
  • Finally, I implemented a parsing procedure based on string itself.

    — answered September 17th 2010 by Hoseong Hwang
    permalink
    3 Comments
    • can you help me?
      I have the same problem…

      — commented September 17th 2010 by matteo annibali
    • Ciao, I have too the same problem of parsing remote HTML. How did you solve the problem? can you share your parsing procedure with us?
      thank you in advance

      — commented October 3rd 2010 by Antonio Calanducci
    • I would also love to get more detail on this.

      — commented April 27th 2011 by nick c
The ownership of individual contributions to this community generated content is retained by the authors of their contributions.
All trademarks remain the property of the respective owner.