Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML.
- The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
- ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
- A stream based parsing option using the StreamedSource class, which allows memory efficient processing of large files using an event iterator. This is essentially a StAX alternative with the ability to process HTML and non-validating XML, as well as several other features not available in other streaming parsers.
- In its standard form it is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
- Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
- Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
SDK for Android
Fixed GC performance problem in StreamedSource.
Renderer.setHRLineLength(0) doesn't completely disable
rendering of HR element.
Fixed performance problem in Attributes.
Fixed position discarded exception in StreamedSource.
Limited left margin in Renderer based on MaxLineLength.
Little-endian BOM encoding detection broken.
HTML5 elements with forbidden end tags weren't present in