screen scraping - Is there a library similar to lxml or nokogiri for Java? -


I want to scrap some screens, using the ideal CSS selectors and not XPath. Is there a library in Ruby or Python?

There are dozens of screen scraping libraries in Java. To quote just a few:

  • - A SAX-compliant parser written in Java, instead of parsing well-formed or valid XML, the form of HTML Parsed as it is found in the wild: dirty and cruel, though often far away. TagsSupup is designed for those who have processed this stuff using some visualization of rational app design. By providing a secs interface, this allows standard XML implementations to be applied on the worst HTML.
  • - Jericho HTML Parser is a simple but powerful Java library that analyzes and manipulates parts of the HTML document. Some common server-side tags, while verbatically redesigning any unfamiliar or invalid HTML, also provides high-level HTML form manipulation functions. T is neither an event, nor a tree based parser, but rather a combination of simple text search, efficient tag recognition and a tag position cache. The text of the entire source document has been loaded in memory first, and then only relevant segments are searched for related segments of each segment.
  • - Underlines HTML elements in different elements and produces well-formed XML from dirty HTML. It follows the same rules that most web browsers use to create document object models. A user can provide custom tags and rule sets for filtering and balancing tags.
  • - NekoHTML is a simple HTML scanner and tag balancer that allows application programmers to parse HTML documents and access information using standard XML interfaces. Parser can scan HTML files and "fix" many common mistakes that human (and computer) authors write in written form of HTML documents. Neko HTML adds missing parent element; Automatically closes elements with alternative end tags; And can handle the mismatched inline element tags.

and many more. But these are the best IMOs to deal with any type of material (like understanding all kinds of nonsense) as I mentioned, although this is not a problem for you.

In case of bus, maybe check the thread.

Update: A new project has been released (2010-01-31), which provides one. See your website for more details and / or its author.


Comments

Popular posts from this blog

oracle - The fastest way to check if some records in a database table? -

php - multilevel menu with multilevel array -

jQuery UI: Datepicker month format -