Java HTML normalizer?

1.3k views Asked by At

IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?

For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.

Sort of like Lynx web browser but with minimal graphics.

2

There are 2 answers

0
Chris On BEST ANSWER

To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).

And i think the reason for this is, that what you want is a very special need.

So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.

I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).

or

  • use xslt to transform the document into some other html document which results in a view that fits your needs.

or

  • use one of the available html parser librarys. (The most of which i found where kind of outdated (2006)) but they could be an option for you.

This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.

2
Joel On

The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.

It's easy to use:

    Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
    String renderedText=source.getRenderer().toString();
    System.out.println("\nSimple rendering of the HTML document:\n");
    System.out.println(renderedText);

(from here)

and handles HTML in the wild (badly formatted) very well.

Here's the first few lines of this page formatted this way using Jericho:

Stack Exchange log in | careers | chat | meta | about | faq

Stack Overflow * Questions * Tags * Users * Badges * Unanswered * Ask Question

Java HTML normalizer?

**

IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?

For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.

Sort of like Lynx web browser but with minimal graphics.

java lynx link|edit|flag asked 2 days ago Kim Jong Woo 593112 89% accept rate Do you want to transform your HTML code to simpler HTML code, or do your want to show this "minimalistic uniform format" to your user? Or do you want to create a image? – Paŭlo Ebermann yesterday simpler html code without sacrificing the relative positioning of the elements. – Kim Jong Woo 16 hours ago

2 Answers

To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).

And i think the reason for this is, that what you want is a very special need.

So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.

I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).

* 

  Use a library to ensure that your html conforms to xhtml:

http://htmlcleaner.sourceforge.net/release.php

* 

  then either parse the xml with DOM or SAX parsers and display it the

way you want.

or

* use xslt to transform the document into some other html document

which results in a view that fits your needs.

or

* use one of the available html parser librarys. (The most of which i

found where kind of outdated (2006)) but they could be an option for you.

This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.