How to auto submit forms using Java headless browser HtmlUnit

72 views Asked by At

I'm quite new to HtmlUnit but what I'm trying to do here is as follows

we have a crystal server where we need to call to fetch reports

we are using Restful APIs that are exposed from crystal server to fetch reports

In this process of fetching document crustal don't have a Direct API to fetch the reports

So we got a final link from one of the API endpoint and by opening that link in the regular browser it loads the pdf document after roughly three different redirects

so I'm trying to achieve this browser behavior inside java using HtmlUnit library

try (final WebClient webClient = new WebClient()) {
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setRedirectEnabled(true);
    htmlPage = webClient.getPage(linkString);
}

here until this I'm getting to second redirect but not to the document itself.

any suggestions on how to archive final page which is document?

Do I need the capture the end result and perform the third call again using new webclient or is there any easy way to achieve the end page

2

There are 2 answers

2
RBRi On BEST ANSWER

There might be several reasons for all that. One is the way the redirect is done - by HttpHeader or by js magic. Both is supported but if the redirect is done by js sometimes a bit more code is required.

And second, the browser handling of non-html responses is looking easy if you are a real person in front of your real browsers but for headless browsers the handling is not that simple (see https://www.htmlunit.org/filedownload-howto.html for details how HtmlUnit tries to do that).

What you can do:

At first try to understand what page you reach with your current code / check the page type HtmlPage or UnexpectedPage. If you got an HtmlPage use asXml() to get an idea what you really got and try to understand how browsers moving on from there.

Next thing to check is the number of windows you got - maybe the download opens a new window containing the content (again see https://www.htmlunit.org/filedownload-howto.html). You can ask the webClient for the list of windows and check before/after.

And finally feel free to open an issue at github and i will try to help with more details.

0
happyDayJeffrey On

I don't know the exactly meaning of getPage(), get a list of DOM which you can query of modify? or get the PDF document data? Different result decides different ways to handle it.

If I have the problem like yours, I will do that:

1.Find the final path through repeatedly redirecting.

2.Use Http tools to call the path with correct request method.

3.Get the data from http body(maybe Blog, JSON ,etc).

4.Convert the data to PDF file by some opensource library like Apache PDFBox ,etc.

Then you get what you want.