I'm writing code to scrape data from a historical website. Below is an example of a function that scrapes the information of a historical place, based on a table on the Wiki. The problem here is that when I save the data one by one, and print out the check, everything is the way I want it to be. Here is the tag for the header
List<String> header = new ArrayList<String>();
header.add("Tag 1");
header.add("Tag 2");
header.add("Tag 3");
header.add("Tag 4");
I tried printing it like this and the results were infoItem exactly what I wanted
System.out.println(header.get(index) + ". " + infoItem);
// Below is the output
Tag 1. {"name":"Trần Phán, Đầm Dơi","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 2. {"name":"Lịch sử","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 3. {"name":"2016","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 4. {"name":"","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Here is the part of code to scrape:
JSONObject info = new JSONObject();
Elements tr = doc.select(" table.wikitable > tbody > tr");
if (tr.size() > 0) {
for (Element element : tr) {
JSONObject infoItem = new JSONObject();
Elements dataCells = element.select("td:not(:first-child)");
int index = 0;
for (Element dataCell : dataCells) {
if (index >= header.size()){
break;
}
Element urlConnect = dataCell.selectFirst("a");
infoItem.put("name", dataCell.text());
if (urlConnect != null) {
infoItem.put("url", urlConnect.attr("href"));
}
info.put(header.get(index), infoItem);
index++;
}
}
}
System.out.println(info);
}
Here is info when I put all infoItem in with info :
{
"Tag 3": {
"name": "",
"url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
},
"Tag 2": {
"name": "",
"url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
},
"Tag 4": {
"name": "",
"url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
},
"Tag 1": {
"name": "",
"url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
}
}
Temporarily ignore the url because I know how to handle it. I want to ask why my name tag is missing information.