Missing data when putting into JSONObject

23 views Asked by At

I'm writing code to scrape data from a historical website. Below is an example of a function that scrapes the information of a historical place, based on a table on the Wiki. The problem here is that when I save the data one by one, and print out the check, everything is the way I want it to be. Here is the tag for the header

        List<String> header = new ArrayList<String>();
        header.add("Tag 1");
        header.add("Tag 2");
        header.add("Tag 3");
        header.add("Tag 4");

I tried printing it like this and the results were infoItem exactly what I wanted

System.out.println(header.get(index) + ".  " + infoItem);
// Below is the output
Tag 1.  {"name":"Trần Phán, Đầm Dơi","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 2.  {"name":"Lịch sử","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 3.  {"name":"2016","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}
Tag 4.  {"name":"","url":"/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"}

Here is the part of code to scrape:

        JSONObject info = new JSONObject();
        Elements tr = doc.select(" table.wikitable > tbody > tr");
        if (tr.size() > 0) {
            for (Element element : tr) {
                JSONObject infoItem = new JSONObject();
                Elements dataCells = element.select("td:not(:first-child)");
                int index = 0;
                for (Element dataCell : dataCells) {
                    if (index >= header.size()){
                        break;
                    }
                    Element urlConnect = dataCell.selectFirst("a");
                    infoItem.put("name", dataCell.text());
                    if (urlConnect != null) {
                        infoItem.put("url", urlConnect.attr("href"));
                    }
                    info.put(header.get(index), infoItem);
                    index++;
                }
            }
        }
        System.out.println(info);
    }

Here is info when I put all infoItem in with info :

{
  "Tag 3": {
    "name": "",
    "url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
  },
  "Tag 2": {
    "name": "",
    "url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
  },
  "Tag 4": {
    "name": "",
    "url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
  },
  "Tag 1": {
    "name": "",
    "url": "/wiki/%C4%90%E1%BA%A7m_D%C6%A1i"
  }
}

Temporarily ignore the url because I know how to handle it. I want to ask why my name tag is missing information.

0

There are 0 answers