How to get element in MathJax when crawl data?

62 views Asked by At

I'm crawling a website that has the following tags by "cheerio", how can I get the entire text of the p tag and also **span * *with attribute "data-mathml".

<p><strong class="content_question">Đề bài</strong></p>
<p style="text-align: justify;">"a. "
    <span class="MathJax_Preview" style="color: inherit; display: none;"></span>
    <span id="MathJax-Element-1-Frame" 
        class="mjx-chtml MathJax_CHTML" 
        tabindex="0" 
        style="font-size: 121%; position: relative;" 
        data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mn>5</mn></math>" role="presentation"><span id="MJXc-Node-1" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-2" class="mjx-mrow"><span id="MJXc-Node-3" class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">5</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math></span></span><script type="math/tex" id="MathJax-Element-1">5</script> và <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span id="MathJax-Element-2-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" style="font-size: 121%; position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mroot><mn>123</mn><mn>3</mn></mroot></math>" role="presentation"><span id="MJXc-Node-4" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-5" class="mjx-mrow"><span id="MJXc-Node-6" class="mjx-mroot"><span class="mjx-root" style="font-size: 50%; vertical-align: 0.774em; width: 0px;"><span id="MJXc-Node-8" class="mjx-mn" style="padding-left: 0.543em;"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">3</span></span></span><span class="mjx-box" style="padding-top: 0.045em;"><span class="mjx-surd"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.507em; padding-bottom: 0.553em;">√</span></span><span class="mjx-box" style="padding-top: 0.119em; border-top: 1.6px solid;"><span id="MJXc-Node-7" class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">123</span></span></span></span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mroot><mn>123</mn><mn>3</mn></mroot></math></span></span>
    <script type="math/tex" id="MathJax-Element-2">\root 3 \of {123} </script>
" ;"</p>

In the span tag with the attribute "data-mathml", should I get text or get element in this attribute to return data to the client?

        const html = response.data;
        const $ = cheerio.load(html);
        const mathjaxEquations = $("span[data-mathml]");
        console.log({ mathjaxEquations });

Please help me, many thanks!

1

There are 1 answers

0
ggorlen On

Based on your comment, you can extract this text with something like Puppeteer. Cheerio doesn't evaluate JS, including MathJax, but browser automation lets the live page run and gives you the opportunity to extract data injected by JS.

const puppeteer = require("puppeteer"); // ^21.0.2

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: "new"});
  const [page] = await browser.pages();
  await page.goto(url);
  await page.waitForSelector(".mjx-char");
  await page.$$eval('[data-id="sp-target-div-outstream"]', els =>
    els.forEach(el => el.remove())
  );
  const result = await page.evaluate(() =>
    $("#box-content > p")
      .first()
      .nextUntil(":not(p)")
      .get()
      .map(e =>
        [...e.childNodes]
          .flatMap(e =>
            e.nodeType === Node.TEXT_NODE
              ? e.textContent
              : e.classList?.contains("mjx-chtml")
              ? [...e.querySelectorAll(".mjx-char")].map(
                  e => e.textContent
                )
              : ""
          )
          .join("")
      )
  );
  console.log(result);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

[ 'So sánh', 'a) 5 và 3√123 ;', 'b) 53√6 và 63√5.' ]

Replace .join("") with .filter(Boolean) if you want a more raw version of the data which you can process further and optionally join later:

[
  [ 'So sánh' ],
  [
    'a) ',  '5',
    ' và ', '3',
    '√',    '123',
    ' ;'
  ],
  [
    'b) ', '5', '3',
    '√',   '6', ' và ',
    '6',   '3', '√',
    '5',   '.'
  ]
]