Converting html to json with pandoc

2.5k views Asked by At

I'm trying to take html and generate some json that keeps the same structure.

I'm trying to use pandoc, as i've had some success in transforming things from format A to format B using pandoc before.

I'm trying to convert this file:

example.html

<p>Hello guys! What's up?</p>

Using the command:

pandoc -f html -t json example.html

What i expect is something like:

[{ "p": "Hello guys! What's up?"}]

What i get is:

[
  { "Para":
    [
      {"t": "Str", "c": "Hello"},
      {"t": "Space"},
      {"t": "Str", "c": "guys!"},
      {"t": "Space"},
      {"t": "Str", "c": "What's"},
      {"t": "Space"},
      {"t": "Str", "c": "up?"}
    ]
  }
]

The problem seems to be that when pandoc reads the text content, it separates every word based on the space character and makes an array out of it, while i expected pandoc to understand that the whole string is a single element.

I'm a beginner at pandoc and I've not been able to find out how to tweak that behavior.

Do you have an idea of how I can get the desired output? Do you know another tool that can do this? The tool, or the language it's written in doesn't matter.

Thanks.

Edit: You can test that behavior online on that pandoc online tool.

Edit 2: Workaround. I couldn't find how to do the HTML->JSON conversion with pandoc. As a workaround, i used the suggestion proposed in the comments, and implemented a solution using Himalaya, which is a node package. The result is exactly what i wished for, even though it's not using pandoc.

2

There are 2 answers

1
mb21 On

Currently, the pandoc JSON representation is not very human-readable, but is auto-generated from the Haskell pandoc data types (aka document AST). There is some discussion to change that eventually.

I guess you're looking for something like https://codebeautify.org/xmltojson? There also seem to be plenty of commandline-tools that do that.

0
ekiim On

Pandoc, It's a tool to convert documents, the json representation of the document, It's just another representation that Pandoc can handle for the AST (Abstract Syntax Tree)

Original Document --> Pandoc's AST --> Output Document
                   |                |
                pandoc           pandoc

Asking pandoc, to output a json, is to ask for the AST tree in it's json format,

If I understand correctly you would need something more like a xml to json converter like this Python xmljson module or an online tool like this one.

There are plenty of tools for that job as you picture it, just google XML to JSON convert.

The json representation of the AST used in pandoc, it normally used to output it from pandoc, and pipe it into another program that can handle json files, so you can alter the AST and make filters to manipulate the structure of your document.