Pandoc - markdown files converted from html contain extraneous "gibberish"

108 views Asked by At

I used Pandoc to convert html files, which were exported from Nimbus Notes, into markdown. There’s something weird with the way pandoc converts my html file to md.

When I view the md file in Obsidian all the original text and images are there as were in the original note in nimbus. However, the note when viewed in Obsidian also has all kinds of "gibberish" that was not in the original file in Nimbus Notes: Lines that start with three colons followed by no text, as well as lines that start with three colons and are followed by text enclosed in curly brackets. For example (this is copied from the note in Obsidian):

::: {#note-editor .note-container .theme-light .is-safari style="position: relative;"} 
::: editor-body 
::: {#note-root .export-mode .nedit-root .notranslate .size-normal .style-normal} 
::: {.editable-text .paragraph .indent-0 style="text-align:left;"}

It's a such a mess as to be unreadable. How do I fix that?

The script I used was as follows (in zsh):

#!/bin/bash

# Use the current working directory as the root directory
root_dir="."

# Recursively traverse the directory structure
find "$root_dir" -name '*.html' -type f -exec sh -c 'pandoc "$1" -o "${1%.html}.md"' _ {} \;

I want the note to appear w/o that extraneous stuff, and for the note to appear as it originally was in Nimbus Notes.

UPDATE: I "might" have fixed the problem. I changed the last line in my script to read as follows:

find "$root_dir" -name '*.html' -type f -exec sh -c 'pandoc "$1" -t gfm-raw_html -o "${1%.html}.md"' _ {} \;

So, I added "gfm-raw_html" to that line.

It appears as though the gibberish is gone, which is great! Does my formatting look correct?

However, I am now observing a small handful of notes in Obsidian (i.e., markdown) that are blank, where the original note in Nimbus was not.

Could that be related to the latest change? IDK why the blank pages (say, 4 out of 30 or so notes). The other notes look great.

2

There are 2 answers

1
John MacFarlane On

Those are a pandoc markdown extension, fenced divs. Apparently obsidian's markdown dialect doesn't support them. That's fine, you can disable them by running pandoc with -t markdown-fenced_divs. In that case you may get some raw HTML div tags; to disable all of this you can use -t markdown-fenced_divs-native_divs-raw_html. Or you could try something like -t commonmark or -t gfm or -t markdown_strict. Pandoc supports many different markdown dialects.

0
Tesgin On

Update:

The addition of the -t gfm-raw_html stuff to the command line fixed the fenced_divs gibberish.

I've confirmed the empty pages are not because of pandoc error; they are an artifact of Nimbus Notes export to html. There are random pages that are exported to blank html files, as well as to pdf files. I've reached out to support for a fix. No word yet.