importing md from html, images links have redundant text

59 views Asked by At

I’m using a script with pandoc to convert html files and assets to markdown. My script “works,” but I have a minor problem: when I look at the markdown note in Obsidian, it has redundant text in the links to the image files in the assets folder:


[![](./assets/3sIcgyHnp7HfaxQy.jpeg)](./assets/3sIcgyHnp7HfaxQy.jpeg)

The proper text should be:


![](./assets/3sIcgyHnp7HfaxQy.jpeg)

So Pandoc seems to be interpreting the html structure as a link within a link, leading to the redundant output.

The code in my html files, I believe, uses a standard tag with an href attribute to link the image., e.g.:


    <img class="img-hide" src="./assets/3sIcgyHnp7HfaxQy.jpeg">
</a>

Can anyone help me fix this to get rid of the redundancy?

My full script, including the pandoc line, is as follows:


# Converts html files into md files (in same location as html files)

success=true

# Use the current working directory as the root directory
root_dir="."

# Recursively traverse the directory structure, suppressing output
find "$root_dir" -name '*.html' -type f -exec sh -c 'pandoc "$1" -t gfm-raw_html --wrap=none -o "${1%.html}.md" >/dev/null 2>&1 || success=false' _ {} \;

if $success; then
    echo "Successfully ran the script."
else
    echo "Some errors occurred during the process."
fi

2

There are 2 answers

0
mb21 On

The input HTML contains a link (the <a> tag) around the image (the <img> tag). You can either remove the links in the HTML or write a pandoc lua filter to have it removed automatically.

0
Tesgin On

RESOLVED.

Fixed. Here's my updated script that fixes the image link.

Hope it's helpful to someone else. Enjoy!

#!/bin/bash

# Remove <a> tags from <img>

# Find and edit HTML files in the current directory and subdirectories
find . -name '*.html' -print0 | while IFS= read -r -d '' file; do
    sed -i '' -e 's/<a href=".\/assets[^>]*><img/<img/g' "$file"
    sed -i '' -e 's/><\/a>//g' "$file"
done

echo "HTML files have been modified to remove image hyperlinks."