Can Microsoft Azure Translator API translate text that has HTML tags?

75 views Asked by At

I'm trying to consume Azure translator API to translate text that has HTML tags. Translation is from English to Finnish or Danish.

I have noticed that for one case href tag it didn't translate it, It just replaced the href title "two-columns" to a ",".

Any ideas why it does that? Or I should take away the HTML tags? Are there limitations or should I add something in order to get the right translation?

English Text:

How do you get started?

Select 'Edit' to start working with this basic two-column template with an emphasis on text and examples of text formatting. With your page in edit mode, select this paragraph and replace it with your own text. Then, select the Basic two-column template title and replace it, too. Type your name in the page author field under the title.

You've just edited a page template and you're on your way to making this page your own!

 

Translated Danish text:

Miten pääset alkuun?

Valitse 'Muokkaa' aloittaaksesi työskentelyn tämän kaksipalstaisen perusmallin , jossa painotetaan tekstiä ja esimerkkejä tekstin muotoilusta.  Kun sivu on muokkaustilassa, valitse tämä kappale ja korvaa se omalla tekstilläsi. Valitse sitten kaksisarakkeisen mallin perusotsikko ja korvaa se myös. Kirjoita nimesi sivun tekijäkenttään otsikon alle.

Olet juuri muokannut sivumallia ja olet matkalla tekemään tästä sivusta omanlaisesi!

 

I'm trying to consume Azure translator API to translate text that has HTML tags. Translation is from English to Finnish or Danish. I expected to get translation of text regardless of HTML tags, but the returned translation missed up the text by replacing href title to comma.

1

There are 1 answers

2
Suresh Chikkam On

I'm trying to consume Azure translator API to translate text that has HTML tags. Translation is from English to Finnish or Danish.

it's common for translation services to encounter difficulties when translating text that includes HTML markup. Trans APIs may not always handle HTML tags correctly and can sometimes produce unexpected results, such as replacing parts of the text or leave out certain elements.

  • Here I have the sample code that translate text containing HTML tags using the Azure Translator API:
from azure.core.credentials import AzureKeyCredential
from azure.ai.translation.document import DocumentTranslationClient
from azure.ai.translation.document import DocumentTranslationInput, TranslationTarget

# Replace these variables with your Azure subscription key and endpoint
azure_key = "YOUR_AZURE_SUBSCRIPTION_KEY"
azure_endpoint = "YOUR_AZURE_ENDPOINT"

# Initialize Azure credentials
credential = AzureKeyCredential(azure_key)
client = DocumentTranslationClient(azure_endpoint, credential)

# Sample English text with HTML tags
english_html_text = """
How do you get started?
Select 'Edit' to start working with this basic <a href="example.com">two-column</a> template with an emphasis on text and examples of text formatting.
"""

# Define translation inputs
inputs = [DocumentTranslationInput(
    source_url=None,
    source_text=english_html_text,
    targets=[TranslationTarget(language="fi"), TranslationTarget(language="da")]
)]

# Translate the text
result = client.begin_translation(inputs)

# Get the translation results
translated_texts = []
for doc in result.result():
    translated_texts.append(doc.translations[0].translated_text)

# Print the translated text
print("Finnish Translation:")
print(translated_texts[0])

print("\nDanish Translation:")
print(translated_texts[1])

Output: enter image description here

Translate Text and Tags Separately:

# Example of splitting text and HTML tags
original_text = "How do you get started? Select 'Edit' to start working with this basic <a href='example.com'>two-column</a> template."
text_segments = ["How do you get started? Select 'Edit' to start working with this basic ", "two-column", " template."]
html_tags = ["<a href='example.com'>", "</a>"]

# Translate text segments (excluding HTML tags) separately
translated_text_segments = translate_text_segments(text_segments)

# After translation, merge translated text segments with HTML tags
translated_text = ""
for i in range(len(text_segments)):
    translated_text += translated_text_segments[i]
    if i < len(html_tags):
        translated_text += html_tags[i]

# Now translated_text contains the translated text with preserved HTML structure