Translating Spanish stata dataset into English using Python

83 views Asked by At

I have a large Spanish dataset in Stata with more than 2500 variables and I want to translate this into English. A lot of these variables are in the form of value labels. I am using Google's API for translation. At the moment I just took 10 observations and 2 variables (p4 and p5) which have value labels and trying to write a code to translate this. However, there is an issue in the translation of value labels. In my orginal dataset the p4 variable has the following label values:

       4 Educación Básica o Preparatoria completa
       6 Educación Media o Humanidades completa
       7 Instituto Profesional o Centros de Formación Técnica incompl
       8 Instituto Profesional o Centros de Formación Técnica complet
       9 Universitaria incompleta
      10 Universitaria completa
     

However, the translated dataset (p4 variable) is showing the following labels: 0 Complete Basic or High School Education 1 Secondary Education or Complete Humanities 2 Professional Institute or Technical Training Centers incomplete 3 Professional Institute or Complete Technical Training Centers 4 incomplete university 5 Complete university

Basically the numbers in the value labels are not getting recorded correctly in the final dataset which is again in dta format. How do I modify my python code to solve this?

Following is my code. Please suggest how to modify this to solve the above issue.

import pandas as pd
from googletrans import Translator, LANGUAGES

# Initialize the translator
translator = Translator()

# Step 1: Read the Stata dataset into Python
df = pd.read_stata('C:\\transl_trial.dta')

# Step 2: Identify the variables with value labels
columns_to_translate = ['p4', 'p5']

from pandas.api.types import CategoricalDtype

# Step 3: Translate the value labels
for col in columns_to_translate:
    # Extract value labels for the column
    value_labels = df[col].cat.categories.tolist()
    print(value_labels)
    translations = {}
    for label in value_labels:
        # Translate from Spanish to English
        translated_text = translator.translate(label, src='es', dest='en').text
        translations[label] = translated_text
    print(translations)
    # Replace the original categories with their translated versions
    df[col] = df[col].replace(translations).astype('category')


output_path = r'C:\\translated_dataset.dta'
df.to_stata(output_path, write_index=False)
1

There are 1 answers

1
Iskander14yo On

As far as I can remember Stata, value labels in Stata datasets are associated with numeric codes, and when you translate the labels, the numeric codes are lost. Therefore, you want to preserve these numbers when translating. I think this code solves issue:

for col in columns_to_translate:
    # Extract value labels and corresponding numeric codes for the column
    value_labels = df[col].cat.categories.tolist()
    code_mapping = df[col].cat.codes.to_dict()

    # Translate the value labels
    translations = {}
    for label in value_labels:
        # Translate from Spanish to English
        translated_text = translator.translate(label, src='es', dest='en').text
        translations[label] = translated_text

    # Replace the original values with their translated versions while preserving codes
    df[col] = df[col].map(translations)
    
    # Update the value labels in the dataframe
    df[col] = df[col].astype('category')
    df[col].cat.categories = value_labels

output_path = r'C:\\translated_dataset.dta'
df.to_stata(output_path, write_index=False)