I need some advice on prepping and cleaning my data. I have two survey data sets (2020 and 2021).
The 2021 survey had additional questions and a change of wording, but the questions between both years are mostly the same. However, I had to manually go through the data sets and identify columns that indicated the same information. To keep track of similar columns between the data sets, a reference key was used to keep track of the similar columns from both years.
Within that, Ive identified a couple of questions that are very similar in nature but the response setup is completely different. Am I able to change that to be more similar without messing anything up? If so, how should I go about doing it? Ive attached screen shots of the questions from both surveys. The similar questions are highlighted in green.
Within that, Ive identified a couple of questions that are very similar in nature but the response setup is completely different. Am I able to change that to be more similar to be able to include it in a merged dataset without messing anything up? If so, how should I go about doing it?
Ive attached screen shots of the questions from both surveys. The similar questions are highlighted in green.
Would measuring similarity between two sentences using cosine similarity be a way to do this?
Also, would python or sql be easier to do this in?
[2020 Question and Response 1](https://i.stack.imgur.com/Wz1Re.png)
[2020 Question and Response 2](https://i.stack.imgur.com/tNESF.png)
[2021 Question and Response 1](https://i.stack.imgur.com/B3voa.png)
[2021 Question and Response 2](https://i.stack.imgur.com/2GHne.png)