I have a 1-dimensional array which I am using to store a categorical feature of my dataset like so: (where each data instance belongs to many categories and the categories are separated by commas)
Administration Oral ,Aged ,Area Under Curve ,Cholinergic Antagonists/adverse effects/*pharmacokinetics/therapeutic use ,Circadian Rhythm/physiology ,Cross-Over Studies ,Delayed-Action Preparations ,Dose-Response Relationship Drug ,Drug Administration Schedule ,Female ,Humans ,Mandelic Acids/adverse effects/blood/*pharmacokinetics/therapeutic use ,Metabolic Clearance Rate ,Middle Aged ,Urinary Incontinence/drug therapy ,Xerostomia/chemically induced ,
Adult ,Anti-Ulcer Agents/metabolism ,Antihypertensive Agents/metabolism ,Benzhydryl Compounds/administration & dosage/blood/*pharmacology ,Caffeine/*metabolism ,Central Nervous System Stimulants/metabolism ,Cresols/administration & dosage/blood/*pharmacology ,Cross-Over Studies ,Cytochromes/*pharmacology ,Debrisoquin/*metabolism ,Drug Interactions ,Humans ,Male ,Muscarinic Antagonists/pharmacology ,Omeprazole/*metabolism ,*Phenylpropanolamine ,Polymorphism Genetic ,Tolterodine Tartrate ,Urinary Bladder Diseases/drug therapy ,
...
...
Each element of the array represents the categories a data instance belongs to. I need to use one-hot encoding, so I can use these as a feature to train my algorithm. I understand this can be achieved using scrikit-learn, however I am unsure how to implement it. (There are ~150 possible categories and around 1,000 data instances.)
I recommend you use the get_dummies method in pandas for this. The interface is a little nicer especially if you're already using pandas to store your data. The sklearn implementations are a bit more involved. If you do decide to go the sklearn route you will either need to use OneHotEncoder or LabelBinarizer. Both would require you to first convert your categories to integer values which you can accomplish with LabelEncoder.