Trying to run the below code count the number of records in a pyspark dataframe.
I am not getting the desired result when I ran it with RDD I got the expected result below is the RDD codes
# We will extract the first element of each split row, assuming it represents the study ID
study_ids = cleaned_data.map(lambda row: row[0].strip('"'))
# Count the number of distinct study IDs
num_studies = study_ids.distinct().count()
print("Number of distinct studies:", num_studies)
I tried running the below code:
cleaned_data_df.groupBy('Id').count().orderBy('count', ascending = False).show()
An error occurred while calling o2584.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 157.0 failed 1 times, most recent failure: Lost task 2.0 in stage 157.0 (TID 476) (ip-10-172-184-229.us-west-2.compute.internal executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Input row doesn't have expected number of values required by the schema. 14 fields are required while 8 values are provided. SQLSTATE: XX000...
Below is the sample of the data in the dataframe:
Id Study Title Acronym Status Conditions Interventions Sponsor Collaborators Enrollment Funder Type Type Study Design Start Completion
NCT03630471 Effectiveness of a Problem-solving Intervention for Common Adolescent Mental Health Problems in India PRIDE COMPLETED Mental Health Issue (E.G. Depression Psychosis Personality Disorder Substance Abuse) BEHAVIORAL: PRIDE 'Step 1' problem-solving intervention|BEHAVIORAL: Enhanced usual care Sangath Harvard Medical School (HMS and HSDM)|London School of Hygiene and Tropical Medicine 250.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: DOUBLE (INVESTIGATOR OUTCOMES_ASSESSOR)|Primary Purpose: TREATMENT 2018-08-20 2019-02-28
NCT05992571 Oral Ketone Monoester Supplementation and Resting-state Brain Connectivity RECRUITING Cerebrovascular Function|Cognition OTHER: Placebo|DIETARY_SUPPLEMENT: β-OHB McMaster University Alzheimer's Society of Brant Haldimand Norfolk Hamilton Halton 30.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: CROSSOVER|Masking: TRIPLE (PARTICIPANT INVESTIGATOR OUTCOMES_ASSESSOR)|Primary Purpose: BASIC_SCIENCE 2023-10-25 2024-08
NCT00237471 Impact of Tight Glycaemic Control in Acute Myocardial Infarction TERMINATED Myocardial Infarct|Hyperglycemia DRUG: Insulin (tight blood glucose control) Melbourne Health National Health and Medical Research Council Australia|Bristol-Myers Squibb 40.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: NONE|Primary Purpose: TREATMENT 2005-10 2006-05
NCT03820271 New Prognostic Predictive Models of Mortality of Decompensated Cirrhotic Patients Waiting for Liver Transplantation SUPERMELD RECRUITING Decompensated Cirrhosis|Liver Transplantation OTHER: SuperMELD Assistance Publique - Hôpitaux de Paris 500.0 OTHER INTERVENTIONAL Allocation: NA|Intervention Model: SINGLE_GROUP|Masking: NONE|Primary Purpose: OTHER 2020-10-01 2023-10-01
NCT06229171 InTake Care: Development and Validation of an Innovative Personalized Digital Health Solution for Medication Adherence Support in Cardiovascular Prevention InTakeCare NOT_YET_RECRUITING Hypertension|Treatment Adherence and Compliance|Digital Health OTHER: adherence support system based on a vocal assistant Istituto Auxologico Italiano Istituti Clinici Scientifici Maugeri SpA|Politecnico di Milano 206.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: NONE|Primary Purpose: OTHER 2024-10-01 2026-04-01
NCT02945371 Tailored Inhibitory Control Training to Reverse EA-linked Deficits in Mid-life REV COMPLETED Smoking|Alcohol Drinking|Prescription Drug Abuse|Substance-Related Disorders|Oral Intake Reduced BEHAVIORAL: Person-centered inhibitory control training|BEHAVIORAL: Active behavioral response training University of Oregon 103.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: SINGLE (PARTICIPANT)|Primary Purpose: PREVENTION 2014-09 2016-05
NCT01055171 Neuromodulation of Trauma Memories in PTSD & Alcohol Dependence COMPLETED Alcohol Dependence|PTSD DRUG: Propranolol|DRUG: Placebo Medical University of South Carolina National Institute on Alcohol Abuse and Alcoholism (NIAAA) 44.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: QUADRUPLE (PARTICIPANT CARE_PROVIDER INVESTIGATOR OUTCOMES_ASSESSOR)|Primary Purpose: TREATMENT 2010-01 2012-08
NCT01125371 Computerized Brief Alcohol Intervention (BI) for Binge Drinking HIV At-Risk and Infected Women COMPLETED Alcohol; Harmful Use|Binge Drinking|Risk Behavior|HIV Infection BEHAVIORAL: Computerized brief alcohol intervention + IVR booster calls|BEHAVIORAL: Computerized brief alcohol intervention|BEHAVIORAL: Attention Control Johns Hopkins University National Institute on Alcohol Abuse and Alcoholism (NIAAA) 439.0 OTHER INTERVENTIONAL Allocation: RANDOMIZED|Intervention Model: PARALLEL|Masking: DOUBLE (INVESTIGATOR OUTCOMES_ASSESSOR)|Primary Purpose: TREATMENT 2011-10 2016-06-07
NCT02554071 Manitoba Pharmacist Initiated Smoking Cessation Pilot Project COMPLETED Smoking Cessation OTHER: Pharmacist - Smoking Cessation Support University of Manitoba Govenment of Manitoba|Canadian Foundation for Pharmacy|Neighbourhood Pharmacy Association of Canada 119.0 OTHER INTERVENTIONAL Allocation: NA|Intervention Model: SINGLE_GROUP|Masking: NONE|Primary Purpose: SUPPORTIVE_CARE 2014-01 2014-11
NCT01772771 Molecular Testing for the MD Anderson Cancer Center Personalized Cancer Therapy Program RECRUITING Glioma|Hematopoietic and Lymphoid Cell Neoplasm|Malignant Solid Neoplasm|Melanoma|Sarcoma PROCEDURE: Biospecimen Collection|OTHER: Genetic Testing|OTHER: Medical Chart Review M.D. Anderson Cancer Center National Cancer Institute (NCI) 12000.0 OTHER OBSERVATIONAL Observational Model: |Time Perspective: p 2012-03-01 2033-03-01