I am trying to write a dataframe in parquet format to hdfs. The dataframe contains all values before writing but when i write it to the disk, it converts the boolean values (true,false)
to null.
Here is the code:
val finalDF = dataFrame.select(dataFrame.columns.map(c => col(c).cast(StringType)) : _*).select(dataFrame.columns.map(x => col(x).as(x.toLowerCase)): _*)
print("final DF before writing")
finalDF.show(4)
finalDF.write.partitionBy(pushStreamInstance.getPartitionColsForHive.map(name => name): _*).option("compression","none").mode("append").parquet(pushStreamInstance.getHiveOutputPath)
When i show first 4 row of the dataframe then i get following data:
final DF before writing+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
|hd_conn|stb_profile|action|action_type|call_letter|category|channel_external_identifier|channel_id|client_software_version_app|client_software_version_mw|company_location_name|company_site| device_name|event_cat|event_counter| event_time|event_type|eventcounter|hdmi_connectivity|household_identifier|hw_type|invocation_point| ip_address| item_id| item_title|item_type|language_audio|language_interface|language_shop|language_subtitles| mac| mac_address|oauth2_client_id|operation_system_version|program_category|program_duration|program_reference_number|program_sub_category|program_title|recommendations_flag|resolution|scheduled_trail_identifier| start_time|stream_adaptivity_indicator| stream_content_url|stream_control_protocol|stream_mode|sub_type|subscriber_type|targeted_advertising_flag|time_position| type|viewing_duration| viewing_identifier|event_date|logstash_date|
+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
| true| HD TV| PLAY| START| BX1| EVENT| UID50075| UID50075| 4.81.1| PXM-SW-3.80.0| 02BRA0| BEPBXL1|F83B1D937298| LINEAR| 278.0|1638173907037| VIEWING| 278.0| true| 4727286| v7| CHANNEL-UP|192.168.1.64|43421543|Archiurbain| SINGLE| FR| FR| FR| OFF|F83B1D937298|F83B1D937298| NONE| Android 9| Magazine| 1080000.0| 43421543| Architecture| Archiurbain| true| SD| 202111203342|1638173520000| FIXED|rtp://239.255.1.1...| IGMP| MULTICAST| LINEAR| R| true| 387031.0|VIEWING| 0.0|F83B1D937298_1638...|2021-11-29| 2021-11-29|
| true| HD TV| PLAY| START| BX1| EVENT| UID50075| UID50075| 4.81.1| PXM-SW-3.80.0| 02BRA0| BEPBXL1|F83B1D937298| LINEAR| 278.0|1638173907037| VIEWING| 278.0| true| 4727286| v7| CHANNEL-UP|192.168.1.64|43421543|Archiurbain| SINGLE| FR| FR| FR| OFF|F83B1D937298|F83B1D937298| NONE| Android 9| Magazine| 1080000.0| 43421543| Architecture| Archiurbain| true| SD| 202111203342|1638173520000| FIXED|rtp://239.255.1.1...| IGMP| MULTICAST| LINEAR| R| true| 387031.0|VIEWING| 0.0|F83B1D937298_1638...|2021-11-29| 2021-11-29|
| true| HD TV| PLAY| START| BX1| EVENT| UID50075| UID50075| 4.81.1| PXM-SW-3.80.0| 02BRA0| BEPBXL1|F83B1D937298| LINEAR| 278.0|1638173907037| VIEWING| 278.0| true| 4727286| v7| CHANNEL-UP|192.168.1.64|43421543|Archiurbain| SINGLE| FR| FR| FR| OFF|F83B1D937298|F83B1D937298| NONE| Android 9| Magazine| 1080000.0| 43421543| Architecture| Archiurbain| true| SD| 202111203342|1638173520000| FIXED|rtp://239.255.1.1...| IGMP| MULTICAST| LINEAR| R| true| 387031.0|VIEWING| 0.0|F83B1D937298_1638...|2021-11-29| 2021-11-29|
| true| HD TV| PLAY| START| BX1| EVENT| UID50075| UID50075| 4.81.1| PXM-SW-3.80.0| 02BRA0| BEPBXL1|F83B1D937298| LINEAR| 278.0|1638173907037| VIEWING| 278.0| true| 4727286| v7| CHANNEL-UP|192.168.1.64|43421543|Archiurbain| SINGLE| FR| FR| FR| OFF|F83B1D937298|F83B1D937298| NONE| Android 9| Magazine| 1080000.0| 43421543| Architecture| Archiurbain| true| SD| 202111203342|1638173520000| FIXED|rtp://239.255.1.1...| IGMP| MULTICAST| LINEAR| R| true| 387031.0|VIEWING| 0.0|F83B1D937298_1638...|2021-11-29| 2021-11-29|
+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
only showing top 4 rows
All columns(hd_conn
) that have boolean values are converted to null values despite the fact that everything has StringType
as type. In other words, the type of hd_conn
is StringType.
After writing to the disk, I get only nulls for these columns. An example of the output is shown below:
{"hd_conn":"null","stb_profile":"HD TV","action":"EXITED","action_type":"STOP","advertisement_reference":"null","app_duration":"4.5691844E7","app_identifier":"F83B1D16ED51_1649207767570","boot_reason":"null","call_letter":"null","category":"EVENT","channel_external_identifier":"null","channel_id":"null","channel_inactivity_threshold":"null","client_software_version":"null","client_software_version_app":"4.92.2","client_software_version_mw":"PXM-SW-3.84.0","company_location_name":"03KAP0","company_site":"BEPVLA2","connected_deflect_state_action":"null","consecutive_playout_indicator":"null","currency_unit":"null","device_identifier":"null","device_name":"F83B1D16ED51","episode_number":"null","episode_title":"null","error_code":"null","error_key":"null","error_message":"null","error_type":"null","event_cat":"Proximus Pickx","event_counter":"50574.0","event_time":"1649253459414","eventcounter":"50574.0","exception_message":"null","exception_name":"null","external_id":"null","first_sign_of_life":"null","genre":"null","group_title":"null","hdmi_connectivity":"null","household_identifier":"1457100","hw_type":"v7","into_standby":"null","into_standby_elapsed_time":"null","invocation_point":"null","ip_address":"192.168.1.65","isolation_state_action":"null","item_id":"be.px.stbtvclient","item_title":"null","item_type":"null","language_audio":"NL","language_interface":"NL","language_shop":"NL","language_subtitles":"OFF","last_activity_timestamp":"null","last_known_heart_beat":"null","mac":"F83B1D16ED51","mac_address":"F83B1D16ED51","oauth2_client_id":"NONE","operation_system_version":"Android 9","operational_mode":"null","out_of_standby":"null","out_standby_elapsed_time":"null","period_of_inactivity":"null","playout_cat":"null","playout_url":"null","post_cycle_state":"null","power_cycle_type":"null","pre_cycle_state":"null","preview":"null","previous_state_uptime":"null","program_category":"null","program_duration":"null","program_reference_number":"null","program_sub_category":"null","program_title":"null","recommendations_flag":"true","remote_connected":"false","rental_cost":"null","resolution":"null"}
I think parquet is trying to do some optimizations that cause this issue.
Any help will be great.
I am using HDP version 2.6.5 & Spark 2.3.