I have created a DataFusion DataFrame:
| asin | vote | verified | unixReviewTime | reviewText |
+------------+------+----------+----------------+-----------------+
| 0486427706 | 3 | true | 1381017600 | good |
| 0486427707 | | false | 1376006400 | excellent |
| 0486427707 | 1 | true | 1459814400 | Did not like it |
| 0486427708 | 4 | false | 1376006400 | |
+------------+------+----------+----------------+-----------------+
I was trying to find the solution of following information from the API document, but could not figure it out:
- Convert the
unixReviewTime
column into Rust Native timestamp - Extract the Year, Month and Day from the newly created column into separate columns
Here is how json datafile looks like:
{"asin": "0486427706", "vote": 3, "verified": true, "unixReviewTime": 1381017600, "reviewText": "good", "overall": 5.0}
{"asin": "0486427707", "vote": null, "verified": false, "unixReviewTime": 1376006400, "reviewText": "excellent", "overall": 5.0}
{"asin": "0486427707", "vote": 1, "verified": true, "unixReviewTime": 1459814400, "reviewText": "Did not like it", "overall": 2.0}
{"asin": "0486427708", "vote": 4, "verified": false, "unixReviewTime": 1376006400, "reviewText": null, "overall": 4.0}
It is very easy to do in pyspark as follows:
from PySpark.sql import functions as fn
from PySpark.sql.functions import col
main_df = (
main_df
.withColumn(
'reviewed_at',
fn.from_unixtime(col('unixReviewTime'))
)
)
main_df = main_df.withColumn("reviewed_year", fn.year(col("reviewed_at")))
main_df = main_df.withColumn("reviewed_month", fn.month(col("reviewed_at")))
Produces: