How to replace emoticon to empty string in scala dataframe?

499 views Asked by At

Hello stackoverflowers,

Would you please help to take a look on how to replace the emoticon in scala dataframe?

import spark.implicits._
val df = Seq(
  (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
  (64, "bb")
).toDF("number", "word")

df.show(false)
+------+-----------------------+
|number|word                   |
+------+-----------------------+
|8     |bat★  ⛱ ✨‍♂️⛷❤️|
|64    |bb                     |
+------+-----------------------+

df.select($"word", regexp_replace($"word", "[^\u0000-\uFFFF]", "").alias("word_revised")).show(false)
+-----------------------+---------------+
|word                   |word_revised   |
+-----------------------+---------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat★  ⛱ ✨‍♂️⛷❤️|
|bb                     |bb             |
+-----------------------+---------------+

The expected result is

+-----------------------+---------------+
|word                   |word_revised   |
+-----------------------+---------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat|
|bb                     |bb             |
+-----------------------+---------------+

Thank you so much for your helping, @fonkap. I am so sorry that chain in to the thread so late as I had get another sprint story to onboard during the past month. I would like to say the approach you posted almost works well for the emoticon. But there are some abnormal icon in my source data from our upstream. Do you have any suggestion on how to replace with them

scala> val df = Seq(
     |   (8, "♥♥♥♥♥☆ Condo֎۩ᴥ★Ąrt Ħouse Ŀocation")
     | ).toDF("airPlaneId", "airPlaneName")
df: org.apache.spark.sql.DataFrame = [airPlaneId: int, airPlaneName: string]

scala> df.select($"airPlaneId", $"airPlaneName", regexp_replace($"airPlaneName", "[^\u0000-\u20CF]", "").alias("airPlaneName_revised")).show(false)
+----------+-----------------------------------+----------------------------+
|airPlaneId|airPlaneName                       |airPlaneName_revised        |
+----------+-----------------------------------+----------------------------+
|8         |♥♥♥♥♥☆ Condo֎۞۩ᴥ★Ąrt Ħouse Ŀocation| Condo֎۞۩ᴥĄrt Ħouse Ŀocation|
+----------+-----------------------------------+----------------------------+

Looks like some symbol still remains as unexpected marked as underscore

enter image description here


Thank you for your sharing, @mck. And the purposed new approach is workable. Anyway, there is a unwanted replacement occurs.

scala> df.selectExpr(
     |     "airPlaneId",
     |       "airPlaneName",
     |     "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '?') airPlaneName_revised"
     | ).show(false)
+----------+------------+--------------------+
|airPlaneId|airPlaneName|airPlaneName_revised|
+----------+------------+--------------------+
|8         |la Cité     |la Cit?             |
|9         |Aéroport    |A?roport            |
|10        |München     |M?nchen             |
|11        |la Tête     |la T?te             |
|12        |Sarrià      |Sarri?              |
+----------+------------+--------------------+

Just wondering that do we have any enhanced approach to exclude the kind of valid ascii, only process emoji or symbol, please?

2

There are 2 answers

0
fonkap On

regexp_replace is doing it right. It is just that some of the "characters" you wrote are indeed in the \u0000-\uFFFF interval.

Proof:

import java.io.File
import java.nio.charset.StandardCharsets
import java.nio.file.{Files, Paths}

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object Emoticon {
  def main(args: Array[String]) {
    val str = "bat★  ⛱ ✨‍♂️⛷❤️"

    val bw = Files.newBufferedWriter(new File("emoji.txt").toPath, StandardCharsets.UTF_8)
    bw.write(str)
    bw.newLine()

    val cps = str.codePoints().toArray
    cps.foreach(cp => {
      bw.write(String.format(" 0x%06x", cp.asInstanceOf[Object]))
      bw.write(" - ")
      bw.write(new java.lang.StringBuilder().appendCodePoint(cp).toString)
      bw.newLine()
    })
    bw.close()
 }
}

Open emoji.txt with your browser and you'll see:

characters and hexa representations

(It is worth noting that some characters are combinations)

The "filtered" string looks like:

enter image description here

So, everything looks right!

Finally, answering your question, you may want to use a narrower character interval, for example: [^\u0000-\u20CF] , and you will get the expected result.

object Emoticon2 {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.master("local[2]").appName("Simple Application").getOrCreate()
    import spark.implicits._

    val df = Seq(
      (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
      (64, "bb")
    ).toDF("number", "word")

    df.show(false)

    df.select($"word", regexp_replace($"word", "[^\u0000-\u20CF]", "").alias("word_revised")).show(false)
  }
}

will output:

+-----------------------+------------+
|word                   |word_revised|
+-----------------------+------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat   ‍     |
|bb                     |bb          |
+-----------------------+------------+

Take a look at: https://jrgraphix.net/research/unicode_blocks.php

0
mck On

You can remove all non-ASCII characters as below:

val df = Seq(
  (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
  (64, "bb")
).toDF("number", "word")

val df2 = df.selectExpr(
    "number",
    "replace(decode(encode(word, 'ascii'), 'ascii'), '?', '') word_revised"
)

df2.show(false)
+------+------------+
|number|word_revised|
+------+------------+
|8     |bat         |
|64    |bb          |
+------+------------+
val df = Seq((8, "♥♥♥♥♥☆ Condo֎۩ᴥ★")).toDF("airPlaneId", "airPlaneName")

val df2 = df.selectExpr(
    "airPlaneId",
    "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '') airPlaneName_revised"
)

df2.show(false)
+----------+--------------------+
|airPlaneId|airPlaneName_revised|
+----------+--------------------+
|8         | Condo              |
+----------+--------------------+