How to replace emoticon to empty string in scala dataframe?

Question

How to replace emoticon to empty string in scala dataframe?

499 views Asked by Tom Tang At 11 November 2020 at 10:00

Hello stackoverflowers,

Would you please help to take a look on how to replace the emoticon in scala dataframe?

import spark.implicits._
val df = Seq(
  (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
  (64, "bb")
).toDF("number", "word")

df.show(false)
+------+-----------------------+
|number|word                   |
+------+-----------------------+
|8     |bat★  ⛱ ✨‍♂️⛷❤️|
|64    |bb                     |
+------+-----------------------+

df.select($"word", regexp_replace($"word", "[^\u0000-\uFFFF]", "").alias("word_revised")).show(false)
+-----------------------+---------------+
|word                   |word_revised   |
+-----------------------+---------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat★  ⛱ ✨‍♂️⛷❤️|
|bb                     |bb             |
+-----------------------+---------------+

The expected result is

+-----------------------+---------------+
|word                   |word_revised   |
+-----------------------+---------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat|
|bb                     |bb             |
+-----------------------+---------------+

Thank you so much for your helping, @fonkap. I am so sorry that chain in to the thread so late as I had get another sprint story to onboard during the past month. I would like to say the approach you posted almost works well for the emoticon. But there are some abnormal icon in my source data from our upstream. Do you have any suggestion on how to replace with them

scala> val df = Seq(
     |   (8, "♥♥♥♥♥☆ Condo֎۩ᴥ★Ąrt Ħouse Ŀocation")
     | ).toDF("airPlaneId", "airPlaneName")
df: org.apache.spark.sql.DataFrame = [airPlaneId: int, airPlaneName: string]

scala> df.select($"airPlaneId", $"airPlaneName", regexp_replace($"airPlaneName", "[^\u0000-\u20CF]", "").alias("airPlaneName_revised")).show(false)
+----------+-----------------------------------+----------------------------+
|airPlaneId|airPlaneName                       |airPlaneName_revised        |
+----------+-----------------------------------+----------------------------+
|8         |♥♥♥♥♥☆ Condo֎۞۩ᴥ★Ąrt Ħouse Ŀocation| Condo֎۞۩ᴥĄrt Ħouse Ŀocation|
+----------+-----------------------------------+----------------------------+

Looks like some symbol still remains as unexpected marked as underscore

Thank you for your sharing, @mck. And the purposed new approach is workable. Anyway, there is a unwanted replacement occurs.

scala> df.selectExpr(
     |     "airPlaneId",
     |       "airPlaneName",
     |     "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '?') airPlaneName_revised"
     | ).show(false)
+----------+------------+--------------------+
|airPlaneId|airPlaneName|airPlaneName_revised|
+----------+------------+--------------------+
|8         |la Cité     |la Cit?             |
|9         |Aéroport    |A?roport            |
|10        |München     |M?nchen             |
|11        |la Tête     |la T?te             |
|12        |Sarrià      |Sarri?              |
+----------+------------+--------------------+

Just wondering that do we have any enhanced approach to exclude the kind of valid ascii, only process emoji or symbol, please?

Original Q&A

There are 2 answers

**fonkap** · Answer 1 · 2020-11-14T12:58:07+00:00

regexp_replace is doing it right. It is just that some of the "characters" you wrote are indeed in the \u0000-\uFFFF interval.

Proof:

import java.io.File
import java.nio.charset.StandardCharsets
import java.nio.file.{Files, Paths}

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object Emoticon {
  def main(args: Array[String]) {
    val str = "bat★  ⛱ ✨‍♂️⛷❤️"

    val bw = Files.newBufferedWriter(new File("emoji.txt").toPath, StandardCharsets.UTF_8)
    bw.write(str)
    bw.newLine()

    val cps = str.codePoints().toArray
    cps.foreach(cp => {
      bw.write(String.format(" 0x%06x", cp.asInstanceOf[Object]))
      bw.write(" - ")
      bw.write(new java.lang.StringBuilder().appendCodePoint(cp).toString)
      bw.newLine()
    })
    bw.close()
 }
}

Open emoji.txt with your browser and you'll see:

(It is worth noting that some characters are combinations)

The "filtered" string looks like:

So, everything looks right!

Finally, answering your question, you may want to use a narrower character interval, for example: [^\u0000-\u20CF] , and you will get the expected result.

object Emoticon2 {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.master("local[2]").appName("Simple Application").getOrCreate()
    import spark.implicits._

    val df = Seq(
      (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
      (64, "bb")
    ).toDF("number", "word")

    df.show(false)

    df.select($"word", regexp_replace($"word", "[^\u0000-\u20CF]", "").alias("word_revised")).show(false)
  }
}

will output:

+-----------------------+------------+
|word                   |word_revised|
+-----------------------+------------+
|bat★  ⛱ ✨‍♂️⛷❤️|bat   ‍     |
|bb                     |bb          |
+-----------------------+------------+

Take a look at: https://jrgraphix.net/research/unicode_blocks.php

**mck** · Answer 2 · 2020-12-25T10:40:38+00:00

You can remove all non-ASCII characters as below:

val df = Seq(
  (8, "bat★  ⛱ ✨‍♂️⛷❤️"),
  (64, "bb")
).toDF("number", "word")

val df2 = df.selectExpr(
    "number",
    "replace(decode(encode(word, 'ascii'), 'ascii'), '?', '') word_revised"
)

df2.show(false)
+------+------------+
|number|word_revised|
+------+------------+
|8     |bat         |
|64    |bb          |
+------+------------+

val df = Seq((8, "♥♥♥♥♥☆ Condo֎۩ᴥ★")).toDF("airPlaneId", "airPlaneName")

val df2 = df.selectExpr(
    "airPlaneId",
    "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '') airPlaneName_revised"
)

df2.show(false)
+----------+--------------------+
|airPlaneId|airPlaneName_revised|
+----------+--------------------+
|8         | Condo              |
+----------+--------------------+

TechQA.

How to replace emoticon to empty string in scala dataframe?

There are 2 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK-SQL

Related Questions in EMOTICONS

Popular Questions

Trending Questions