How can I configure Hive Metastore or Spark to respect owner ACL privileges in Spark SQL?

90 views Asked by At

I'm currently working on Apache Spark with Hive Metastore. I've created a table in Hive Metastore using user A. However, when running a DROP query in Spark, it appears to be using user B, and surprisingly, I'm able to drop the table even though I haven't granted any privileges to user B. This seems like a security concern. How can I configure Hive Metastore or Spark to ensure that the owner's privileges defined in Hive Metastore are respected for access control?

hive meta store tables are given below

TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|OWNER_TYPE|RETENTION|SD_ID|TBL_NAME|TBL_TYPE     |VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|IS_REWRITE_ENABLED|
------+-----------+-----+----------------+-----+----------+---------+-----+--------+-------------+------------------+------------------+------------------+
     2| 1701160550|    1|               0|A    |USER      |        0|    2|orders  |MANAGED_TABLE|                  |                  |false             |

Table privileges

TBL_GRANT_ID|CREATE_TIME|GRANT_OPTION|GRANTOR|GRANTOR_TYPE|PRINCIPAL_NAME|PRINCIPAL_TYPE|TBL_PRIV|TBL_ID|
------------+-----------+------------+-------+------------+--------------+--------------+--------+------+
           5| 1701160550|           1|A      |USER        |A             |USER          |select  |     2|
           6| 1701160550|           1|A      |USER        |A             |USER          |insert  |     2|
           7| 1701160550|           1|A      |USER        |A             |USER          |update  |     2|
           8| 1701160550|           1|A      |USER        |A             |USER          |delete  |     2|

Spark code

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
# Create a SparkSession

spark_conf = SparkConf().setAppName("spark_hms")

spark_conf.set("hive.metastore.uris", "thrift://localhost:9083")
spark_conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("spark.sql.catalogImplementation","hive")
spark_conf.set("spark.hadoop.fs.s3a.access.key", ""),
spark_conf.set("spark.hadoop.fs.s3a.secret.key", ""),
spark_conf.set("spark.hadoop.fs.s3a.endpoint", ""),
spark_conf.set("spark.hadoop.fs.s3a.region", "us-east-1"),
spark_conf.set("hive.security.authenticator.manager", "org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator")
spark_conf.set("hive.security.authorization.enabled", "true")
spark_conf.set("hive.security.authorization.manager", "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory")
spark_conf.set("spark.jars", ""),

    
spark = SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()
print(spark.sparkContext.getConf().getAll())


spark.sql("DROP table default.orders_trino_R")

Hive site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
                <name>hive.metastore.server.min-threads</name>
                <value>10</value>
        </property>
        <property>
                <name>hive.metastore.server.max-threads</name>
                <value>100</value>
        </property>
        <property>
                <name>hive.metastore.client.socket.timeout</name>
                <value>1800</value>
        </property>
        <property>
                <name>hive.metastore.pool.maximumPoolSize</name>
                <value>50</value>
        </property>
        <property>
                <name>metastore.thrift.uris</name>
                <value>thrift://localhost:9083</value>
                <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
        </property>
        <property>
                <name>metastore.task.threads.always</name>
                <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask</value>
        </property>
        <property>
                <name>metastore.expression.proxy</name>
                <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>org.postgresql.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:postgresql://db:5432/mydatabase</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>admin</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>admin</value>
        </property>
        <property>
                <name>hive.vectorized.execution.enabled</name>
                <value>false</value>
        </property>
        <property>
                <name>spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation</name>
                <value>true</value>
        </property>
        <property>
                <name>fs.defaultFS</name>
                <value>s3a://datalake//</value>
        </property>
        <property>
                <name>hive.metastore.warehouse.dir</name>
                <value>s3a://datalake/</value>
        </property>
        <property>
                <name>fs.s3a.impl</name>
                <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
        </property>
        <property>
                <name>fs.s3a.access.key</name>
                <value>pQ339uhR6wrublDY1Dqn</value>
        </property>
        <property>
                <name>fs.s3a.secret.key</name>
                <value>HrqHchDVytpcVdxNMS7KnQ5haJF4X0NsJIBk138o</value>
        </property>
        <property>
                <name>fs.s3a.endpoint</name>
                <value>min:443</value>
        </property>
        <property>
                <name>fs.s3a.path.style.access</name>
                <value>true</value>
        </property>
        <property>
                <name>hive.support.concurrency</name>
                <value>false</value>
        </property>
        <property>
                <name>hive.txn.timeout</name>
                <value>300</value>
        </property>  
        <property>
          <name>iceberg.experimental.extended-statistics.enabled</name>
          <value>true</value>
        </property>
        <property>
          <name>hive.security.authorization.enabled</name>
          <value>true</value>
        </property>
</configuration>

I'm interested in understanding how to configure Hive Metastore or Spark to respect owner ACL privileges in Spark SQL.

0

There are 0 answers