i have a database with 10 tables. all 10 tables data is stored in different different locations. out of 10 tables, some are managed tables and some are external tables.

some tables location is /apps/hive/warehouse/

some tables location is /warehouse/hive/managed/

some tables location is /warehouse/hive/external/

is there any way to find out total size of the database with out go into each location and find the size, any alternative?

1 Answers

1
gomz On

The below query when run in the Hive Metastore DB would help you in getting the total size occupied by all the tables in Hive. Note: The results you get for this query would be 100% correct only if all the tables are having their stats updated. [This can be checked in the table - TABLE_PARAMS in Metastore DB that I have also mentioned below (How it works?.b)]

Steps:

1. Login into Hive Metastore DB and use the database that is used by hive. hive1 by default.

2. Once done, you can execute the below query to get the total size of all the tables in Hive in bytes. The query takes the sum of total size of all the Hive tables based on the statistics of the tables.

    MariaDB [hive1]> SELECT SUM(PARAM_VALUE) FROM TABLE_PARAMS WHERE PARAM_KEY="totalSize";
    +------------------+
    | SUM(PARAM_VALUE) |
    +------------------+
    |   30376289388684 |
    +------------------+
    1 row in set (0.00 sec)```

3. Remember, the result derived above is for only one replication. 30376289388684 x 3 is the actual size in HDFS including the replication.

How it works?

a. Selecting a random table in Hive with id 5783 and name - test12345 from the TBLS table in Hive Metastore DB.

    MariaDB [hive1]> SELECT * FROM TBLS WHERE TBL_ID=5783;
    +--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
    | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME  | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
    +--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
    |   5783 |  1555060992 |     1 |                0 | hive  |         0 | 17249 | test12345 | MANAGED_TABLE | NULL               | NULL               |           NULL |
    +--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
    1 row in set (0.00 sec)

b. Checking the different parameters of the table in Hive Metastore table - TABLE_PARAMS for the same Hive table with id - 5783. The totalSize record indicates the total size occupied by this table in HDFS for one of its replica. The next point (c) which is the hdfs du -s can be compared to check this.

The param COLUMN_STATS_ACCURATE with the value true says the table's statistics property is set to true. You can check for tables with this value as false to see if there are any tables in Hive those might have missing statistics.

MariaDB [hive1]> SELECT * FROM TABLE_PARAMS
    -> WHERE TBL_ID=5783;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE |
+--------+-----------------------+-------------+
|   5783 | COLUMN_STATS_ACCURATE | true        |
|   5783 | numFiles              | 1           |
|   5783 | numRows               | 1           |
|   5783 | rawDataSize           | 2           |
|   5783 | totalSize             | 324         |
|   5783 | transient_lastDdlTime | 1555061027  |
+--------+-----------------------+-------------+
6 rows in set (0.00 sec)

c. hdfs du -s output of the same table from HDFS. 324 and 972 are the sizes of one and three replicas of the table data in HDFS.

324  972  /user/hive/warehouse/test12345

Hope this helps!