Partitioning By date column in Athena

3k views Asked by At

I'm using AWS Athena to query S3 bucket, that have partitioned data by day only, the partitions looks like day=yyyy/mm/dd. When I tried to us Glue to run update the partitions every day, It creates new table for each day (sync 2017, around 1500 tables).

I tried to use Partition projection with like this:

PARTITIONED BY ( 
  day string)

TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'projection.day.format'='yyyy/mm/dd', 
  'projection.day.interval'='1', 
  'projection.day.interval.unit'='DAYS', 
  'projection.day.range'='2017/01/01,NOW', 
  'projection.day.type'='date', 
  'projection.enables'='true'

But the partition not updated without MSCK Repair. Any ideas? Do I miss something with the partition projection?

3

There are 3 answers

2
AswinRajaram On

You don't need to use Glue or MSCK REPAIR TABLE if you are loading partitions using Partition Projection. Just run the CREATE TABLE script once from the query editor and that should be it. If you are loading partitions using Partition Projection, you won't be able to see the partitions in the Glue Data Catalog.

or maybe the below script can help you

#Import libraries
import boto3
import datetime

#Connection for S3 and Athena
s3 = boto3.client('s3')
athena = boto3.client('athena')

#Get Year, Month, Day for partition (this will get tomorrow date's value)
date = datetime.datetime.now()
athena_year = str(date.year)
athena_month = str(date.month).rjust(2, '0')
athena_day = str(date.day + 1).rjust(2, '0')

#Parameters for S3 log location and Athena table
#Fill this carefully (Read the commented section on top to help)
s3_buckcet = 'sqladmin-cloudtrail'
s3_prefix = 'AWSLogs/XXXXXXXXXXXX/CloudTrail/'
s3_input = 's3://' + s3_buckcet + '/' + s3_prefix
s3_ouput = 's3://aws-athena-query-results-XXXXXXXXXXXXXX-us-east-1'
database = 'athena_log_database'
table_name = 'cloudtrail_logs_table'

#Executing the athena query:
def run_query(query, database, s3_output):
        query_response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )
        print('Execution ID: ' + query_response['QueryExecutionId'])
        return query_response

#Main function for get regions and run the query on the captured regions
def lambda_handler(event, context):
 result =  s3.list_objects(Bucket=s3_buckcet,Prefix=s3_prefix, Delimiter='/')
 for regions in result.get('CommonPrefixes'):
    get_region=(regions.get('Prefix','').replace(s3_prefix,'').replace('/',''))
    query = str("ALTER TABLE "+ table_name +" ADD PARTITION (region='"
            + get_region + "',year="
            + athena_year + ",month="
            + athena_month + ",day="
            + athena_day
            + ") location '"+s3_input
            + get_region
            + "/" + athena_year + "/" + athena_month + "/"
            + athena_day + "';")
      #print(get_region) -- for debug
      #print(query) -- for debug
run_query(query, database, s3_ouput)

You can run a glue job with a similar script to create the partitions daily. Just change the ALTER TABLE part accordingly and it should be good to go.

0
Nicolas Busca On

The issue is that you can't have / in the date format. Either you need to replace / by something else (like a -) or switch to integer format for year, month and day, and have three partitions.

Three partitions would look like:

projection.year.type=integer,
projection.year.range='2017,2025',
projection.year.format='day=${year}',
projection.month.type=integer,
projection.month.range='01,12',
projection.day.type=integer,
projection.day.range='01,31'
0
Chaurasiya On
PARTITIONED BY ( 
  day string)

TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'projection.day.format'='yyyy/mm/dd', 
  'projection.day.interval'='1', 
  'projection.day.interval.unit'='DAYS', 
  'projection.day.range'='2017/01/01,NOW', 
  'projection.day.type'='date', 
  'projection.enabled'='true'            **here you have written enables , that should be enabled**

After that partition projection will work. :)