数据处理 Athena

2019-09-18 Code PV:

AWS数据查询工具-Athena
Amazon Athena 是一种交互式查询服务，让您能够轻松使用标准 SQL 直接分析 Amazon Simple Storage Service (Amazon S3) 中的数据。只需在 AWS Management Console 中执行几项操作，即可将 Athena 指向 Amazon S3 中存储的数据，并开始使用标准 SQL 运行临时查询，然后在几秒钟内获得结果。
Amazon Athena 还可使用 Apache Spark 以交互方式轻松运行数据分析，无需规划、配置或管理资源。在 Athena 上运行 Apache Spark 应用程序时，您需要提交 Spark 代码以进行处理并直接接收结果。使用 Amazon Athena 控制台中简化的笔记本体验，以通过 Python 或 Athena 笔记本 API 开发 Apache Spark 应用程序。
Athena SQL 和 Amazon Athena 上的 Apache Spark 无服务器，因此您无需设置或管理任何基础设施，只需为运行的查询付费。Athena 可自动扩展（并行执行查询），因此，即使在数据集很大、查询很复杂的情况下也能很快获得结果。

备份分区表

CREATE table xxx.test
WITH (external_location ='s3://xxx/xxx/test',
partitioned_by = ARRAY['dt'])
as
SELECT * FROM xxx

查询字符串字段 endwith ‘ ‘

1 2	select distinct product_type from prod where substring(product_type,-1,1)=' '

替换字符串前后空格

1	trim(article_promo_main_catg, ' ')

拼接时间字符串 yyyymmdd -> yyyy-mm-dd

1	concat(SUBSTRING(DATE,1,4),'-',SUBSTRING(DATE,5,2),'-',SUBSTRING(DATE,7,2)) as biz_date

python操作Athena数据库将df上传到数据表

import datetime
import awswrangler as wr

def datetime_beijing(datetime_):
    beijing_time = datetime_ + datetime.timedelta(hours=8)
    return beijing_time


def gen_part_parquet(df: str,path: str,part: list,table: str,dtype: dict):
    if len(df)>0:
        print(datetime_beijing(datetime.datetime.now()), '###### generate {type} parquet start...'.format(type=table))
        wr.s3.to_parquet(
                df = df, path= path, dataset=True, mode="overwrite_partitions", partition_cols=part, sanitize_columns=True,
                database="xxx", table=table, dtype=dtype
            )
        print(datetime_beijing(datetime.datetime.now()), '###### generate {type} parquet end...'.format(type=table))
    else:
        print(df," is empty....")

def gen_s3_parquet(df: str,path: str,table: str, dtype=None):
    if len(df)>0:
        print(datetime_beijing(datetime.datetime.now()), '###### generate {type} start...'.format(type=table))
        wr.s3.to_parquet(
                df = df, path= path, dataset=True, mode="overwrite",
                sanitize_columns=True, database="xxx", table=table, dtype=dtype
            )
        print(datetime_beijing(datetime.datetime.now()), '###### generate {type} end...'.format(type=table))
    else:
        print(df," is empty....")

备份分区表

查询字符串字段 endwith ‘ ‘

替换字符串前后空格

拼接时间字符串 yyyymmdd -> yyyy-mm-dd

python操作Athena数据库 将df上传到数据表

python操作Athena数据库将df上传到数据表