Fusic Tech Blog

Fusion of Society, IT and Culture

AWS Glue を 作成するための CloudFormation を組んでみました
2021/10/07

AWS Glue を 作成するための CloudFormation を組んでみました

こんにちは。 最近 Factorio 熱が再燃して、時間が無限に溶けている技術開発第1部門の政谷です。

案件で AWS Glue について調べることが多く、ビルド&スクラップを繰り返しています。

今回は作業の効率化のために AWS Glue を CloudFormation で構築できるようにしたので、記事にしました。

実現したいことを3行で

  • CloudFormation で
  • いい感じに
  • Glue の実行環境を作りたい

CloudFormation で作るもの

今回は S3 に処理対象のファイルが設置されたら、Glue でなんらかの処理を実行して別のファイルに出力する環境を作ります。

必要になるのは下記のリソースです。

  • AWS Glue(スクリプトはサンプルコードを使用)
  • Glue で取り扱うスクリプトやデータを保存する S3 バケット
  • Glue を発火させるための EventBridge
  • EventBridge を発火させるための CloudTrail
  • その他、必要になる IAM やバケットポリシー

でき上がったテンプレート

先に完成したテンプレートを記載しておきます。

AWSTemplateFormatVersion: 2010-09-09
Description: AWS Glue CloudFormation

Parameters:
  S3BucketName:
    Type: String
    Description: bucket name.

  WorkflowName:
    Type: String
    Description: workflow name.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      -
        Label:
          default: Configuration
        Parameters:
          - WorkflowName
          - S3BucketName

Resources:
  Trail:
    DependsOn: S3Bucket
    Type: AWS::CloudTrail::Trail
    Properties:
      TrailName: !Sub 's3-event-trail-${AWS::StackName}'
      IsLogging: True
      S3BucketName: !Ref S3BucketName
      S3KeyPrefix: cloudtrail
      EventSelectors:
        - DataResources:
            - Type: AWS::S3::Object
              Values:
              - !Sub 'arn:aws:s3:::${S3BucketName}/rawdata/'
          IncludeManagementEvents: False
          ReadWriteType: WriteOnly

  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref S3BucketName
      AccessControl: Private
      PublicAccessBlockConfiguration:
        BlockPublicAcls: True
        BlockPublicPolicy: True
        IgnorePublicAcls: True
        RestrictPublicBuckets: True

  S3BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref S3BucketName
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - cloudtrail.amazonaws.com
            Action:
              - s3:GetBucketAcl
            Resource: !Sub arn:aws:s3:::${S3BucketName}
          - Effect: Allow
            Principal:
              Service:
                - cloudtrail.amazonaws.com
            Action:
              - s3:PutObject
            Resource: !Sub arn:aws:s3:::${S3BucketName}/cloudtrail/*
            Condition:
              StringEquals:
                s3:x-amz-acl: bucket-owner-full-control

  EventDrivenWorkflow:
    Type: AWS::Glue::Workflow
    Properties:
      Name: !Ref WorkflowName
      Description: Glue workflow triggered by S3 PutObject Event

  CrawlerJobTrigger:
    DependsOn: CrawlerJob
    Type: AWS::Glue::Trigger
    Properties:
      Name: !Sub '${WorkflowName}_pre_job_trigger'
      Description: Glue trigger which is listening on S3 PutObject events
      Type: EVENT
      Actions:
        - JobName: !Ref CrawlerJob
      WorkflowName: !Ref EventDrivenWorkflow

  EventBridgeRule:
    DependsOn:
      - EventBridgeGlueExecutionRole
      - EventDrivenWorkflow
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub s3_file_upload_trigger_rule-${AWS::StackName}
      EventPattern:
        source:
          - aws.s3
        detail-type:
          - AWS API Call via CloudTrail
        detail:
          eventSource:
            - s3.amazonaws.com
          eventName:
            - PutObject
          requestParameters:
            bucketName:
              - !Ref S3BucketName
            key:
              - prefix: rawdata
      Targets:
        -
          Arn: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
          Id: CloudTrailTriggersWorkflow
          RoleArn: !GetAtt 'EventBridgeGlueExecutionRole.Arn'

  EventBridgeGlueExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub EventBridgeGlueExecutionRole-${AWS::StackName}
      Description: Has permissions to invoke the NotifyEvent API for an AWS Glue workflow.
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - events.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /

  GlueNotifyEventPolicy:
    DependsOn:
      - EventBridgeGlueExecutionRole
      - EventDrivenWorkflow
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: !Sub GlueNotifyEventPolicy-${AWS::StackName}
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Action:
              - glue:notifyEvent
            Resource: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
      Roles:
        - !Ref EventBridgeGlueExecutionRole

  GlueServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub GlueServiceRole-${AWS::StackName}
      Description: Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

  S3DataPolicy:
    DependsOn:
      - GlueServiceRole
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: !Sub S3DataPolicy-${AWS::StackName}
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Action:
              - s3:GetObject
              - s3:PutObject
            Resource: !Sub arn:aws:s3:::${S3BucketName}/*
          -
            Effect: "Allow"
            Action:
              - s3:ListBucket
            Resource: !Sub arn:aws:s3:::${S3BucketName}
      Roles:
        - !Ref GlueServiceRole

  CrawlerJob:
    Type: AWS::Glue::Job
    Properties:
      Name: !Sub '${WorkflowName}_crawler_job'
      Description: Glue job that converts input data files into csv format.
      Role: !Ref GlueServiceRole
      GlueVersion: "2.0"
      Command:
        Name: glueetl
        PythonVersion: "3"
        ScriptLocation: !Sub 's3://${S3BucketName}/script/crawler.py'
      NumberOfWorkers: 2
      WorkerType: G.1X
      ExecutionProperty:
        MaxConcurrentRuns: 5
      DefaultArguments:
        --job-bookmark-option: job-bookmark-enable
        --job-language: python

リソースの簡単な説明

S3 バケット

  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref S3BucketName
      AccessControl: Private
      PublicAccessBlockConfiguration:
        BlockPublicAcls: True
        BlockPublicPolicy: True
        IgnorePublicAcls: True
        RestrictPublicBuckets: True

  S3BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref S3BucketName
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - cloudtrail.amazonaws.com
            Action:
              - s3:GetBucketAcl
            Resource: !Sub arn:aws:s3:::${S3BucketName}
          - Effect: Allow
            Principal:
              Service:
                - cloudtrail.amazonaws.com
            Action:
              - s3:PutObject
            Resource: !Sub arn:aws:s3:::${S3BucketName}/cloudtrail/*
            Condition:
              StringEquals:
                s3:x-amz-acl: bucket-owner-full-control

全てのデータを保存するための S3 バケットを作成してます。

CloudTrail がアクセスできるようにバケットポリシーも設定しています。

CloudTrail

Resources:
  Trail:
    DependsOn: S3Bucket
    Type: AWS::CloudTrail::Trail
    Properties:
      TrailName: !Sub 's3-event-trail-${AWS::StackName}'
      IsLogging: True
      S3BucketName: !Ref S3BucketName
      S3KeyPrefix: cloudtrail
      EventSelectors:
        - DataResources:
            - Type: AWS::S3::Object
              Values:
              - !Sub 'arn:aws:s3:::${S3BucketName}/rawdata/'
          IncludeManagementEvents: False
          ReadWriteType: WriteOnly

S3 の PutOBject イベントを記録するための CloudTrail を作成しています。

rawdata プレフィックス配下に置かれたファイルのみイベントを記録するようにしています。

EventBridge

  EventDrivenWorkflow:
    Type: AWS::Glue::Workflow
    Properties:
      Name: !Ref WorkflowName
      Description: Glue workflow triggered by S3 PutObject Event

  CrawlerJobTrigger:
    DependsOn: CrawlerJob
    Type: AWS::Glue::Trigger
    Properties:
      Name: !Sub '${WorkflowName}_pre_job_trigger'
      Description: Glue trigger which is listening on S3 PutObject events
      Type: EVENT
      Actions:
        - JobName: !Ref CrawlerJob
      WorkflowName: !Ref EventDrivenWorkflow

  EventBridgeRule:
    DependsOn:
      - EventBridgeGlueExecutionRole
      - EventDrivenWorkflow
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub s3_file_upload_trigger_rule-${AWS::StackName}
      EventPattern:
        source:
          - aws.s3
        detail-type:
          - AWS API Call via CloudTrail
        detail:
          eventSource:
            - s3.amazonaws.com
          eventName:
            - PutObject
          requestParameters:
            bucketName:
              - !Ref S3BucketName
            key:
              - prefix: rawdata/sample/
      Targets:
        -
          Arn: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
          Id: CloudTrailTriggersWorkflow
          RoleArn: !GetAtt 'EventBridgeGlueExecutionRole.Arn'

CloudTrail にフックして Glue を発火させるための EventBridge を作成しています。

CloudTrail 側で記録しているプレフィックスとは別にイベントを発火させるためのプレフィックスを指定できるので、
ここでは rawdata/sample/ 配下のファイルのみイベントが発火するようにしています。

IAM

  EventBridgeGlueExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub EventBridgeGlueExecutionRole-${AWS::StackName}
      Description: Has permissions to invoke the NotifyEvent API for an AWS Glue workflow.
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - events.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /

  GlueNotifyEventPolicy:
    DependsOn:
      - EventBridgeGlueExecutionRole
      - EventDrivenWorkflow
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: !Sub GlueNotifyEventPolicy-${AWS::StackName}
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Action:
              - glue:notifyEvent
            Resource: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
      Roles:
        - !Ref EventBridgeGlueExecutionRole

  GlueServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub GlueServiceRole-${AWS::StackName}
      Description: Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

  S3DataPolicy:
    DependsOn:
      - GlueServiceRole
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: !Sub S3DataPolicy-${AWS::StackName}
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Action:
              - s3:GetObject
              - s3:PutObject
            Resource: !Sub arn:aws:s3:::${S3BucketName}/*
          -
            Effect: "Allow"
            Action:
              - s3:ListBucket
            Resource: !Sub arn:aws:s3:::${S3BucketName}
      Roles:
        - !Ref GlueServiceRole

CloudTrail, EventBridge, Glue が動作するための Role や Policy を作成しています。

必要な権限を与えているだけなので、特に説明する部分はありません。

Glue Job

  CrawlerJob:
    Type: AWS::Glue::Job
    Properties:
      Name: !Sub '${WorkflowName}_crawler_job'
      Description: Glue job.
      Role: !Ref GlueServiceRole
      GlueVersion: "2.0"
      Command:
        Name: glueetl
        PythonVersion: "3"
        ScriptLocation: !Sub 's3://${S3BucketName}/script/crawler.py'
      NumberOfWorkers: 2
      WorkerType: G.1X
      ExecutionProperty:
        MaxConcurrentRuns: 5
      DefaultArguments:
        --job-bookmark-option: job-bookmark-enable
        --job-language: python

実際に動作する Glue の job を作成しています。

スクリプトは、/script/crawler.py に保存したものを使うように設定しています。

最後に

このテンプレートを流せば、S3 への PutObject をトリガーに、AWS Glue を実行するための環境をサクッと構築するすることができます。

これで、色々と検証をするためのビルド&スクラップが簡単に実行できるようになりました。

みなさまも、Glue を試してみようかな、という時は是非ご利用ください。

それでは、よい Glue ライフを。

k-masatany

k-masatany

インターネットの海で泳ぐときは、だいたいペンギンの姿をしています。