Table of Contents
実現したいことを3行で
- CloudFormation で
- いい感じに
- Glue の実行環境を作りたい
CloudFormation で作るもの
今回は S3 に処理対象のファイルが設置されたら、Glue でなんらかの処理を実行して別のファイルに出力する環境を作ります。
必要になるのは下記のリソースです。
- AWS Glue(スクリプトはサンプルコードを使用)
- Glue で取り扱うスクリプトやデータを保存する S3 バケット
- Glue を発火させるための EventBridge
- EventBridge を発火させるための CloudTrail
- その他、必要になる IAM やバケットポリシー
でき上がったテンプレート
先に完成したテンプレートを記載しておきます。
AWSTemplateFormatVersion: 2010-09-09
Description: AWS Glue CloudFormation
Parameters:
S3BucketName:
Type: String
Description: bucket name.
WorkflowName:
Type: String
Description: workflow name.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
-
Label:
default: Configuration
Parameters:
- WorkflowName
- S3BucketName
Resources:
Trail:
DependsOn: S3Bucket
Type: AWS::CloudTrail::Trail
Properties:
TrailName: !Sub 's3-event-trail-${AWS::StackName}'
IsLogging: True
S3BucketName: !Ref S3BucketName
S3KeyPrefix: cloudtrail
EventSelectors:
- DataResources:
- Type: AWS::S3::Object
Values:
- !Sub 'arn:aws:s3:::${S3BucketName}/rawdata/'
IncludeManagementEvents: False
ReadWriteType: WriteOnly
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref S3BucketName
AccessControl: Private
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
S3BucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref S3BucketName
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- cloudtrail.amazonaws.com
Action:
- s3:GetBucketAcl
Resource: !Sub arn:aws:s3:::${S3BucketName}
- Effect: Allow
Principal:
Service:
- cloudtrail.amazonaws.com
Action:
- s3:PutObject
Resource: !Sub arn:aws:s3:::${S3BucketName}/cloudtrail/*
Condition:
StringEquals:
s3:x-amz-acl: bucket-owner-full-control
EventDrivenWorkflow:
Type: AWS::Glue::Workflow
Properties:
Name: !Ref WorkflowName
Description: Glue workflow triggered by S3 PutObject Event
CrawlerJobTrigger:
DependsOn: CrawlerJob
Type: AWS::Glue::Trigger
Properties:
Name: !Sub '${WorkflowName}_pre_job_trigger'
Description: Glue trigger which is listening on S3 PutObject events
Type: EVENT
Actions:
- JobName: !Ref CrawlerJob
WorkflowName: !Ref EventDrivenWorkflow
EventBridgeRule:
DependsOn:
- EventBridgeGlueExecutionRole
- EventDrivenWorkflow
Type: AWS::Events::Rule
Properties:
Name: !Sub s3_file_upload_trigger_rule-${AWS::StackName}
EventPattern:
source:
- aws.s3
detail-type:
- AWS API Call via CloudTrail
detail:
eventSource:
- s3.amazonaws.com
eventName:
- PutObject
requestParameters:
bucketName:
- !Ref S3BucketName
key:
- prefix: rawdata
Targets:
-
Arn: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
Id: CloudTrailTriggersWorkflow
RoleArn: !GetAtt 'EventBridgeGlueExecutionRole.Arn'
EventBridgeGlueExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub EventBridgeGlueExecutionRole-${AWS::StackName}
Description: Has permissions to invoke the NotifyEvent API for an AWS Glue workflow.
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- events.amazonaws.com
Action:
- sts:AssumeRole
Path: /
GlueNotifyEventPolicy:
DependsOn:
- EventBridgeGlueExecutionRole
- EventDrivenWorkflow
Type: AWS::IAM::Policy
Properties:
PolicyName: !Sub GlueNotifyEventPolicy-${AWS::StackName}
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action:
- glue:notifyEvent
Resource: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
Roles:
- !Ref EventBridgeGlueExecutionRole
GlueServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub GlueServiceRole-${AWS::StackName}
Description: Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
S3DataPolicy:
DependsOn:
- GlueServiceRole
Type: AWS::IAM::Policy
Properties:
PolicyName: !Sub S3DataPolicy-${AWS::StackName}
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub arn:aws:s3:::${S3BucketName}/*
-
Effect: "Allow"
Action:
- s3:ListBucket
Resource: !Sub arn:aws:s3:::${S3BucketName}
Roles:
- !Ref GlueServiceRole
CrawlerJob:
Type: AWS::Glue::Job
Properties:
Name: !Sub '${WorkflowName}_crawler_job'
Description: Glue job that converts input data files into csv format.
Role: !Ref GlueServiceRole
GlueVersion: "2.0"
Command:
Name: glueetl
PythonVersion: "3"
ScriptLocation: !Sub 's3://${S3BucketName}/script/crawler.py'
NumberOfWorkers: 2
WorkerType: G.1X
ExecutionProperty:
MaxConcurrentRuns: 5
DefaultArguments:
--job-bookmark-option: job-bookmark-enable
--job-language: python
リソースの簡単な説明
S3 バケット
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref S3BucketName
AccessControl: Private
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
S3BucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref S3BucketName
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- cloudtrail.amazonaws.com
Action:
- s3:GetBucketAcl
Resource: !Sub arn:aws:s3:::${S3BucketName}
- Effect: Allow
Principal:
Service:
- cloudtrail.amazonaws.com
Action:
- s3:PutObject
Resource: !Sub arn:aws:s3:::${S3BucketName}/cloudtrail/*
Condition:
StringEquals:
s3:x-amz-acl: bucket-owner-full-control
全てのデータを保存するための S3 バケットを作成してます。
CloudTrail がアクセスできるようにバケットポリシーも設定しています。
CloudTrail
Resources:
Trail:
DependsOn: S3Bucket
Type: AWS::CloudTrail::Trail
Properties:
TrailName: !Sub 's3-event-trail-${AWS::StackName}'
IsLogging: True
S3BucketName: !Ref S3BucketName
S3KeyPrefix: cloudtrail
EventSelectors:
- DataResources:
- Type: AWS::S3::Object
Values:
- !Sub 'arn:aws:s3:::${S3BucketName}/rawdata/'
IncludeManagementEvents: False
ReadWriteType: WriteOnly
S3 の PutOBject イベントを記録するための CloudTrail を作成しています。
rawdata
プレフィックス配下に置かれたファイルのみイベントを記録するようにしています。
EventBridge
EventDrivenWorkflow:
Type: AWS::Glue::Workflow
Properties:
Name: !Ref WorkflowName
Description: Glue workflow triggered by S3 PutObject Event
CrawlerJobTrigger:
DependsOn: CrawlerJob
Type: AWS::Glue::Trigger
Properties:
Name: !Sub '${WorkflowName}_pre_job_trigger'
Description: Glue trigger which is listening on S3 PutObject events
Type: EVENT
Actions:
- JobName: !Ref CrawlerJob
WorkflowName: !Ref EventDrivenWorkflow
EventBridgeRule:
DependsOn:
- EventBridgeGlueExecutionRole
- EventDrivenWorkflow
Type: AWS::Events::Rule
Properties:
Name: !Sub s3_file_upload_trigger_rule-${AWS::StackName}
EventPattern:
source:
- aws.s3
detail-type:
- AWS API Call via CloudTrail
detail:
eventSource:
- s3.amazonaws.com
eventName:
- PutObject
requestParameters:
bucketName:
- !Ref S3BucketName
key:
- prefix: rawdata/sample/
Targets:
-
Arn: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
Id: CloudTrailTriggersWorkflow
RoleArn: !GetAtt 'EventBridgeGlueExecutionRole.Arn'
CloudTrail にフックして Glue を発火させるための EventBridge を作成しています。
CloudTrail 側で記録しているプレフィックスとは別にイベントを発火させるためのプレフィックスを指定できるので、
ここでは rawdata/sample/
配下のファイルのみイベントが発火するようにしています。
IAM
EventBridgeGlueExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub EventBridgeGlueExecutionRole-${AWS::StackName}
Description: Has permissions to invoke the NotifyEvent API for an AWS Glue workflow.
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- events.amazonaws.com
Action:
- sts:AssumeRole
Path: /
GlueNotifyEventPolicy:
DependsOn:
- EventBridgeGlueExecutionRole
- EventDrivenWorkflow
Type: AWS::IAM::Policy
Properties:
PolicyName: !Sub GlueNotifyEventPolicy-${AWS::StackName}
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action:
- glue:notifyEvent
Resource: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:workflow/${EventDrivenWorkflow}
Roles:
- !Ref EventBridgeGlueExecutionRole
GlueServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub GlueServiceRole-${AWS::StackName}
Description: Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
S3DataPolicy:
DependsOn:
- GlueServiceRole
Type: AWS::IAM::Policy
Properties:
PolicyName: !Sub S3DataPolicy-${AWS::StackName}
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub arn:aws:s3:::${S3BucketName}/*
-
Effect: "Allow"
Action:
- s3:ListBucket
Resource: !Sub arn:aws:s3:::${S3BucketName}
Roles:
- !Ref GlueServiceRole
CloudTrail, EventBridge, Glue が動作するための Role や Policy を作成しています。
必要な権限を与えているだけなので、特に説明する部分はありません。
Glue Job
CrawlerJob:
Type: AWS::Glue::Job
Properties:
Name: !Sub '${WorkflowName}_crawler_job'
Description: Glue job.
Role: !Ref GlueServiceRole
GlueVersion: "2.0"
Command:
Name: glueetl
PythonVersion: "3"
ScriptLocation: !Sub 's3://${S3BucketName}/script/crawler.py'
NumberOfWorkers: 2
WorkerType: G.1X
ExecutionProperty:
MaxConcurrentRuns: 5
DefaultArguments:
--job-bookmark-option: job-bookmark-enable
--job-language: python
実際に動作する Glue の job を作成しています。
スクリプトは、/script/crawler.py
に保存したものを使うように設定しています。
最後に
このテンプレートを流せば、S3 への PutObject をトリガーに、AWS Glue を実行するための環境をサクッと構築するすることができます。
これで、色々と検証をするためのビルド&スクラップが簡単に実行できるようになりました。
みなさまも、Glue を試してみようかな、という時は是非ご利用ください。
それでは、よい Glue ライフを。