You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried setting this up to compare it against an existing non-Glue-based solution, and am running into a couple of issues I'm hoping someone can help with.
Disclaimer: I have little previous experience with Glue Crawlers or Glue ETL - it's possible I'm making a simple (or more than one) mistake.
I have been able to reproduce the behavior by resetting the environment (destroy terraform, empty buckets, delete glue database/tables) and following the process below:
Created 4 buckets: xxxx-cloudtrail-glue-testing, for cloudtrail_s3_bucket xxxx-cloudtrail-glue-testing-scripts, for etl_script_s3_bucket xxxx-cloudtrail-glue-testing-output, for parquet output xxxx-cloudtrail-glue-testing-tmp, for parquet_s3_bucket
Configured and applied terraform module with terraform 0.12.29 / aws provider 3.28.0, no issues observed, crawlers/etl/workflow created as expected.
Enabled an organization cloudtrail for multiple accounts for 45 minutes, shipping management cloudtrail events to the xxxx-cloudtrail-glue-testing bucket with no additional prefix configured (I had a whole different set of issues when I tried a POC with s3 data events and using prefixes), to make sure I was working with a fixed number of events sent with a default configuration.
Ran Workflow from Glue console, indicated workflow completed successfully
2 glue tables were created: cloudtrail_parquet raw_xxxx_cloudtrail_glue_testing
Attempted to validate results by comparing number of raw events vs number of events in parquet table: select count(*) from cloudtrail_parquet returns 186949 select count(*) from raw_tmg_cloudtrail_glue_testing unqueryable, returns the following error:
HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: : expected at the position 2264 of 'struct<securityGroupSet:string,securityGroupIdSet:struct<items:array<struct<groupId:string>>>,filterSet:struct<items:array<struct<name:string,valueSet:struct<items:array<struct<value:string>>>>>>,lookupAttributes:array<struct<attributeKey:string,attributeValue:string>>,startTime:string,endTime:string,maxResults:string,roleArn:string,roleSessionName:string,includeShared:boolean,includePublic:boolean,subnetSet:struct<items:array<struct<subnetId:string>>>,publicIpsSet:string,allocationIdsSet:struct<items:array<struct<allocationId:string>>>,customerGatewaySet:string,networkAclIdSet:string,networkInterfaceIdSet:struct<items:array<struct<networkInterfaceId:string>>>,volumeSet:struct<items:array<struct<volumeId:string>>>,DescribeHostsRequest:struct<MaxResults:int>,vpcSet:struct<items:array<struct<vpcId:string>>>,instancesSet:struct<items:array<struct<instanceId:string,imageId:string,minCount:int,maxCount:int,keyName:string>>>,vpnGatewaySet:string,internetGatewayIdSet:string,vpnConnectionSet:string,routeTableIdSet:string,Host:string,DescribeVpcEndpointsRequest:string,DescribeNatGatewaysRequest:string,DescribeEgressOnlyInternetGatewaysRequest:string,vpcPeeringConnectionIdSet:string,externalId:string,targetGroupArn:string,groupName:string,maxItems:string,durationSeconds:int,regionSet:string,resourceArns:array<string>,pageSize:int,includeAllInstances:boolean,functionVersion:string,name:string,s3BucketName:string,includeGlobalServiceEvents:boolean,isMultiRegionTrail:boolean,enableLogFileValidation:boolean,isOrganizationTrail:boolean,bucketName:string,location:string,certificateArn:string,includes:struct<hasDnsFqdn:boolean,keyTypes:array<string>>,trailNameList:array<string>,resource:string,masterRegion:string,userName:string,functionName:string,resourceName:string,tagging:string,acl:string,website:string,replication:string,logging:string,policy:string,versioning:string,encryption:string,publicAccessBlock:string,lifecycle:string,accessKeyId:string,trailName:string,resourceIdList:array<string>,includeShadowTrails:boolean,logGroupName:string,accelerate:string,object-lock:string,notification:string,cors:string,requestPayment:string,DescribeVpcEndpointServiceConfigurationsRequest:string,keySpec:string,keyId:string,encryptionContext:struct<aws\:cloudtrail\:arn:string,aws\:s3\:arn:string,aws\:acm\:arn:string,aws\:cloudfront\:arn:string,aws\:lambda\:FunctionArn:string,aws\:elasticloadbalancing\:arn:string,aws\:codecommit\:env-alg:string,aws\:codecommit\:sig-alg:string,aws\:codecommit\:id:string,glue_catalog_id:string,service:string,aws\:rds\:db-id:string,aws\:pi\:service:string,SecretARN:string,SecretVersionId:string,domainARN:string>,resourceArn:string,paginationToken:string,tagFilters:array<struct<key:string,values:array<string>>>,resourcesPerPage:int,resourceTypeFilters:array<string>,crawlerNames:array<string>,crawlerNameList:array<string>,logStreamName:string,aggregateField:string,filter:struct<eventStatusCodes:array<string>,startTimes:array<struct<from:string>>>,jobName:string,runId:string,includeGraph:boolean,roleName:string,policyArn:string,accountAttributeNameSet:struct<items:array<struct<attributeName:string>>>,versionId:string,hidePassword:boolean,bucket:string,policyStatus:string,catalogId:string,databaseName:string,tablesToDelete:array<string>,DescribeByoipCidrsRequest:struct<MaxResults:int>,maxRecords:int,tableName:string,targetGroupArns:array<string>,targets:array<struct<id:string,port:int,availabilityZone:string,arn:string>>,dhcpOptionsSet:struct<items:array<struct<dhcpOptionsId:string>>>,subnetId:string,description:string,groupSet:struct<items:array<struct<groupId:string>>>,privateIpAddressesSet:string,ipv6AddressCount:int,resourcesSet:struct<items:array<struct<resourceId:string>>>,tagSet:struct<items:array<struct<key:string,value:string>>>,autoScalingGroupNames:array<string>,networkInterfaceId:string,nextToken:string,encryptionAlgorithm:string,serviceCode:string,marker:string,registryId:string,repositoryName:string,layerDigest:string,cluster:string,services:array<string>,imageIds:array<struct<imageDigest:string,imageTag:string>>,acceptedMediaTypes:array<string>,registryIds:array<string>,containerInstance:string,secretId:string,policyName:string,reservedInstancesSet:string,instanceId:string,agentVersion:string,agentStatus:string,platformType:string,platformName:string,platformVersion:string,iPAddress:string,computerName:string,agentName:string,jobQueue:string,parameters:struct<user_ids:string,dates:string,write_to_feature_store:string,remove_old_data:string,executionTimeout:array<string>,commands:array<string>>,jobDefinition:string,serviceNamespace:string,resourceIds:array<string>,commitId:string,branchName:string,rule:string,eventPattern:string,stateMachineArn:string,tags:array<struct<value:string,key:string>>,definition:string,streamName:string,limit:double,capacityProviderStrategy:array<struct<capacityProvider:string,weight:double,base:double>>,count:double,enableECSManagedTags:boolean,enableExecuteCommand:boolean,networkConfiguration:struct<awsvpcConfiguration:struct<assignPublicIp:string,securityGroups:array<string>,subnets:array<string>>>,overrides:struct<containerOverrides:array<struct<name:string,command:array<string>,environment:array<struct<name:string,value:string>>>>,cpu:string,memory:string>,taskDefinition:string,partitionValues:array<string>,stackStatusFilter:array<string>,tasks:array<string>,reservedInstancesModificationSet:string,loadBalancerNames:array<string>,GroupName:string,Filters:array<struct<Name:string,Values:array<string>>>,attribute:string,image:string,attributes:array<string>,exclusiveStartShardId:string,streamArn:string,showCacheNodeInfo:boolean,showCacheClustersNotInReplicationGroups:boolean,userData:string,instanceType:string,blockDeviceMapping:struct<items:array<struct<deviceName:string,ebs:struct<volumeSize:int,deleteOnTermination:boolean,volumeType:string>>>>,availabilityZone:string,monitoring:struct<enabled:boolean>,disableApiTermination:boolean,clientToken:string,iamInstanceProfile:struct<name:string>,tagSpecificationSet:struct<items:array<struct<resourceType:string,tags:array<struct<key:string,value:string>>>>>,instanceMarketOptions:struct<marketType:string,spotOptions:struct<maxPrice:string,spotInstanceType:string>>,hostedZoneId:string,startRecordName:string,startRecordType:string,changeBatch:struct<changes:array<struct<action:string,resourceRecordSet:struct<name:string,type:string,tTL:int,resourceRecords:array<struct<value:string>>>>>>,startRecordIdentifier:string,dNSName:string,destinationKeyId:string,destinationEncryptionAlgorithm:string,sourceEncryptionAlgorithm:string,sourceAAD:string,sourceEncryptionContext:struct<aws\:acm\:arn:string>,destinationAAD:string,destinationEncryptionContext:struct<aws\:elasticloadbalancing\:arn:string,aws\:acm\:arn:string>,instanceIdentityDocument:string,instanceIdentityDocumentSignature:string,totalResources:array<struct<name:string,type:string,doubleValue:double,longValue:double,integerValue:double,stringSetValue:array<string>>>,task:string,status:string,reason:string,containers:array<struct<containerName:string,runtimeId:string,networkBindings:array<struct<bindIP:string,containerPort:double,hostPort:double,protocol:string>>,status:string,exitCode:double,reason:string>>,pullStartedAt:string,pullStoppedAt:string,instanceIds:array<string>,documentName:string,timeoutSeconds:double,comment:string,outputS3BucketName:string,outputS3KeyPrefix:string,interactive:boolean,commandId:string,details:boolean,items:array<struct<typeName:string,schemaVersion:string,captureTime:string,contentHash:string,id:string,title:string,severity:string,status:string,details:struct<DocumentVersion:string,DocumentName:string>>>,associationId:string,executionResult:struct<executionDate:string,status:string,executionSummary:string,errorCode:string>,resourceId:string,resourceType:string,complianceType:string,executionSummary:struct<executionTime:string,executionId:string,executionType:string>,itemContentHash:string,executionStoppedAt:string,instances:array<string>,loadBalancerName:string,dBInstanceIdentifier:string,dBSnapshotIdentifier:string,clusterId:string,clusterStates:array<string>,launchConfigurationNames:array<string>,names:array<string>,autoScalingGroupName:string,targetGroupARNs:array<string>,loadBalancerArn:string,protocol:string,port:int,defaultActions:array<struct<targetGroupArn:string,type:string>>,securityGroups:array<string>,type:string,ipAddressType:string,subnetMappings:array<struct<subnetId:string>>,scheme:string,loadBalancerArns:array<string>,healthCheckPort:string,healthCheckPath:string,vpcId:string,healthCheckProtocol:string,cacheClusterId:string,pipelineId:string,input:struct<key:string,frameRate:string,resolution:string,aspectRatio:string,interlaced:string,container:string>,output:struct<key:string,thumbnailPattern:string,rotate:string,presetId:string,watermarks:array<struct<presetWatermarkId:string,inputKey:string>>,composition:array<struct<timeSpan:struct<startTime:string,duration:string>>>>,outputKeyPrefix:string,insightSelectors:array<string>,s3KeyPrefix:string,kmsKeyId:string,eventSelectors:array<struct<readWriteType:string,includeManagementEvents:boolean,dataResources:array<string>,excludeManagementEventSources:array<string>>>,availabilityZoneSet:string,availabilityZoneIdSet:string,spotInstanceRequestIdSet:struct<items:array<struct<spotInstanceRequestId:string>>>,expression:string,executableBySet:string,imagesSet:struct<items:array<struct<imageId:string>>>,ownersSet:string,auditContext:struct<additionalAuditContext:string>,segment:struct<segmentNumber:int,totalSegments:int>,queryExecutionId:string,queryString:string,clientRequestToken:string,queryExecutionContext:struct<database:string,catalog:string>,resultConfiguration:struct<outputLocation:string>,workGroup:string,tableInput:struct<partitionKeys:array<struct<name:string,type:string>>,lastAccessTime:string,storageDescriptor:struct<numberOfBuckets:int,location:string,storedAsSubDirectories:boolean,compressed:boolean,sortColumns:array<string>,columns:array<struct<name:string,type:string>>,outputFormat:string,parameters:string,skewedInfo:struct<skewedColumnValueLocationMaps:string,skewedColumnNames:array<string>,skewedColumnValues:array<string>>,serdeInfo:struct<parameters:struct<serialization.format:string>,serializationLibrary:string>,inputFormat:string,bucketColumns:array<string>>,tableType:string,name:string,isRowFilteringEnabled:boolean,parameters:struct<EXTERNAL:string,avro.schema.literal:string,parquet.compress:string,transient_lastDdlTime:string>,retention:int,owner:string>,instanceTypeSet:struct<items:array<struct<instanceType:string>>>,productDescriptionSet:struct<items:array<struct<productDescription:string>>>,tenancy:string,networkInterfaceSet:struct<items:array<struct<deviceIndex:int,subnetId:string,description:string,deleteOnTermination:boolean,associatePublicIpAddress:boolean,groupSet:struct<items:array<struct<groupId:string>>>>>>,queryExecutionIds:array<string>,partitionInputList:array<struct<storageDescriptor:struct<numberOfBuckets:int,location:string,storedAsSubDirectories:boolean,compressed:boolean,sortColumns:array<string>,columns:array<struct<name:string,type:string>>,outputFormat:string,parameters:string,skewedInfo:struct<skewedColumnValueLocationMaps:string,skewedColumnNames:array<string>,skewedColumnValues:array<string>>,serdeInfo:struct<parameters:struct<serialization.format:string>,serializationLibrary:string>,inputFormat:string,bucketColumns:array<string>>,values:array<string>,lastAccessTime:string>>>' but '\' is found. (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
Needed raw data and not familiar with the error above, so created a table manually to get results:
select count(*) from manual_cloudtrail_glue_testing_raw returns 187073 events... 124 more than the 186949 events in the parquet table
Ran the workflow again manually from Glue console
select count(*) from cloudtrail_parquet now has 373908 records, which means 186959 records were added, even though no additional files were being written to the bucket
confirmed with additional queries that these were duplicate event records, but 10 records more were added than the first ETL job
Ran the workflow a third time manually form the Glue console
select count(*) from cloudtrail_parquet now has 560826 records - 186918 were added, again duplicate records but this time 31 records less than the first ETL job
It would require additional testing to be certain, but it appears (at least with static input, as I tested with), that subsequent workflow runs reprocess the same events and convert them to parquet again.
It also appears that, given static input, the output of each subsequent ETL run appears to be inconsistent (ie on each run, a different number of events were converted to parquet).
I'm curious to see if anyone has gotten this to work properly without modifications to the Crawlers or ETL job?
If so, were you able to query the raw log table? Did multiple runs on the same dataset produce duplicate results for you? Has anyone else tried to run a static dataset through this workflow multiple times and compare the output (parquet) to the input (raw) event data to validate the consistency of the ETL process?
Thanks in advance - looking forward to hopefully learning a little more about this!
The text was updated successfully, but these errors were encountered:
I've tried setting this up to compare it against an existing non-Glue-based solution, and am running into a couple of issues I'm hoping someone can help with.
Disclaimer: I have little previous experience with Glue Crawlers or Glue ETL - it's possible I'm making a simple (or more than one) mistake.
I have been able to reproduce the behavior by resetting the environment (destroy terraform, empty buckets, delete glue database/tables) and following the process below:
xxxx-cloudtrail-glue-testing
, forcloudtrail_s3_bucket
xxxx-cloudtrail-glue-testing-scripts
, foretl_script_s3_bucket
xxxx-cloudtrail-glue-testing-output
, forparquet output
xxxx-cloudtrail-glue-testing-tmp
, forparquet_s3_bucket
xxxx-cloudtrail-glue-testing
bucket with no additional prefix configured (I had a whole different set of issues when I tried a POC with s3 data events and using prefixes), to make sure I was working with a fixed number of events sent with a default configuration.cloudtrail_parquet
raw_xxxx_cloudtrail_glue_testing
select count(*) from cloudtrail_parquet
returns 186949select count(*) from raw_tmg_cloudtrail_glue_testing
unqueryable, returns the following error:select count(*) from manual_cloudtrail_glue_testing_raw
returns 187073 events... 124 more than the 186949 events in the parquet tableselect count(*) from cloudtrail_parquet
now has 373908 records, which means 186959 records were added, even though no additional files were being written to the bucketselect count(*) from cloudtrail_parquet
now has 560826 records - 186918 were added, again duplicate records but this time 31 records less than the first ETL jobIt would require additional testing to be certain, but it appears (at least with static input, as I tested with), that subsequent workflow runs reprocess the same events and convert them to parquet again.
It also appears that, given static input, the output of each subsequent ETL run appears to be inconsistent (ie on each run, a different number of events were converted to parquet).
I'm curious to see if anyone has gotten this to work properly without modifications to the Crawlers or ETL job?
If so, were you able to query the raw log table? Did multiple runs on the same dataset produce duplicate results for you? Has anyone else tried to run a static dataset through this workflow multiple times and compare the output (parquet) to the input (raw) event data to validate the consistency of the ETL process?
Thanks in advance - looking forward to hopefully learning a little more about this!
The text was updated successfully, but these errors were encountered: