amazon-redshift-utils/src/SimpleReplay at master · chemipot/amazon-redshift-utils

History

Name		Name	Last commit message	Last commit date
parent directory ..
.DS_Store		.DS_Store
ParseUserActivityLog.py		ParseUserActivityLog.py
README.md		README.md
ScriptRunner.sh		ScriptRunner.sh
SimpleReplay.pdf		SimpleReplay.pdf
SimpleReplay.sh		SimpleReplay.sh
mitigate_duplicates.sh		mitigate_duplicates.sh
order_list.sh		order_list.sh

README.md

Introduction

Customers are always trying to reproduce issues or workloads from clusters or to do what-if scenarios. A customer can easily clone a production cluster, but replicating the workload is more complicated. This set of tools was created to bridge that gap.

ScriptRunner.sh: Uses list of files [sqls] and executes all scripts at a pre-determined parallelism
SimpleReplay.sh: Uses a ordered list of files (sqls) and replay them at the same cadence as originally.
ParseUserActivityLog.py: Converts the User Activity Log (audit) into a series of sql scripts named <DB>-<USERNAME>_<PID>.sql
order_list.sh: It will order the list of scripts by the time their original session started.
mitigate_duplicates.sh: sed script to allow for duplicated CTAS on the same transaction.

Usage

Preparation

Enable Amazon Redshift Audit Log. Specficially the User Activity Log https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html
Collect the User Activit Logs (AUDIT) files from S3 for the desired time range. You can aggregate them into a single file or run one at a time in chronological order on the same folder.
Run ParseUserActivityLog.py on file. A number of .sql files will be generated. -_.sql
The optional parameter -c defines credentials to be added to the scripts if you want to replay COPY and UNLOAD statements. Note that the script will try to read/write the original bucket used on the workload
The optional -r parameter tries to remove duplicate statements due to rewrites (it might remove user issued statements, so use with caution).
The optional parameter -s will prefix any file that contain a write operation with rw-, so you can easily separate that workload.
Remove any files that you do not want to replay and make a new file listing all the files you want to replay. Normally I will move the files to a sub-folder
If you want to replay on the same cadence as the original workload you should use **order_list.sh **script to order the sqls file.

Replay at Arbitrary Concurrency

./ScriptRunner.sh <n> <sqls>

is the number of sessions created and the scrpts referenced on sqls file will be distributed amonst the sessions.
is the file that has the list of scripts to run

For long running replays, it is recomended to use

nohup ./ScriptRunner.sh <n> <sqls> &
tail -f nohup.out

Replay at the same cadence and concurrency as the original workload

./SimpleReplay.sh [-b <begin time> ] [-e <end time> | -r <seconds to run> ] [ <script list> ] [ <target concurrency> ] optional, is the file with the list of scripts to be executed.Important: Scritps in this file need to be properly ordered. Use order_list.sh for that. Uses a file names sqls if not declared <tgt_conc> optional, is the target concurrency in case you want to speed up issuing of statements. defaults to 0 meaning original cadence is respected

For long running replays, it is recomended to use

nohup ./SimpleReplay.sh [-b <begin time> ] [-e <end time> | -r <seconds to run> ] [ <script list> ] [ <target concurrency> ] &
tail -f nohup.out

Known Limitations

When the original workload was executed with Redshift's JDBC driver the statements might show twice in the log.
The replay script does not consider idle time within the sessions, only delay of starting each session.
You should customize the ParseUserActivityLog.py file to keep/avoid any specific statement ( avoid writes or COPY, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleReplay

SimpleReplay

README.md

Introduction

Usage

Preparation

Replay at Arbitrary Concurrency

Replay at the same cadence and concurrency as the original workload

Known Limitations

Files

SimpleReplay

Directory actions

More options

Directory actions

More options

Latest commit

History

SimpleReplay

Folders and files

parent directory

README.md

Introduction

Usage

Preparation

Replay at Arbitrary Concurrency

Replay at the same cadence and concurrency as the original workload

Known Limitations