Data Engineer certificate

File upload to S3

aws CLI automatically splits large file into multi-upload

$ aws s3 cp large_test_file s3://DOC-EXAMPLE-BUCKET/

For example in a case of multipart upload for a 100 GB file you would have the following API calls for the entire process:
- A CreateMultipartUpload call to start the process
- 1000 individual UploadPart calls, each uploading a part of 100 MB, for a total size of 100 GB
- A CompleteMultipartUpload call to finish the process
There would be a total of 1002 API calls

To store directly into S3 Glacier:

aws s3 cp your-file.txt s3://your-bucket-name/your-file.txt --storage-class DEEP_ARCHIVE
aws s3api put-bucket-lifecycle-configuration --bucket your-bucket-name --lifecycle-configuration file://lifecycle_policy.json

To add a lifecycle policy:

{
    "Rules": [
        {
            "ID": "Move to Glacier after 30 days",
            "Filter": {
                "Tag": {
                    "Key": "Lifecycle",
                    "Value": "Archive"
                }
            },
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "GLACIER"
                }
            ]
        },
        {
            "ID": "Delete after 365 days",
            "Filter": {
                "Tag": {
                    "Key": "Lifecycle",
                    "Value": "Archive"
                }
            },
            "Status": "Enabled",
            "Expiration": {
                "Days": 365
            }
        }
    ]
}

Databases

DynamoDB

DynamoDB is NoSQL Serverless database, closest to Key-value store (doesn’t not support joins, sum, age, but supports horizontal scaling). MM RPS, 100s TB of storage. Made out of table, partition key (HASH) or partition key and sort key (HASH + RANGE), and attributes that can be any key-values pairs. Max per item is 400 Kb.

Dynamo DB has Provisioned and On-Demand mode (can switch between every 24 hours, On-demand is 2.5X more expensive). Provisioned has to have defined: - Read Capacity Units (RCU) has: - strongly consistent reads (1 RCU = 1 read per second of size 4KB)
- eventual consistent reads (1 RCU = 2 read per second of size 4KB)
- Write Capacity Units (WCU) - 1 WCU = 1 write per second of size 1KB,

There is temporary BurstCapacity as a buffer but don’t rely too much on it - it uses exponential backoff if surpassed.

Number of partitions = ceil(max( RCUs/3000 + WCUs/1000, Total size / 10 Gb))

WCUs and RCUs are divided evenly across partitions.

RDS and Aurora

For RDS in CloudWatch monitor ReadIOPS to be small and stable (together with CPU, memory, storage, replica lag)

Make sure DNS to RDS TTL is not too long, in case of failure you want TTL to be short (like 30 sec.).

Consider imposing rate limits in the API Gateway.

Use InnoDB for storage engine for MySQL and MariaDB.

For PostgreSQL, when loading data disable DB backups and multi-AZ. Disable synchronous commit and autovacuum (enable all after loading data is done during regular operation).

SQL server specific: do not enable recovery mode, offline mode, or read-only mode; these break multi-AZ.

Aurora is AWS implementation of MySQL and PostgreSQL.

Others

  • DocumentDB is AWS implementation of MongoDB (1MM RPS)
  • MemoryDB is AWS implementation for Redis (160MM RPS)
  • Amazon Keyspaces is AWS implementation for Cassandra (NoSQL) (1K RPS)
  • Neptune is AWS implementation for Graph database.
  • Timestream is AWS version of Time-series database (like Prometheus). Recent data is in memory, older data in cost-optimized storage.

Redshift

Redsihft is data warehouse, designed to store data and online analytic processing (OLAP).