vx-data.com follow up

Topics: Using Parquet, Athena + ETL, API Hosting

Jeff Gensler
13 min readSep 13, 2022

Finalizing the Round Explorer

At the end of the previous blog post, I was still exploring how to build the Round Explorer. While I understood how to cache and load the data (using Parquet files + parquet-wasm), I still needed to write the code to upload the individual rounds to the bucket. You can read the previous blog post here:

Using Parquet

Although I had decided on using Parquet, there were still some details I wanted to explore. As the data would likely be migrated once, I wanted to put in a bit of extra effort to understand if any efficiencies could be gained. Here are some of the results of some of the tests on analyzing Parquet size.

JSON vs Parquet

First, we should break down the original data structure and understand the “base case” of the comparison. I have pulled one particular match containing 6 rounds of data. I’ll present a few numbers and establish a range that a bundle of paths for a single match might take.

The first player’s replay from the first particular round: jq -c '.replay_info.replayRoundInfo[0].replayCharacterInfo[0].replayCharacterStateInfo' match-replay.json | wc
1 1 12313

which suggests a maximum uncompressed size for 7 rounds is around 861KB: 12313 * 10 * 7 = 861 KB

using gzip on a singular round:
jq -c '.replay_info.replayRoundInfo[0].replayCharacterInfo[0].replayCharacterStateInfo' match-replay.json | gzip | wc
6 43 2064

which suggests a maximum around 144KB : 2064 * 10 * 7 = 144.48 KB

and a minimum around 82KB: 2064 * 10 * 4 = 82.56

using gzip on a whole rounds worth of replays (10 players): jq -c '.replay_info.replayRoundInfo[0].replayCharacterInfo' match-replay.json | gzip | wc
113 647 26777

which suggest a maximum around 187KB: 26777 * 7 = 187.439 KB

finally, the whole structure in plaintext: jq -c ‘.’ match-replay.json | wc
1 9 1218678

and gziped 250KB : jq -c ‘.’ match-replay.json | gzip | wc
994 5566 245178

Using these numbers, I am guessing that the size of only the round replay data is a range between 80KB (4 rounds) to 180KB (7 rounds). For this particular match with 6 rounds, we will assume a size of 120KB to 160KB.

By default, Athena CTAS queries store data as Parquet utilizing GZIP compression (docs). This was a bit of a surprise to me as Pandas saves with snappy by default (docs). Here is the result for storing this match using the default parameters in Athena (129KB):

gzip

Overall, these results were a bit surprising to me as I figured Parquet should remove repeated text in the JSON. I realized that I was saving the point data as Well-Known Text (a string!) instead of another format which was likely to blame for the similar size. This was originally done for two reasons:

  • dbt scripts could assume every “geometry” column is stored as text
  • frontend could easily parse both paths (LINESTRING) and map regions (POLYGON) using the above assumption

GZIP vs Snappy

I was curious about storing the file using Snappy compression to see the difference. Switching to Snappy compression, we get a size of 195K, which is larger than the original JSON estimate.

snappy

This should match expectations as Snappy mentions that it sacrifices time to compress for final compression size (website):

For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.

I do wonder if the Engineers behind Athena ran similar tests and chose GZIP over Snappy for this reason.

Data types: double -> int

I was curious about keeping WKT but also wanted to see if changing the type of parameters used in the ST_POINT constructor would help. Most of the points were between -32,000 and 32,000 and I figured any loss of the fractional part would not lead to “inaccuracy” when viewed in the frontend. After changing them from double to int, the resulting file was 92KB:

gzip, WKT, int

This was encouraging as it had finally beat the “best case” of the original JSON data by a a noticeable margin.

WKT, Raw XY, WKB

With both compression choice and data types out of the way, it was time to figure out if WKT could be replaced. Before storing the columns as X/Y integers, I double checked the the Types page of the Parquet docs to see if a smaller type could be used. Other than Boolean, 32 bit Integer seem reasonable:

gzip, raw X/Y column, int

Again, the results were encouraging: 68KB

JSON and WKT formed an upper bound and raw XY formed the lower bound. I figured there was some middle ground that could apply to all geometry data and not require purpose-built optimizations. This “middle ground” is WKB (Well Known Binary). The benefit is similar to that of WKT: the same column type can be used for many types of Geometry.

This resulted in a reasonable 77KB

gzip, WKB, int

Data granularity:

Hungry to make this data even smaller, I figured I could take an approach similar to the conversion of double -> int. I wondered if reducing the accuracy of the points further could improve the compression size at all. I opened the draw tab and drew squares of various sizes:

three squares left to right, coordinates displayed as the last three POLYGONS in the console
  • big: 135 units
  • medium: 50 units
  • tiny: 15 units

For comparison, I zoomed out to get the Car in the upper right. You can barely even see the smallest square!

Surely, a customer won’t notice the difference between a 10 unit margin of error. Using raw XY columns and dividing by 10 resulted in 44KB

gzip, raw XY, int / 10

Although a 100 unit margin looks to be enough to place a player in or out of cover, I was curious about how much smaller the data could get. Using raw XY columns and dividing by 100 resulted in 24KB (!!!)

gzip, raw XY, int / 100

Combining the best of both strategies, I tried using WKB after dividing by 10 which resulted in 64KB

gzip, WKB, int / 10

While other strategies may exist (see the Parquet Encoding’s page), these may be too cumbersome to setup and integrate into existing tools. Overall, I would summarize the results as the following:

  • JSON: operationally simple, worst storage
  • WKT: easiest to setup and debug, worst storage
  • WKB: reasonable to setup, reasonable storage
  • Raw XY: implementation dependent, best storage

Athena + ETL

I mentioned in the previous article that Athena lookups require full table scans unless Partitioning / Bucketing is used. Given the Partition limit (100) in CTAS queries, I wrote all of the ETL operating on full table transformations without creating Partitions/Buckets. Given that I wanted to host the individual round replays as Parquet (example: cdn/replays/match-1234.parquet), it was now time to explore how to move data from a single unpartitioned table directory to individual partitioned directories.

Unfortunately, you are on your own to build this type of ETL process. I ended up exploring a few options (mostly trying to optimize for wall-clock time):

  • Download/Upload with driver table
  • CTAS with driver table
  • batching using UNLOAD

I won’t dive into the details but most of the time in the ETL was spent moving the S3 files from one location to both rename and add the Cache-Control header. I was still hesitant to use a CDN to avoid any extra payment which meant these headers needed to be on the objects themselves.

For all strategies, I first created a table with the maximum number of buckets (100) and keyed by RSN (the match id). This should reduce the full table scan by a factor of 100, though will still ended up scanning quite a bit of extra data.

The batching of UNLOADS (16 concurrent runners) was quick; IIRC, this took around 20 minutes for all rounds. Moving the files took around 60 seconds per 100 RSNs and somewhere over 2 hours. This was a one time cost so I figured I would let it run and see just how bad the table scans were. Overall, it ended at about $7 which isn’t too bad though I am not very pleased with the result and suspect it can be improved by downloading the raw parquet files and doing the selects/uploads myself. However, this would require spinning up server when I need them which is also a bit tedious.

To be a bit more explicit, Athena costs $5 per TB queried. A cost of $3.39 scanned would imply about 678 GB scanned of the original 80GB of data which is… undesirable.

migrating the paths

I wanted to keep pushing the machinery further and see what partitioning something based on the players was like. I figured the nickname resolution could be hosted this way so I gave it a shot (though only one or two rows of ~15 characters requires no need for Parquet). This didn’t turn out to be a great idea and ended up costing $11, with the majority of costs in the requests to move the data.

Sep 05 == using ETL to migrate batches of 100 nicknames

Although this experiment wasn’t really necessary, it does highlight some issues when using Athena:

  • Using DBT on whole table transformations is easy to get started though can pose problems if partitioning is required later
  • Combining DBT and partitioned Athena data is not trivial: either DBT needs to be wrapped in another tool or process (to process a batch) or the adapter needs to be customized to run your specific transformation process on table creation / insert. Perhaps this is a place to use dbt post-model hook. Here is a document showing how to create an Athena table with more than 100 partitions. Excerpt: “Continue using INSERT INTO statements that read and add no more than 100 partitions each. Continue until you reach the number of partitions that you require.”
  • In general, moving from unpartitioned data to partitioned data requires a specially crafted process and dependent on the cardinality of the source.

Athena is great for this stage of the project as the read costs scale to zero and S3 storage is relatively cheap. However, I am a bit more worried when the game launches and incremental processes happen as often as every hour of every day. I assume partitioned data is a must and the storage will need to be reworked to support partitioning while running dbt transformations.

Reducing API Hosting Costs

Around this time, the first of the month had past and I received a billing alarm for the month (for both clouds!). I hadn’t really been checking Google Cloud as I was almost always working on Athena / AWS and the previous month’s alarm for both platforms had triggered earlier in August.

The majority of the costs were the following:

  • Balanced PD Capacity (two 100GB disks for Supabase)
  • HTTP Load balancer: hosting the certificate in front of both the bucket and Supabase instance

I had found that Supabase could run on the lowest instance size (e2-micro) provided joins were not required by the frontend.

Supabase vs Lambda

I had mentioned in the previous post that Supabase was really only there to provide the nickname search and various lookups per-user (leaderboard place, per-map percentile, recent match list). I figured I should try my suggestion from the previous post: bundling a single Parquet file with DuckDB.

The function is only 50 lines, supporting both ilike and eq queries. It was easy enough to bundle using AWS’s Docker base image. Docker Compose also helped to manage the variety of images and build args and made is simple to build/test/push. While my previous post utilized a singular request in the browser as a rough benchmark, I decided to put a bit more effort to quantify the two platforms. I wrote a load test using K6 to generate many more requests over a 10 second inteveral. This was very easy to get started and the cloud feature proved useful for the tests in the following section.

Both tests were run from my desktop computer. The results for the Lambda are also after the cold-start. The results appear great: better average, p90 and p95 response times. I did notice that the data transferred was a bit higher for Supabase. As mentioned in the previous blog post, PostgREST/Kong doesn’t GZIP the response so we receive a plaintext blob of the JSON results. DuckDB’s rows of results have the following schema:

[ column_1_value, column_2_value ]

This decreases the repeated key (column name) for each row.

Overall, Lambda represents a reasonable replacement for Supabase, although extra machinery is needed to re-build and deploy the functions.

SQLite vs DuckDB

DuckDB takes up space that could otherwise be used by the Parquet file. It happens that Python ships with a SQLite package by default! Surprisingly, it took very little to “migrate” from DuckDB to SQLite. The APIs are nearly identical although SQLite only supports LIKE and not ILIKE. Performance was similar for my data though DuckDB did behave a bit better (maybe 10–50ms).

Lambda vs Lambda@Edge

This blog post contains two great pictures of architecture of both Lambda@Edge and CloudFront functions: https://aws.amazon.com/blogs/aws/introducing-cloudfront-functions-run-your-code-at-the-edge-with-low-latency-at-any-scale/

It took bit of rewriting the Lamba as the event for Lambda@Edge are different from the HTTP 2.0 event. Like above, I tried both SQLite and DuckDB. Both performed far worse than the container-based deployment, by nearly 100-200ms. I checked the logs and I found the following warning:

"OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k"

It seems this is a common issue and is mentioned in several SO questions.

API behind CDN?

Given that functions have their own URLs, I wondered if hosting them behind the CDN would avoid the need to embed specific URLs into the website (in some form or another). CNAMEs won’t work as you’ll need them to utilize/provision a specific certificate which isn’t supported. Unfortunately, API Gateway does not have a continuous free tier. It also doesn’t support multiple multiple backends for a single path so my only options for hosting the Lambdas was either behind CloudFront or use their External URL. I configured CloudFront to have paths and origins for each function and begun testing. Note that the following test a Lambda Deployed to us-east-2 (Ohio). K6 also has their script runners located in us-east-2.

Running test from Ohio -> Ohio (AVG/P95)

  • only Lambda: 27/67
  • CloudFront -> Lambda: 86/180
only Lambda on left / CloudFront -> Lambda on right

Running test from Seoul -> Ohio was even more accentuated (AVG/P95):

  • only Lambda: 190/236
  • CloudFront-> Lambda: 512/807
only Lambda on left / CloudFront -> Lambda on right

I also ended up updating the frontend to see it use the functions behind CloudFront and had the following issue: https://stackoverflow.com/a/985704 . Given that Lambdas would be running on their own URLs, it seemed quite obvious to avoid CloudFront entirely and allow the browser to run as many requests as possible.

Geographically Distributed Clients

Now that these lookups were “portable,” I could theoretically deploy them to every Lambda-supported region. Given that the game is developed in South Korea, I want to make sure the website performs well for those customers. I figured that Europe may as well host a Lambda, too.

  • Seoul -> Seoul: 24/54

Configuring the frontend to utilize the “nearest” Lambda was the next step. Fortunately, CloudFront returns an x-amz-cf-pop header. There is a hosted JSON file of these edge locations (website, JSON file). I wrote a script to use the Google Maps Geolocation API to resolve the country to a Lat/Lng then calculated the distance from each edge location to the edge location hosting a function (Gist of this file). A crude attempt but it should be enough for 90% of customers. For an overview of corner cases, check out this article. The website is built by embedding a file of Terraform outputs which contain the function URLs. Looking up the set of functions to use is trivial after finding the “nearest” edge location. However, this construction could pose a problem if the functions were deleted and a client has cached a version of the website.

Google Cloud -> AWS Migration

Now that the website and functions were created, it was time to migrate the domain’s NS servers to Amazon from Google so that an alias record could be used for the apex A record. Usually, these sorts of operations take very long to complete given they are infrequent operations and cached in many places. After making the change to the NS records, I tried resolving A, NS and SOA records both from my Desktop and from the Cloud Cloud Shell. These records failed to respond for over 24 hours (from multiple DNS providers) so I figured I had missed something (Oddly enough, https://mxtoolbox.com/ was able to resolve the queries over this time).

The mistake I made was not realizing that Google Cloud leaves a DNSSEC record even when Cloud DNS is removed (and after NS servers are changed). After enabling DNSSEC on Route53 and updating the values on Google Domains, all DNS providers started working again! This update only took a few minutes.

Decommissioning Supabase

Finally, I was ready to shut down the Supabase instance. I still wanted to keep a copy of the data so I looked into archival options. GCP supports snapshotting volumes and recently added “archive” snapshots for a lower price.

Final Results

Starting: $1.75 a day hosting GCP DNS/LB/Supabase
Ending: $0.05 a day using Route53/CloudFront/Lambda:

--

--