How I built vx-data.com

design choices and lessons learned

18 min readAug 30, 2022

Over the last two months, I have built vx-data.com which is a website to analyze the match replays for Veiled Experts, a tactical 5 versus 5 first-person shooter. I fell in love with the game during the public beta and wondered what the underlying data said about the state of the game. There were some recurring themes in the Global Discord and I was curious if the underlying data supported these claims. Given the tactical element of the game, I also wondered if certain strategies proved more successful than others. I figured these insights could also help new players and (hopefully) retain them to keep the player base healthy. Prior to building this website, I had built a similar data analysis project to analyze NHL games. That project proved unsuccessful but was a great learning experience on building models and understand where alpha can come from. Given that both projects relied on scraping and analyzing data, there were many pieces of code that could be reused and I was pretty confident in my choice of technologies. I was soon humbled and found out that there was still quite a bit I needed to learn about running data systems.

The Underlying Data

There are two websites that Nexon uses to publish the underlying Veiled Experts data:

The first is the “public” version of their website which contains information on the Agents and Maps as well as the leaderboard for various stats in the game. The second website contains in-depth statistics for the maps, guns, and utility. It also contains individual player profiles and gives players a way to view their match replays. Specifically, the most important pages are the following:

global list of recent matches: https://globalstats.vx.nexon.com/en/matchall
individual player profile: https://globalstats.vx.nexon.com/en/10031255/
match replay: popover displayed when clicking the “play” emoji on a game. Unfortunately, match replays have been suspended since the ending of the beta as the team focuses on building the next set of features for the final release.

Before a match, players selector their Agent and configure Leptons. Agents have specific abilities (like healing during a match or one free self-revive when downed). Leptons also provide another way to augment abilities during combat (like gaining movement speed when health is low).

Pictures show different Agents. Icons show different Leptons.

The match replays are timeseries data of the player position and direction. The team that reaches 4 winning rounds first wins and there are a maximum of 7 rounds total.

Between rounds, players can purchase better guns, ammo, grenades as well as “upgrades” that can increase various abilities (such as allow gun attachments or increase reload speed).

Engineering the MVP

Given that I want to use functions that summarize/describe the (X,Y) coordinates, I need database specializing in geospatial queries.

Attempt 1: SQLite

I wanted to keep the project as simple as possible so I initially looked at SQLite’s SpatiaLite extension. I figured using SQLite would help in the beginning as I didn’t want to pay for any servers and the database could be shared easily enough. Like every technology choice, this comes with its own set of tradeoffs. First, every SQL client needs the ability to load SQLite extensions. GUI tools (like DBeaver or VS Code), CLI tools (like dbt), and eventually Javascript in the browser would need the ability to load this extension. While the technology worked as expected (creating points and lines from the match replay data), it was extremely cumbersome so a switch was deemed necessary.

Attempt 2: PostgreSQL

In my opinion, PostgreSQL’s PostGIS represents the gold standard for functionality of a GIS database. Even Spatialite’s homepage references this project! I was especially interested in the pgRouting extension that could be used to represent parts of the map using edge weights. One particular map, Shipyard, features a bomb site atop a massive ship. The attackers need to climb one of three scaffolds to reach the top or can attempt a flank of the opponents. Either way, climbing up the scaffolds takes a much longer time than a defender jumping down so the edge weight depends on the direction traveled. Having this library available would make such an analysis possible to represent and query.

There isn’t much that needs to be said about PostgreSQL. It performed as expected and I didn’t run into any obstacles for the majority of the project.

The front of the ship from the attacker’s side. A,B,C indicate the main route. L and R indicate the scaffolds on left and right side.

Data Storage: Supabase

Supabase provides a convenient bundle of open source components useful for building web applications. PostgreSQL provides the database layer, PostgREST and Kong provide the API and Authentication layer. The Supabase JS library also makes it easy to access the API layer and building a website based on tabular data takes no time at all. They also provide a hosted version of this bundle. I used this for a bit but quickly hit the 500MB storage limit and moved the services to a GCP instance. Supabase provides a “studio” dashboard that can be used to enable popular Postgres extensions (PostGIS being one of them!) so there was very little operational work required to get something running.

I didn’t find an official Go client library for Supabase so I decided to use GORM to upload the match data. I thought that the schema migration might help in the future if I didn’t get something quite right in the beginning. I did end up needing it to add a CreatedAt field to help with dbt incremental models. GORM was able to handle the column addition but I still ended up running SQL manually to fill in the column post migration.

Data Preparation: DBT

The health of this project depends on the ability to create features from the underlying data. From the dbt website:

dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.
dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

In short, you write SQL to transform your data. In my case, I want to predict who might win the round of a match. To do this, I extract various features from the match. There were three categories of features I considered:

team independent features: map name, bomb site locations
team dependent round-independent features: aggregate rolling Kill/Death/Assist, aggregate agent selection, aggregate lepton selection
team dependent round-dependent features: aggregate paths traveled, upgrade level, is defender/attacker

A simplification of the tables used to create a table with features and predictors

While I was able to write many SQL files to create each feature, I do wonder if there is a better way to “manage” the features. Because there is an aggregate between player and team, I must decide which aggregate function to use and then write the SQL to perform the function and then name the features. A good example would be having players with very high and very low KDA. These players may be able to “outplay” the other team independent of their teammates and a MAX function may be suitable to capture its presence. The closest way to describe the features and their relations was an example from the Deep Feature Synthesis part of Featuretools documentation.

Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:
(parent_dataframe, parent_column, child_dataframe, child_column)
relationships = [ (“sessions”, “session_id”, “transactions”, “session_id”), (“customers”, “customer_id”, “sessions”, “customer_id”), ]

After, you can instruct DFS which aggregation functions to use.

ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "max", "min", "std", "skew"], trans_primitives=["time_since_previous"], )

In my case, I want a framework similar to the one above to describe the possible features and their possible aggregations but represented in SQL/DBT instead of in Python/Pandas. Better yet, a syntax for both tools in utilize would be great. Such a framework might even extend as far back as parsing the JSON out of a response body.

Actual dbt DAG (maybe this issue could help communicate meaning in sections of the graph)

Data Analysis: Jupyter + sklearn

Like PostgreSQL, Jupyter Notebooks and sklearn models hardly need any introduction. There are countless tutorials and the surrounding libraries have been used for quite some time. I used the VS Code Extension to explore and analyze the data created by the dbt scripts. Coupled with the existing ssh-tunnel, panda’s read_sql_table function makes it incredibly easy to download data.

I had two goals when analyzing the match data:

game balance: were there features that could predict the outcome of a round?
player improvement: given a model (or set of models), can I recommend a change in player behavior that could improve the outcome of a round?

Typically, most video games have a “meta” which consists of the strongest items, characters, or decisions. The developers of the video game make adjustments over time that affect the metagame. Usually, this coincides with new character releases (often so that they are stronger). Changes in the metagame help keep the game exciting and offers a new challenge. Players depend on patch notes to hypothesize how balance adjustments might impact their gameplay.

Given the two goals for the data, I figured I would use Decision Trees for both goals. My previous data project showed that Decision Trees performed as well as other models. The Decision Tree can also be visualized which can also provide some insight into certain patterns in the data. Like other models, there are several parameters to tune and heavily influence the overfitting/underfitting of the data. Cost Complexity Pruning can provide some guidance on the values of parameters.

Below is an example attempting to predict the winner for attackers on the “Derailed” map.

sum_f_nicki, sum_f_dmitry: defender agent selection features
sum_f_lag_assists: team summary features of lagging KDA for defending team
sum_f_w826: the “Compensation” lepton
sum_f_c_site_river, others: player path features

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.015 maximizes the testing accuracy.

Similar to the example provided in the scikit docs, we can see a convexity to the accuracy vs alpha as the model gradually shifts from overfit to underfit.

I won’t dive into the results here, but I was surprised at some of the results I found while exploring the match data. Overall, there were not many features that gave either attackers or defenders an overwhelming advantage. The path the players took is a somewhat controversial feature as it is using the results of the round to predict itself. However, the results showed considerable correlation with winners and losers and it sets the stage for the next round of analysis.

While game balance was explored, player improvement remained. I haven’t found a convenient way to translate the various decision trees values into a summary. Most of the trees contain “obvious” summaries such as an attacker failing to make it to the defender side. However, the tree does communicate how catastrophic not having at least one attacker make it to the defender side.

highlighting attackers winning the round

In the above tree, I could hypothesize the following explanation for the blue nodes:

when sum_f_defender_side <= 2.5 == False: get at least 3 players to the opponent side
when sum_f_a_site_defender_inner_left <= 1.5 == True : this is likely rounds where the attacking team is not traveling to the A bomb site
when sum_f_d_site <= 2.5 == True: a slightly favorable but mostly equal distribution of wins and losses. Fewer attackers on site might indicate inability to defend it.
when sum_f_d_site <= 2.5 == False: favorable that attackers will win. More attackers on site also means fewer attackers elsewhere on the map and could allow better “trades” when the defenders try to reclaim site.

While exploring the ability for the attacker to win, we may have also uncovered some balance in the games bomb site locations. Even if all attackers don’t make it to the D bomb site, there is still a slight advantage for winning which might indicate that defenders are unable to reclaim the site after a plant.

Given that these interpretation are still up to the observer, I decided to generate trees using a variety of filters (like bomb site location and round number). I also aggregated the features into “groups” and trained some models using a subset of feature groups. Using a combination of filters and feature groups, I present all models on the website along with filters to view models for a specific scenario.

Training Matrix: Prefect

In my previous project, I tried using Airflow and Luigi for task execution when training models. I had more success with Luigi but I figured I should try something new. Initially, I used Dagster to build the graph of train jobs. Writing the DAG was a bit troublesome as I needed to build factory functions for individual nodes and the code was not easy to modify or extend. I also ran into some issues controlling concurrency so I looked for another task management abstraction. Prefect has a similar annotation-based structure but functions more as a work queue than a DAG. Controlling the concurrency was easy enough with the Dask Task Runner. I wasn’t yet able to figure out how to manage the underlying features_predictorsDataFrame using Dask so each filter-train task reads from a cached version on disk.

Prefect frontend for visualizing task and flow duration

Dask also has a frontend to visualize task progress and performance

Continued Engineering Decisions

Somewhere along the project, I was unhappy with the speed that dbt was taking to run. I had initially begun the project using only replays from one map as I wanted to prove the feature creation and model generation process end-to-end. Even if vertically scaling the PostgreSQL could alleviate the speed in the short term, I really didn’t want to pay for it and was curious if another product could provide speed for the same cost.

Attempt 3: Citus

From their website:

Citus is an open source extension to Postgres that distributes data and queries across multiple nodes in a cluster. …
Citus transforms Postgres into a distributed database with features like sharding, a distributed SQL engine, reference tables, and distributed tables.

Citus provides a docker-compose file to get started with their example. I figured I could use this file and create a Docker Swam cluster of spot instances on GCP. While the Swarm cluster was easy to get running, I was unable to get the example working when the workers were scheduled on separate hosts.

Somewhere in the middle of building Citus on Swarm, I took a break to write the dbt plugin necessary for creating distributed tables. This turned out easier than expected as you can use Jinja macros defined in other adapters which left implementing only the unique parts of table creation. Here is the example snippet that run the Citus function create_distributed_table() after the Postgres table is created. For new table creation of a large source table (like from a dbt --full-refresh), I am not sure this is preferred strategy. I was more interested in an MVP to prove this integration was even possible at all.

dbt-citus/adapters.sql at main · jgensler8/dbt-citus

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

github.com

Overall, I had high hopes for Citus. It would allow me to stay in the PostgreSQL ecosystem and allow potential use of the various PostGIS functions. Unfortunately, this didn’t work so I figured it was time to look for a hosted solution.

Attempt 4: BigQuery

I saw that BigQuery supported some Geospatial functions so I migrated some data over and started rewriting some of the dbt scripts (every database seems to have their own implementation of JSON path extraction 😢). I quickly found out that BigQuery doesn’t not support arbitrary coordinate systems:

Constraints
Latitudes must be in the range [-90, 90]. Latitudes outside this range will result in an error.
Longitudes outside the range [-180, 180] are allowed; ST_GEOGPOINT uses the input longitude modulo 360 to obtain a longitude within [-180, 180].
- link

Attempt 5: Athena

Having reasonable doubt from trying BigQuery, I wrote a smaller subsection of the migration and found that Athena does allow arbitrary latitude and longitude for the st_point functions.

I used Airbyte to load the various gorm_* tables into the S3 bucket. Then, I spent part of a day rewriting all of the JSON parsing (yet another style!) and was able to get the transformations working. There seems to be some static overhead in the connector implementation (about 5–6 seconds per table) but this was negligible compared to the rest of the data:

Airbyte migrates data to S3 as Parquet files by default. When Airbyte migrated this table to about 300 files each of roughly 25MB which ends up to about 7.5 gigabytes of compressed Parquet files. (Aside: when exploring Citus, I wanted to understand how the create_distributed_table function behaved with the most problematic table I had: the round replay data. I used pg_dump to save and load the round paths which ended up at around 80GB uncompressed.)

Athena was able to handle parsing ( bytes -> JSON -> MULTIPOINT) the Parquet version of the path data in about 4 minutes

public.match_replay_round_path ........... [OK -1 in 234.20s]

and classification ( st_contains ) in about 8 minutes

public.match_replay_path_classifier ...... [OK -1 in 488.49s]

I wasn’t able to save the exact timing of a --full-refresh with the scripts on the original PostgreSQL server, but it was definitely more than 10 minutes.

One drawback of Athena is the lack of indexing. Unless you utilize Partitioning, Athena will scan all data when aggregating results. I was initially worried this would results in extra costs but I spent less than $2 running several --full-refresh queries during the migration. In my case, most of this applications workload is incremental and the most data transferred out only after being summarized. I only really run incrementals when new features are added and even then I can utilize the --select flag to only refresh certain tables. This migration also allowed my to size down the PostgreSQL instance given it was only needed to host nickname searches and a few other summary tables.

The Last Big Step: Hosting the Round Explorer

After the Athena migration, I spent most of my time on the frontend. I had already built the page to show and filter the various decision trees but it was missing pieces to help individual players. Finally, I worked on the following:

leaderboard with segmentation based on number of rounds played. The existing leaderboards were based on total games played and didn’t reflect the K/D ratio and win percentage.
search bar which links to an individual player page. I had only scraped a portion of the data when I started investigating Citus figuring I could resume the scraping later once I had found a suitable way to store the remaining data. By the time I had ruled out Citus, Nexon then decided to delete (or hide) the match replays. I needed to communicate the “completeness” of the match replays, so I added a bar show the percentage of match replays over the number of matches from the global match list. Using the available match replays, I also show the map/round summary to highlight potential weaknesses on each map. The numbers shown are percentiles vs other players so you can easily tell if you are performing better or worse than the rest of the player base.

The last component to implement is the Match Replay. This will help me overlay the map geometry features with the individual paths. It will also highlight the team choice of agent and lepton and display a comparison of each team’s percentile ranking for each round.

All of the gorm_* tables still existed in PostgreSQL but all the parsing had moved to Athena. I tried using Airbyte to migrate the S3 files of parsed round data back to Postgres. There was some error at the end of Airbyte job and Airbyte retries a sync up two more times if an error occurs. As this was a very slow process, I had let it run in the background while I worked on other tasks. The next day, I found that I had spent $18 on S3 Egress transfer! Yikes! I figured this might be why Nexon shut down the match replays given I had downloaded close to 80GB worth of data as JSON.

Although there may not be a way around S3 transfer egress, I really didn’t want my tiny PostgreSQL server to be responsible for the hosting the massive table of parsed version replays. Maybe there is a way to host the Parquet files directly?

parquet-wasm to the rescue!

Fortunately, there is a way to parse Parquet files in the browser: parquet-wasm. There is another pair of libraries that are “pure Javascript” and “pure Typescript” but these for use in Node.js code and not browser-based Javascript. parquet-wasm had some examples in the README and it only took a bit of tinkering to get the imports working in React. I ended up manually using ESM modules even though installation via npm tried to import the “bundler” version of the package.

As I was already rendering the models page (about 3000 decision trees), I wanted to refactor that component first to use Parquent. The dot representation of a graph contains newlines so I had base64 encoded the dot column (I think I had problems un-pickling the DataFrame before b64 encoding). To keep the run configuration used for model training in the same map, I had base64 encoded a json encoded version of this map. Given that Parquet would be serialized as bytes (and not plaintext via PostgREST), I could try flattening the run config and leaving the raw dot data in a DataFrame before saving as Parquet to the GCP bucket. That should allow the compression library to operate on the repeated strings of the dot files (this SO post refers to this file for an overview). As expected, this resulted in considerable improvement when hosting the models:

before: using PostgREST + base64 encoded columns

after: using parquet

I attribute most of the decrease from initially saving the data in the “wrong” format. Moving data from PostgreSQL to a bucket could remove a few layers of possible server-client buffering. After seeing this improvement, I realized that removing the labels from each node in the decision tree and only labeling the root node would both visually shrink the size of the tree and result in fewer bytes of dot data. A small but easy win:

I recorded the timing from before/after (both bucket and server are in the same GCP region):

headers before and after (I’ll finalize the cache-control header value when things finalize)

While inspecting these requests, I don’t believe that PostgREST or Kong compresses any of the JSON data so moving as much data out of PostgreSQL and into Parquet files seems like a good decision going forward. Caching of data in the buckets is easy to configure. The Cache-Control headers are set as metadata per-file which means I could potentially utilize this strategy for quickly changing data (like the global or player match lists). As the individual match details never change after the match is over, storing this content outside of a database and treating it as static content keeps the system simple and could allow a CDN to cache content even closer to customers.

The Work Ahead

While I still have to finalize the round explorer page, I have no doubt that the data will be hosted via Parquet files. I may explore refactoring the scraper to save directly into the S3 bucket which would simplify the ingestion of the system. The only other computation required by the database is:

searches via nickname (8MB)
getting the list of recent matches per-player (15MB)

Given that using Athena worked well for the database layer, perhaps a Function-as-a-Service could work with an embedded Parquet file and DuckDB. I have not used the second version of GCP Functions product but the first version could support 100MB compressed. AWS Lambda containers could support 10GB which would provide plenty of space for the tables to grow.

I suspect there will be plenty more to investigate after most of the data movement and hosting parts are built. Besides, most of the value will come from the insights drawn from the data and not necessarily from the technology chosen to efficiently store and serve the underlying data.

Edit: you can read the next steps for the Round Explorer here:

vx-data.com follow up

Topics: Using Parquet, Athena + ETL, API Hosting

jeffzzq.medium.com

Lessons Learned

I think the biggest surprise was how useful Athena is (actually Presto/Trinio). While the architecture initially seems less performant than a traditional database, its simplicity actually fit quite well for this type of project. There isn’t a need for online calculation/prediction so staying cheap “at rest” is higher priority than performance.

When looking for databases that support geospatial queries, many had support for the basics (st_point, st_contains). Given my lack of success with Citus, I am curious what other larger geospatial datasets use for their storage layer.