Creating a clean, maintainable and powerful dataset can be a real challenge, especially when collecting data from sources that are evolving. Many changes at the source of the data need to be represented in the format of the data collected, and maintaining compatibility over time can be incredibly time consuming and frustrating - a growing backlog of ETL and data munging, janitorial tasks that have to come before the actual analysis of the data.
If you’re responsible for instrumenting your data sources and determining formats, there are a few helpful heuristics that can prevent these janitorial tasks from taking too much time in the future - but also, many ways of making the data easier to analyze and interact with.
One of the key considerations for designing a healthy dataset is making sure that the values you collect are in the best type for what you’re measuring. By (irresponsibly) oversimplifying the world of computing we can say that there are five key types to use:
Assigning the right type to an attribute can make all the difference in the world: doing analysis across types is painful, time consuming and only gets more so at scale. Having worked with a ton of in-house analytics solutions, we’ve seen first-hand on many occasions how costly a poorly typed attribute can be.
A great example of a poorly typed value is having an attribute Latency that returns a string like “24 ms”, rather than a simple number like “24”. The engineer responsible for this real world example almost certainly thought including the unit of measurement in the value was a nice way of documenting his code, but he had inadvertently shot himself in the foot - it’s unnecessarily complex to ask the average of a string. His string value generated great human-readable logs for his latency checks, but didn’t allow for simple aggregate analysis. In a noSQL environment he’d run a MapReduce to get the answer, and in a SQL environment he’d have to write unnecessarily complex and low-performance queries that would run the risk of crashing the program. If another string like “unknown”, something that couldn’t be meaningfully converted to a number, made it’s way into the mix you’d have a serious headache.
Having strongly typed data makes such undesirable states unrepresentable. This is why many experienced data wizards hate the .csv format - the incredibly helpful type information is not carried between commas. We wrote a schema inference engine and version control system for data models into our platform, Traintracks, to make sure data was always perfectly type-safe, and it allows us to generate interfaces that strictly prohibit malformed questions. Strongly typed data pays for itself every day.
A healthy dataset will contain unique identifiers that make sure you can drill down on specific entities. Although something like username can be a nice thing to store, it’s unique identifiers like userID or sessionID that allow you to isolate specific entities without risk of confusion or bad data. We strongly recommend recording all unique identifiers as strings - even if they look like numbers to the human eye. Experience tells us that most queries you’ll want to run about unique identifiers will not be what the minimum, maximum or average of your userID attribute is - but rather finding userIDs that match, contain or begin with something.
As the latency example above showed, it’s helpful to make sure you record your measurements as pure numbers. Strive to have a consistent use of units across your dataset - don’t mix seconds with milliseconds! Imagine the horror of having to sort out the latency example above if the stored value could be both “237 ms” and “0.51 s”.
One of the most common questions about things like sessions is how long the session lasted - or the average time someone spends in your application. Things like sessions can be approached in multiple ways, and many of them have dangerous pitfalls. If you assign each sent event a sessionID you can easily do analysis on what happened in that session, but finding the first and last event of every session is a typical batch job that doesn’t play nice with real-time streaming ambitions. You can elect to have Start Session and End Session event types, each with their own timestamp - but subtracting the starting timestamp from the end for every session is another batch job that will throw a spanner in the works.
Most SDK-based analytics make sure to solve this for you at ingestion by including a session duration at the end of session, or by having predefined reports that pre-calculate and cache the answers to give you the illusion of speed. For real-time use cases we recommend solving the duration at ingestion and not during analysis. Include the duration of a session as a number in the time unit of your choice at the end of the session. Be careful if you choose to add the current session length on each individual event if you don’t have clearly marked session endings - you’ll run into the problem of finding the last event for each session again.
Don’t be lazy or promiscuous with your use of booleans. If you set gender as a boolean, which we’ve seen on more than one occasion, you’ll constantly be answering questions which is which. Tracking changes in the state of a boolean between events is also a batch job, and not conducive to real-time analysis. It is often better to have the data source remember its previous state and return state changes in the event than to look for state changes later.
Some funny hacks with numbers
You can often take some pretty humorous and sometimes convenient shortcuts by storing things as a number. One funny example we came across was recording gender as 0 or 1 (numerically), which allowed the to very quickly answer ad hoc questions about demographics for an arbitrary grouping. The average of that numerical value gave you a percentage - our users in Beijing are 47% female. Amusing in its elegance, it still isn’t best practice to create obscure formats like that. However, that use case later inspired a clever way of tracking how one-sided conversations were in a dating application. Within each chat (with its own unique identifier) every individual chat message could be assigned a numerical interactivity value of 1 or 2 depending on if the initiator of the conversation sent the message or not. A conversation with an average interactivity of 1.5 meant both parties had sent the same amount of messages - and any chat with a pure 1 was a completely ignored advance.
Ultimately, creating a healthy dataset is difficult task with a moving target. Make sure you can either iterate on your formats and make sure your think carefully about types, and if you can’t iterate on the schema moving forward pay extra attention when setting up your tables. Make sure you’ve REALLY thought things through, because a bad format will cost you every time you interact with it.
Personally, we felt that the goal of perfectly clean and healthy datasets was practically unrealistic. We built Traintracks to be a scalable and performant way of having mutable schema on a perfectly type safe and immutable dataset, and we’re pretty excited about it.