Designing Classes for Data Aggregation Pipelines

This post is less of a tutorial and more a walkthrough of my internal though process when designing classes for a complex project. It's mostly a tidied-up version of notes I made while going through the process._

I'm working on a project that involves getting candid user data from various sources. By that I mean... tweets, reddit posts, discord posts, etc. Some content is public, like Twitter, and some is private, like Discord. The ultimate goal is to create an object-oriented pipeline where a user can simply call a service with some parameters and get back data formatted to their specifications, from an app with a GUI, a Jupyter notebook, or anywhere else. Easy peasy. (/s, in case that wasn't obvious)

I decided to separate the data collection and processing. It was tempting to try to write a class that would handle any and all services and methods... but ultimately I think that having separate base classes for collection and processing is more scalable and less prone to spaghettification. (There's a joke somewhere about code that starts out clean but eventually becomes spaghetti as it approaches a tech-debt event horizon but I can't settle on sufficiently punchy phrasing.)

I'll talk more about the processing steps in another post. For now, zeroing in on just the data collection part, I starting thinking about what all these services have in common (the base class), and what would be class-specific.

This is where frameworks and libraries and sdks can become more of a hindrance than a help. I'm used to using all these tools when building complex apps and automations, and most of the project's initial functions were written with various helper tools. And there's nothing wrong with that, usually. But they vary quite a lot, and it doesn't make sense to install a separate framework or SDK for every service, then write semantic code for each one, when you're more or less trying to do the exact same thing for each service. I really needed to strip it down to baremetal code. (Well, to a point. I'm writing python with minimal libraries, not machine code or butterflies, emacs or otherwise.)

So I started out with Tweepy, Discord.py, etc... and ended with python's requests library. The builk of this project is reading API docs and digging to find all the relevant endpoints, learning how to correctly format requests, drilling into the objects properly to get back what I want, etc. For example, in Discord each channel has a unique id that is independend from its guild, so the endpoint does not include the guild id (which I expected). In Twitter's API, to find tweets by username you must first call the user by their username and then get the id from that response. And so on.

Once I had all that figured out, I sat down to make a more formal list of the commonalities and differences among services. Not intended to be comphrensive, just some things I noticed.

Commonalities and Differences Among Services

Things that are universal (or so near as to be functionally universal):

A request with one or more credentials (some combination of SECRETS, TOKEN and KEYS)
A payload response

Things that are very, very common

Requirement to create an account, and often to register as a developer/set up a registered app in order to get the required credentials

Things that vary a lot

Payload format/response attributes (e.g. 'content' vs 'text' fields, different kinds of nesting, etc)
Header format (where you put your SECRETS and/or KEYS, whether you need to specify a content-type, etc)
Endpoint and method names

Some of this was pretty obvious. Of course all data requests will involve a request and a response. But it was helpful to lay everything out in a readable way for myself. Then I was able to look at the list and decide on what I needed, given that I wanted to make an object that was as standardized as possible for different services.

My list ended up being that each object:

Should instantiate by passing the required keys
Should have a method that fetches data based on user-supplied parameters (e.g. username, date range, etc)
Should output a formatted dataframe with the same fields across all services
Should have a "help" method that will tell the user which credential are necessary (e.g. some services use Bearer tokens, some use Bot tokens) and a linke or explanation on how to find/obtain the credentials
Should have very informative error handling; it's hard to keep all the little differences straight and very frustrating as a developer to move from one service to another, similar but different, service

Another way I could have possibly chosen to do this would be to make a single DataCollection class, and then give it a "type" and have the class itself contain the logic to fetch and process each one with a match statement. But I like my way better.

Class Design

The DataCollection class holds the validation function and the standardized dataframe columns. I left room so that I can eventually add the ability to have users determine their own desired object attributes/dataframe columns, but for now each object instantiates just by passing in credentials. It also holds the logic to convert the data into a Pandas Dataframe for easy processing.

Each individual class has its own methods to request and parse data into a 2d array. I chose to separate these for flexibility; this way, a user can get the raw data, the 2d array, or the dataframe, depending on their needs.

The process looks something like:

twitter = Twitter('creds') tweets = twitter.get_tweets('username', count) twitter_df = twitter.make_df(tweets)

I could combine these 3 method calls into one, but I've split them for greater flexibility. For example, some use cases may require the data to be in list format instead of a dataframe. If it turns out that the overwhelming majority of users just want a pandas dataframe, I will reconsider, and either combine the methods or (more likely) add a new method that chains the others. But I think the current method (and methods! cue rimshot) is readable and easy to follow instead of being a bit of a mystery box where creds go in, ???, df comes out.

There is a certain amount of repetition in the individual classes; this is because the requests have different named parameters. For example, twitter has "max_results" and discord has "limit" as query parameters. As of the time of this writing, I have plans to condense the actual request logic into the base class and have that fuction take in a path and headers. Little more elegant.

Then came the determination of what to name the functions. As everyone knows, this is the hardest part of programming. Just ask /r/programmerhumor! It was a bit of a challenge, though, in all honesty. On the one hand, uniformity is good. It is ideal to have the same pattern across all objects. On the other hand, it's confusing to tell someone that they should fetch Tweets by calling a method called "get_posts." In the end, I decided that I would name the fetching functions semantically, after their collection names, even though that means that the user has to type something different to get the information from each service. I'm not sure that this is the best approach, and may update that in the future. I want to minimize the amount of specific documentation that devs have to read, and am not sure whether it's more or less intuitive to have the methods named after the services. Thoughts on the subject are welcome!

Ultimately, the objects and functions themselves don't require that much logic (some of the data processing does, and I'll talk about that in another post). This section of the project was more about noticing overarching patterns, and taking the time to dig through and understand the API docs for a ton of services. Except for paginating Tweets... which doesn't require that much logic, exactly, it's just always a (necessary) pain dealing with pagination tokens.

It's funny because my job title currently is technically Senior Integrations Engineer, but I've been doing straight software engineering for the past 6-8 months. This is the first time I've flexed those "integrations" muscles in quite a while!

At the time of this writing I have built out Twitter and Discord to be able to fetch a user-provided number of tweets from a user or a user-provided number of messages from a channel, respectively. (Hence the heavily reliance on those two for examples!) I am following the same process/patterns for Reddit and Kaggle, which will be fully built out in the upcoming weeks. Microft Teams and Slack follow very similar conventions as Discord, and will be easy to add should the need arise.

In upcoming posts I'll talk about the processing side of the pipeline, as well as eventual deployment and building components that will allow non-technical folks to use them with a graphical interface.