Between November and December 2012, I built SoundScout as a way to discover up-and-coming musicians on SoundCloud. What started as a simple command-line script rapidly evolved into a Ruby app running on Sinatra, powered by two gems, deployed to Heroku, fueled by Sidekiq workers managed by a Redis instance. Building SoundScout was one of the most challenging projects I’ve ever undertaken, and one of the most satisfying by far.
It all started when I realized I wanted to find more people to follow on SoundCloud. My first impulse was to scroll through artists in my Rdio collection and then search for each name on SoundCloud in turn, but after a few minutes of rapid tapping, a realization kicked in: A computer could do this!
A quick skim of Rdio’s API documentation confirmed my hope that I’d be able to get the info I needed through their API. Meanwhile, past experience with the SoundCloud API reassured me that it would be smooth sailing on that end. Before long, it occurred to me: this had the potential to be a great final project for CS50, Harvard’s introductory computer science course (and one of my favorite classes last semester). A conversation with Erik convinced me that I could actually go beyond finding the SoundCloud accounts of my favorite Rdio artists—that the higher good would be to identify new artists to check out based the revealed preferences embedded in my Rdio data. And so, it was decided. With that decision, the next month of my life was more or less spoken for.
This is the story of that month, what I learned along the way, and where I ended up.
In the beginning, the basic workflow I had in mind went like this:
- Authorize a user’s SoundCloud and Rdio accounts in order to access personalized data.
- Grab a list of the user’s most-listened-to Rdio artists. (Relying on Rdio for the initial dataset provided the additional advantage of data “cleanliness”; artist names would be spelled correctly and the list of names would arrive free of duplicates.)
- Iterate through the whole list of Rdio artist names; use each artist’s name to seed a separate search on SoundCloud.
- Probe into the superset of SoundCloud results using SoundCloud’s API; grab relevant info such as number of tracks, number of followers, recency of the latest upload, etc.
- Run a custom algorithm against that data to select the “best” of the SoundCloud results for any given Rdio artist.
- Return a permalink to the top few artists’ pages on SoundCloud.
Spreadsheets are my secret weapon, so my first version of the app returned results as a CSV. They looked something like this:
By opening the CSV in Excel and turning it into a sortable table, I was able to experiment with sorting by most followers to least followers (and vice versa) and the recency of track uploads. I clicked a few of the permalinks that came out on top to see if the method had any merit. The only problem was that every time I clicked through, I got lost listening to tracks and clicking around on SoundCloud some more. I eventually concluded that this was the opposite of a problem: my plan was working.
Clicking around some more, I noticed that many of the artists’ profiles sported extensive bios. Further examination revealed that many of these bios listed the allegedly “matching” Rdio artist name within the text itself. Artists on SoundCloud were using the text of their bio to list influences, inspirations, and bands they’d opened for in the past. This was a totally emergent behavior (as far as I could tell), but also impressively reliable: I couldn’t find many cases where the pattern didn’t hold.
The current version of SoundCloud’s search API casts a wide net, but I’d stumbled across a great way to filter the results. Given a search results superset seeded by a particular Rdio artist’s name, I could boil it down to just the results that included the seeding artist’s name in the text of the bio. In Ruby (my programming language of choice), I could accomplish that filtration with the following code:
With that breakthrough, I had my primary mechanism working, and the results were already starting to look promising. The next step was to graduate from a CSV to a real database.
Rather than go the familiar Rails/ActiveRecord route, I followed a a path known to me from past projects: I quickly settled on Sinatra as my web framework and DataMapper as my ORM—an object-relational map, a kind of interface to the underlying database. Rather than inserting my results into a CSV line by line, I could create reusable database entries and connections between those entries. This meant that I could store an Rdio artist’s name in one table, associate that artist with a user of my project stored in another table, and then link both to the info for multiple SoundCloud artists stored in a third table.
Rather than crown one “most promising” SoundCloud pick for any given Rdio artist right off the bat, I decided to stash all the members of the
subset and their attendant API-derived properties in my local database. Objects in a database can be sifted and sorted in much the same way that rows in a spreadsheet can. Storing info for a large number of results in my local database would let me grab a rich dataset and then use all of its properties as fodder for the development of my algorithm.
Writing the algorithm was the most interesting part of building SoundScout. In boiling down the whole universe of artists to just a handful worthy of displaying to the user, I had to think about what qualities I valued and what metrics had the potential to signal the presence of those qualities. For instance, I decided early on that I wanted to highlight “up-and-coming” artists. To me, this meant artists without too many followers on SoundCloud. But how many is “too many”? Did an arbitrary cutoff make sense, or would I need to determine the cutoff for each user in the context of the rest of the artists in their pool of candidates? I also wanted to showcase artists who would be fun and rewarding to follow—those whose actions on SoundCloud communicated that every listener mattered. And follower count and platform engagement were just two of the properties I examined. I took seven main factors into account, awarded points for each one, and summed all those points into an overall score. Here’s the full list of factors:
- Follower Count Suitability: I awarded points such that the max number went to artists with 2,000 followers; the slope then descended linearly on both sides such that artists with 0 followers and 4,000 followers would each receive zero points for this measure. In a future iteration, I would probably revise this to take into account the context of the user’s candidate pool. I could order the candidates from least followers to most followers and then award the max number of points at the “happy medium,” or midpoint.
- Upload Recency: I wanted to reward artists who were in the middle of an active posting streak, and so intended for the points to shake out so as to favor ultra-recency. Once I applied this measure to my own personal candidate pool, though, I realized that favoring ultra-recent posters unnecessarily and somewhat arbitrarily narrowed the field. I revised the criterion to focus on artists who had posted in the last 10 weeks, or 2.5 months. I could have tweaked that particular cutoff, but the idea was to favor artists for whom SoundCloud engagement was currently top-of-mind—perhaps they were in the middle of releasing and promoting a new album, perhaps they were in the studio and sharing behind-the-scenes work. I’d like to make this metric more useful, perhaps again by tweaking the dials according to the characteristics of a particular candidate pool.
- Join Recency: Another “freshness” metric; I thought it would be nice to shower artists new to the platform with attention, so I rewarded artists whose first upload took place within the past year or sol.
- Upload Frequency: I was looking for artists who uploaded, ideally, an average of once per week. To determine this, I just subtracted the date of the least recent track upload from the date of the most recent track upload to get the span of days over which the artist had been uploading, then divided the number of tracks into that span. Unfortunately, this was a very crude metric; to really figure out upload frequency, I should have sought some metric that referred to the distribution or the average gap of time between uploads. Instead, my version of “upload frequency” really just asked: if they’ve been actively uploading over a span of a lot of weeks, do they at least have a consummate number of tracks to show for it?
- Platform Engagement: The goal here was to enable reciprocity; I wanted to reward artists who were using SoundCloud to the fullest, experimenting with its features, and supporting other sound creators. This measure had two components: number of fellow SoundClouders followed by the user, and number of favorites doled out. I awarded the max number of points available at 500 fellow users followed and 25 likes given. Those cutoffs were arbitrary, but I took them to be sufficient proof of engagement.
- Venn: This was a measure of how much “overlap” a SoundCloud artist’s bio text had with a SoundScout user’s list of favorite Rdio artists. A SoundCloud user who cited Metric, Arcade Fire, and M83 as influences would receive a Venn score of 3 against my collection of favorite Rdio artists, since all of those artists are on my list. Any artist inspired by all three would be an artist I’d be very likely to want to check out. Heavily favoring artists with a “Venn” of at least 2 also had the advantage of suppressing anomalies created when an Rdio artist’s name was also a common word. “Metric” is a good example of that, actually; if “Metric” were the lone word matched in a SoundCloud artist’s bio text, it could easily be them proclaiming their love of the metric system. But when at least two artist names were present in the bio text, I found that they were typically being cited as I expected—that is, as artistic influences. Another criterion followed from this one, then: Confirmation, which just meant that the artist’s bio cited at least two relevant Rdio favorites. Any Venn number of 2 or greater would lead to a Confirmation value of 2, which I used as a multiplier on the final score.
- Has Avatar: I took the presence of an avatar (a picture uploaded to represent the artist on their profile) as a positive signal: for artists without many followers, it provided one way to demonstrate commitment to and engagement with the platform. Also: without an avatar, the track-streaming widgets I planned to display on my results page wouldn’t look right. So this was a practical necessity, too. I allotted “Has Avatar” a value of 0 (for no avatar) or 1 (avatar present), intending to use it as a multiplier in the final evaluation.
My ultimate scoring equation looked like this:
But before I could test out the algorithm, I had to fetch data to feed to it. And fetching the data turned out to be a very, very long process at first.
Fetching Rdio favorites wasn’t so bad; I just had to issue a single API call and then stash all the resulting artists in my database, row by row. (It actually wasn’t quite that simple, but only because I’d missed a crucial feature of the Rdio API—more on that later. Anyway, calling out to Rdio was the fast part.) But fetching SoundCloud results took much longer, much longer. Let’s say my call to the Rdio API gave me 100 artist names. To run 100 individual searches seeded by that name, I’d need to make 100 separate calls to the SoundCloud API. Each of those calls could return up to 200 results; I’d need to insert all of the datapoints for each result into my database of prospects and create appropriate connections with all the other databases, too. Furthermore, to get some of the data I wanted, I’d need to make an additional per-artist call to that artist’s page of tracks, to figure out things like number of tracks and upload recency. Basically, each Rdio artist name was the tip of an iceberg that involved inserting hundreds of database rows and making multiple SoundCloud API calls, all of which took time…and all of which I was performing in serial.
That’s right: serial. At first, I only had one SoundCloud API client running. If that client was running a search, it wasn’t grabbing an artist’s tracks; furthermore, whenever the app reached the point in the loop where it inserted a bunch of rows into the database, my one SoundCloud API client would sit idle. As a result, running the script against the 600+ artists in my Rdio collection took days—yes, days. I knew that couldn’t be right, but I didn’t know how to fix it. So I waited.
Waiting wasn’t too bad. I had other classes to attend to, so I spent the better part of a week mostly just babysitting the script when I woke up in the morning and before I went to sleep at night, making sure that it didn’t get hung up on errors and checking in on its progress. Finally, I had my dataset. I ran my algorithm draft and mostly rejoiced: the results were interesting. I felt vindicated, and motivated to continue. But in order to move forward, I’d need to learn a thing a two.
What I needed was to learn how to run background jobs. That phrase had actually kicked around in my head and over text message conversations as I’d watched the script continue its slooooow progress over multiple days. I’d heard of “background jobs” before, and I had a feeling they were the solution, but I didn’t know the “right” way to implement them in the context of Ruby. Some Google searching suggested that Resque was pretty popular, but when I told Erik about my plan, he suggested I check out Sidekiq. Resque is great, and has some strong Google juice because it’s been around for a while. But Sidekiq is an up-and-coming complement/alternative (sort of like the musicians I was seeking to highlight!); as further evidence, Erik showed me the "Background Jobs" page on Ruby Toolbox, which made it clear that Sidekiq is on the rise. As an added plus, Sidekiq’s “Examples” folder included a Sinatra example that looked like it would help me get everything set up right. I was sold.
It took me about half a day to get the hang of Sidekiq, but once it finally clicked, my coding world turned inside out. What had taken days now took minutes. Instead of making API calls wait patiently in line, Sidekiq allowed me to spin up as many as 20 simultaneous API clients, each of which could go run searches and retrieve data independently. Since my app was all about casting a wide net, stashing the detailed results locally, and then running a one-time algorithm on the results, it presented an ideal use case for background jobs…and Sidekiq was the perfect companion.
I felt like I was getting close. There was still a lot to do—design a front end, deploy to Heroku, refine the algorithm, hunt down bugs—but the core of the app was working, and I felt like the hard part was behind me. Normally, this is the point in a narrative where I’d say “little did I know how wrong I was”…but fortunately, and to my huge relief, I was mostly right. Designing front ends and deploying to Heroku and equation-tweaking and bug-hunting aren’t exactly easy, but they were all at least paths I’d been down in the past. Background jobs had been the big unknown, but tackling them had provided the biggest payoff. In the case of SoundScout, I learned to run background jobs mostly out of necessity; the app just wasn’t viable without them. But I think the same logic applies even in less-dire situations: great unknowns can yield incredible surprises. I’ll try to remember that the next time familiarity beckons.
The clock was still ticking. For the purposes of CS50, the app was due at noon on a Sunday. The night before, I skipped sleep entirely—knocking out the front end, deployment, and algorithm refinements one by one. As the final half-hour ran down, I captured some screencast footage in a hurry; part of the assignment was to create a short video showcasing your app. I found a Creative Commons-licensed track by a band called Method to use as the soundtrack, rushed to edit the video in iMovie, and uploaded it to YouTube in the nick of time.
Time ran out. The assignment was over. But I wasn’t done. I wanted to see if SoundScout would work for people besides me. So I tweeted out an oblique invitation to try out the app I’d been working on, and got a few responses. I sent those people a link to the site, refreshed my database connection to watch their SoundCloud and Rdio authorizations appear, and then the Sidekiq workers were off to the races.
It was thrilling to see people really using a thing I really built. It was also maddening to see assumptions break down in practice. In one case, the entire app got derailed because a certain artist name derived from Rdio exceeded the default max string length set by DataMapper. (Fixing that was as simple as setting a new, higher maximum manually…but hunting down the bug in the first place took forever!) And extremes on the spectrum presented their own problems; searches seeded with just two Rdio artists led to thin, almost unusable results; a candidate pool derived from an Rdio list of 1000+ artists took so long to sift on the fly that a page load timed out. (The solution there was to cache results in a separate table as part of the initial search process.) But the most interesting realization hit slowly, and arrived as a result of a back-and-forth with my friend Patrick.
Throughout this recap, I’ve been using “members of a user’s Rdio collection” and “Rdio favorites” interchangeably. But they’re not actually interchangeable at all, and their incompatibility was a major source of frustration throughout the development process. What I wanted was a shortlist of each user’s most-listened-to or otherwise most-beloved Rdio artists; what I settled for, after a too-cursory initial read of the documentation, was artists whose albums belonged to the user’s Rdio collection. Rdio collections can be gigantic, as demonstrated by the 1000+ artists case. But without the ability to surface the most important artists in any given collection, I opted to prefer a wide net over an arbitrarily exclusive one.
After trying the app for himself, Patrick asked how I was determining which Rdio artists to focus on. I explained my reasoning, and he said something like: “Weird. Seems like most-listened-to artists is something the Rdio API should have.” To prove my point, I loaded up Rdio’s API documentation to confirm the URL before sharing it with him. But as I scrolled up and down the page, my heart sank and exploded: it was there all along. Rdio has a concept of “Heavy Rotation”—the most-listened-to artists and albums in a network. Turns out, they expose per-user Heavy Rotation through the API as well—complete with a
hits count representing relative importance in a user’s collection.
I took a screenshot of my conversation with Patrick as the realization dawned on me in real-time:
Moral of the story: don’t give up so easily! The tool you wish existed probably already does. Keep looking, keep asking, keep trying. I hope I’ll never resign myself to apparent limitations so quickly again.
As it was, the late-breaking epiphany led to a madcap rewrite on zero sleep…which, all things considered, turned out to be pretty fun. I think part of me didn’t want it to be over, and my “how could I have missed this?!” moment kept the adrenaline going.
A month later, I still miss that adrenaline sometimes. I miss the terror of the unknown getting toppled by the thrill of discovery, over and over again. But I’ve got a new project in mind; a new unknown I’m ready to face. I’ll find my way down the rabbit hole again.
With thanks to two great Pauls: Paul Bowden, my CS50 TF, and Paul Osman, SoundCloud’s incredibly helpful API Evangelist. (Go check out his new SoundCloud API lesson on Codecademy!) The longest hug goes to Erik, for everything.