Hard Refresh Podcast artwork

The Phantom Subscriptions

Hard Refresh Podcast

English - September 16, 2016 19:37 - 29 minutes - 32.4 MB - ★★★★★ - 7 ratings
Technology Society & Culture Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Next Episode: 4.2.3

Pippin Williamson had a big problem: Thousands of business owners were charging the wrong people for subscriptions to their products — because of his software. No pressure, Pippin! You got this.

Pippin Williamson had a big problem: thousands of business owners were using his software to take payments online, but the latest version was charging the wrong customers for subscriptions. Hear how a coding mistake with data from Paypal and Stripe platforms put businesses at risk, and how they repaired the damage.


About our Guests

Pippin Williamson is the founder of several widely-used WordPress plugins, including Easy Digital Downloads, and an avid teacher in the WordPress community. Learn more about his work and products at pippinsplugins.com.


Chris Koslowski is the Co-Lead Developer of Easy Digital Downloads. Learn more about Chris at chrisk.io.

@DouglasDetrick @hardrefreshshow @pippinsplugins thanks for having us on. Loved the episode and can't wait for the next!


— Chris Klosowski (@cklosowski) September 17, 2016



Show Notes

Find Easy Digital Downloads and the Recurring Payments Extension here.


The episode of the Apply Filters Podcast where Pippin and Brad Touesnard first discussed this issue can be found here.


The Hard Refresh theme is by The Brow, the latest in a long line of beat projects from genre-bending producer Marcus Williams. His ability to blend the visceral headnod qualities of golden era hip hop with dream pop and indie rock has captured fans across the world. He has been a steady fixture in the Portland scene through the years and continues to remain versatile – these tracks can be uptempo and playful one minute, then gritty and determined the next, but always confidently crafted with a unique penchant for sonic detail.


Interstitial music is by The Ocular Concern from their album “Sister Cities,” a danceable and soul-satisfying project co-led by Andrew Oliver and Daniel Duval that ties together West African music, tango, jazz and chamber music. It was produced in Portland, OR, and is released on PJCE Records.


Thanks to Pippin and Chris for sharing their story. Thanks to Fastson for typewriter sounds, Dnlburnett for the office sounds, and freesound.org.


Show Transcript
0:00 – 0:56 — Intro

Intro:

[Catherine] More and more, our lives and businesses depend on internet technology. These are stories about the people who pick up the pieces when it all falls apart. Welcome to Hard Refresh.

Music begins; overlay Pippin describing the bug and ramifications, stops as the beat drops

[Catherine] Pippin Williamson had a big problem. A bug in the latest version of his popular software was charging the wrong customers for new subscriptions. And the worst part about it? It could have been 10 customers, or 10 million.
[Douglas] Hard Refresh is a production of Rocket Lift, a web development company based in Portland, Oregon. I’m your host, Douglaslas Detrick, and I’m glad you’re with me.

0:56 – 2:36 — How did this all start?

[Pippin] “Well, I’ll go back quite a ways and if it’s too far…”

[Douglas] That’s Pippin. He’s the creator of Easy Digital Downloads, a WordPress plugin that allows business owners to sell digital products on their own websites—pretty much anything that can be downloaded. In April 2016 the plugin was installed on over 50 thousand websites, and things were going well with the business. That is until this story starts…
[Pippin] “We had two customers come to us and say ‘hey I’ve got this weird issue where a charge has been assigned to the wrong customer. Can you help me figure out what’s wrong?’ In development there’s an idea that if you see a weird problem, it could be a bug, it could be a conflict with some other system they’re running on their site, or who knows what.”
[Douglas] Maybe their WordPress software was out of date. Maybe some other plugin was interfering with Easy Digital Downloads. Maybe this was a server problem. Maybe Pippin’s customer just made some honest mistake that caused this, and they didn’t know it.
[Pippin] “But if you ever see that same thing a second time, you’ve got an actual bug in your code.”
Douglas: And that was the situation he was in: two reports about the same issue. They thought the first support request was a result of random chance.
[Pippin] “We didn’t see any logical explanation for why it would happen. And so when logic says there’s no straightforward answer, it’s probably a weird edge case, or a weird conflict.”
[Douglas] An “edge case” is a software problem that arises only in a particular situation that the vast majority of users don’t encounter. So, when the second support request came in, Pippin knew this wasn’t just an edge case any more.
[Pippin] “Once I had that realization that something was wrong, it took about 30 minutes to identify the bug and figure out exactly why it was happening. Then it was about 72 hours of shitstorm trying to fix it, and freaking out.”
[Douglas] You know when you buy an apple, then you notice it’s got a little bad spot, and then you find the whole fruit is rotten beneath the skin? This bug was like that.

2:36 – 4:10 Chris Klosowski was on vacation…

[Douglas] One of the things we want to do with this podcast is to show how problems with code online can affect our real lives. Well, here’s a great example. Here’s Chris Koslowski.
[Chris] “Pippin and I share the lead responsibility of managing the overall project and direction that Easy Digital Downloads heads in.”
[Douglas] When you work in software development, bugs happen. Sometimes bugs happen when you’re on vacation.
[Chris] We don’t see our families a lot. So when we take vacations, I’ll take like two weeks and work remotely wherever we’re at as a family, and take a week off when we see my parents. So we were actually in Florida.”
[Chris] “I think we had just gotten to her dad’s…”
[Douglas] And Chris received a phone call from Pippen.
[Chris] “We basically got there and it was like ‘hey, this is going on…”
[Douglas] “This is going on” as in “we’ve got a big problem but don’t freak out…”
[Chris] “It was bad, but it was one of our best customers that kind of looked at it and saw what was going on, and they were really good at debugging and getting us the information that we needed. We were able to figure out exactly why it happened. That moment was a little bit of panic, but at the same time, that’s the software industry. I’ve been doing it for about six and a half years. It’s not foreign to me that all of a sudden we have to jump on and just start working.”
[Douglas] Software developer goes on vacation. Software developer stops vacationing to fix  bug while kids go play on the beach in Florida. It’s a classic tale.

4:10 – 5:45 Back to Pippin: What was at stake?

[Douglas] So we know how that moment went down, when they discovered this bug. But what was at stake? Here’s Pippin again.
[Pippin] “Well, at lot, frankly.”
[Douglas] Remember that the sole function of Easy Digital Downloads and the Recurring Payments extension is to sell products online, and charge customers on a recurring basis. Pippin discovered the product was charging customers for subscriptions that they didn’t buy.
[Pippin] “There could be customers that were suddenly paying for someone else’s subscription. There could be customers who had their subscription canceled incorrectly…”
[Douglas] The product was charging for subscriptions that customers didn’t buy. The problem undermined the entire purpose of the product, and Pippin had no idea how many people might be affected.
[Pippin] “We knew how many customers we had of the plugin. And we knew two or three reports of actual customers that had been affected by it. Beyond that the only thing we could do was speculate.”
[Douglas] The way the plugin works is that a store owner runs the plugin on his or her website, and then uses it to sell to their customers. So, both groups of people could be harmed by this bug.”
[Pippin] “We know how many customers we have of the product. But what we don’t know is how many customers all of those sites have. We knew that one site had 250 customers. We knew that another site had thirty thousand customers. Who knows? Is there another site that had fifty thousand? Two hundred thousand? A million?”
[Douglas] Now, Pippin’s company definitely survived this bug, but at this moment in the story, it seemed like maybe it wouldn’t. They feared they had a huge problem where their customers could be losing…
[Pippin] “Dozens, Hundreds, or thousands of customers. We were responsible for that. And we were responsible for companies potentially going out of business.”
[Douglas] Destruction on a massive scale.

5:45 – 7:45 — So, What was the problem?

[Douglas] When Easy Digital Downloads started it was limited to single transactions, and business owners were asking for a way to sell ongoing subscriptions. So, the Recurring Payments extension was born. But, the first version of the extension was very basic. As Pippin says, it was…
[Pippin] “…to be frank, kind of shitty. It was very minimal, it was an MVP”
[Douglas] So, they turned Recurring Payments into a much more robust product. But to do that, they needed to completely re-make the plugin. And that’s where things start to get interesting.
[Pippin] “So we have a bunch of data in the database that’s a record of the subscriptions that customers have to the site.”
[Douglas] Ten bucks a month for this customer, 20 bucks a month for that customer…
[Pippin] “The new version of the plugin, in order to offer all the new features and enhancements that we wanted to, required a completely different structure for the database.”
[Douglas] But you can’t just do a mass copy and paste—each data point has to be carefully fit into a slot in the new database.
[Pippin] “During this upgrade routine we would pull the old data and put it in the new data and there’d be some information that was missing. So we’d have to call Stripe.com or call Paypal to retrieve information about that subscription.”
[Douglas] Remember that we could have millions of database entries that have to be changed—that’s the scale we’re talking about. To understand this, imagine a big room full of people, a hundred of them sitting at a hundre d desks, each with two stacks of paper in folders—old data on the left, new data on the right—and a typewriter and a phone in the middle.
[typewriter sound design starts]
[Douglas] No, I mean BIG room.
[lots of reverb as if in a big room]
[Douglas] And I said a hundred people, right?
[Multiply the typewriter sounds, phones ringing, people talking…]
[Douglas] Ah, there we go. Now, for each database entry, the worker picks up a folder from the left, and looks for blanks that need to be filled in to get the entry ready for the new database. If there are blanks, they pick up the phone and they call Stripe and get someone to tell them the customer id number, then hang up, type the number in, and put the finished folder in the pile on the right. Now, those hundred people in that room might be able to do four hundred entries in an hour, and at that rate, doing a million entries would take 2500 hours. All this is just to say: this is why we have computers, folks. A million entries can be done by a computer in a few seconds.
[ends with return bell sound]

7:45 – 10:40 How did they do it?

[Douglas] So, rather than the room full of telephones and typewriters, they wrote some new code to add to the plugin, an upgrade routine that would convert all the old data to the new format.
[Pippin] “So during this upgrade routine, we go through a loop. We say here are all the old subscribers, let’s loop through them one by one and retrieve and update their details in the new database.”
[Douglas] A loop is a programming method used to go through a list of entries one by one, doing something to each. In other words, that’s how you replace all those typewriters.
[Pippin] “During the process there was a bug that basically caused the id number for one customer’s id number to be applied to another customer’s subscription id number. So let’s say that you have five subscriptions. You have a subscription from Paypal, a subscription from Stripe, then a subscription in Stripe, then a subscription in Paypal, and another one in Stripe. It would loop through those. So, first it does a Stripe one, then it does a Paypal….” [fade out]
[Douglas] The loop would look for data that needed to be updated, get the missing information through API requests to Stripe or Paypal, and then update each customer’s record.
[Pippin] “What ended up happening was that during that loop if we did a Stripe customer followed by a Paypal customer, the id number retrieved for that Stripe customer would get assigned to the next Paypal customer.”
[Douglas] Stripe customers have customer id numbers, but Paypal customers do not. So, the loop wrongly assumed that each entry should have a customer id number.
[Pippin] “It turned out that profile id was only needed for Stripe customers. So, the bug came from the fact that that profile id, even though it was not needed, was still getting stored for a Paypal customer.”
[Douglas] Here’s how that worked. Imagine each database entry is numbered: one, two, three, four, five, six…
[Pippin] “Let’s say that we’re on record number seven, and that is a Stripe customer. So, we go and retrieve that profile id and store it in a variable.
[Douglas] For all of you non-programmers out there, a “variable” in code is a like a variable in mathematics. It’s a particular piece of data whose value has to be defined. But, the variable isn’t just the data, it’s also a location to store the data so you can manipulate it. If you think back to your grade school days, it’s a bit like the cubby hole where you stashed your favorite dinosaur sweatshirt during class.
[Pippin] “Now we’ve finished the loop and we go to record number 8. Now this is a Paypal customer. Since the Paypal customers do not need to have a profile id stored, we would normally just ignore the profile id variable, since it’s not needed.”
[Douglas] Instead of telling their loop to ignore the profile id variable for Paypal customers, the code said:
[Pippin] “If this profile id has a value…”
[Douglas] Any value at all, even the one from the previous entry…
[Pippin] “…let’s store it on the customer. Now imagine that second customer makes a new purchase. The new purchase would get attributed to the other customer’s account. Because the id number was improperly duplicated onto their account.”
[Douglas] The takeaway? Well, if we go back to the cubby hole analogy, the problem happened because the contents of one cubby hole were being used by many kids, not just the one it belonged to. That’s what was happening to Pippin’s customers, but there was more at stake than hurt feelings.

10:40 – 12:00 What was at stake?

[Douglas] So what was going on behind the scenes for Pippin? I asked him if the bug affected him personally.
[Pippin] “Oh, absolutely. I felt completely worthless. I mean just, just…. The reality of most issues that get introduced into software like this are very minor. They affect a customer or two. And those hurt, but this was a whole different kind of scale. This was a ‘I want to walk away right now kind of feeling, the ‘maybe I’m in the wrong business’ type of feeling.”
[Douglas] Most bugs are like a sliver in your finger, but this one felt like a broken bone. Even though they had only received a few complaints about incorrect charges, Pippin and his team were thinking about the worst possible scenarios—it could be the end of his own business for sure. But it could also destroy his customers’ businesses.
[Pippin] “The scary thing to me was the idea that this problem could exist for a month, six months, a year, and suddenly a site owner discovers ‘I have 50 thousand dollars in incorrect charges. Those are the kinds of numbers that ruin businesses.”
[Catherine] Coming up after the break, a conversation about why bugs hurt and why it’s important to disclose them.

12:00 – 12:45 A message from Rocket Lift

[Sponsor Message] “At Rocket Lift, we’re a bunch of nerds, and we think this work is exciting. It’s dramatic, and hard, and important for so many things we take for granted these days. It requires creative problem solving and serious dedication, but usually, when something breaks, it’s behind the scenes. We’re telling these stories to celebrate, and learn from them. If you face challenges with your website, or need a dedicated, ongoing partner to manage risk and improve your website systems, get in touch with us at rocketlift.com.”

12:45 – 16:30 — Matthew and Douglas conversation

[sounds of Matthew and Douglas getting settled into a conversation, intros, etc.]

[Douglas] To get a little more perspective on some of the issues in Pippin’s story, I brought in Matthew Eppelsheimer. He’s the Executive Producer of the show, and the President of Rocket Lift. We started out just talking about how he first met Pippin.
[Conversation]
[Douglas] I’m not a developer. So when I first started in this industry doing marketing and support work three years ago, there was a lot I didn’t understand.

16:30 – 18:30 — It turns out that the actual bug was tiny…

[Douglas] So far, we’ve talked about what Pippin and his team were trying to achieve through this update to Easy Digital Downloads, the unexpected problem they discovered, and the potentially huge amount of damage that the issue could cause. So, knowing all that, you’d think there would be some great big fire-breathing dragon of a bug, right? Well, it turns out that the bug was actually tiny. Like a cricket with a bad temper.
[Pippin] “That’s the infuriating part about it is that it was such an easy and simple mistake to make. It’s one of those mistakes that tons of people make, but you only learn to not make that mistake once you have been affected by it.”
[Douglas] It was a rookie mistake, made by a seasoned veteran. Anyone could have done it, and all he could do was all any of us can ever hope to do—to get back to work.

[Pippin] “Fixing it was literally like a 10 character fix in one single line of code.”
[Douglas] Chris Koslowski gave a little more detail. There’s a PHP function called “unset.”
[Chris] “…and what it basically does is it takes a variable that you’ve defined and takes it out of memory. So it doesn’t persist anymore. And if you try to use it again, you need to recreate it. All we had to do was unset one variable.”
[Douglas] I keep thinking there should be a lot more to this part, but like Pippin said:
[Pippin] “Yeah, it’s crazy. That’s literally it.”
[Douglas] All they had to do was to unset the customer id variable at the beginning of each iteration of the loop. That would keep it from being copied to the wrong customer’s account.
[Pippin] “It was infuriatingly simple.”
[Douglas] So, now they’ve fixed the bug. However, that wasn’t actually repairing all the damage the bug had caused. That was just to keep it from happening again.
[Pippin] “The massive amount of work was to fix problems for customers that had already done this upgrade routine, because their data was already damaged. We had to go back and fix all of the data.”
[Douglas] The next update to the Recurring Payments extension would need to repair the bad data that had been created by the previous version. Remember all those people in the big room with the phones and typewriters? [Bring back a snippet of that sound.] The next version of the plugin would need to re-do that whole process, and fix all the mistakes in every user’s database.

18:30 – 20:30 Testing and implementing the repair

[Pippin] “When I identified it, I just sent the other developer a message that said ‘Uh oh. We have a problem, and we need to fix it immediately.’ Once we figured out what to do, he started building it, the fix for it.”
[Douglas] Pippin’s talking about Chris, who we heard from earlier.
[Pippin] “And I started working with the rest of our team, to make sure they all knew what the problem was, and to be on the lookout for additional problems from other customers.”
[Douglas] As we already heard, they fixed the bug quickly, but they still needed to repair the damaged data on every customer’s website. When they actually looked at the damaged data entries, this is what it looked like:
[Pippin] “Let’s say you have a database of a thousand customers. Three hundred of them might have duplicated id numbers. So, five of them share one id, six of them share another id, five of them share another id. And to fix it, we have to go through and identify all of the duplicates, and then we have to figure out who does that id actually belong to. And then you have to delete the id from all the incorrectly assigned ones.”
[Douglas] So that cleans up your data, but it doesn’t do the most important thing of all: refund all of the incorrect charges.
[Pippin] “Then you have to go back to the original id and say ‘do we have any incorrect charges, any new subscriptions that came from customers that were assigned a duplicate id? And if we find any payments incorrectly attributed to them, we have to refund them, we have to cancel the subscriptions, and then we have to delete the stored card data on that account.”
[Douglas] So now they know what to do, and they’ve written the code to do it.
[Pippin] “Once we had some preliminary fixes in place, there was a lot of testing. We ended up building an event log that would allow us to simulate what we were doing, without actually doing any of it. We could set up our bad database once, and then run simulation and simulation after simulation and make sure that everything came out correctly.”
[Douglas] Once their fix was passing their tests, it was time to test with real data, and that means a real website.

20:30 – 22:40 Taking the fix to a real website

[Douglas] So, Pippin says to his customers:
[Pippin] “‘Yeah, we screwed up. Will you allow us to use your site to test a fix for it?’ So that was a little scary. But, those were the only ways knew to confirm the issue was truly resolved. No matter how much testing you do with fake data, it’s still fake data. And so you still need to test it in a real environment to be a hundred percent sure that you’ve fixed all the problems.”
[Douglas] Running the test on real data means that if there are problems, you’ve damaged your customer data even more. But, Pippin and Chris anticipated that, so they kept the logging function that allowed them to run the fix on a database without actually changing the data. That way they could run the fix on the real database, verify that the results were correct, and then run the update for real. Chris described how that worked.
[Chris] “We were able to basically run the update routine without actually running the upgrade routine. We are going to take this subscription and this Stripe customer and make this change to it, or we’re not making any changes to this customer, move on to the next one. We got to actually echo out all the staging to a text file. So, we could review it on our end, but we could also handed it off to that customer and have them run it verify that what they saw was right.”
[Douglas] After they identified the bug late Sunday night, Pippin’s team released the fix to all their customers by Wednesday the next week.
[Pippin] “Monday was really us working directly with the team to do tests and simulations. I believe it was Tuesday that we were really testing on all these sites. And then it was Wednesday that we pushed the update out to everybody else and then sent out an email to every customer explaining the problem, how to know if you’re affected, and what you need to do.”
[Douglas] Here’s a little bit of that email: “There has been an important update to the Recurring Payments extension to Easy Digital Downloads. Shortly after releasing version 2.4, a bug was discovered in the upgrade process from versions previous to 2.4. The bug resulted in some user accounts getting the incorrect Stripe customer id assigned to them, which could cause charges and subscriptions to be linked incorrectly in your Stripe account.”

22:40 – 25:10 — What did he learn?

[Douglas] So, what have we learned?
[Pippin] “Any creator, no matter whether you’re software or art, if there’s a major problem, you’re going to be personally affected by it. But, in the long run, as long as you don’t give up, or let it define you or your product, it’s going to be a net positive.”
[Douglas] Pippin mentioned two important reminders he got out of this process. First, the value of defensive coding. He explained it this way.
[Pippin] “Let’s say you want to paint a room, and you want the color to be a particular shade of white.”
[Douglas] And you have to achieve this very particular shade of white by mixing a whole lot of white paint and a tiny bit of black paint together. So you write software to drive a room-painting machine.
[Pippin] “Because it’s super important that this room comes out exactly the right shade, you decide that you are going to do a color check on it. So—I don’t know, whatever kind of equipment you have—you’re going to double check that it’s actually what it says it is.”
[Douglas] That would be defensive coding, making sure your data is what it’s supposed to be. But, let’s say you weren’t so defensive, and the software doesn’t have a way to check the color. Instead of that teeny bit of black paint, your intern puts purple paint in the room-painting machine. If the software has no way to check the color, you’re not going to get the end result that you need.
[Pippin] “You only know it’s wrong after you’ve put it on the wall.”
[Douglas] I hope you like mauve. Or maybe it’s periwinkle? Either way, it’s not good. In this case, unsetting the variable each time through the loop would have been the defensive thing to do. Ok, so what was lesson number two?
[Pippin] We could have just ignored the problem, pretended it didn’t happen and just hope that nobody who had updated noticed.”
[Douglas] But they didn’t do that. Transparency is the issue here, and it was a critical part of the solution.
[Pippin] “We could have easily done that, and honestly, that would have been the easy answer. It would have killed us inside, and hopefully turned our souls black, but we could have easily done that.”
[Douglas] Pippin’s attitude towards transparency is inspired by the WordPress community, where it’s expected that information about bugs and security flaws is shared quickly and honestly. Pippin felt the same should be true for his business.
[Pippin] “I know that there are definitely some communities, both online and offline, that definitely don’t share the openness that I think a lot of us are used to in the WordPress community. The two primary customers we had were more than willing to let us test on their sites, and that’s something I’m very very grateful for, and I think it was a result of us being very forward and honest from the beginning. I think it was a testament that transparency can go a long way.”

25:10 – 27:45 — Wrap-up

[Douglas] To all of you who are struggling with website issues, even ones that you created yourself, chin up. You’re in good company. And if you do find yourself having caused a serious problem, coming clean is the best option. We’re very much in favor of a technology sector where transparency is the norm, not the exception.
[ Matthew/Douglas conversation on transparency]
[Douglas] Thanks for listening to the first episode of Hard Refresh. We can’t wait to share all these great stories with you. The Hard Refresh theme was composed by The Brow, and interstitial music in this episode is by the Ocular Concern, on PJCE Records. Thanks to Fastson for typewriter sounds, Dnlburnett for the office sounds, and freesound.org.

27:45 – 29:24 — End Credits

[theme music]
[Catherine] Recording, editing, mixing, sound design, by Douglas Detrick. Production  by Douglas and me, Catherine Bridge. Matthew Eppelsheimer is our Executive Producer. Be sure to subscribe on your podcast platform of choice, and write us a review on itunes. Head on over to hardrefresh.audio to read more about Pippin, or send us a message. You can find us on twitter and facebook. Coming up on episode two of Hard Refresh: Richard Tape of the University of British Columbia runs a multi-site WordPress infrastructure with tens of thousands of individual websites. A change to the WordPress shortcode API broke hundreds of them, and Richard had to get creative to fix them.
[Richard] “I waited until close of play, updated WordPress on a staging environment. My team and I checked it all through and everything looked fine, so we put it into production. It took maybe seven minutes for the first ticket to come through, and it was “my whole site’s broken, nothing works.”
[Catherine] If you, or someone you know, has run into a serious problem while working on the web, we’d like to hear about it. We’re looking for stories where internet technology was part of the problem, and human creativity was vital to the solution. Thanks for listening. Now, perhaps it’s time treat yourself to a hard refreshment?

Twitter Mentions