Using Yahoo! Pipes to Extract Data You Want

Let’s say that you’ve read all the hubbub about aXXo – one of the most popular seeders of movies, whether you like him or don’t – and how he suddenly removed all of his torrents from The Pirate Bay. And you want to be sure that you’re getting only true aXXo releases, because you know all sorts of junk gets put out with variations on “axxo” in the title, just so everyone will download it. Hey, it happens.

You could go to a number of other places that allow you to pull an RSS feed by user. But that would be the easy way out. So you decide to stick it out at The Pirate Bay and hope that aXXo comes back. Maybe he will, maybe he won’t. Or maybe you just want to learn something. Then read on.

First, go grab an RSS feed of the category you want. For aXXo, you’ll probably want to get the movies feed. The problem is all the other junk in there that doesn’t come from aXXo. As you probably know, anyone can copy his style, right down to the naming convention. So how do you determine if it’s his? You need to look at the feed. Right there is a line that will tell you. Each item has a line that tells you the creator. It looks like this:

<dc:creator>portyu</dc:creator>

That one, incidentally, didn’t come from aXXo, but it was someone who posted a movie that was supposed to be from aXXo. So you want to exclude it. The next question is how. You could easily look through the file and pull them out manually, but that’s a pain. Enter Yahoo! Pipes. You’ll need a Yahoo! ID to create a pipe, so if you don’t have one, go get one now. I’ll wait.

Once you have one, simply sign in to Pipes and select the big blue button that says Create a pipe. Don’t worry, you don’t need to know what a pipe is at this point, just follow along and you’ll see what happens. Once you click the button, you’re given a somewhat friendly interface where you can drag things around. You’ll need three pieces to make this work: A source, a filter and an output.

Click the Fetch Feed item under Source and drag it into the window, then do the same for Filter (under Operators). You’ll notice when you do that you also get a Pipe Output module in the window.

Enter the feed URL into the Fetch Feed module. Then click the dot at the bottom of the Fetch Feed module and drag it to the dot at the top of the Filter module. This connects (or “pipes” the flow of data) from one module to the next. In the Filter module, select Permit All. The rule you are selecting will be to match item.dc:creator when it is aXXo. All of these are in the drop-down lists except aXXo – you’ll need to type that in. Finally, click the dot at the bottom of the filter module and drag it to the dot at the top of the Pipe Output module. That’s it.

Your new Yahoo! Pipe will have a source feed, a filter and an output. If you’re feeling really adventurous, you can even create multiple inputs for your feed, but if you do that, make sure that both have the dc:creator element, or you won’t get any output at all because of the filter you set.

Once your pipe has been created, click on save, at which point you can name your pipe. Then you’ll be able to view this pipe, as well as any other pipes that you have, by clicking on the My Pipes link. From there, click the View Results link and you’ll see what sort of effect your changes have made. In order to subscribe to your pipe, click the More Options button, at which point you can get raw XML, or easily subscribe in a number of the more popular readers.

Yahoo! Pipes is not limited to this sort of application, but it’s a very easy way to show you what you can do. There are all sorts of pre-built pipes that are available from the pipes home page, and you can even share your own pipes once you’ve built them. What have you built, or even come across?

2 Replies to “Using Yahoo! Pipes to Extract Data You Want”

  1. Hi Joe –

    I haven’t done it specifically on upcoming, but you’ll do it much like outlined above. First, grab a feed, then filter the data you want. You may even want to search for existing upcoming pipes, because there are plenty of them out there.

    But I don’t know if you will be able to flat-out extract that data. Mostly it depends on what sort of format you want to get the data in. When I have created pipes, it’s to create a new feed of filtered data. If that’s what you want, I think you’ll be fine. Otherwise you might need to take your filtered feed and do something else with it (parse with some other program, for instance).

  2. Hey,
    working on a project to extract addresses and phone numbers from the RSS descriptions of events from websites like upcoming.yahoo.com

    any idea how that can be achieved?

Comments are closed.