It started with Peter and the Wolf.


I’ve been wanting to play with Markoved text for some time, using books available in the public domain, easily accessible via Gutenberg. The idea in my head was to create a Twitter bot that would mash Peter Pan together with Call of the Wild (although later I realized that Beowulf would be the better option) and tweet lines from the merged books. As I dug into this project, I realized that the mash-up text wasn’t really interesting enough to hold its own as a bot, but I did like combining different texts to see the results. After while I realized that the *more interesting* challenge would be to build an app where any two (or more) Gutenberg texts can be combined. Behold, the #notbot! Screen Shot 2015-08-15 at 7.55.06 PM

(And yes, I got my Peter and the Wolf in there.) Go play, and then come back and I will show you how I built it:

Step 1: Set up the function

I looked for a Ruby gem that would generate random  sentences from a given text, and found the markov_chains gem by Justin Domingue. From there I wrote a basic script to open and concatenate two text files, and then generate 1 or more sentences from them. So far, so good!

Step 2: Get you some books!

I spent a long time poking around the Gutenberg site, reading their guidelines, and trying to determine the best way to get text into my app. Ideally I wanted a user to be able to choose one or more titles in Gutenberg, and merge those. I looked at their file structure, the book ID numbers, etc., and came up with a few solutions to get the books via ID lookup or scraping. However, according to their API documentation: 

The Project Gutenberg website is intended for human users only. Any perceived use of automated tools to access the Project Gutenberg website will result in a temporary or permanent block of your IP address.

Why Gutenberg, whyyyyy? This is a hiccup. One that I don’t really want to mess with. (I already was having trouble accessing the site while using my normal VPN.) I settled for downloading about 20 texts, figuring that would provide a good variety, and hoping to implement a way for users to paste their own text if they want to do a Gutenberg cut-paste manually. It goes without saying that I looked for a variety of books with distinct styles and by a diverse set of authors. 

In my first foray into *multiple ruby files that play together* I created a class “Corpus” in a separate Ruby file for my books with a variety of methods to list books, look up file names, add books, and concatenate two books, and required that file in my main script. My thinking here was to make it easy to add or change books in the future.

Step 3: Books are terrrrrible

The merge function was working well, but chapter headings, random spaces and new lines were making the sentence generation all wonky. So I took advantage of my new knowledge of regular expressions to do some cleaning in Sublime Text.

This was also good practice on multi-cursor use. I tried to strip most everything that wasn’t sentences, including a ton of notes and prefaces and appendixes and whatnot. (Sorry, books!) In my earliest iterations, the Marvoked output was saved into a named file in my project folder, but I scrapped that for web implementation. But, knowing how to open/read/write files without breaking everything is potentially useful for projects down the road…

Step 4: Ol’ Blue Eyes

Now I had a working ruby script and files, but I wasn’t sure how to translate them to a web application. I could have attempted this project using Rails, but since there’s no database to keep track of, I thought that might be overkill. I turned to my pal Joel for advice and he directed me towards Sinatra, a lightweight Ruby/HTML framework (Sinatra comes with delightful messages such as “Sinatra takes the stage on port 4567” and “Sinatra departs to thunderous applause”). So I downloaded and awkwardly poked at Sinatra and got as far as being able to load up a page and get some Ruby variables in that page, and make a drop-down menu for book selection. Where I got stuck, though, was how do I get variables from user input on the page and put them back into a Ruby function (what I have since come to learn is the “post” function).

Unbeknownst to me, Joel was also taking a stab at my app and much more successfully devised a text generator (that “post” feature) as well as some JavaScript to run a load wheel and the files and bundling that pulls it all together. With his help, I went from here to here.

Step 5: Add a book grid & user input text

My first thought was to rebuild everything Joel had done from scratch, so I would understand how it all works. However, I found that difficult to implement (partly because a lot of the formatting was coming from bootstrap css), so I settled for using his code as a launching pad and adding enough features and tweaks to comfortably call it my own. In doing this, I touched almost every feature and prepared myself well for the next Sinatra app.

  • I added the book grid at the top using Ruby and css for a visual of what books are available in the app. The full list appears when you select an option from a drop-down menu. I picked book images labeled ok for non-commercial reuse, so some of them are not the book image you may have in your head.
  • It took me the better part of a Saturday (and I burned soup in the process), but I added a user text option for customized text. This required sending another parameter to Sinatra’s post function (the user text), teaching the Corpus how to handle user text (since it’s not a .txt file like the rest of the corpus), and some logic statements to update the user text when changed while still maximizing performance on the book merges. Joel had set a generator up to track book combinations — this saves considerable time for sentence generating since some of the books are lengthy files. However, that feature was breaking when new user text was entered (it would keep giving you the old text, even on reload), so I set it to generate new every time if a user entry is involved, and use the time-saving feature for regular book combos.
  • I added Google fonts (maybe you hate them… I think they’re great) and messed with the css styles a bit. I also added a footer with attribution.
  • Here’s a fun fact: I wanted to use the “it was the best of times, it was the worst of times…” passage as sample text because I thought it would Markov well. But in running it, I got the same phrase or sentence over and over; it was not meshing well and not taking advantage of patterns. Why? It turns out that passage is *all one sentence* and the Markov gem looks for periods as key stop points. My great example was actually a terrible example. I kept it but changed all the commas to periods. So, yes, I took liberties with Dickens and I’d do it again, too. (Sorry, Charles.)

Step 6: Deploy!

As with past projects, I used Heroku to publish this app and it was very easy. I’m also using this project to practice using git branches to track specific app features as I go along, so I’m switching to a new branch, completing the feature, and merging back into master as I go.

Step 7: Profit?

I got a good portfolio piece from this work, and lots of relevant practice, plus now I can MOVE ON WITH MY LIFE and for the love of God stop combining books already.

The toy is for fun and discovery, and maybe if you like it you’ll kick a few dollars to Project Gutenberg, even though their lack of developer API makes me and my VPN cry.

Step 8: Feedback Plz

Y’all, I wrote NO tests for this app, and set no character limits either. There are probably a lot of ways it could break. If you find one, or see anything that looks like a security hazard, let me know? This is a learning process!

Here’s the link again, feel free to share: Book Merge!