/gutenberg project volunteer

comment(s) 

The ClearType Press

http://www.brains-N-brawn.com/gutenberg 10/3/2002 casey chesnut

End up converting a majority of the ~5000 Project Gutenberg etexts to proper ebook format for MS Reader and Pocket PCs. I am critical about Project Gutenberg at certain times throughout this article. Please dont take offense, I think PG kicks a55. If it continues as is, I will still love it; but I hope they might take what I have put together, or build their own automated tools and make it even more powerful and successful.

Go To Project Gutenberg

I love books. A quick glance around my apartment and the only apparent furnishings are a bed, computers, and technical books. Perusing through my harddrive(s) you find even more digitized books, either ebook or audio format. Save my eyes for technical books and listen to other books for pleasure, primarily from Audible.com. So 3 weeks ago, 2 weeks after the release of /noSink (cooler than this), my ISP cancelled my account unexpectedly. Took me 2 weeks to get connected again ... with the pictures below showing the damage from Barnes and Noble / Amazon / and Borders (think tax writeoff):

BEFORE

AFTER

in both pictures: the left stack has been read, the right stack is pending

Since I move alot (nomad), and books are heavy and take up alot of space, I keep shipping them back to my parents whom run the 'Casey Chesnut Memorial Library', which consists of a huge bookshelf. Periodically, whenever reinventing myself (e.g. from KC++ to KCJ to KC#) the bookshelf gets cleaned and the deprecated books are hauled off to some nearby library whose comp sci book section is typically doubled or tripled in size, as well as brought up to the correct decade.

<Rant>GP brings up the issue of copyrights and such. I am pissed how successfully RIAA dessimated mp3.com and napster. At least Rio made it. Unexpectedly, I am even more pissed that RIAA is now having to pass a law to flood the networks with bad copies of songs and such (which they are already doing). There is no law. They should be able to battle it out with technology fair and square, without their hands being tied. Of course gnutella and kazaa would then incorporate some media or user rating services, but that is beside the point. Granted they cannot wreck my harddrive, but if I end up with a looped song, then they did a good job. Whats really cr4ppy is that there are places without these regulations, and we are having to battle these same people on a virtual landscape where location does not matter, that puts us at a serious disadvantage. If the US continues towards regulation ... people will start leaving. Walled gardens will make for an even quicker exit strategy. Wonder if FedEx will deliver offshore? Dont forget you would be able to name your floating barge, as well as fight off pirates. Did the hiding place in Atlas Shrugged have a name? If so, that is what I would name mine</Rant>

Step 1) Complain about problem

As you can kind of see from the badly taken pictures, my favorite publishers are MS Press, Wiley Tech Brief, Wrox Early Adopter, and APress. For .NET, MS Press has been dominating. The funny thing is, when I get their books they come with a CD that is supposed to have an ebook on it. I'm like great, 'I'll be able to put the ebook on my PocketPC and read it from there'. Then the alleged ebook ends up being a stupid CHM file. A CHM file is not an ebook! Their is only 1 CHM reader on the PocketPC, and it sucks. To read a CHM file on my PPC, I have to decompile it to HTML and then copy all those files over to a storage card and open them individually ... lame.

A true ebook (*.LIT format) on a PPC actually makes for a great reading experience. You get ClearType fonts which are easy on the eyes. MS Reader does pagination based on the font size you select and size of the display, so that the # of pages varies based on if you read it on your desktop computer vs your PPC. You can hold a PPC in one hand and read while lying facing up, which requires 2 hands with a physical book, not to mention the weight difference (a true bookaholic). Also, on the desktop, their is a Text-To-Speech add-in which will read the book to you in a HAL 2000 type voice. I am hoping this feature ultimately makes it to the PPC as well. You might not be able to let people borrow an ebook because of DRM; but I am selfish with my physical books in the 1st place on the chance that the borrower might enjoy reading in the restroom, thus fecalizing my book, at which point they can keep it.

Regarding the other alleged ebook formats: PDFs suck on a PPC because they dont paginate and they dont have cleartype. Also, they dont always convert from the desktop flavor. TXT is portable to every device but you dont get pagination or cleartype, and the file sizes are larger too. The PALM formats ... who cares; PALM took the same path as the Mac. When I see a PDB, I'm like, 'I dont want to debug your code'. CHM files dont have a good reader, nor the new Help file format, so this requires decompiling to HTML and opening the pages individually. Sometimes this can be over 1000 HTML files, and the small CPUs on the PPCs dont like that many files in one directory.

Step 2) Find books in electronic format

So LIT is clearly the superior format ... except it is difficult to find ebooks at all, regardless of LIT format. Instead of complaining some more, I decided to gen a ton of ebooks! The 1st step being, where does one get a bunch of books in electronic format? [http://www.archive.org/texts/texts.php] Out of those links, Project Gutenberg is clearly the best. Volunteers donate their own time to enter in a book no longer under copyright (typically before 1923), others proofread their typing, and it eventually is added to the collection of etexts as they put it, and mirrored across multiple FTP servers to make available to everyone. So I fire up my bulletproof FTP client and proceed to download the entire thing. After ripping through all the folders, it says it is going to download 15 gigs of info! Dont feel like waiting that long, so I disconnect and look through their directories. Looks like every etext they have also comes in a *.ZIP flavor. Fire up the FTP client again and d/l approximately 7500 ZIP files, just over 2 gigs. That took 1 night.

Just perusing through the files was fun (approximations follow): There were about 25 audio books in MP3 format. 10 videos, almost all of them showing nuclear blasts. 15 XML docs, most being of sheet music. 50 LIT format, the better ones of the author of Tarzan. 50 PALM format. 50 PDF format. 1 NFO file. 100 RTF. 5 TEX. The largest text files were the chromosome patterns of the Human Genome Project. Another interesting one was the value of 2^10. I think their was PI to some ridiculous precision as well. 400 HTML, some with pictures. Couple of Word DOCs. etc... could have spend weeks going through it.

Step 3) Parse indexes

Going back to the FTP site, they have text files called GUTINDEX.*. These contain a listing of all the texts that have been converted. The TXT and ZIP files themselves have minimal metadata, but cannot be parsed in an automated manner. My plan is to parse the index files, and then match the name of the TXT file against the parsed index info to get that files metadata. The text format of the index is really bad, and is something to this effect:

Mon Year Title and Author                                  [filename.ext]####
Aug 2004 The PG Works Of Gilbert Parker,   Complete [GP127][gp127xxx.xxx]6300*

An hour later, and a 1 or 2 page long function later for parsing text, and I get an XML file that looks like this: gutenberg.xml (1 meg, DataSet friendly). Not 100%, but close enough for this exercise. The XML file ended up having something like 6500 elements. NOTE some entries did not make it because they did not follow a semblance of the above format. Wrote a chunk of if/then logic to get as many of these special instances as possible. Some manual intervention would be needed to get all of them.

TitleAuthorEtcMonthYearFilenameID
"Confessio Amantis"John Gower[circa 1375 AD]5/95conamxxx.xxx266
"Pigs is PigsEllis Parker Butler 12/99pgpgsxxx.xxx2004
[A Biography of] Sidney LanierEdwin Mims 2/98lanrbxxx.xxx1224
[Hans Christian] Andersen's Fairy Tales [HCA #1]  1/99hcaftxxx.xxx1597
[Harvard] Philosophy 4Owen Wister 3/97phil4xxx.xxx862
[Reserved for 2001Arthur C. Clarke] 12/99xxx.xxx2001
[Reserved for Pietro di MiceliPG Webmaster] 11/99xxx.xxx1964
[Thomas] Bulfinch's MythologyThe Age of Fable #1 7/02bmaofxxx.xxx3327
10000 Dreams Interpreted 5/97drmntxxx.xxx926
100%: The Story of a PatriotUpton Sinclair[#13]5/04strptxxx.xxx5776
1 2 3 4 5 6 7 8 9 10 ...

Step 4) Scrub files

Now that I have metadata about the files, I need to prep the actual files. The etexts from the FTP site are arranged by year that the etext was submitted, as far as I am concerned, this is useless info unless somebody is reading along each year ... not likely. Pull the 7500 zip files out of their subdirectories and throw them into 1. This turns up some duplicate file names, ~25 to 50, I forget. Then I pull out the really big Human Genome Project files, about 50. Next, I write a method to look into the ZIP files to see what they contain. Used the SharpZipLib, which worked great. Out of the 7500 files, it had problems with about 500 of them (something about Compression 6). Of those 500, I used WinZip to extract them all to subdirectories and then wrote a quick function to zip them all using SharpZipLib. From there, my program iterated through the ZIP files. If the file contained was not a TXT file, then it was thrown out (about 700). Then, if it contained more then 1 file, it was thrown out (about 100). Next, if the TXT filename did not match the ZIP filename, it was thrown out (about 50). Then, if the 1st 5 letters of the filename could not be found in the XML file created above for its metadata, then it was thrown out (about 550). Finally, I ripped out the ones that were obviously bibles, because I cannot support that (about 5). I know there is a free bible in LIT format, but you'll have to find it on your own. After that 4 or 5 hour scrub, I end up with a directory of just over 6000 ZIP files that contain a single TXT file which can be looked up in the XML file above.

Step 5) brains-N-brawn publishing

From an earlier misadventure with ebooks, I had written a C++ app to generate (adult) ebooks. Not the easiest thing in the world. I revisited that app for this endeavor. Specifically, I made it add the proper Project Gutenberg title and author info to the ebook from the metadata. Also, added thumnail and cover page images. Couple hours for these mods. It works by compiling a handful of files into the LIT format. The main file is an XML format following the OEBPS open ebook specification. It points to the images, HTML content files, and has Dublin Core metadata. The thumbnail image is a certain size and shows up on the desktop PC, but not the PPC. The cover page is scaled down to PPC format to save space, but also renders on the desktop PC. Their is a static HTML About page made by me. Also, there is a CSS file for style info. Finally, the content HTML page is created by a preprocess of the Project Gutenberg TXT file. The app is picky and basically only accepts XHTML, so the TXT has to be scrubbed for certain characters, <P> and <BR> has to be added, etc... The app takes all those files, runs for a couple seconds, and then writes out a LIT file. Set that up before bedtime, and it finished when I woke up the next day. Then wrote a quick function to rezip those LIT files into a *L.ZIP file, following the PG naming convention; that took an hour or so.

Result

Now I have 6063 LIT books, just over 1 gig. Granted some are just different versions of the same book. If PG will accept them, then I will upload them to their servers to be mirrored for all PPC owners and MS Reader users to enjoy (does PALM have a LIT reader?). My recommendations to PG are: lose the 8.3 filenames, use XML indexes, provide a template or tool for submitting ebooks as XML instead of TXT, get rid of the etext## directory structure, separate the small print from the content, add parseable metadata to the TXT files themselves, automate to avoid potential human error from repetition.

&npsb; Select a page 
&npsb; &npsb;

Source

I would make my source (C# and C++) available to PG in case they want to extend it and use it as a base to help automate their work

Reponse

Well, I wrote PG and a response was basically, they dont think automated converters are good enough. The one they sort of like is here: GutenMark . I will not post their response out of respect for their privay; but here is my reply, since I dont care about my privacy:

I agree that you should host every imaginable format, or be able to convert to every format. I disagree that you should not host closed or uneditable formats, which you currently do in a small quantity as pointed out in my article. I agree that they should also be available in other formats, specifically XML. Forget HTML as a format, unless you meant strict XHTML. Since XHTML is a prerequisite for generating LIT, that option is available from my effort. I did not mean that you should only host LIT, it is just my preferred format, and I was offering a direct conversion of what you have in TXT, converted to LIT for the advantages that if offers (clearType, pagination, smaller size, TTS on PC). If people manually converted the TXT into a better LIT format with pictures and chapter breakups, etc ... then by all means it should supercede my effort. I would hope that others would do the same for PALM formats, PDF, etc... As far as making the TXT look pretty, I did not think that was the point, I thought the purpose was to make the books more accessible to everybody; and converting file formats is too complex or involved for a majority of users.

At a minimum, you might look at how I made the GUTINDEX.* files into XML; that would allow you to easily automate and add parseable metadata to the TXT files, remove current inconsistencies between the indexes and the files, as well as improve your search functionality.

I will host the converted books on my own server. Will not revisit this until they can break from the past

Update

if you have a pocket pc, brains-N-brawn.com mobile (http://www.brains-N-brawn.com/mDefault.aspx) has these 6000+ ebooks online. you can browse them on your pocket pc, search, and download them without having to synch. the ebook files are optimized for MS reader and have small file sizes. also, there are some 25000 today screen themes that can be browsed, searched, and selected as well. finally, the articles listed to the right are also available for you to read. casey 2/4/2003

Future

Just ended a contract. Looking for interesting and challenging work in web services, speech, and/or mobile development (preferrably virtual). The best example of my skill set can be found here /noSink. As far as articles, I'm thinking about implementing a biometric with CF .NET, or round 2 of the web service specs. Later