Spent the better part of this afternoon transforming one of my old B5 sermon notebooks into a 56.1mb PDF. Overall I’m satisfied with the results though it would have been wonderful if the OCR worked. Anyway I struggled to read my own handwriting… so I don’t think I can blame the OCR for not deciphering it.
This is the equipment that I used:
I decided to do it on a computer as it provides me with much more advanced controls. With the SCX-4828FN, I can only either scan simplex using the ADF or do everything manually on the flatbed. Furthermore I would be unable to preview my progress and any mistakes would only be noticed later. There were also some A4 printed inserts which I had to handle manually. Thankfully PDF is a really flexible document format that allows pages of different sizes.
Another area where I ended up intervening on is pages on which I had used a glitter pen to write. I didn’t want to feed them through the ADF at risk of the glitter rubbing off on the rollers.
I scanned the whole notebook in full color (with printed notes in grayscale) as I mostly used a colored pens. Grayscale would definitely have saved a lot of space but I felt that was not necessary. One gotcha was that my root directory was getting full with only about 25mb of space remaining! After having scanned over 50% of the notebook, I realized that I couldn’t even save the project into a PDF! So at 6:20pm I decided to start from scratch as the filenames in the tmp directory were completely random and I didn’t want to piece together the jigsaw puzzle.
The scanning seemed to stop between each page. I suspect this is due to the scanner buffering content before sending it to the PC. This might be due to the SANE Linux driver that I was using.
gscan2pdf allowed me to set flexible page number progressions, e.g. incrementing page numbers by different multiples, thus I could utilize the simplex ADF to scan one side of a stack of notes before scanning the other side.
- Better OCR. It might be possible to train tesseract on my handwriting, but I realized that even in the span of one notebook I had switched handwritings and used different pens.
- Better color control. Again the NAA has done a good job with their document Digitising accumulated physical records.
- Though this is a seemingly simple project, having OCR, running other graphical processing filters and exporting to PDF all takes encoding. I had no qualms spending 2 hours to rip 1 hour from a DVD, but for this more interactive endeavor a faster processor would have made the experience more enjoyable.
- Write properly. You’ll never know who needs to read what you’ve written, even yourself! Use good contrast ink and white paper is really the best at the end of the day.
Weird! When I scan a single page, the free disk space on my root file system dips by about 35mb, before going back up to the previous number. However when I use the ADF and scan pages in bulk, it seems that only the disk space of the last page gets reclaimed, thus the program is taking up some space somewhere on my root file system. gscan2pdf problem.
Ended up doing a separate scan of the remaining pages and using Preview to fix them up… oh well. Just finished at 8:45pm, a total of 2h 25m! Not good but alright for the first time. Total of 125 pages, resulting in an average 449kb per 300dpi scanned B5 page. I am guessing that gscan2pdf used jpg compression.
For those who are interested, you can take a look at this file: Scanned Sermon Notes sample.
- Pages 1 – 2: The first sermon I have recorded in this book
- Page 3: A scanned sermon note
- Page 4: Notes written with a glitter pen
- Page 5: Trying out a new handwriting
I expect subsequent projects to take much less time as I didn’t use my glitter pen subsequently.