I am currently attempting to move to a paperless office. Here’s my notes investigating the technical side.
Goals
- This needs to be simple. Currently, filing a sheet of paper requires me to fetch the folder, punch the holes, then put it in the right register, finally put the folder back in the drawer. For the digital version, at the most, I want to place some sheets of paper into the scanner, press a button, be queried for a title and maybe some tags, then click Save and be done.
- The metadata I create must not be locked into some proprietary application. That means that any titles/keywords must be included with the PDF file. Ideally, any folder hierarchy would also be reflected in the filesystem, though obviously this is though if you also want to support one document being in multiple folders.
The scanner
Basically, the ScanSanp S1500 seems to be what everyone is using. In the same price segment, there is also the Canon DR-C125.
I chose the Canon because it is a bit smaller. Also because it has TWAIN. Though as it turns out, basically every program I tested supports the ScanSnap anyway, while some do not support TWAIN, so go figure.
I’m reasonably happy with the Canon. Pages are never scanned 100% without skew, but I assume that’s to be expected (most scan tools have deskew functionality). The ability to scan thick and longer-than-usual sheets of paper already came in useful. The optional ability to eject paper onto the surface in front of the scanner is troublesome; new sheets keep pushing the old ones away, and sometimes even end up below earlier ones.
A tool to scan and OCR
Nearly every document manager I tried supports taking documents directly from the scanner. Most apply some version of OCR while they’re at it. Many use the not-so-great Tesseract engine. None really do what I want.
When I’m digitizing my archive, and I put 30 sheets of paper in the scanner, some of which belong to the same document, I need an UI that allows me to quickly group those sheets together. An app generally might give you the option of either saving each scanned page separately, or to merge everything into one document.
The one exception is Abbyy FineReader 11, which does allow this sort of splitting off of scanned pages into new documents, with an interface even that is pretty close to what I’m a looking for, relatively speaking, but then opens a new window for each document, and requires you to OCR and save each one manually. It also likes to crash when doing this.
I didn’t have a close look at the OCR results of most of the apps I tried, but I did compare Capture OnTouch, the tool that ships with the Canon scanner, with Abbyy FineReader 11, and the latter is clearly superior. CaptureOnTouch doesn’t even recognize German umlauts.
I gather fron online sources that the OCR in Acrobat X is not supposed to be as good as Abbyy either. I did also not try OmniPage.
A tool to tag/organize/view
I tried a huge number of those, for Windows and OS X, mostly looking for one that a) integrates easy batch scanning, as described above, and b) isn’t a metadata blackbox. I’m not a happy camper.
Obviously, storing meta data is a hard problem.
- You can store it in the file, but are limited by formats; and changing the metadata means the file needs to be changed (which is potentially dangerous, impacts backups etc).
- You can store it via xattr or alternate streams, but it’s easy to lose these across filesytems.
- You can store it in a boxcar files, but they are also easy to lose, and cause clutter.
- Store them in an external database, and you are locked into the particular tool you are using.
Also, having a document in multiple categories cannot be easily mapped into the filesystem.
That said, here’s a quick rundown of the apps I do have tried, and the reason for dismissing them:
- Blackbox
- Stores metadata in xattr; neat, but not cross platform. I’m not trusting tools to not loose this data over a period of decades.
- Does support TWAIN scanners.
- Maybe the best of the bunch. *Very close*.
- Available on Windows, though that version isn’t quite as nice.
- Unfornately, does not support storing files outside it’s library package. This also means any collections I create are in a blackbox.
- Does OCR (using Tesseract), but doesn’t store it in the pdf itself, so this won’t do any harm to an OCR layer already there.
- Seems like a good tool, but also a total lock in.
- Simple and nice, but not a lot of functionality (not even PDF preview), basically just a list.
- Strange system of account / account numbers
- Quite heavy, didn’t really leave an impression.
- Uses OpenMeta for tags (xattr).
- But apparently cannot write them to the PDF file.
- Can work with PDFs that aren’t copied into it’s library package.
- But has it’s on folder/groups system which does not represent the filesystem.
- Very close.
- Mostly a black box, but allows to link an external folder structure (using the Index menu item, I didn’t get it at first).
- This allows you to use the app for search, but metadata changes made are not synced to the files.
- It doesn’t seem to show/search metadata from the actual PDF, but I may be mistaken here.
Other tools I tried: Papers (quite nice, but too research tailored), Medeley (also totally research tailored).
Resolution
At this point, I’ve decided to put together a small utility for myself to sort scanned pages into documents, and send the whole thing through OCR. I’m then going to store keywords directly in the PDF files, save those in a relatively flat folder structure, and use Windows search / iFilter to access them, forgoing a specialized GUI.
Hi, looking at your blog while browsing the source of SmartInspect-Python project. Take a look at FileCenter by Lucion — it’s a surprisingly good application for scanning. (I’ve done a similarly broad search for paperless-office tools). It keeps the files in real directories and has the ability to manipulate the PDFs (slice and recombine). It also comes with built-in indexing/searching. I don’t work for Lucion — just a happy user.
Mark
LikeLike