Case Example - Tesseract to OCR Faxes

Discussion in 'Documentation & Training Resources' started by Jerry, Dec 9, 2008.

  1. Jerry

    Jerry Administrator Staff Member

    Graham worked with me last evening and today getting the new FaxQ Agent and Tesseract working correctly on my system. I run the synapse client on a fairly fast Windows 2000 machine, and the synapse-server on a Windows 2000 VM, running inside ubuntu 8.04 server. My hylafax server is hosted on ubuntu. I have hylafax's recvq directory, where the new faxes are received, setup as a shared (and mapped) network drive to the Win2k VM. I use VirtualBox which is very similar to VMware player, and works well. Graham had to modify the faxq.exe "daemon" a bit to work correctly on my mapped drive, and in addition I had to do the following:

    1. Run the Windows 2000 VM as root. This is because I must run hylafax as root for Graham's Fax Portal gateway to work correctly.
    2. Make sure the Windows 2000 VM folder that is shared with ubuntu as a mapped drive is owned by root, and has read/write enabled.
    3. Make sure ImageMagick is installed per Graham's instructions on the forum.
    4. Set the path for the mapped drive in synapse client under Setting-->User-->Fax Inbox. In my case it is the UNC pathname "\\VBOXSVR\recvq\". I could not get the syntax "S:\" (the designation for my mapped drive) to work.

    After getting the setup correct, the new faxq.exe agent imports my tiffG3 faxes from hylafax, OCR's them, and stores them in the synapse-server's cache-listener directory. The OCR output is automatically displayed in Inbox-->Faxes.

    Here's a PDF of the the sample fax file I was using (with patient info blocked, of course) -- click on "TestFax.pdf" attachment above to view:

    Here's the OCR output from Tesseract from this file (The patient name and demographics were accurate). Since I am really just interested in the results, which were OCR'd accurately, with lab and patient identifiers, I could rapidly edit out the "garbage" throw in a couple of spaces and be rapidly on my way.

    NOV. 26. 2000 4:i2PM USHEALTHWORKS NORTH NO. 560 P. i
    Spokane, W 9921 -1235 . LABORATORIES
    (509) 755-B600| (SOO) 541-7891 i FAX (509)924-0002
    CLIENT SERVICES (509)755-B999| FAX (509)924-5127
    Muaaw mamma 1·n¤m¤s J Anagram
    FIRSTNAME, LASTNAME 0-41819XXXX 04/13/XXXX M 36 Y 509-XXX-XXXX 41819XXXX
    PARK DO. JERRY 1 'I/13/2008 16:00 1 1/13/2008 1000734 Final 1
    COMMENTS: H534579:FT3 , T4, TSH
    For additional diagnostic criteria, see our Iirectory at www.pamI.c0m
    Diagnostic Procedure X In Range Out of Range Units Reference Range Site Code
    Free T3 3.2 pg/mL 2.3-4.2 01
    Thyroxine (T4) 8.8 ug/dL 5.0-12.0 O1
    TSH 0.64 uIU/mL 0.40-5.00 01
    End or Report
    FIRSTNAME, LASTNAME 11/14/2008 08:04 D223

    Attached Files:

  2. Graham

    Graham Developer Staff Member

    Of course if you run Windows only, you should not have these permission problems!

  3. Jerry

    Jerry Administrator Staff Member

    I should mention that, initially with this file, Tesseract simply reported it was unable to OCR. My problem was Linux permissions preventing Tesseract access to the fax file. I think that if you try the new FaxQ agent, however, and find that relatively straightforward faxes are unable to be OCR'd, you might have a problem with Tesseract's access to the file as well (but probably for a different reason). So, don't just assume it's a Tesseract failure if no faxes including simple ones are not being OCR'd.
  4. laumansm

    laumansm New Member

    As far as I remember, the problem that I had that led to Tesseract not being able to OCR was that I had not installed Imagemagick correctly and it could not convert my input pdfs into the tiffs needed for OCR'ing. Marius
  5. Jerry

    Jerry Administrator Staff Member

    Is the pdf to tiff conversion working OK now? And if so, how's the Tesseract OCR conversion rate and accuracy? I'm just working with tiffG3 "regular" fax files so far.
  6. Jerry

    Jerry Administrator Staff Member

    If your fax software creates tiffG3/G4 files of incoming faxes (I think most do), the quick and dirty way to get stuff into the right file format (if you're still having OCR recognition trouble) might be to simply fax stuff back to yourself. The fax in my example was a printed paper lab result that I put through a "real" fax machine back to myself.
  7. Graham

    Graham Developer Staff Member

    Hylafax normally creates faxes in a compressed Tif or PDF format.

    Tesseract requires a decompressed TIF format.

    So, imagemagick converts compressed TIFs and PDFs to uncompressed TIFs.

    You need a working imagemagick installation!!

  8. Jason

    Jason Developer / Handyman Staff Member

    With all this new stuff, my Snappy Fax is looking OLD ! :(

    Actually, it would be great to receive FAX medication renewals right in my inbox.

    Maybe that's the clincher !

  9. Graham

    Graham Developer Staff Member

    If snappyfax saves the incoming faxes in an accessible directory, you can use the same technique.

  10. Jason

    Jason Developer / Handyman Staff Member

    It does !


    Do you think it's possible to have SnappyFax receive, but the Fax Agent will OCR it and place it in the Fax Area ?
  11. Graham

    Graham Developer Staff Member

    Sure ... the FaxQ agent doesn't care how you get the files to it ... you can drop them there from email if you want.

Share This Page