Automated OCR for Synapse

Discussion in 'Feature: Requests and Planning' started by Graham, Nov 10, 2007.

  1. Graham

    Graham Developer Staff Member

    One new optional service that we will be offering is a utility that will automatically OCR all your scanned documents. The results will be automatically available to you in the inbox and in the patients results as a new OCR tab.

    This will be offered as a web service, and initial testing shows that the OCR is much better than most commercial offerings, which don't even offer any way to automate the OCR of documents relying instead on user driven GUIs.

    I expect that at about 15 seconds per document, it would take a day or so to scan all your existing documents, and then the utility ( Synapse-Web-OCR ) will sleep every 5 mins before wakening. Even when it is working, there is very little impact upon performance with a tiny % of CPU use because it uses a web service.

    By default, only PDF and Tiff images will be OCR'd. There will be a choice of Indo-european languages.
  2. Graham

    Graham Developer Staff Member

    One possible outcome of having high quality OCR means that it may be possible to automatically pull your labs from scanned/faxed reports and have them injected digitally into your database.

    My experience so far is that TIFF is better than PDF, and a fax directly from a laboratory and received electronically by a fax server is way better than a paper report that you scan in.
  3. Jason

    Jason Developer / Handyman Staff Member

    Scanned images of good quality typed text work very well for me. I use .pdf to store the results of the scan and OCR. I dont think that the storage format you store the results in makes a difference. In the situation where you have previously created documents, and your plan is to read the file, OCR it, and then store the image and extracted text as a file ... TIFF might be better, but that is unlikely. It might also be quite dependent on the OCR engine used.
  4. Graham

    Graham Developer Staff Member

    Here's an example of the results extractor tool working on the OCR'd text.

    Only some tests are being looked at ( eg. ignoring HCT, RCC, RDW etc, and from the image, you can see that some tests can not be retrieved due to the way the OCR has come back.

    [​IMG]
  5. Graham

    Graham Developer Staff Member

    [​IMG]This was a much worse scan, and so the OCR is also much worse. Note however, that Synapse is able to recognize some of the mispelled results including "Rilirubin" and "Alkali-ne Phosphatase".

  6. Graham

    Graham Developer Staff Member

    This is a new screen which now shows the captured values with a * beside it if it is outside the normal range, a ? if there is no normal range recorded for this result type. If there are nothing beside it, it lies in the normal range.

    I have clicked on a result to edit it before I insert them into Synapse.

    The normal ranges are based upon values entered in to the normals section under Settings.

    [​IMG]

  7. Jason

    Jason Developer / Handyman Staff Member

    This file is a downloaded .pdf file that should be a perfect copy of the relevant lab. I suppose the OCR isn't perfect, but this .pdf is a searchable .pdf and the text could some how be extracted. When I copy and paste the the embedded text data, there are no spelling errors, etc.

    Attached Files:

  8. Jason

    Jason Developer / Handyman Staff Member

    The main reason why my labs didn't perfectly import into synapse is the OCR engine made some mistakes.


    The neat part is that it will likely make the exact same mistakes --- EVERYTIME, as my .pdfs that are submitted for OCRing are not scanned but digitally assembled PDFs (by my online lab ilablink).


    The "mistakes" are actually adding spaces where they shouldn't be... examples below.



    WHITE BLOOD CELL COUNT 6.2 4.0 - 11.0 x E9/1- AO
    GLUCOSE- FASTI NG 5.7 3.9 6.0 MMOL/L AO
    CREATI N I N E 81 60-125 UMOL/L AO

    CH LORI D E 106 95-110 MMOL/L AO
    *CHOLESTEROL/ HDL RATIO 4.0 AO
    TRI GLYCERI DES 1.04 MMOL/L AO
    ASPARTATE
    TRANSAMI NASE(AST)
  9. Graham

    Graham Developer Staff Member

    Can't you train the OCR engine so it won't make these mistakes?
  10. Jason

    Jason Developer / Handyman Staff Member

    I dont think so, PaperPort/Omnipage isn't trainable.
  11. Jason

    Jason Developer / Handyman Staff Member

    [quote user="Jason"]

    I dont think so, PaperPort/Omnipage isn't trainable.

    [/quote]

    well these errors are in the webservice OCRing. These lab files are digitally created PDFs.
  12. Graham

    Graham Developer Staff Member

    I'm about to release a database update tool to create the ocr rules, but I need the following in US units. I already have the SI units.

    Hb
    Globulins
    Bilirubin
    ALT
    AST
    GGTP
    Alk Phos
    Cholesterol (total)
    HDL
    LDL
    Triglycerides
    Creatinine
    eGFR
    Glucose fasting



    I need the units, and the approx possible ranges ( not normal range )

    If I don't get them, I will release as SI and you'll have to update each manually to your own units and ranges.
  13. Graham

    Graham Developer Staff Member

    A tutorial on this new feature is here.
  14. qilin

    qilin Member

    I can give you the units and common refrence range, but not sure about the possible range for many of them, (maybe use 0 as lower limit and about 5 folds or more than the upper normal value for the upper limit?):

    Hb 11.7-15.5 g/dL
    Globulins
    Albumin 3.5-4.9 g/dl
    Globulin, calculated 2.2-4.2 g/dL
    Bilirubin, total 0.2-1.3 mg/dL
    ALT 3-35 U/L
    AST 3-35 U/L
    GGTP 8-78 U/L
    Alk Phos 38-126 U/L
    Cholesterol (total) <200 mg/dL
    HDL >/= 40 mg/dL
    LDL < 130 mg/dL
    Triglycerides <150 mg/dL
    Creatinine 0.5-1.2 mg/dL
    eGFR >= 60 mL/min/1.73 m2 (multiply by 1.21 for African-American)
    Glucose fasting 65-99 mg/dL
  15. Graham

    Graham Developer Staff Member

    Looks good ... but I won't be able to update the tool till tonight now.

  16. Graham

    Graham Developer Staff Member

    Actually managed to do it now .... version 0.0.4

  17. Graham

    Graham Developer Staff Member

    Consulting today ... and one of my chronic rheumatoid patient's results were online at the laboratory, but they had not sent them to me via HL7. I'll be sending another complaint to them. But in the meantime:
    1. I switched the patient's paperport directory.
    2. Printed out her results to PDF there
    3. Imported it as a result
    4. A few mins later the automatic OCR was done
    5. Then synapse extracted the results digitally for importing into Synapse!

    A little tedious .. but at least better than just having the results on line at the laboratory.

  18. Graham

    Graham Developer Staff Member

    Uploading scanned documents to Synapse is one of the more tedious aspects ... but so far essential if one wishes to become paperless.

    I am considering a process whereby Synapse will scan the incoming scan directory, and attempt to determine which document is for which patient. If it is fairly certain, it can then upload automatically to the right patient's file and doc's inbox.

    Because it will be using the OCR web service there will be a charge though for this utility.

    Any interest??
  19. Jason

    Jason Developer / Handyman Staff Member

    Of course.

    How much ?My OmniPage / PaperPort get such good results in the OCR I would hesitate trying another method. Maybe an option to purchase the Auto-importing / Data Extraction engine as well. Considering Omnipage / PaperPort
  20. Graham

    Graham Developer Staff Member

    I'd prefer not to have to interface with multiple OCR engines ... and would need one with a command line interface.

    Cost .. about $200/year.

Share This Page