Error » Search Engines » Google » Google Open Sources OCR Code, Launches Digital Library Project

Google THE search engine. This is for discussion about Google

Post New Thread Reply
  Google Open Sources OCR Code, Launches Digital Library Project
LinkBack Thread Tools Display Modes
Old 23-Apr-2007, 03:50 AM   #1 (permalink)
Fixed Error!
 
Iphone's Avatar

Posts: 4,202
Join Date: Mar 2007
Rep Power: 6 Iphone is on a distinguished road

IM:
Default Google Open Sources OCR Code, Launches Digital Library Project

A recent Google Inc. blog post by Uber Tech Lead Luc Vincent reveals an exHewlettPackard & Co. employee now working at the search giant has helped to dust off an optical character recognition (OCR) engine with the intent of putting it into the open source domain.

Tesseract the engine was developed between and by HP Labs but was tucked away when the company pulled out of the OCR business.
Google called on the Information Science Research Institute at the University of Las Vegas in Nevada which is known for its expertise in OCR to help debug the code. With help from researchers at UNLV the OCR application made its way into the open source domain Kazem Taghva the university's associate director for information science said Wednesday.
Taghva said as an open source project university researchers and OCR experts could review and improve the application. "One of the guys who once worked for HP now works for Google" he said. "We are working on the project but Google really has taken the lead in debugging the software."
Today the Tesseract OCR project only supports the English language and does not yet include a page layout analysis module so it performs poorly on material with multiple columns. "It also doesn't do well on grayscale and color documents and it's not nearly as accurate as some of the best commercial OCR packages out there" Vincent wrote on the company blog.
With Google's announced plan on Tuesday to provide a service that allows people to search for news articles dating as far back as the s the reasoning behind the software's resurrection becomes perfectly clear said David Doermann associate research scientist at the University of Maryland College Park. "I'm sure Google intends to automate the process" he said. "They are probably not automatically OCRing them now. Most archives likely have been done by hand."
Doermann said putting the application into an open source project will also help with getting answers to problems not addressed by OCR such as analysis of complex pages for example scanning figures and drawings or text that lay on intricate backgrounds.
It may be too early for Google to take advantage of Tesseract OCR as an open source project to build its digital library but it could help over the long haul researchers said.
Through the Google News Archive Search service Google will work with the New York Times Co. and Washington Post Co. as well as newsretrieval services such as Reed Elsevier Inc.'s LexisNexis to make articles available.
The Wall Street Journal and Factiva a newsretrieval service owned by Journal publisher Dow Jones & Co. and Reuters Group plc also will make articles searchable through the Google service.
Consumers will have an option to search fulltext articles using keywords. Google will make summaries available to view for free but access to the content will require a fee.
Google says it won't host content itself or charge content owners or consumers for the service. Content owners will handle article delivery pricing and billing to consumers.
Iphone is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit!
Reply With Quote
   


   
Post New Thread Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT -8. The time now is 06:33 AM.

Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0

DMCA Policy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227