Skip to content

NoraTop

anonymous edited this page Oct 9, 2011 · 12 revisions

Background

The so-called WeScience0 project is a preparatory joint initiative with the Norwegian Open Research Archives (NORA) and the UiO Center for Information Technology (USIT); the project is partially funded by NORA in 2009 and forms part of the larger WeScience initiative coordinated by the UiO Language Technology Group. Some general motivation for the project and a preliminary project plan are available through the original project proposal. Related initiatives include the ACL Anthology Reference Corpus, the HyLaP project at DFKI Saarbrücken, and the UK Intute Repository Search.

This page (and other NORA sub-pages), at least as of August 2009, primarily serve for project-internal communication. Access to these pages is limited to registered wiki users, using the exact user name registered on the NoraGroup page. Please contact StephanOepen or GisleYtrestol, in case you want additional NORA pages to be created, experience difficulties with reading or editing these pages, or need assistance related to wiki usage more generally.

Project Objectives

The WeScience0 effort can be sub-divided by basic processing tasks. These include (a) PDF Inspection (NoraInspection), (b) text extraction (NoraExtraction), (c) language identification (NoraIdentification), (d) text correction (NoraCorrection), (e) sentence boundara detection (NoraSegmentation), and (f) interfacing to the Lucene search engine (NoraLucene). Please view the individual pages for details and the current state of play.

The main deliverables from the project comprise (i) a flexible pre-processing pipeline, implementing tasks (a) through (e) above; (ii) documented knowledge on the strong and weak points of various existing tools (for text extraction and segmentation, for example) and parameter settings, correlated to common PDF production methods; and (iii) a revised on-line search interface for NORA, based on Lucene and supporting full text search. The IFI Language Technology group has the primary responsibility for tasks (a) through (e) (and, correspondingly, deliverables (i) and (ii)), whereas USIT focuses on task (f) (and deliverable (iii)).

TextGrabber

The software application [TextGrabber] is now finished, and the technical report can be downloaded here. [TextGrabber] is the software that has been developed to meet the demands of this project.

People Involved

Clone this wiki locally