Background

We are seeking to collect user-generated text to support the evaluation of parser adaptation across domain/genre. We are interested in a variety of registers: Open Access Research Literature, Wikipedia, Technology Blogs, Product Reviews and User Forums. Secondly we collect text from sources that discuess with the Linux operating system or natural language processing. The choice of these domains is motivated by our assumption that the users of the corpus will be more familiar with the language used in connection with these topics.

Collected Data

NLP blogs were obtained in mid-April from the following sites:

Linux blogs were also downloaded in mid-April, from:

Linux forums were extracted from the Unix & Linux subset of the April 2011 Stack Exchange Creative Commons Dump. In this set a text corresponds to a post (be it a question or an answer). If necessary threads can be reconstructed by using the primary/new id xref file.

Linux reviews are from http://www.softpedia.com/reviews/linux/. They possible require some manual cleaning - each review typically ends with a sentence like 'Check out these screenshots'

The Linux wiki set was created following the method used for WikiWoods.

All data and scripts are in /ltg/jread/workspace/wesearch/data-collection. The content has been extracted by finding the most specific element that contains all the relevant text (for example, blog posts typically contain some element with an attribute indicating that is the content element). All mark-up related to rendering has been retained for now. Sentences were obtained from tokenizer (as used in creating WikiWoods).

Section

Source

Documents

Total Items

Avg. Items

NLP, blog

http://blog.cyberling.org

51

659

12.9

http://gameswithwords.fieldofscience.com

457

11,014

24.1

http://lingpipe-blog.com

343

12,693

37.0

http://nlpers.blogspot.com

249

7,612

30.6

http://thelousylinguist.blogspot.com

536

7,748

14.5

Linux, blog

http://embraceubuntu.com

220

2,957

13.4

http://www.linuxscrew.com

312

4,049

13.0

http://www.markshuttleworth.com

159

4,503

28.3

http://www.ubuntugeek.com

1,631

42,770

26.2

http://ubuntu.philipcasey.com

105

1,577

15.0

http://www.ubuntu-unleashed.com

278

6,362

22.9

Linux, forums

stack exchange

9,945

54249

5.5

Linux, reviews

softpedia

249

13,430

53.9

Initial Parsing Results

Section

Items

Coverage

Length

Ambiguity

Time

Tokens

Types

NLP, wiki

11558

86.4%

18.0

10859

8.2

238059

19396

NLP, blog

46106

81.9%

15.5

8158

6.1

838592

41771

Linux, wiki

40738

85.0%

18.5

12407

9.6

843082

45783

Linux, blog

92280

83.7%

11.1

5151

3.9

1000683

48511

Linux, review

14761

84.6%

18.1

10610

7.5

304672

13158

Linux, forum

85743

74.8%

11.0

4885

3.1

1115412

56673

Corpus statistics for each section. Coverage shows what precentage of items received an analysis (using the unadapted parser 'out of the box'), and ambiguity and time give an indication of average parsing complexity (for the 'vanilla' parser configuration). Tokens shows the token count of each section and types is the number of unique, non-punctuation tokens seen per section.

Data Preparation

Identifier Format

Initial WDC identifiers take the form of:

where:

Output

Parsing Results on clean data

(wiki results just copied from above)

Section

Items

Coverage

Length

Time

Tokens

NLP, wiki

11558

86.4%

18.0

8.2

238059

NLP, blog

38498

80.8%

17.6

8.3

676080

Linux, wiki

40738

85.0%

18.5

9.6

843082

Linux, blog

64520

82.3%

13.2

5.7

854157

Linux, review

13430

80.4%

19.8

9.2

266063

Linux, forum

54249

82.0%

14.8

5.6

802736

And again on clean blog data:

Section

Items

Coverage

Length (parsed)

Time

Tokens

NLP, blog

39726

82.9%

17.2 (14.8)

7.0

681896

Linux, blog

62216

82.2%

13.0 (11.1)

4.6

808600

Observed Tags

Tag

Frequency

Definition

p

51214

paragraph

br

39709

single line break

div

17426

section in a document

li

17267

list item

strong

8103

strong text

img

7329

image

blockquote

3474

long quotation

h1

3265

Header 1

code

2821

computer code text

td

2623

cell in a table

em

1320

emphasized text

h4

1313

Header 4

var

1166

variable part of a text

h3

1051

Header 3

h2

988

Header 2

small

976

small text

tr

945

row in a table

param

211

parameter for an object

ol

204

ordered list

font

184

font, color and size for text (deprecated)

table

183

table

input

157

input control

th

146

header cell in a table

tt

115

teletype text

kbd

80

keyboard text

sup

76

superscripted text

s

71

strikethrough text (deprecated)

tbody

68

groups the body content in a table

sub

53

subscripted text

col

44

attribute values for one or more columns in a table

label

38

label for an input element

dt

12

term in a definition list

dd

10

description of a term in a definition list

acronym

6

acronym

dl

6

definition list

abbr

4

abbreviation

noscript

4

an alternate content for users that do not support client-side scripts

h5

2

Header 5

Other Potential Sources

Open Access Research Literature

Wikis

Product Reviews

Blogs

Mailing Lists

User Forums

Related Work

Next Steps

WeSearch/DataCollection (last edited 2012-03-05 16:17:29 by RebeccaDridan)

(The DELPH-IN infrastructure is hosted at the University of Oslo)