Skip to content

WeSearch_DescriptiveStatistics

JonathonRead edited this page Mar 12, 2014 · 1 revision

Beginning by reproducing the methodology of Baldwin et al. (2013) using the WDC, with the following exceptions:

  • tokenisation using REPP. Punctuation removed from tokens

Language Mix

According to langid.py (Lui and Baldwin, 2012) 100% of the WDC is English. Reassuring.

References

Baldwin, T., Cook, P., Lui, M., MacKinlay, A., and Wang, L. (2013). "How Noisy Social Media Text, How Diffrnt Social Media Sources?" in Proceedings of the International Joint Conference on Natural Language Processing, pp. 356-364.

Lui, M and Baldwin, T. (2012). "langid.py: An off-the-shelf language identification tool" in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Demo Session, pp.25-30.

Clone this wiki locally