Big (and Open) Data for Scholarship of All Sizes: A New Release of the HathiTrust Research Center Extracted Features Dataset

HathiTrust today announces the release of a significantly expanded open dataset, the HathiTrust Research Center (HTRC) Extracted Features (EF) Dataset, Version 1.0. This dataset provides researchers with open access to data extracted from the full text of the HathiTrust Digital Library (HTDL) at an unprecedented scale.

The Extracted Features Dataset opens the complete HathiTrust collection for investigations into historical and cultural trends, the rise and fall of topics within the corpus, and the evolution of words and writing structures in publications dating from the 16th to the late 20th century. It provides quantitative information about word and line counts, parts of speech, and other details within each page of every volume in the HTDL. In addition to these larger-scale investigations, the EF Dataset also allows researchers to closely analyze the contents of a given volume or subset of volumes.

The data is extracted from 13.7 million volumes found in the HTDL, representing over 5 billion pages consisting of over 2 trillion tokens (words). A preliminary release of the EF Dataset, drawn from a much smaller subset comprising only HathiTrust’s public domain collection, has already enabled novel research from scholars in economics, history, linguistics, literary studies and sociology, among other fields.

“The Extracted Features Dataset creates opportunities for scholarship and teaching that were previously impossible,” said J. Stephen Downie, co-director of HathiTrust Research Center and Associate Dean for Research and Professor at the School of Information Sciences, University of Illinois at Urbana-Champaign. “We look forward to seeing how the scholarly community takes advantage of the EF dataset in their research, labs, and classrooms.”

“We launched the HathiTrust Research Center to help researchers fully mine the entire collection of texts found in HathiTrust,” said Michael Furlough, HathiTrust’s executive director. “This release provides a novel and effective way to do so by generating relevant data from the entire corpus.”

Founded in 2008 and hosted at the University of Michigan, HathiTrust preserves and provides access to millions of digitized books and journals from the collections of more than 120 institutional academic and research partners via its certified trusted digital repository This searchable archive of published literature from around the world includes both in-copyright and public domain materials from mass digitization programs and partners’ local digitization initiatives.

The HathiTrust Research Center is an advanced research service of HathiTrust and a collaborative research center launched jointly by Indiana University and the University of Illinois.  The Research Center team strives to meet the technical challenges that researchers face when dealing with massive amounts of digital text, by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

For more information about the Extracted Features Dataset and access to it, go to The HTRC EF Dataset is released under a Creative Commons CC-BY license. Download information can be found at the DOI in the formal dataset citation below:

Boris Capitanu; Ted Underwood; Peter Organisciak; Timothy Cole; M. Janina Sarol; J. Stephen Downie (2016): The HathiTrust Research Center Extracted Features Dataset. 1.0 [Dataset]. HathiTrust Research Center. Dataset.

Questions? Please contact

Рубрика: Ресурсы/Софт | Добавить комментарий

Call for Papers – ISNN 2017 (Sapporo, Hokkaido, Japan, June 21-23, 2017)

The 14th International Symposium on Neural Networks (ISNN 2017), Sapporo, Hokkaido, Japan, June 21-23, 2017

The 14th International Symposium on Neural Networks (ISNN 2017) will be held in Sapporo, Hokkaido, Japan during June 21-23, 2017, following the successes of previous events. Located in northern island of Hokkaido, Sapporo is the fourth largest Japanese city and a popular summer/winter tourist venue. ISNN 2017 aims to provide a high-level international forum for scientists, engineers, and educators to present the state of the art of neural network research and applications in related fields. The symposium will feature plenary speeches given by world renowned scholars, regular sessions with broad coverage, and special sessions focusing on popular topics.

Authors are invited to submit full-length papers by the submission deadline through the online submission system. The submission of a paper implies that the paper is original and has not been submitted under review or is not copyright-protected elsewhere and will be presented by an author if accepted. All submitted papers will be refereed by experts in the field based on the criteria of originality, significance, quality, and clarity. Papers presented at ISNN 2017 will be published in the EI-indexed proceedings in the Springer LNCS series and selected good papers will be included in special issues of several SCI journals.

Important Dates

– Paper submission: January 1, 2017

– Notification of acceptance: February 1, 2017

– Camera-ready copy and author registration: March 1, 2017

– Conference: June 21-23, 2017

Рубрика: Конференции | Добавить комментарий

PAN 2017: Author Identification, Profiling, and Obfuscation

We invite you to take part in one of the following shared tasks:
1. Author Identification — Given a document, who wrote it?
This task focuses on author clustering and style breach detection.
Author clustering will be done on short documents of paragraph length.
Style breach detection has the goal of identifying breaches of writing
style in longer texts.
2. Author Profiling — Given a document, what’re its author’s traits?
This task focuses on gender and language variety identification on Twitter,
providing a corpus of tweets annotated with gender and native language of
their respective authors.
3. Author Obfuscation — Given a document, hide its author.
This task works against identification and profiling by automatically
paraphrasing a text to obfuscate its author’s style. The tasks offered are
author masking and obfuscation evaluation.
Learn more at
PAN is held in conjunction with the CLEF’17 conference in Dublin, Ireland.

Читать далее

Рубрика: Конференции, Ресурсы/Софт | Добавить комментарий

Summer Internship positions at Information Retrieval and Machine Learning group at Oracle Labs

The Information Retrieval and Machine Learning (IRML) group at Oracle Labs is looking for several highly motivated interns for Summer 2017.
The candidate should be proficient in one or more areas of Machine Learning, including Classification, Statistical NLP, and Computer Vision. Our current focus is in the areas of information extraction, structured prediction, topic models, causal inference, and deep learning. Research experience in any of these areas is strongly preferred. The candidate should also have good implementation skills in Java, Scala, Python or other common language of choice. We are open to both Masters and PhD candidates, but PhD candidates are preferred.
About IRML:
The group is tasked with developing core Information Retrieval, statistical Natural Language Processing and Machine Learning technologies in order to help solve complex and challenging business problems. We collaborate with a number of Oracle product groups, working on projects like classification, search relevance, feature selection, Bayesian inference, sentiment analysis, named entity recognition, entity linking, and product attribute extraction. We also publish our research at top conferences.
We’re also looking for people to join our team for full-time positions.
You can learn more about our research here:
About Oracle Labs:
The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle’s business. Oracle Labs grew out of Sun Microsystems Labs after Oracle’s acquisition of Sun Microsystems. While many product development organizations within Oracle develop leading edge technologies, Oracle Labs is devoted exclusively to research. The Labs has a wide range of projects, from research into fast VMs for dynamic languages like R and Javascript, to developing state of the art inference algorithms for clusters of GPUs.
The Information Retrieval and Machine Learning Group is located at Oracle Labs in Burlington, MA. The candidate would need to relocate to this area. Please send your CV to for applications or questions about the position. Oracle is an equal opportunity employer.
Рубрика: Вакансии/Стажировки | Добавить комментарий

Магистратура в Хельсинки (Linguistic diversity in the digital age)

Linguistic Diversity in the Digital Age is an integrated international programme that offers you a comprehensive view of all subfields of the science of language. As a student in the programme you will be able to choose among four specialist options: (1) General Linguistics, (2) Phonetics, (3) Language Technology, and (4) Diversity Linguistics.

Язык обучения — английский. Приём документов открылся 1 декабря и продлится до 12 января 2017 года. Подробную информацию о программе и процессе поступления можно найти здесь:

Информация о стипендиях для студентов из-за пределов ЕС:

Рубрика: Курсы/Образование/Постдоки | Добавить комментарий


Dear colleagues,

we are pleased to invite you to the international conference SLOVKO 2017 that will be held on 25-27 October 2017 at Hotel Apollo in Bratislava, Slovakia. The conference is aimed at presenting natural language processing, corpus linguistics, terminology, e-terminology and e-terminography, e-lexicography, grammar corpus-based or corpus-driven research, analysis and description of language resources relevant to less researched languages, application of existing approaches to less researched languages, machine translation, tools for linguistic text analysis, parallel corpora, creation and use of language resources, linguistic databases and networks, spoken and dialect corpora, speech analysis and synthesis.

Further information can be found in the first call <> or on the website <>.
Send us your registration via e-mail by 30 January 2017.

We are looking forward to seeing you.

Katarina Gajdosova
Jana Levicka
Adriana Zakova
(Organizing Committee)

Рубрика: Без рубрики | Добавить комментарий

Dialog Evaluations: morphology

Приглашаем вас принять участие в соревновании по оценке методов морфологического анализа русского языка MorphoRuEval. Цель соревнования — стимулировать развитие технологий морфологической обработки текстов на русском языке, в особенности текстов из сети Интернет. Задача корректной обработки таких текстов является одной из наиболее приоритетных задач в современной прикладной лингвистике, так как объем неформальных текстов все время растет за счет социальных сетей, а вся дальнейшая цепочка обработки текста сильно зависит от результатов морфологического анализа.

Предлагаются следующие дорожки:

  1. дорожка с закрытым набором данных, участникам которой разрешено обучать свои модели только на предоставленных данных;
  2. дорожка с открытым набором данных, участникам которой разрешается дополнительно использовать большой неразмеченный корпус.


Более подробную информацию вы можете прочитать во вложенном файле или на нашем сайте в разделе Dialogue Evaluation, а также получить по адресу geekrya[цобачка]

Рубрика: Конференции, Ресурсы/Софт | Добавить комментарий

Второй Международный семинар “Язык, музыка и компьютерные технологии”

В апреле 2017 года на базе Санкт-Петербургского государственного университета аэрокосмического приборостроения и Санкт-Петербургской государственной консерватории им. Н.А. Римского-Корсакова пройдет Второй международный семинар “Язык, музыка и компьютерные технологии”.

Цели семинара: содействовать междисциплинарному общению и сотрудничеству лингвистов, музыкантов и IT-специалистов в сфере актуальных проблем гуманитарного знания, среди которых следующие:

1 Вопросы усвоения языка и музыкального знания; влияние музыкальных навыков на развитие языка и процессы речевой деятельности; влияние лингвистических навыков на изучение и понимание музыки; взаимодействие между музыкальной и лингвистической практикой.

2 Языковое и музыкальное знание, их структура и функционирование; имплицитное и эксплицитное музыкальное и лингвистическое знание; сходства и различия в понимании музыки и языка.

3 Автоматическая классификация языкового и музыкального знаний; формальные модели языкового и музыкального знаний; особенности организации музыкального и лингвистического автоматизированного поиска.

В этом году специальные темы семинара:
– формальное представление языка и музыки
– звуковые корпусы в языке и музыке

Приглашенные лекторы:

  • Сабина Ятриду (Sabine Iatridou), США
  • Серджи Хорда (Sergi Jordà), Испания
  • Мерил Голдберг (Merryl Goldberg), США
  • Елена Риехакайнен, Россия


По итогам семинара планируется издание сборника статей в серии Communications in Computer and Information Science  в издательстве Springer (индексируется в Scopus) и публикация отдельных русскоязычных статей в журнале «Философия и гуманитарные науки в информационном обществе» (индексируется в РИНЦ). Все опубликованные статьи будут отобраны по результатам слепого рецензирования.

Срок приема заявок – до 10 декабря 2016 года.
Подробную информацию можно найти на сайте семинара –

Рубрика: Конференции, Лекции/Семинары | Добавить комментарий

PhD position in Computer Science on Information Retrieval and Text Mining

The Faculty of Informatics of the Universita’ della Svizzera Italiana (a.k.a. University of Lugano) Switzerland, is looking for a candidate for a fully funded PhD position, available for 4 years (starting 1 July 2017 or as soon as possible thereof), in connection with the SNSF funded project titled “The Global Structure of Knowledge Networks (GSKN)”.

*The Project*

The GSKN project will develop and test new and innovative information retrieval techniques and new network-analytic models for the analysis of very large datasets characterized by complex micro-relational structures. The main objective of the project is to study and use next generation of statistical models for the analysis of “big” network data. The project is organised into three sub projects dealing with: a) the use of database management techniques  to create and manage data related to patents; b) the study of innovative data retrieval algorithms and topic modelling techniques to identify and delineate different subnetworks in the global structure of patent networks; and c) design new statistical models of patent as social networks that scale up to the analysis of networks of arbitrary size.

*The positions*
The PhD position will be in the area of Information Retrieval and Text Mining. The PhD student will work in collaboration with two postdocs in the context of the second sub project, titled “Developing new computational approaches to the analysis of knowledge structures”. This subproject aims at appling innovative data retrieval algorithms and topic modelling techniques to identify and delineate the different subnetworks linking Switzerland to the global structure of patents networks and to represent the evolutionary development of knowledge structures in which Switzerland is embedded. Your main responsibility will be to carry out research in the above area, but you will be given the opportunity to acquire some teaching experience and qualification as well. The work will be supervised by Prof. Fabio Crestani.

*Our offer*
We offer a full-time PhD position with a competitive salary and social security insurance. The year salary of a PhD student, in accordance with Swiss regulations, ranges (gross) from CHF 47,040 in the first year to CHF 50,040 in the last year, with taxation around 20%. The university and, in particular, the Faculty of Informatics whose official language is English, provides excellent facilities for both professional and personal development and the city of Lugano offers an enjoyable environment all year round.

*Your profile and application process*
Applications are invited as soon as possible as the selection will start from early March 2017. Applicants are expected to already have or be about to get a Master degree (or comparable level) in Computer Science, Information Science or closely related disciplines (Maths, Statistics, etc).

Please send the following information to Prof. Fabio Crestani (at using the subject “GSKN PhD Position”:

1. Motivation letter indicating your background and interests.
2. Full curriculum vitae, including a list of all courses and marks.
3. Contact information for at least two academic references.
4. Evidence of the command of English (fluent in writing and speaking).

Рубрика: Вакансии/Стажировки, Курсы/Образование/Постдоки | Добавить комментарий

разрешение кореференции для тех, кто не боится трудностей

Call for Participation in the Shared Task on Projection-Based Coreference Resolution



Previous shared tasks on coreference resolution (e.g.,  the SemEval 2010 shared task Coreference Resolution in Multiple Languages , the  CoNLL 2011  and  CoNLL 2012  shared tasks) operated in a setting where a large amount of training data was provided to train coreference resolvers in a fully supervised manner. Our shared task has a different goal: we are primarily interested in a low-resource setting. In particular, we seek to investigate how well one can build a coreference resolver for a language for which there is no coreference-annotated data available for training.
With a rising interest in annotation projection, we hereby offer a projection-based task which will facilitate the application of existing coreference resolution algorithms to new languages. We believe that with this exciting setting, the shared task can help promote the development of coreference technologies that are applicable to a larger number of natural languages than is currently possible.
This year we will focus on two languages ? German and Russian. To mimic a low-resource setting, no German or Russian coreference-annotated data will be provided. Rather, to facilitate system development, the shared task participants will be provided two versions of an English-German-Russian parallel corpus: an unlabelled version and a labelled version. The labelled version has the English side of the parallel corpus automatically coreference-labelled using the Berkeley coreference resolver, which was trained on the English OntoNotes corpus.


Participants will compete in two tracks:
*  Closed track: projection-based coreference resolution on German and Russian. The only coreference-annotated training data that the participants can use is the English OntoNotes corpus. Alternatively, they can use any of the publicly-available coreference resolvers trained on English OntoNotes. They can then use whatever parallel corpus and method they prefer to project the English annotations into German/Russian and subsequently train a new coreference resolver on the projected annotations. As for additional linguistic information, the participants can use POS information provided by the parser of their choice. Note that they do not have to use the provided English-German-Russian parallel corpus.
*  Open track: coreference resolution on German and Russian with no restriction on the kind of coreference-annotated data the participants can use for training. For instance, they can label their own German/Russian coreference data and use it to train a German/Russian coreference resolver, or they can adopt a heuristic-based approach where they employ knowledge of German/Russian to write coreference rules for these languages.
The participants can choose to take part in either one or both tracks. —————————
Читать далее

Рубрика: Конференции, Ресурсы/Софт | Добавить комментарий