Carnegie Mellon University

Stock image of a computer screen displaying line graphs and website analytics data

CRI: CI-SUSTAIN: Collaborative Research: Sustaining Lemur Project Resources for the Long-Term

By Jamie Callan

For more than a decade, the software, datasets, and online services developed and provided by the Lemur Project have supported and enabled a large body of academic and commercial research on search engines, information retrieval, and other areas of computer science that analyze and process human language. This project makes critical enhancements to Lemur Project infrastructure, operates the infrastructure for another three years, and positions it for long-term sustainability. As part of the enhancements, the Galago search engine is enhanced to provide stronger integration of neural networks and other machine learning methods. A new dataset, ClueWeb2020, is developed to replace the widely-used ClueWeb09 and ClueWeb12 datasets. These investments will support advanced research for the next decade. The advanced search capabilities developed for the project's open-source Indri and Galago search engines, which are widely used for research, are added to the open-source Lucene search engine, which is widely used by industry. New software applications are developed to simplify migration between Lemur Project search engines and Lucene. These investments improve the state-of-the-art of software important to industry and enable researchers to migrate research to more widely-used software. The Lemur Project's research infrastructure attracted a substantial research user community because it easily enables leading-edge research. These enhancements enable researchers in information retrieval and related areas to carry out a much broader range of experiments and to share their results. Research and industry development supported by the new Lemur Project software will create a new generation of more capable search engines for a variety of tasks.

The project is organized around three types of activities: Sustaining software, sustaining datasets, and operation. The project achieves long-term software sustainability by adding support for Indri and Galago functionality and creating integration and migration paths with the open-source Lucene search engine, which has large user and volunteer-developer communities. Research done with Galago or Indri will thus be reproducible in Lucene and more accessible to Lucene's industry users. The project also extends the Galago Application Programming Interface to support the newest developments in neural network (deep learning) document ranking technologies, which now are being studied widely and expected in a state-of-the-art research system. It broadens the utility of Ranklib by supporting neural algorithms for better comparison with high quality learning to rank approaches, and broadens the utility of the Sifaka text mining application with support for additional document and machine learning formats. The older ClueWeb09 and ClueWeb12 datsets are superseded by a new ClueWeb2020 dataset that is designed to last a decade and support research on newer learning-to-rank and neural network (deep learning) ranking algorithms. The project maintains and operates the existing infrastructure, in the form of software maintenance and support; dataset licensing and distribution; and operation of online search services. The new Lemur Project infrastructure supports a broad range of Information Retrieval research, for example, research on retrieval models; how to train learned rankers; use of semi-structured knowledge bases; result diversification; query optimization; and distributed search. In particular, it greatly improves support for research on learned and neural (deep learning) ranking algorithms, which have become important research topics in recent years. The ClueWeb datasets are used by a broad human language technologies research community. This project makes enhancements that sustain this infrastructure for the research community for at least the next decade.