Carnegie Mellon University

Jamie  Callan

Jamie Callan

Professor, Language Technologies Institute

  • 5419 —Gates & Hillman Centers
  • 412-268-4525

Jamie Callan is a Professor at Carnegie Mellon University's Language Technologies Institute (LTI), where he leads research in information retrieval. With a Ph.D. from the University of Massachusetts Amherst, Dr. Callan has been at the forefront of search engine innovation for over three decades. He is internationally recognized for his contributions to advanced search engine architectures, federated retrieval, and neural approaches to document ranking, as evidenced by his extensive publications and an h-index of 73.

Dr. Callan’s current research focuses on leveraging neural techniques and large language models to enhance information retrieval systems, including improvements in term weighting, conversational retrieval, and retrieval-augmented generation. His recent research includes improved first stage retrieval using a latent vocabulary for sparse systems and hypothetical documents for dense vector-based systems. He also maintains and distributes the widely used ClueWeb datasets, which have supported groundbreaking research in search and retrieval for more than a decade.

An accomplished educator and mentor, Dr. Callan teaches advanced courses on search engine design and advises students across several programs, fostering the next generation of leaders in information retrieval. He actively serves the academic community as a senior program committee member and area chair for conferences such as SIGIR, WSDM, and ECIR. He is a past Treasurer and past Chair of SIGIR, the international professional society for Information Retrieval research, a co-founding Editor-in-Chief of Foundations and Trends in Information Retrieval, and a past Editor-in-Chief of ACM's Transactions on Information Systems (TOIS). With a commitment to innovation and service, Dr. Callan continues to shape the future of search technologies and their applications.

My research and teaching focus on information retrieval and analysis. I have worked on a wide range of topics over the years, but am particularly interested in search engine architectures, information filtering and text mining. A sample of current projects is shown below. See my personal webpage for more information.

Areas of Focus:

  • Information Retrieval
  • Text Mining and Analytics

Lemur: The Lemur Project develops open-source search engines, toolbars, text analysis tools, search services and datasets that support international research and development. The project is best known for its Indri and Galago search engines, and large-scale ClueWeb datasets. Our software and datasets are widely used in scientific and research applications, and some commercial applications. Lemur's software development philosophy emphasizes state-of-the-art accuracy, flexibility and efficiency.

Search Engines With Knowledge Resources: This project develops new methods for using knowledge graphs and ontologies to improve search engine accuracy, especially for vague, ambiguous or poorly specified queries. Knowledge graphs and ontologies are less structured than typical relational databases and semantic web resources, but more structured than text stored in full-text search engines. The weak semantics in these semi-structured information resources can support interesting applications, but can also accommodate contradictions, inconsistencies and mistakes — making them easier to scale for large amounts of information. A search engine can use these resources to identify the probable meanings of query terms, and use this knowledge to identify documents that match those meanings.

Selective and Federated Search: I have a long-term interest in environments that contain numerous search engines. Much of my prior research focused on integrating many independent search engines — perhaps operated by different organizations with different interests— into a single integrated federated search system. My recent work investigates a related problem: decomposing a massive text collection into hundreds or thousands of small search engines designed to have skewed utility distributions that enable most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on the characteristics desired or to be avoided in a text collection to enable accurate search. We've developed new resource selection algorithms to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. Our goal is an easily customizable and extensible off-the-shelf method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents.