Scientific Formula Retrieval via Tree Embeddings

Zichao Wang, Mengxue Zhang, Richard G. Baraniuk, Andrew S. Lan

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Exploiting the ever-growing corpus of scientific content calls for new ways and means to effectively organize, search, and retrieve scientific formulae. We propose a new data-driven framework for retrieving similar scientific formulae via learned formula representations based on tree embeddings. FORTE (for FOrmula Representation learning via Tree Embeddings) leverages operator tree representations of symbolic scientific formulae (such as math equations) to explicitly capture their inherent structural and semantic properties. FORTE employs i) a tree encoder that encodes the formula's operator tree into an embedding vector and ii) a tree decoder that directly generates a formula's operator tree from the embedding vector. We also develop a novel tree beam search algorithm that improves the quality of the decoded operator trees. We demonstrate that FORTE (sometimes significantly) outperforms various baseline methods on formula reconstruction and retrieval using a real-world dataset comprising 770k scientific formulae collected on-line.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
EditorsYixin Chen, Heiko Ludwig, Yicheng Tu, Usama Fayyad, Xingquan Zhu, Xiaohua Tony Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, Carlos Ordonez
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages11
ISBN (Electronic)9781665439022
StatePublished - 2021
Event2021 IEEE International Conference on Big Data, Big Data 2021 - Virtual, Online, United States
Duration: Dec 15 2021Dec 18 2021

Publication series

NameProceedings - 2021 IEEE International Conference on Big Data, Big Data 2021


Conference2021 IEEE International Conference on Big Data, Big Data 2021
Country/TerritoryUnited States
CityVirtual, Online


  • generative models
  • information retrieval
  • representation learning
  • scientific formulae understanding
  • tree-structured data

ASJC Scopus subject areas

  • Information Systems and Management
  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Information Systems


Dive into the research topics of 'Scientific Formula Retrieval via Tree Embeddings'. Together they form a unique fingerprint.

Cite this