How ArXiv Machine Learning Publications Have Changed This Decade

Academic
Advanced
Author

Kian Ghodoussi

Published

March 9, 2025

Prerequisites

pip install sturdy-stats-sdk pandas numpy plotly matplotlib

Code
from IPython.display import display, Markdown, Latex
import pandas as pd
import numpy as np
import plotly.express as px
from sturdystats import Index, Job
import matplotlib.pyplot as plt


from pprint import pprint
Code
## Basic Utilities
px.defaults.template = "simple_white"  # Change the template
px.defaults.color_discrete_sequence = px.colors.qualitative.Dark24 # Change color sequence

def procFig(fig, **kwargs):
    fig.update_layout(plot_bgcolor= "rgba(0, 0, 0, 0)", paper_bgcolor= "rgba(0, 0, 0, 0)",
        margin=dict(l=0,r=0,b=0,t=30,pad=0),
        title_x=.5,
        **kwargs
    )
    fig.layout.xaxis.fixedrange = True
    fig.layout.yaxis.fixedrange = True
    return fig

def displayText(df, highlight):
    def processText(row):
        t = "\n".join([ f'1. {r["short_title"]}: {int(r["prevalence"]*100)}%' for r in row["paragraph_topics"][:5] ])
        x = row["text"]
        res = []
        for word in x.split(" "):
            for term in highlight:
                if term in word.lower() and "**" not in word:
                    word = "**"+word+"**"
            res.append(word)
        return f"<em>\n\n#### Result {row.name+1}/{df.index.max()+1}\n\n#### {row['title']}  {row['published']}\n\n"+ t +"\n\n" + " ".join(res) + "</em>"

    res = df.apply(processText, axis=1).tolist()       
    display(Markdown(f"\n\n...\n\n".join(res)))
index = Index(id="index_b6a5a6ffb51e4ed695e80b92a8252a09")
Found an existing index with id="index_b6a5a6ffb51e4ed695e80b92a8252a09".

Exploring Topics

Our bayesian probabilistic model learns a set of high level topics from your corpus. These topics are completely custom to your data, whether your dataset has hundreds of documents or billions. The model then maps this set of learned topics to single every word, sentence, paragraph, document, and group of documents to your dataset, providing a powerful semantic indexing.

Tabular format

This indexing enables us to store data in a granular, structured tabular format. This structured format enables rapid analysis to complex questions. Our topic search api returns a ranked list of the most prominent topics in the corpus. The data includes a topic title, a group title, a topic_id, the discrete number of paragraphs that each topic was mentioned, and the percentage of data that is tied to each topic.

df = index.topicSearch()
df.head()[["short_title", "topic_group_short_title", "topic_id", "prevalence", "mentions"]]
short_title topic_group_short_title topic_id prevalence mentions
0 Performance Enhancement Methods Optimization Techniques 186 0.046857 81545.0
1 Innovations in Machine Learning Machine Learning Techniques 272 0.034884 50900.0
2 Computational Efficiency Techniques Optimization and Efficiency 81 0.033749 61648.0
3 Theoretical Foundations in Machine Learning Theoretical Foundations 325 0.032269 63637.0
4 Performance Analysis Evaluation and Assessment 444 0.016385 31990.0

Visualizing all of cs.LG’s Topics

We can see there are two levels of topics: short_title and topic_group_short_title. The topic group is a high level thematic category while a topic is a much more granlular annotation. A dataset can have hundreds of topics, but ussually only 20-50 topic groups. This hierarchy is extremly useful for organizing and exploring data in hierarchical formats such as sunbursts.

The inner circle of the sunburst is the title of the plot. The middle layer is the topic groups. And the leaf nodes are the topics that belong to the corresponding topic group. The size of each node is porportional to how often it shows up in the dataset.

df["title"] = "ArXiv cs.LG Publications"
fig = px.sunburst(df, path=["title", "topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"],)
procFig(fig, height=550).show()

Focusing on the 2022-Present

Because we structure the semantic topics into a tabular format, we are able to store the data in a relational database. We expose this relational expose this structured sql functionality directly in our topicSearch api. In this case we can focus on ArXiv publications after 2022 to get a more modern view of ArXiv’s topics.

df = index.topicSearch(filters="published > '2022-01-01'")
df["title"] = "ArXiv cs.LG Publications <br> 2022-Present"
fig = px.sunburst(df, path=["title", "topic_group_short_title", "short_title"], values="prevalence", hover_data=["topic_id"],)
procFig(fig, height=550).show()

Comparing the plots

We can actually see some meaningful changes between the first and second plot. Machine Learning Techniques overtook Optimization Techniques as the most prominent research publication topic. This is interesting. However, this is not most efficient way to visualize trends or changes over time.

Surfacing All the Examples

The topic search is an aggregration on top of our native search and tabular infrastructure and because topicSearch follows the same api parameters, we easily switch from an aggregate overview to specific examples

Honing in on a Topic

Let’s say we want to dig more into the Multimodal Representation Learning topic. We can easily surface all the matching examples from that went into the topic search results and line graph

row = df1.loc[df1.short_title == "Multimodal Representation Learning"]
row
short_title topic_id mentions prevalence one_sentence_summary executive_paragraph_summary topic_group_id topic_group_short_title conc entropy
19 Multimodal Representation Learning 339 47.0 0.009339 This theme explores the alignment and interact... The provided documents illustrate the advancem... 34 Multimodal and Interactive Systems 14.713141 5.465266
docdf = index.query(SEARCH_QUERY, topic_id=row.topic_id, filters=f"published>'{year1}-01-01'", semantic_search_cutoff=.6, limit=200)

## NB length of docs returned line up with number of mentions
assert len(docdf) == row.mentions.iloc[0]
displayText(docdf.iloc[[0,-1]], [*SEARCH_QUERY.split(), "multi", "modal", "metadata", "graph", "layout""imgbert", "images"])

Result 1/47

A Graphical Approach to Document Layout Analysis 2023-08-07

  1. Linguistic Embeddings in NLP: 23%
  2. Multimodal Representation Learning: 18%
  3. Computational Efficiency Techniques: 17%
  4. Performance Improvement Techniques: 10%
  5. Automated Code Analysis: 9%

Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.

Result 47/47

MemeGraphs: Linking Memes to Knowledge Graphs 2023-06-27

  1. Detection of Online Misbehavior: 29%
  2. Knowledge Graph Embedding: 22%
  3. Multimodal Representation Learning: 20%
  4. Advancements in Methodology: 12%
  5. Theoretical Foundations in Machine Learning: 4%

Memes are a popular form of communicating trends and ideas in social media and on the internet in general, combining the modalities of images and text. They can express humor and sarcasm but can also have offensive content. Analyzing and classifying memes automatically is challenging since their interpretation relies on the understanding of visual elements, language, and background knowledge. Thus, it is important to meaningfully represent these sources and the interaction between them in order to classify a meme as a whole. In this work, we propose to use scene graphs, that express images in terms of objects and their visual relations, and knowledge graphs as structured representations for meme classification with a Transformer-based architecture. We compare our approach with ImgBERT, a multimodal model that uses only learned (instead of structured) representations of the meme, and observe consistent improvements. We further provide a dataset with human graph annotations that we compare to automatically generated graphs and entity linking. Analysis shows that automatic methods link more entities than human annotators and that automatically generated graphs are better suited for hatefulness classification in memes.

Unlock Your Unstructured Data Today

from sturdystats import Index

index = Index("Custom Analysis")
index.upload(df.to_dict("records"))
index.commit()
index.train()

# Ready to Explore 
index.topicSearch()

More Examples