Downloading AACT database metadata with Pandas and Selenium

In this post I will present how I downloaded the HTML tables that contain the metadata of the "Access to Aggregate Content of ClinicalTrials.gov" (AACT) database and store them as comma separated values (csv) files. The objective is to use this information to create an interactive dashboard that will help users understand the contents of the database.

The metadata is comprised of three HTML tables on the same webpage. They are the following:

The "AACT Data Elements" table that contains the names of the tables and the columns along with the column descriptions.
The "AACT Tables" table, with information about study-related tables. The information that I'm interested in is the table name and their description.
The "AACT Views & Functions" table, which contains information about the views and functions in used to make data retrieval easier.

Without further ado, let's get started.

Requirements

A working linux installation. I'm running Ubuntu 22.04 in WSL (Windows Subsystem for Linux) on Windows 11.
A Python environment. I'm using miniconda with a Python 3.11.7 virtual environment.
Installing the chrome-driver. In Ubuntu it can be done using the commands below:

sudo apt-get update
sudo apt install chromium-chromedriver -y

Installing the following python packages:
- selenium: we will use it to load the JavaScript-dependent "AACT Data Elements" table and extract it's text. This approach is relatively slow but since it is on operation I don't consider that spending time optimization is worth it.
- pandas: we will use it to extract the "AACT Tables" and "AACT Views & Functions" tables. We don't need selenium for those because they are static, meaning that they are already in the HTML when the page is loaded and don't depend on JavaScript to be rendered. Additionally, we will use pandas to do some data manipulation and then store the tables as csv files.
- lxml: it is a dependency of pandas and is used to parse the HTML and extract the tables.

To install the packages, run the following command:

pip install selenium pandas lxml

Once we have the requirements installed, we can start coding.

Download the metadata

For downloading the metadata, we will use two approaches, one for static tables (the simple ones) and another for the dynamic table (the one that depends on JavaScript to be rendered).

Downloading the static tables

After inspection of the HTML source code of the page, I found that both static tables have the same class name dictionaryDisplay. With that information, we can use the pandas.read_html function to extract the tables from the HTML and then store them as csv files. I created a file named download-static-tables-metadata.py with the following code:

download-static-tables-metadata.py
import pandas as pd
from datetime import datetime

# Metadata URL
METADATA_URL = "https://aact.ctti-clinicaltrials.org/data_dictionary"

# Date
today = datetime.today().strftime("%Y-%m-%d")

# Download the static tables

## Since we know that exactly two tables are selected
## we can use python destructuring syntax to assign the two pandas DataFrames
## to aact_tables and aact_v_and_f
aact_tables, aact_v_and_f = pd.read_html(
    METADATA_URL, attrs={"class": "dictionaryDisplay"}
)

## Save the DataFrames to CSV
aact_tables.to_csv(f"{today}_aact_tables_metadata.csv", index=False)

aact_v_and_f.to_csv(f"{today}_aact_views_functions_metadata.csv", index=False)

After running the code with the command python download_aact_metadata.py, we will have two csv files with the metadata of the "AACT Tables" and "AACT Views & Functions" tables.

Downloading the dynamic table

To tackle the dynamic table, we will use Selenium to load the page and extract the table. The code is a bit more complex than the previous one, that is why I decided to create a separate file for it.

Additionally, I used Python classes to organize the code and make it more readable and maintainable.

Code explanations

Click the little "+" buttons in the code to see the explanation of each part.

JavaScript code

In a few of the functions I decided to use JavaScript to interact with the page. This is because the page updated the DOM after the table was loaded resulting in the following exceptions when trying to interact using Selenium directly:

download-dynamic-table-metadata.py
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait  # (1)!
from selenium.webdriver.support import expected_conditions as EC  # (2)!
from datetime import datetime


class AACTDataElementsDownloader:
    """
    Class to download the contents of the AACT data elements
    dynamic table.
    """

    def __init__(self, url):
        self.url = url
        self.driver = None
        self.download_path = None
        self.wait = None
        self.metadata = None

    def setup_webdriver(self):
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument("--headless")  # (3)!
        chrome_options.add_argument("window-size=1920,1080")  # (4)!
        self.driver = webdriver.Chrome(options=chrome_options)
        self.wait = WebDriverWait(self.driver, timeout=10, poll_frequency=1)  # (5)!

    def load_page(self):
        self.driver.get(self.url)

    def wait_for_table(self):
        table_selector = '//*[@id="jsGrid"]/div[2]/table/tbody/tr[1]/td[3]'

        self.wait.until(EC.presence_of_element_located((By.XPATH, table_selector)))
        self.wait.until(
            EC.visibility_of(self.driver.find_element(By.XPATH, table_selector))
        )
        return True

    def click_next_button(self):
        self.driver.execute_script(
            """Array.from(document.querySelectorAll('span.jsgrid-pager-nav-button a'))
                .find(el => el.textContent.trim() === "Next").click()"""
        )

    def update_total_pages(self):
        total_pages = self.driver.execute_script(
            """return document.querySelector("div.jsgrid-pager-container")
                .textContent;"""
        )
        total_pages = int(total_pages.split(" ")[-2])
        return total_pages

    def get_total_pages(self):
        total_pages = self.update_total_pages()
        while total_pages == 0:
            total_pages = self.update_total_pages()
        return total_pages

    def get_data_elements_metadata(self):
        current_page = self.driver.execute_script(
            """return document.querySelector("div.jsgrid-pager-container")
                .textContent;"""
        )
        current_page = int(current_page.split(" ")[-4])
        total_pages = self.get_total_pages()
        table_headers = self.driver.execute_script(
            """return Array.from(document.querySelectorAll("#jsGrid div th"))
            .map(element => element.textContent);"""
        )
        table_data = []
        while current_page <= total_pages:
            print(f"Current page: {current_page} of {total_pages}")
            if self.wait_for_table():
                row_data = self.driver.execute_script(
                    """return Array.from(document.querySelectorAll("table.jsgrid-table tbody tr"))
                        .map(row => Array.from(row.querySelectorAll("td"))
                        .map(cell => cell.textContent));"""
                )
                table_data.extend(row_data)
            if current_page == total_pages:
                break
            self.click_next_button()
            current_page += 1
        self.driver.quit()
        self.metadata = pd.DataFrame.from_records(table_data, columns=table_headers)
        self.metadata.to_csv(self.download_path, index=False)
        print("Data Elements metadata downloaded and saved")

    def download_metadata(self):
        self.setup_webdriver()
        self.load_page()
        self.get_data_elements_metadata()

    def finalize(self):
        self.driver.quit()


if __name__ == "__main__":
    METADATA_URL = "https://aact.ctti-clinicaltrials.org/data_dictionary"
    today = datetime.today().strftime("%Y-%m-%d")
    metadata_downloader = AACTDataElementsDownloader(METADATA_URL)
    metadata_downloader.download_path = f"{today}_aact_data_elements_metadata.csv"
    metadata_downloader.download_metadata()
    metadata_downloader.finalize()

WebDriverWait is used to add selenium's explicit wait to the code. It will wait for the element to be present in the DOM before trying to extract it's text.
EC (expected condition) is used in combination with WebDriverWait to wait for something specific to happen before proceeding with the code.
Headless mode is used to run the browser without a graphical interface. This is useful when running the code in a terminal, for example in a server environment.
I used a FullHD window size to make sure that the table is fully rendered. This is important because the find_element method will only find the elements that are visible in the viewport.
The wait is configured to check every 1 second if the expected condition is met with a time limit of 10 seconds before raising a TimeoutException.

As you can see there are quite a few lines of code, but the process is relatively straight forward:

The page is loaded.
A wait is added to make sure that the table is fully rendered.
The total number of pages is extracted.
The current page number is extracted.
The table data from the current page is extracted and saved into a variable.
The next button is clicked and the and the current page is updated by adding 1.
Steps 2, 5 and 6 are repeated until the last page is reached.
The table data is saved as a csv file.
The browser is closed.

After running the code with the command python download-dynamic-tables-metadata.py, you should and an output like this:

Current page: 1 of 12
Current page: 2 of 12
Current page: 3 of 12
Current page: 4 of 12
Current page: 5 of 12
Current page: 6 of 12
Current page: 7 of 12
Current page: 8 of 12
Current page: 9 of 12
Current page: 10 of 12
Current page: 11 of 12
Current page: 12 of 12
Data Elements metadata downloaded and saved

Finally, a file with the name aact-data-elements.csv will be created in the same directory as the script.

Possible improvements

A few improvents that can be made are:

Refactor the download_table method to use a while loop instead of a for loop. This way we can avoid the need to know the total number of pages before starting the download.
Add a try-except block to handle possible exceptions and gracefully close the browser in case of an error.
Add a logger to the script to make it easier to debug and understand what is happening.
Add tests.
Add type hints.
Add docstrings.

Conclusion

In this post I presented how I downloaded the metadata of the "Access to Aggregate Content of ClinicalTrials.gov" (AACT) database using Python, Selenium and Pandas. Selenium was used to deal with the JavaScript-dependant table and Pandas to extract the data from the simple HTML tables and store all the information in CSV files.

We saw that the Selenium code is a bit more complex than the Pandas code, so it is a good idea to use classes to organize the code and make it more readable and maintainable. Additionally, we saw that it is possible to use JavaScript (through Selenium) to interact with the page and avoid some exceptions that can be raised when trying to interact with the page using Selenium directly.

In the upcoming posts I will present how I cleaned the data and metadata, calculated some statistics and stored the results in files ready to be used in a dashboard.

If you find this content useful, please consider supporting my work by buying me a coffee.