Downloading a static copy of the AACT database with Python

Hi, in this post I will share the method I used to download a static copy of the "Access to Aggregate Content of ClinicalTrials.gov" (AACT) database.

Before starting, a good question to ask is why would you want to download this information? As a health care professional or a data professional working in the health care realm, having access to a comprehensive database of clinical trials can be very useful. In particular, the AACT database is one of most comprehensive and detailed databases of clinical trials I have found: it is updated every day, it has good documentation and is really easy to use.

To know a bit more about the information it contains y recommend you visit the AACT database website and the ClinicalTrials.gov website. Here is a schematic of where the data comes from:

Source: https://clinicaltrials.gov/about-site/about-ctg

Other presentations of the AACT database

Besides downloading static copies of the database, you can also access the AACT database through a web interface, or a SQL database. You can find more information about these options in the AACT database download section.

Requirements

A working linux installation. I'm running Ubuntu 22.04 in WSL (Windows Subsystem for Linux) on Windows 11.
A Python environment. I'm using miniconda with a Python 3.11.7 virtual environment.
The requests python package. You can install it with the following command (remember to activate your virtual environment before running the install command):

pip install requests

Download the database

The first step is finding the URL from where we can download de database. From the dropdown menu, we can see that each option tag is linked to the url.

/static/static_db_copies/daily/<date>

With that information we can recreate the download URL. Check the URL variable in line 34 of the code below.

Here is the complete code, feel free to read the comments to understand what each part does and copy the code to your own project:

download-database.py
import requests
import zipfile
from pathlib import Path
from urllib.parse import urljoin


# Get current working directory
ROOT_DIR = Path().cwd()

# Create the downloads directory
DOWNLOADS_DIR = ROOT_DIR.joinpath("downloads")
DOWNLOADS_DIR.mkdir(parents=True, exist_ok=True)

# Create the directory where data will be stored
DATA_DIR = ROOT_DIR.joinpath("data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Select the desired date of the database
DATE_OF_DATABASE = "2024-03-01"

# Assign a name to the directory where the raw data will be stored
DB_DIR_NAME = f"{DATE_OF_DATABASE}_aact_database"

# Assign a name to the downloaded zip file
DOWNLOAD_DB_FILE_NAME = f"{DB_DIR_NAME}.zip"

# Create the complete download path for the zipfile
DOWNLOAD_PATH = DOWNLOADS_DIR.joinpath(DOWNLOAD_DB_FILE_NAME)

# Create the complete path where the database will be unzipped
UNZIP_DESTINATION_PATH = DATA_DIR.joinpath(DB_DIR_NAME)

# Build the URL
URL = urljoin("https://aact.ctti-clinicaltrials.org/static/static_db_copies/daily/", DATE_OF_DATABASE)

# Download the file
response = requests.get(URL, stream=True)

# Get the total file size in bytes
total_size = int(response.headers.get('content-length', 0))

# Print the total file size in MB
print(f"Total size: {total_size/1024/1024:.2f} MB")

# Open a file for writing in binary mode
with open(DOWNLOAD_PATH, 'wb') as f:
    # Iterate over the response content with a chunk size of 1024 bytes
    print(f"Starting download: {DOWNLOAD_PATH}")
    for data in response.iter_content(chunk_size=1024):
        # Write each chunk to the file
        f.write(data)
    print(f"Finished download: {DOWNLOAD_PATH}")

# Unzip the file
with zipfile.ZipFile(DOWNLOAD_PATH, "r") as zip_file:
    zip_file.extractall(UNZIP_DESTINATION_PATH)

# Print the path where the database was unzipped
print(f"Database downloaded and unzipped at {UNZIP_DESTINATION_PATH}")

To run the python script you can use the following command:

python download-database.py

After a couple of minutes, the script should finish and you should have the following directory structure:

.
├── data
│   └── 2024-03-01_aact_database
│        ├── data_dictionary.csv
│        ├── nlm_protocol_definitions.html
│        ├── nlm_results_definitions.html
│        ├── postgres.dmp
│        └── schema.png
├── downloads
│   └── 2024-03-01_aact_database.zip
└── download-database.py

The file we are interested in is the postgres.dmp file, which is a PostgreSQL database dump (backup) of the AACT database. We can use it to recreate a local copy of the database.

The other files are metadata, which I didn't find very useful, I prefer to use the information in AACT website to get some insight of how the database is structured.

Conclusion

In this post I shared a method to download a static copy of the AACT database using the python requests library. In upcoming blog posts I will show how to download the database metadata and how to create a local database with the downloaded dump file and docker.

If you find this content useful, please consider supporting my work by buying me a coffee.

Downloading a static copy of the AACT database with Python

Requirements

Download the database

Conclusion

Comments