Downloading AACT database metadata with Pandas and Selenium
In this post I will present how I downloaded the HTML tables that contain the metadata of the "Access to Aggregate Content of ClinicalTrials.gov" (AACT) database and store them as comma separated values (csv) files. The objective is to use this information to create an interactive dashboard that will help users understand the contents of the database.
The metadata is comprised of three HTML tables on the same webpage. They are the following:
- The "AACT Data Elements" table that contains the names of the tables and the columns along with the column descriptions.
- The "AACT Tables" table, with information about study-related tables. The information that I'm interested in is the table name and their description.
- The "AACT Views & Functions" table, which contains information about the views and functions in used to make data retrieval easier.
Without further ado, let's get started.
Requirements
-
A working linux installation. I'm running Ubuntu 22.04 in WSL (Windows Subsystem for Linux) on Windows 11.
-
A Python environment. I'm using miniconda with a Python 3.11.7 virtual environment.
-
Installing the chrome-driver. In Ubuntu it can be done using the commands below:
sudo apt-get update
sudo apt install chromium-chromedriver -y
- Installing the following python packages:
selenium
: we will use it to load the JavaScript-dependent "AACT Data Elements" table and extract it's text. This approach is relatively slow but since it is on operation I don't consider that spending time optimization is worth it.pandas
: we will use it to extract the "AACT Tables" and "AACT Views & Functions" tables. We don't need selenium for those because they are static, meaning that they are already in the HTML when the page is loaded and don't depend on JavaScript to be rendered. Additionally, we will usepandas
to do some data manipulation and then store the tables as csv files.- lxml: it is a dependency of
pandas
and is used to parse the HTML and extract the tables.
To install the packages, run the following command:
pip install selenium pandas lxml
Once we have the requirements installed, we can start coding.
Download the metadata
For downloading the metadata, we will use two approaches, one for static tables (the simple ones) and another for the dynamic table (the one that depends on JavaScript to be rendered).
Downloading the static tables
After inspection of the HTML source code of the page, I found that both static tables have the same class name dictionaryDisplay
. With that information, we can use the pandas.read_html
function to extract the tables from the HTML and then store them as csv files. I created a file named download-static-tables-metadata.py
with the following code:
download-static-tables-metadata.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
After running the code with the command python download_aact_metadata.py
, we will have two csv files with the metadata of the "AACT Tables" and "AACT Views & Functions" tables.
Downloading the dynamic table
To tackle the dynamic table, we will use Selenium to load the page and extract the table. The code is a bit more complex than the previous one, that is why I decided to create a separate file for it.
Additionally, I used Python classes to organize the code and make it more readable and maintainable.
Code explanations
Click the little "+" buttons in the code to see the explanation of each part.
JavaScript code
In a few of the functions I decided to use JavaScript to interact with the page. This is because the page updated the DOM after the table was loaded resulting in the following exceptions when trying to interact using Selenium directly:
download-dynamic-table-metadata.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
- WebDriverWait is used to add selenium's explicit wait to the code. It will wait for the element to be present in the DOM before trying to extract it's text.
- EC (expected condition) is used in combination with WebDriverWait to wait for something specific to happen before proceeding with the code.
- Headless mode is used to run the browser without a graphical interface. This is useful when running the code in a terminal, for example in a server environment.
- I used a FullHD window size to make sure that the table is fully rendered. This is important because the
find_element
method will only find the elements that are visible in the viewport. - The wait is configured to check every 1 second if the expected condition is met with a time limit of 10 seconds before raising a
TimeoutException
.
As you can see there are quite a few lines of code, but the process is relatively straight forward:
- The page is loaded.
- A wait is added to make sure that the table is fully rendered.
- The total number of pages is extracted.
- The current page number is extracted.
- The table data from the current page is extracted and saved into a variable.
- The next button is clicked and the and the current page is updated by adding 1.
- Steps 2, 5 and 6 are repeated until the last page is reached.
- The table data is saved as a csv file.
- The browser is closed.
After running the code with the command python download-dynamic-tables-metadata.py
, you should and an output like this:
Current page: 1 of 12
Current page: 2 of 12
Current page: 3 of 12
Current page: 4 of 12
Current page: 5 of 12
Current page: 6 of 12
Current page: 7 of 12
Current page: 8 of 12
Current page: 9 of 12
Current page: 10 of 12
Current page: 11 of 12
Current page: 12 of 12
Data Elements metadata downloaded and saved
Finally, a file with the name aact-data-elements.csv
will be created in the same directory as the script.
Possible improvements
A few improvents that can be made are:
- Refactor the
download_table
method to use a while loop instead of a for loop. This way we can avoid the need to know the total number of pages before starting the download. - Add a try-except block to handle possible exceptions and gracefully close the browser in case of an error.
- Add a logger to the script to make it easier to debug and understand what is happening.
- Add tests.
- Add type hints.
- Add docstrings.
Conclusion
In this post I presented how I downloaded the metadata of the "Access to Aggregate Content of ClinicalTrials.gov" (AACT) database using Python, Selenium and Pandas. Selenium was used to deal with the JavaScript-dependant table and Pandas to extract the data from the simple HTML tables and store all the information in CSV files.
We saw that the Selenium code is a bit more complex than the Pandas code, so it is a good idea to use classes to organize the code and make it more readable and maintainable. Additionally, we saw that it is possible to use JavaScript (through Selenium) to interact with the page and avoid some exceptions that can be raised when trying to interact with the page using Selenium directly.
In the upcoming posts I will present how I cleaned the data and metadata, calculated some statistics and stored the results in files ready to be used in a dashboard.
If you find this content useful, please consider supporting my work by buying me a coffee.