Examples using API¶
The API can be used in a variety of ways to analyze the dataset for generating compliance reports. The page describes three typical use-cases for the API
General Use-Case¶
The most common use case for the API is to generate compliance reports for a single MRI dataset. In this scenario, a user would typically access the API through command-line interface or by writing python scripts. The user should provide the API with the location of the dataset, as well as any relevant metadata or additional arguments. The API would then analyze the dataset and generate a report indicating which modalities/subjects/sessions conform to the specified protocol
On the CLI, specify the arguments as given below:
mrqa --data-source /path/to/dataset --format dicom --name my_dicom_dataset
To check for a BIDS dataset:
mrqa --data-source /path/to/dataset --format bids --name my_bids_dataset
Similarly, in a python script:
from MRdataset import import_dataset
from mrQA import check_compliance
data_folder = "/home/datasets/XYZ_dataset"
output_dir = '/home/mr_reports/XYZ'
dicom_dataset = import_dataset(data_source=data_folder,
ds_format='dicom',
name='XYZ_study')
report_path = check_compliance(dataset=dicom_dataset,
output_dir=args.output_dir)
To check for a BIDS dataset, use ds_format argument:
from MRdataset import import_dataset
from mrQA import check_compliance
bids_dataset = import_dataset(data_source=data_folder,
ds_format='bids',
name='XYZ_study')
report_path = check_compliance(dataset=bids_dataset,
output_dir=args.output_dir)
Parallel Use-Case¶
In some cases a user may need to generate compliance reports for a large MR-dataset. Processing large dicom datasets may be limited by disk reading speed, all the more when a user is accessing the data over a network. Typically, the API takes an hour to read 100 thousand .dcm files. In this scenario, we recommend that the user should split his dataset, and read it in parallel. Then the subsets can be merged to a single dataset and checked for compliance. The complete process is divided into three steps:
Create bash-scripts for each job
from mrQA.run_parallel import create_script script_list_filepath, mrds_list_filepath = create_script( data_source=data_source, subjects_per_job=5, conda_env='mrcheck', conda_dist='anaconda3', output_dir=output_dir, hpc=False, )
This will generate two txt files: script_list_filepath and mrds_list_filepath. These files contain the paths to corresponding the bash-scripts and the mrds files for each job respectively. Note that you need to specify the number of subjects per job. This is the number of subjects that will be processed by each job. The number of jobs will be equal to the number of subjects divided by the number of subjects per job. For example, if you have 100 subjects and you want to process 5 subjects per job, then you will have 20 jobs. The conda_env and conda_dist arguments are required to activate the conda environment in the bash-script. The hpc argument is optional and is used to specify whether the jobs are to be run on a HPC or not.
Submit jobs/ Execute the generated scripts:
from mrQA.run_parallel import submit_job submit_job(scripts_list_filepath=script_list_filepath, mrds_list_filepath=mrds_list_filepath, hpc=False)
This will submit the jobs to the HPC or execute the scripts locally. The hpc argument is optional and is used to specify whether the jobs are to be run on a HPC or not. If hpc is False, then the scripts will be executed locally. If hpc is True, then the scripts will be submitted to the HPC, using sbatch.
Merge datasets and generate report:
from mrQA.run_merge import check_and_merge check_and_merge( mrds_list_filepath=mrds_list_filepath, output_path=output_path, name=name )
This will merge the datasets generated by each job and generate a report. The mrds_list_filepath is the path to the txt file containing the paths to the mrds files for each job. The output_path is the path to the directory where the report will be saved. The name is the name of the dataset.
On the CLI, the above steps can be executed as follows:
mrqa_parallel --data-source /path/to/dataset --format dicom --name my_dicom_dataset --subjects-per-job 5 --conda_env mrcheck --conda_dist anaconda3 --output_dir /path/to/output_dir --hpc False
Additional points to note:
The recommended values for ‘subjects-per-job’ are 50-100, depending on the size of the dataset.
The current framework was built for HPCs running SLURM. If you are using a different scheduler, you will need to modify the submit_job function in mrQA/run_parallel.py to submit jobs to your scheduler.
The conda_env and conda_dist arguments are required to activate the conda environment in the bash-script. If you are not using conda, you can remove these arguments and the corresponding lines from the bash-script.
The parallelization framework is currently only available for dicom datasets. We are working on adding support for BIDS datasets. In addition it was tested on the ABCD dataset, and we are working on testing it on other datasets.
The subject list is generated by reading the folders with a prefix ‘sub-‘ in the dataset. If your dataset does not follow this convention, you will need to modify the _get_subject_ids function in mrQA/parallel_utils.py to generate the subject list. The subject list is used to split the dataset into subsets, and to merge the subsets into a single dataset. We are working on relaxing this requirement.
Reaching out to us with any questions or suggestions is always welcome.
Monitoring Use-Case¶
In some cases, a user may want to monitor the compliance of a dataset over time. For example, a user may want to check if the dataset is still compliant with the protocol after a new subject is added to the dataset. The user can use the API to generate a report for the dataset. If monitor method is executed again, the dataset is checked for any new files, and the report is updated accordingly. If any changes are detected, the user can generate a new report and compare it with the previous report.
To monitor a dataset:
from mrQA import monitor
monitor(name='my_dataset',
data_source='/path/to/dataset',
output_dir='/path/to/output_dir')
This will generate a report for the dataset and save it in the output_dir. The report name will also have a timestamp appended to it. The report will be saved in the output_dir as my_dataset_report_timestamp.html. The name argument is the name of the dataset. The data_source argument is the path to the dataset. The output_dir argument is the path to the directory where the report will be saved.
On the CLI, the above steps can be executed as follows:
mrqa_monitor --name my_dataset --data-source /path/to/dataset --output_dir /path/to/output_dir
Additional points to note:
The name argument should be same if the user wants to monitor the same dataset. The files are saved in the output_dir with the name name_report_timestamp.html. If the name argument is different, then the files will be saved with a different name, and it would not be possible to monitor the dataset. If name argument is not provided, a random number is used as the name.
The monitor function can be used to monitor dicom datasets only. We are working on adding support for BIDS datasets.
We recommend that the user should run the monitor function at least once a day. This will ensure that the report is updated with any new files that are added to the dataset.
Reaching out to us with any questions or suggestions is always welcome.