Please see LICENSE
for details.
You can install gatk through Conda
conda install -c vacation gatk
If you run into name conflicts of other GATKs or the installer hangs you may need to use
conda install --override-channels -c vacation gatk
The jar files are available in the conda installation
<path to conda>envs/<environment name>/jar/gatk.jar
or you can just use the gatk command. If you want to adjust the java options you can make your first argument the java options i.e.
gatk -Xmx4g <subcommand>
Without the argument, the command will request 4g of memory.
This project is in a pre-release stage of development. It is subject to change without warning. Do not use this code for production work.
If you are looking for the current version of GATK to use in production work (ie., GATK3), please see the GATK website, where you can download a precompiled executable, read documentation, ask questions and receive technical support.
This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents of this repository are 100% open source and released under the BSD 3-Clause license (see LICENSE.TXT).
GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.
./gradlew
script which will
download and use an appropriate gradle version automatically (see examples below).gatk-launch
frontend script)git lfs install
after downloading, followed by git lfs pull
from the root of your git clone to download the large files. The download is several hundred megabytes../gradlew bundle
(creates gatk-VERSION.zip
in build/
)./gatk-launch --help
./gatk-launch --list
./gatk-launch PrintReads -I src/test/resources/NA12878.chr17_69k_70k.dictFix.bam -O output.bam
./gatk-launch PrintReads --help
You can download and run pre-built versions of GATK4 from the following places:
Starting with the beta release, a zip archive with everything you need to run GATK4 can be downloaded for each release from the github releases page.
Starting with the beta release, you can download a GATK4 docker image from our dockerhub repository. We also host unstable nightly development builds on this dockerhub repository.
To do a full build of GATK4, run:
./gradlew bundle
Equivalently, you can just type:
./gradlew
build/
directory with a name like gatk-VERSION.zip
containing a complete standalone GATK distribution, including our launcher gatk-launch
, both the local and spark jars, and this README. Other ways to build:
./gradlew installDist
./gradlew installAll
./gradlew localJar
build/libs
with a name like gatk-package-VERSION-local.jar
, and can be used outside of your git clone../gradlew sparkJar
build/libs
with a name like gatk-package-VERSION-spark.jar
, and can be used outside of your git clone. To remove previous builds, run:
./gradlew clean
For faster gradle operations, add org.gradle.daemon=true
to your ~/.gradle/gradle.properties
file.
This will keep a gradle daemon running in the background and avoid the ~6s gradle start up time on every command.
Gradle keeps a cache of dependencies used to build GATK. By default this goes in ~/.gradle
. If there is insufficient free space in your home directory, you can change the location of the cache by setting the GRADLE_USER_HOME
environment variable.
The standard way to run GATK4 tools is via the gatk-launch
wrapper script located in the root directory of a clone of this repository.
gatk-launch
can be run:
./gradlew bundle
to a directory, and running gatk-launch
from theregatk-launch
script within the same directory as fully-packaged GATK jars produced by ./gradlew localJar
and/or ./gradlew sparkJar
GATK_LOCAL_JAR
and GATK_SPARK_JAR
, and setting them to the paths to the GATK jars produced by ./gradlew localJar
and/or ./gradlew sparkJar
gatk-launch
can run non-Spark tools as well as Spark tools, and can run Spark tools locally, on a Spark cluster, or on Google Cloud Dataproc.java -jar
directly and bypassing gatk-launch
causes several important system properties to not get set, including htsjdk compression level!For help on using gatk-launch
itself, run ./gatk-launch --help
To print a list of available tools, run ./gatk-launch --list
.
Spark
(eg., BaseRecalibratorSpark
). Most other tools are non-Spark-based.To print help for a particular tool, run ./gatk-launch ToolName --help
.
To run a non-Spark tool, or to run a Spark tool locally, the syntax is: ./gatk-launch ToolName toolArguments
.
Examples:
./gatk-launch PrintReads -I input.bam -O output.bam
./gatk-launch PrintReadsSpark -I input.bam -O output.bam
gatk-launch
with the --javaOptions
argument:
./gatk-launch --javaOptions "-Xmx4G" <rest of command>
./gatk-launch --javaOptions "-Xmx4G -XX:+PrintGCDetails" <rest of command>
./gatk-launch PrintReads -I gs://mybucket/path/to/my.bam -L 1:10000-20000 -O output.bam
gcloud auth application-default login
gcloud auth activate-service-account --key-file "$PATH_TO_THE_KEY_FILE"
GOOGLE_APPLICATION_CREDENTIALS
environment variable to point to the file
export GOOGLE_APPLICATION_CREDENTIALS="$PATH_TO_THE_KEY_FILE"
./gatk-launch ToolName toolArguments -- --sparkRunner SPARK --sparkMaster <master_url> additionalSparkArguments
* Examples:
./gatk-launch PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
-- \
--sparkRunner SPARK --sparkMaster <master_url>
./gatk-launch PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
-- \
--sparkRunner SPARK --sparkMaster <master_url> \
--num-executors 5 --executor-cores 2 --executor-memory 4g \
--conf spark.yarn.executor.memoryOverhead=600
--
.gatk-launch
invokes the spark-submit
tool behind-the-scenes.gatk-launch
invokes the gcloud
tool behind-the-scenes. As part of the installation, be sure
that you follow the gcloud
setup instructions here. As this library is frequently updated by Google, we recommend updating your copy regularly to avoid any version-related difficulties.gs://my-gcs-bucket/path/to/my-file
You can run GATK4 jobs on Dataproc from your local computer or from the VM (master node) on the cloud.
Once you're set up, you can run a Spark tool on your Dataproc cluster using a command of the form:
./gatk-launch ToolName toolArguments -- --sparkRunner GCS --cluster myGCSCluster additionalSparkArguments
Examples:
./gatk-launch PrintReadsSpark \
-I gs://my-gcs-bucket/path/to/input.bam \
-O gs://my-gcs-bucket/path/to/output.bam \
-- \
--sparkRunner GCS --cluster myGCSCluster
./gatk-launch PrintReadsSpark \
-I gs://my-gcs-bucket/path/to/input.bam \
-O gs://my-gcs-bucket/path/to/output.bam \
-- \
--sparkRunner GCS --cluster myGCSCluster \
--num-executors 5 --executor-cores 2 --executor-memory 4g \
--conf spark.yarn.executor.memoryOverhead=600
--
.GATK_GCS_STAGING
environment variable to a bucket you have write access to (eg., export GATK_GCS_STAGING=gs://<my_bucket>/
)BaseRecalibratorSpark
,BQSRPipelineSpark
and ReadsPipelineSpark
). You can convert your fasta to 2bit by using the faToTwoBit
utility from UCSC - see also the documentation for faToTwoBit
.brew tap homebrew/science
brew install R
The plotting R scripts require certain R packages to be installed. You can install these by running `scripts/docker/gatkbase/install_R_packages.R`. Either run it as superuser to force installation into the sites library or run interactively and create a local library.
sudo Rscript scripts/docker/gatkbase/install_R_packages.R
**or**
R
source("scripts/docker/gatkbase/install_R_packages.R")
A tab completion bootstrap file for the bash shell is now included in releases. This file allows the command-line shell to complete GATK run options in a manner equivalent to built-in command-line tools (e.g. grep).
This tab completion functionality has only been tested in the bash shell, and is released as a beta feature.
To enable tab completion for the GATK, open a terminal window and source the included tab completion script:
source gatk-launch-completion.sh
Sourcing this file will allow you to press the tab key twice to get a list of options available to add to your current GATK command. By default you will have to source this file once in each command-line session, then for the rest of the session the GATK tab completion functionality will be available. GATK tab completion will be available in that current command-line session only.
Note that you must have already started typing an invocation of the GATK (using gatk-launch) for tab completion to initiate:
./gatk-launch <TAB><TAB>
echo "source <PATH_TO>/gatk-launch-completion.sh" >> ~/.bashrc
<PATH_TO>
is the fully qualified path to the gatk-launch-completion.sh
script.Do not put private or restricted data into the repo.
Try to keep datafiles under 100kb in size. Larger test files should go into src/test/resources/large
(and subdirectories) so that they'll be stored and tracked by git-lfs as described above.
GATK4 is BSD licensed. The license is in the top level LICENSE.TXT file. Do not add any additional license text or accept files with a license included in them.
Each tool should have at least one good end-to-end integration test with a check for expected output, plus high-quality unit tests for all non-trivial utility methods/classes used by the tool. Although we have no specific coverage target, coverage should be extensive enough that if tests pass, the tool is guaranteed to be in a usable state.
All newly written code must have good test coverage (>90%).
All bug fixes must be accompanied by a regression test.
All pull requests must be reviewed before merging to master (even documentation changes).
Don't issue or accept pull requests that introduce warnings. Warnings must be addressed or suppressed.
Don't issue or accept pull requests that significantly decrease coverage (less than 1% decrease is sort of tolerable).
Don't use toString()
for anything other than human consumption (ie. don't base the logic of your code on results of toString()
.)
Don't override clone()
unless you really know what you're doing. If you do override it, document thoroughly. Otherwise, prefer other means of making copies of objects.
For logging, use org.apache.logging.log4j.Logger
We mostly follow the Google Java Style guide
Git: Don't push directly to master - make a pull request instead.
Git: Rebase and squash commits when merging.
If you push to master or mess up the commit history, you owe us 1 growler or tasty snacks at happy hour. If you break the master build, you owe 3 growlers (or lots of tasty snacks). Beer may be replaced by wine (in the color and vintage of buyer's choosing) in proportions of 1 growler = 1 bottle.
Before running the test suite, be sure that you've installed git lfs
and downloaded the large test data, following the git lfs setup instructions
To run the test suite, run ./gradlew test
.
build/reports/tests/test/index.html
.TEST_TYPE
environment variable:
cloud
, unit
, integration
, spark
: run only the cloud, unit, integration, or Spark testsall
: run the entire test suitegcloud
and authenticated with a project that has access
to the cloud test data. They also require setting several certain environment variables.
HELLBENDER_JSON_SERVICE_ACCOUNT_KEY
: path to a local JSON file with service account credentials HELLBENDER_TEST_PROJECT
: your google cloud project HELLBENDER_TEST_APIKEY
: your google cloud API keyHELLBENDER_TEST_STAGING
: a gs:// path to a writable locationHELLBENDER_TEST_INPUTS
: path to cloud test data, ex: gs://hellbender/test/resources/ TEST_VERBOSITY=minimal
will produce much less output from the test suite To run a subset of tests, use gradle's test filtering (see gradle doc):
test.single
when you just want to run a specific test class:
./gradlew test -Dtest.single=SomeSpecificTestClass
--tests
with a wildcard to run a specific test class, method, or to select multiple test classes:
./gradlew test --tests *SomeSpecificTestClass
./gradlew test --tests *SomeTest.someSpecificTestMethod
./gradlew test --tests all.in.specific.package*
To run tests and compute coverage reports, run ./gradlew jacocoTestReport
. The report is then in build/reports/jacoco/test/html/index.html
.
(IntelliJ has a good coverage tool that is preferable for development).
We use Travis-CI as our continuous integration provider.
See the test report at
.
If TestNG itself crashes there will be no report generated.We use Broad Jenkins for our long-running tests and performance tests.
To output stack traces for UserException
set the environment variable GATK_STACKTRACE_ON_USER_EXCEPTION=true
After installing git-lfs, run git lfs install
To manually retrieve the large test data, run git lfs pull
from the root of your GATK git clone.
To add a new large file to be tracked by git-lfs, simply:
src/test/resources/large
(or a subdirectory)git add
the file(s), then git commit -a
git lfs track
on the files manually: all files in src/test/resources/large
are tracked by git-lfs automatically. Ensure that you have gradle
and the Java 8 JDK installed
You may need to install the TestNG and Gradle plugins (in preferences)
Clone the GATK repository using git
In IntelliJ, click on "Import Project" in the home screen or go to File -> New... -> Project From Existing Sources...
Select the root directory of your GATK clone, then click on "OK"
Select "Import project from external model", then "Gradle", then click on "Next"
Ensure that "Gradle project" points to the build.gradle file in the root of your GATK clone
Select "Use auto-import" and "Use default gradle wrapper".
Make sure the Gradle JVM points to Java 1.8
Click "Finish"
After downloading project dependencies, IntelliJ should open a new window with your GATK project
Make sure that the Java version is set correctly by going to File -> "Project Structure" -> "Project". Check that the "Project SDK" is set to your Java 1.8 JDK, and "Project language level" to 8 (you may need to add your Java 8 JDK under "Platform Settings" -> SDKs if it isn't there already). Then click "Apply"/"Ok".
Follow the instructions above for creating an IntelliJ project for GATK
Go to Run -> "Edit Configurations", then click "+" and add a new "Application" configuration
Set the name of the new configuration to something like "GATK debug"
For "Main class", enter org.broadinstitute.hellbender.Main
Ensure that "Use classpath of module:" is set to use the "gatk" module's classpath
Enter the arguments for the command you want to debug in "Program Arguments"
Click "Apply"/"Ok"
Set breakpoints, etc., as desired, then select "Run" -> "Debug" -> "GATK debug" to start your debugging session
In future debugging sessions, you can simply adjust the "Program Arguments" in the "GATK debug" configuration as needed
Running JProfiler standalone:
./gradlew localJar
~/gatk/build/libs/gatk-package-4.alpha-196-gb542813-SNAPSHOT-local.jar
for "Main class or executable JAR" and enter the right "Arguments"Running JProfiler from within IntelliJ:
To upload snapshots to Sonatype you'll need the following:
You need to configure several additional properties in your /~.gradle/gradle.properties
file
If you want to upload a release instead of a snapshot you will additionally need to have access to the gatk signing key and password
#needed for snapshot upload
sonatypeUsername=<your sonatype username>
sonatypePassword=<your sonatype password>
#needed for signing a release
signing.keyId=<gatk key id>
signing.password=<gatk key password>
signing.secretKeyRingFile=/Users/<username>/.gnupg/secring.gpg
To perform an upload, use
./gradlew uploadArchives
Currently all builds are considered snapshots. The archive name is based off of git describe
.
Please see the the Docker README in scripts/docker
. This has instructions for the Dockerfile in the root directory.
Please see the How to release GATK4 wiki article for instructions on releasing GATK4.
To generate GATK documentation, run ./gradlew gatkDoc
build/docs/gatkdoc
directory.We use Zenhub to organize and track github issues.
To add Zenhub to github, go to the Zenhub home page while logged in to github, and click "Add Zenhub to Github"
Zenhub allows the GATK development team to assign time estimates to issues, and to mark issues as Triaged/In Progress/In Review/Blocked/etc.
Apache Spark is a fast and general engine for large-scale data processing. GATK4 can run on any Spark cluster, such as an on-premise Hadoop cluster with HDFS storage and the Spark runtime, as well as on the cloud using Google Dataproc.
In a cluster scenario, your input and output files reside on HDFS, and Spark will run in a distributed fashion on the cluster. The Spark documentation has a good overview of the architecture.
Note that if you don't have a dedicated cluster you can run Spark in standalone mode on a single machine, which exercises the distributed code paths, albeit on a single node.
While your Spark job is running, the Spark UI is an excellent place to monitor the progress.
Additionally, if you're running tests, then by adding -Dgatk.spark.debug=true
you can run a single Spark test and
look at the Spark UI (on http://localhost:4040/) as it runs.
You can find more information about tuning Spark and choosing good values for important settings such as the number of executors and memory settings at the following:
(Note: section inspired by, and some text copied from, Apache Parquet)
We welcome all contributions to the GATK project. The contribution can be a issue report or a pull request. If you're not a committer, you will need to make a fork of the gatk repository and issue a pull request from your fork.
To become a committer, you need to make several high-quality code contributions and be approved by the current committers.
For ideas on what to contribute, check issues labeled "Help wanted (Community)". Comment on the issue to indicate you're interested in contibuting code and for sharing your questions and ideas.
To contribute a patch:
* Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
* Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. If applicable, include the issue number in the pull request name.
* Make sure that your code passes all our tests. You can run the tests with ./gradlew test
in the root directory.
* Add tests for all new code you've written. We prefer unit tests but high quality integration tests that use small amounts of data are acceptable.
* Follow the General guidelines for GATK4 developers.
We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some things to consider:
* Write tests for all new code.
* Document all classes and public methods.
* For all public methods, check validity of the arguments and throw IllegalArgumentException
if invalid.
* Use braces for control constructs, if
, for
etc.
* Make classes, variables, parameters etc final
unless there is a strong reason not to.
* Give your operators some room. Not a+b
but a + b
and not foo(int a,int b)
but foo(int a, int b)
.
* Generally speaking, stick to the Google Java Style guide
Thank you for getting involved!
The authors list is maintained in the AUTHORS file. See also the Contributors list at github.
Licensed under the BSD License. See the LICENSE.txt file.