NGless Miscellania [5/5]

综合编程 2016-04-25

NOTE
: As of Apr 2016, ngless is available
only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch
if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous [this post]

Ngless has a few not so visible
details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly
for efficiency reasons.

Best practices should be easy
and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data
.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl
in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write()
so that the output files include the subsample
extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results
. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0'

input = fastq('input.fq.gz')
preprocess(input) using |r|:
    r = substrim(r, min_quality=45)
    if len(r) < 45:
        discard
mapped = map(input, reference='hg19')
counted = count(mapped, features=['gene'])
write(counted, ofile='genes.txt')

Almost all the steps in the pipeline can take advantage of multiple processors:

  1. QC is performed on the fly as the file ‘input.fq.gz’ is being read.
  2. preprocess
    takes advantage of mulitple processors by processing reads in parallel
  3. map
    calls bwa
    which makes use of threads
  4. count
    again processes the output of mapping in parallel.

To use more than one core in ngless
, just use the option -j
with the number of threads you want. For example:

ngless -j8 pipeline.ngl

Will run with 8 cores, speeding the processing considerably.

责编内容by:Meta Rabbit (源链)。感谢您的支持!

您可能感兴趣的

This coding bootcamp is offering free tuition to e... Earlier this week, Britain’s fifth-biggest airline, Monarch, entered adminis...
如何写一个拖拽日历组件(附源码) 作者简介 Kid 蚂蚁金服·数据体验技术团队 本文会介绍如何写一个可拖拽日历组件,偏重点在于日历组件的功能挖掘以及对于开发过程的一些...
Everyone Is Not Ops Yesterday was Sysadmin Appreciation Day. There was a lot of chatter about what t...
Project Euler 直白解法 O(n) (def num-divby-3-5 (filter (fn (let ...
Math Proofs vs. Explanations (aka Nutrition vs. Ta... Math class has two goals: Verify that a statement is true Understand...