Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data. [PDF]
Jo J, Jung S, Park J, Kim Y, Kang M.
europepmc +2 more sources
DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark [PDF]
Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts.
Michael D. Linderman +3 more
doaj +2 more sources
pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP. [PDF]
Chen W, Yao C, Guo Y, Wang Y, Xue Z.
europepmc +3 more sources
Adding data provenance support to Apache Spark. [PDF]
Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging.
Interlandi M +7 more
europepmc +6 more sources
DNA short read alignment on apache spark [PDF]
The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has introduced several computational challenges.
Maryam AlJame, Imtiaz Ahmad
doaj +1 more source
TRANSMUT‐Spark: Transformation mutation for Apache Spark [PDF]
SummaryThis paper proposesTRANSMUT‐Sparkfor automating mutation testing of big data processing code within Spark programs. Apache Spark is an engine for big data analytics/processing that hides the inherent complexity of parallel big data programming. Nonetheless, programmers must cleverly combine Spark built‐in functions within programs and guide the ...
João Batista de Souza Neto +3 more
openaire +4 more sources
Efficient processing of complex XSD using Hive and Spark [PDF]
The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with
Diana Martinez-Mosquera +2 more
doaj +2 more sources
Large-scale virtual screening on public cloud resources with Apache Spark. [PDF]
Capuccini M +4 more
europepmc +3 more sources
Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge.
Panagiotis Moutafis +3 more
doaj +1 more source
Mining Frequency of Drug Side Effects Over a Large Twitter Dataset Using Apache Spark [PDF]
Despite clinical trials by pharmaceutical companies as well as current FDA reporting systems, there are still drug side effects that have not been caught. To find a larger sample of reports, a possible way is to mine online social media. With its current
Dennis Hsu
openalex +5 more sources

