More Annotations

Favourite Annotations

Text

PROGRAMMING ARTICLES ON WAITINGFORCODE.COM June 5, 2021 • Apache Spark. Shuffle writers: SortShuffleWriter. In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance

purposes

ORG.APACHE.SPARK.SQL.ANALYSISEXCEPTION: QUERIES WITH org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start() explained. September 17, 2017 • Apache Spark Structured Streaming • APACHE SPARK'S _SUCESS ANATOMY ON WAITINGFORCODE.COM In other words, _SUCCESS is there to control whether downstream processes can consume the generated data. Such logic of file generation is very helpful to quite easily start downstream processing as soon as the _SUCCESS file is generated. On the other hand, if you want to apply more reactive processing, like event-based and process

every file

WHAT'S NEW IN APACHE SPARK 3.0 BUCKETS IN APACHE SPARK SQL ON WAITINGFORCODE.COMSEE MORE ON

WAITINGFORCODE.COM

SERVERLESS STREAMING ON AWS If you already worked on AWS and tried to implement streaming applications, you certainly noticed one thing. There is no single way to do it! And if you didn't notice that, I hope that this blog post will convince you, and by the way, help you to get a better understanding of the available solutions. HANDLING ANNOTATIONS WITH SPRING ANNOTATIONUTILS ONSEE MORE ON

WAITINGFORCODE.COM

APACHE KAFKA AND MAX.IN.FLIGHT.REQUESTS.PER.CONNECTION ON Versions: Apache Kafka 2.3.0. I didn't plan to write this post at all. However, when I was analyzing the idempotent producer, it was hard to understand out-of-sequence policy for multiple in-flight requests without understanding what this in-flight requests parameter really

means.

KAFKA TIMESTAMP AS THE WATERMARK ON WAITINGFORCODE.COM In the first version of my demo application I used Kafka's timestamp field as the watermark. At that moment I was exploring the internals of arbitrary stateful processing so it wasn't a big deal. But just in case if you're wondering what I didn't keep that for the official demo version, I wrote this article. ISOLATION LEVEL IN APACHE KAFKA CONSUMERS ONSEE MORE ON

WAITINGFORCODE.COM

purposes

every file

WHAT'S NEW IN APACHE SPARK 3.0 BUCKETS IN APACHE SPARK SQL ON WAITINGFORCODE.COMSEE MORE ON

WAITINGFORCODE.COM

means.

WAITINGFORCODE.COM

purposes

BUCKETS IN APACHE SPARK SQL ON WAITINGFORCODE.COM To wrap-up, a bucket is a technique to partition data inside the given partition. It can accelerate the execution of some operations like bucketed sampling or joins. On the other side, since it was made popular with Hive, Apache Spark supports it only as Hive tables. As you can see in the last section, it was impossible to save bucketed

JSON or

BARRIERS IN JAVA CONCURRENCY ON WAITINGFORCODE.COM 6 methods are implemented in CyclicBarrier: - await with or without timeout: it's used to signal that the barrier was reached by one of threads from thread group. If the thread reaching the barrier isn't the last thread from the group, it remains in waiting state until the last one calls await (). All waiting threads awake when the barrier is WHAT'S NEW IN APACHE SPARK 3.0 All the operations from the title are natively available in relational databases but doing them with distributed data processing systems is not obvious. Starting from 3.0, Apache Spark gives a possibility to implement them in the data sources. APACHE KAFKA SOURCE IN STRUCTURED STREAMING Even though I've already written a few posts about Apache Kafka as a data source in Apache Spark Structured Streaming, I still had some questions in my head. In this post I will try to answer them and let this Kafka integration in Spark topic for investigation later. CHECKPOINT STORAGE IN STRUCTURED STREAMING ON At the moment of writing this post I'm preparing the content for my first Spark Summit talk about solving sessionization problem in batch or streaming. Since I'm almost sure that I will be unable to say everything I prepared, I decided to take notes and transform them into blog posts. You're currently reading the first post from this series (#Spark Summit 2019 talk notes). GRAPH PROCESSING FRAMEWORKS SURVEY ON WAITINGFORCODE.COM Today it's the moment to analyze some major graph processing frameworks and choose the framework that I'll present more in details in incoming posts. This article talks about 3 main graph processing frameworks: Apache Spark GraphX, Apache Flink's Gelly library, and Apache Giraph project. The features of all of them are listed in the

first

ISOLATION LEVEL IN APACHE KAFKA CONSUMERS ON The isolation level is set for each consumer and sent to the broker with the fetch request. On the broker's side, the isolation level is later transformed into an instance of kafka.server.FetchIsolation (one of FetchLogEnd, FetchTxnCommitted, FetchHighWatermark), and passed from ReplicaManager to the Log, where the physical data retrieval

happens.

OUTER JOINS IN APACHE SPARK STRUCTURED STREAMING ON Previously we discovered inner stream-to-stream joins in Apache Spark but they aren't the single supported type. Another one are outer joins that let us to combine streams without matching rows. REPROCESSING STATEFUL DATA PIPELINES IN STRUCTURED During my talk, I insisted a lot on the reprocessing part. Maybe because it's the less pleasant part to work with. After all, we all want to test new pipelines rather than reprocess the data because of some regressions in the code or any other errors. Despite that, it's important to know how Structured Streaming integrates with this data

engineering task.

WAITINGFORCODE.COM

every file

BUCKETS IN APACHE SPARK SQL ON WAITINGFORCODE.COMSEE MORE ON WAITINGFORCODE.COMBUCKETING IN SQLSPARK BUCKETSPARK BUCKET JOINSPARK BUCKET SORTSPARK SAVEASTABLE HIVESPARK SAVEASTABLE PATH APACHE KAFKA AND MAX.IN.FLIGHT.REQUESTS.PER.CONNECTION ON Versions: Apache Kafka 2.3.0. I didn't plan to write this post at all. However, when I was analyzing the idempotent producer, it was hard to understand out-of-sequence policy for multiple in-flight requests without understanding what this in-flight requests parameter really

means.

DIRECT CHANNEL AND SERVICE ACTIVATOR ON WAITINGFORCODE.COMSEE MORE ON

WAITINGFORCODE.COM

KAFKA TIMESTAMP AS THE WATERMARK ON WAITINGFORCODE.COMKAFKA TIMESTAMP FORMATKAFKA TIMESTAMP OFFSETKAFKA GET MESSAGE TIMESTAMPKAFKA GET

OFFSET

In the first version of my demo application I used Kafka's timestamp field as the watermark. At that moment I was exploring the internals of arbitrary stateful processing so it wasn't a big deal. But just in case if you're wondering what I didn't keep that for the official demo version, I wrote this article. ISOLATION LEVEL IN APACHE KAFKA CONSUMERS ONSEE MORE ON

WAITINGFORCODE.COM

every file

means.

DIRECT CHANNEL AND SERVICE ACTIVATOR ON WAITINGFORCODE.COMSEE MORE ON

WAITINGFORCODE.COM

KAFKA TIMESTAMP AS THE WATERMARK ON WAITINGFORCODE.COMKAFKA TIMESTAMP FORMATKAFKA TIMESTAMP OFFSETKAFKA GET MESSAGE TIMESTAMPKAFKA GET

OFFSET

WAITINGFORCODE.COM

ARTICLES ABOUT APACHE KAFKA ON WAITINGFORCODE.COM September 27, 2020 • Apache Kafka. Control messages in Apache Kafka. During my last exploration of logs compaction, I found a method called isControlBatch. At the time, I only had a rough idea about this category of batches and that's the reason why I APACHE SPARK'S _SUCESS ANATOMY ON WAITINGFORCODE.COM In other words, _SUCCESS is there to control whether downstream processes can consume the generated data. Such logic of file generation is very helpful to quite easily start downstream processing as soon as the _SUCCESS file is generated. On the other hand, if you want to apply more reactive processing, like event-based and process

every file

JSON or

happens.

WHAT'S NEW IN APACHE SPARK 3.1 I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes

decommissioning!

SPARK STREAMING CHECKPOINTING AND WRITE AHEAD LOGS ON Checkpoint allows Spark to truncate dependencies on previously computed RDDs. In the case of streams processing their role is extended. In additional, they're APACHE SPARK AND DATA BIGGER THAN THE MEMORY ON This post presented Apache Spark behavior with data bigger than the memory size. As we could see, when a record's size is bigger than the memory reserved for a task, the processing will fail - unless you process data with only 1 parallel task and the total memory size is much bigger than the size of the biggest line. HANDLING ANNOTATIONS WITH SPRING ANNOTATIONUTILS ON Annotations in Java let us, programmers, to transfer some of configuration from configuration files into Java classes. For example, in Spring, we can configure URL mapping directly inside the controllers thanks to @RequestMapping annotation. But it couldn't be possible without several utilitary classes, like AnnotationUtils,

described here.

DATA PIPELINES: ORCHESTRATION, CHOREOGRAPHY OR BOTH? ON Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea. REPROCESSING STATEFUL DATA PIPELINES IN STRUCTURED During my talk, I insisted a lot on the reprocessing part. Maybe because it's the less pleasant part to work with. After all, we all want to test new pipelines rather than reprocess the data because of some regressions in the code or any other errors. Despite that, it's important to know how Structured Streaming integrates with this data

engineering task.

This website uses cookies to ensure you get the best experience on our

website. Learn more

Got it!

WAITING FOR CODE

_on waitingforcode.com_

* __

* __ Articles

* __ Tips

* __ Tags

* __

* __ Big Data

* Big Data algorithms * Big Data problems - solutions * Data engineering patterns

* General Big Data

* __ Java

* Class loading

* Garbage Collection

* Java

* Java 8

* Java I/O

* Java Instrumentation

* Java bytecode

* Java collections

* Java concurrency

* Java memory model

* Monitoring

* Off-heap

* __ Scala

* Scala OOP

* Scala async

* Scala collections

* Scala core

* Scala functional

* Scala syntax

* Scala tests

* Scala types

* __ Spring

* Spring Data JPA

* Spring Integration

* Spring Web MVC

* Spring framework

* Spring security

* Akka

* Apache Airflow

* Apache Avro

* Apache Beam

* Apache Cassandra

* Apache Kafka

* Apache Parquet

* Apache Pulsar

* Apache Spark

* Apache Spark GraphFrames * Apache Spark GraphX

* Apache Spark SQL

* Apache Spark Streaming * Apache Spark Structured Streaming

* Apache ZooKeeper

* Data on AWS

* Elasticsearch

* Google Guava

* Graphs

* HDFS

* Hibernate

* JPA

* JUnit

* Maven

* MySQL

* Play Framework

* PostgreSQL

* Programming

* RabbitMQ

* SQL

* Testing

* Time series

* Tomcat

* Web security

* Simple

* Advanced

* Java

* Tomcat

* Spring

* Hibernate

* Maven

* JUnit

* MySQL

* Programming

* Web security

* Google Guava

* Play Framework

* JPA

* Elasticsearch

* Testing

* RabbitMQ

* Big Data

* Apache Cassandra

* Apache Kafka

* Apache ZooKeeper

* Apache Avro

* Apache Spark

* Apache Spark Streaming

* HDFS

* Apache Spark SQL

* Apache Spark Structured Streaming

* Scala

* Apache Parquet

* Apache Beam

* Time series

* SQL

* Akka

* Graphs

* Apache Spark GraphX * Apache Spark GraphFrames

* Data on AWS

* Apache Airflow

* Apache Pulsar

* PostgreSQL

Dates: -

* __ Home

CHECK OUT MY NEW COURSE ON DATA ENGINEERING! Are you a data scientist who wants to extend his data engineering skills? Or a software engineer who wants to work with Big Data? If not, maybe a BI developer who wants to evolve to engineering position? My course will help you to achieve your goal! Join the class → ALTER DEFAULT PRIVILEGES IN POSTGRESQL __ January 19, 2020 __ PostgreSQL __ Bartosz Konieczny At first glance, managing users access in PostgreSQL is easy, you simply execute a CREATE USER, give him some grants, assign a role, and often that's all. However, after some time "permission denied" errors can appear as new objects are created and not owned by the user. To mitigate the maintenance burden for that case, PostgreSQL proposes ALTER DEFAULT privileges operator. Continue Reading → APACHE KAFKA AND MAX.IN.FLIGHT.REQUESTS.PER.CONNECTION __ January 18, 2020 __ Apache Kafka __ Bartosz

Konieczny

I didn't plan to write this post at all. However, when I was analyzing the idempotent producer, it was hard to understand out-of-sequence policy for multiple in-flight requests without understanding what this in-flight requests parameter really means. Continue Reading → NIO SELECTOR IN APACHE KAFKA __ January 12, 2020 __ Apache Kafka __ Bartosz

Konieczny

It's rare when in order to write a blog post I need to cover more than 3 other topics. But that's what happens with Apache Kafka idempotent producer that I will publish soon. But before that, I need to understand and explain NIO Selector, its role in Apache Kafka, and finally the in flight requests. Since the first topic was already covered, I will move to the second one. Continue Reading → SCHEMA CASE SENSITIVITY FOR JSON SOURCE IN APACHE SPARK SQL __ January 11, 2020 __ Apache Spark SQL __ Bartosz

Konieczny

On the one hand, I appreciate JSON for its flexibility but also from the other one, I hate it for exactly the same thing. It's particularly painful when you work on a project without good data governance. The most popular pain is an inconsistent field type - Spark can manage that by getting the most common type. Unfortunately, it's a little bit trickier for less common problems, for instance when a same field has different case sensitivity. Continue Reading → HANDLING MULTIPLE I/O FROM ONE THREAD WITH NIO SELECTOR __ January 5, 2020 __ Java I/O __ Bartosz Konieczny That's the next post I wrote after my unsuccessful analysis of Apache Kafka source. When I was reading the part responsible for sending requests to the broker, I found that it was partially managed by a Java package that I've never seen before. And that will be the topic of this post. Continue Reading → APACHE SPARK AND LINE-BASED DATA SOURCES __ January 4, 2020 __ Apache Spark SQL __ Bartosz

Konieczny

Under one of my posts I got an interesting question about ignoring maxPartitionBytes configuration entry by Apache Spark for text-based data sources. In this post I will try to answer it. Continue Reading

→

TROUBLESHOOTING 'SYSTEM MEMORY MUST BE AT LEAST' ERROR __ December 29, 2019 __ Apache Spark __ Bartosz

Konieczny

When the unit tests work on "your machine" but fail on your colleague's, you know you did something wrong. When the failures are not about test assertions but technical reasons, the "something wrong" transforms into "something strange". And it may happen with Apache Spark as well. Continue Reading → FROM APACHE SPARK CONNECTOR TO APACHE PULSAR BASIC CONCEPTS __ December 28, 2019 __ Apache Pulsar __ Bartosz

Konieczny

Some time ago I saw an interesting presentation about Apache Pulsar and it intrigued me. Compute separated from the storage in a streaming system? Sounds great! In this series of posts, I will try to understand how different challenges were solved but I will start by making an exercise of trying to figure out Apache Pulsar's architecture from its Structured Streaming connector. Continue Reading

→

IMPLICIT DATETIME CONVERSION IN APACHE SPARK SQL __ December 22, 2019 __ Apache Spark SQL __

Bartosz Konieczny

If you've ever wondered why when you write "2019-05-10T20:00", Apache Spark considers it as a timestamp field? The fact of defining it as a TimestampType is one of the reasons, but another question here is, how Apache Spark does the conversion from a string into the timestamp type? I will give you some hints in this blog post. Continue Reading

→

EXTENDING STATE STORE IN STRUCTURED STREAMING - REPROCESSING AND

LIMITS

__ December 21, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny In my previous post I have shown you the writing and reading parts of my custom state store implementation. Today it's time to cover the data reprocessing and also the limits of the solution. Continue

Reading →

EXTENDING STATE STORE IN STRUCTURED STREAMING - READING AND WRITING

STATE

__ December 15, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny In my previous post I introduced the classes involved in the interactions with the state store, and also shown the big picture of the implementation. Today it's time to write some code :) Continue

Reading →

WHY UNSAFEROW.COPY() FOR STATE PERSISTENCE IN THE STATE STORE? __ December 14, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny In my last Spark+AI Summit 2019 follow-up posts I'm implementing a custom state store. The extension is inspired by the default state store. At the moment of code analysis, one of the places that intrigued me was the put(key: UnsafeRow, value: UnsafeRow) method. Keep reading if you're curious why. Continue Reading → EXTENDING STATE STORE IN STRUCTURED STREAMING - INTRODUCTION __ December 8, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny When I started to think about implementing my own state store, I had an idea to load the state on demand for given key from a distributed and single-digit milliseconds latency store like AWS DynamoDB. However, after analyzing StateStore API and how it's used in different places, I saw it won't be easy. Continue Reading → EXTENDING DATA REPROCESSING PERIOD FOR ARBITRARY STATEFUL PROCESSING

APPLICATIONS

__ December 7, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny After my Summit's talk I got an interesting question on "off" for the data reprocessing of sessionization streaming pipeline. I will try to develop the answer I gave in this post. Continue Reading → CUSTOM CHECKPOINT FILE MANAGER IN STRUCTURED STREAMING __ December 1, 2019 __ Apache Spark Structured Streaming __ Bartosz Konieczny In this post I will start the customization part of the topics covered during my talk. The first customized class will be the class responsible for the checkpoint management. Continue Reading → close X 📚 SIGN UP FOR THE WAITINGFORCODE NEWSLETTER AND GET

EXCLUSIVE CONTENT

Subscribe to waitingforcode.com newsletter and get exclusive content like "14 tips for data engineers", "Data glossary - 89 terms you should know to work with data" or still "10 useful tips to know about Python before starting to code". SPAM FREE - no 3rd party ads, only the information about waitingforcode! Curious about the content ? Check some of already sent newsletters .

Or if you prefer, you can still follow me on Twitter

.

X

__ ENROLL THE COURSE!

Details

Image Url

HTML Url

Moderation By

More Annotations

Bella Mccoy

2021-06-07 00:26:37

Bella Mccoy

2021-06-07 00:26:37

Bella Mccoy

2021-06-07 00:26:39

Bella Mccoy

2021-06-07 00:26:39

Bella Mccoy

2021-06-07 00:26:39

Bella Mccoy

2021-06-07 00:26:39

Bella Mccoy

2021-06-07 00:26:42

Bella Mccoy

2021-06-07 00:26:43

Bella Mccoy

2021-06-07 00:26:43

Bella Mccoy

2021-06-07 00:26:43

Bella Mccoy

2021-06-07 00:26:44

Bella Mccoy

2021-06-07 00:26:44

Favourite Annotations

Bella Mccoy

2020-01-05 03:07:05

Bella Mccoy

2020-01-05 03:08:01

Bella Mccoy

2020-01-05 03:08:13

Bella Mccoy

2020-01-05 03:08:33

Bella Mccoy

2020-01-05 03:08:52

Bella Mccoy

2020-01-05 03:09:15

Bella Mccoy

2020-01-05 03:10:19

Bella Mccoy

2020-01-05 03:10:25

Bella Mccoy

2020-01-05 03:10:31

Bella Mccoy

2020-01-05 03:10:40

Bella Mccoy

2020-01-05 03:10:52

Bella Mccoy

2020-01-05 03:10:53

Text

purposes

every file

WAITINGFORCODE.COM

WAITINGFORCODE.COM

means.

WAITINGFORCODE.COM

purposes

every file

WAITINGFORCODE.COM

WAITINGFORCODE.COM

means.

WAITINGFORCODE.COM

purposes

JSON or

first

happens.

engineering task.

WAITINGFORCODE.COM

every file

means.

WAITINGFORCODE.COM

OFFSET

WAITINGFORCODE.COM

WAITINGFORCODE.COM

every file

means.

WAITINGFORCODE.COM

OFFSET

WAITINGFORCODE.COM