In this article, I won’t explain all the reasons that motivated our choice of Akka-Stream at MFG Labs and the road towards our library Akka-Stream Extensions. Here, I’ll focus on one precise aspect of our choice:
types
… And I’ll tell you about a specific extension I’ve created for this project:ShapelessStream
.
The code is there on github
As you may know, I’m a Type
lover. Types are proofs and proofs are types. Proofs are what you want in your code to ensure in a robust & reliable way that it does what it pretends with the support of the compiler.
First of all, let’s remind that typesafety is a very interesting feature of Akka-Stream.
Akka-Stream most basic primitive Flow[A, B]
represents a data-flow
that accepts elements of type A
and will return elements of type B
. You can’t pass a C
to it and you are sure that this flow won’t return any C
for example.
At MFG Labs, we have inherited some Scala legacy code mostly based on Akka actors which provide a very good way to handle failures but which are not typesafe at all (till Akka Typed) and not composable. Developers using Akka tend to scatter the business logic in the code and it can become hard to maintain. It has appeared that in many cases where Akka was used to transform data in a flow, call external services, Akka-Stream would be a very good way to replace those actors:
Yes, it’s quite weird to say it but Akka-Stream helped us correct most problems that had been introduced using Akka (rightly or wrongly).
Ok, Akka-Stream promotes Types
as first citizen in your data flows. That’s cool!
But it appears that you often need to handle multiple types in the same input channel:
When you control completely the types in input, you can represent input types by a classic ADT:
1 2 3 4 |
|
… And manage it in Flow[A, B]
:
1 2 3 4 5 |
|
Nice but you need to wrap all input types in an ADT and this involves some boring code that can even be different for every custom flow.
Going further, in general, you don’t want to do that, you want to dispatch every element to a different flow according to its type:
… and merge all results of all flows in one single channel…
… and every flow has its own behavior in terms of parallelism, bufferization, data generation and back-pressure…
In Akka-Stream, if you wanted to build a flow corresponding to the previous schema, you would have to use:
Have a look at the doc and see that it requires quite a bunch of lines to write one of those. It’s really powerful but quite tedious to implement and not so typesafe after all. Moreover, you certainly would have to write one FlexiRoute
and one FlexiMerge
per use-case as the number of inputs types and return types depend on your context.
In my latest project, this dispatcher/flows/merger
pattern was required in multiple places and as I’m lazy, I wanted something more elegant & typesafe if possible to build this kind of flow graphs.
Thinking in terms of pure types and from an external point of view, we can see the previous dispatcher/flows/merger
flow graph in pseudo-code as:
1 2 3 4 |
|
And to build the full flow graph, we need to provide a list of flows for all pairs of input/output types corresponding to our graph branches:
1
|
|
In Shapeless, there are 2 very very very useful structures:
Coproduct
is a generalization of the well known Either
. You have A or B
in Either[A, B]
. With Coproduct
, you can have more than 2 alternatives A or B or C or D
. So, for our previous external view of flow graph, using Coproduct
, it could be written as:1 2 3 4 |
|
HList
allows to build heterogenous List
of elements keeping & tracking all types at compile time. For our previous list of flows, it would fit quite well as we want to match all input/output types of all flows. It would give:1
|
|
So, from an external point of view, the process of building our dispatcher/flows/merger
flow graph looks like a Function taking a
Hlist of flowsin input and returning the built
Flow of Coproducts`:
1 2 |
|
Let’s write it in terms of Shapeless Scala code:
1 2 3 4 5 6 7 8 9 |
|
Fantastic !!!
Now the question is how can we build at compile-time this
Flow[CIn, COut, Unit]
from anHList
ofFlows
and be sure that the compiler checks all links are correctly typed and all types are managed by the provided flows?
An important concept in Akka-Stream is the separation of concerns between:
For the curious, you find the same idea in scalaz-stream but in a FP-purer way as scalaz-stream directly relies on
Free
concepts that formalize this idea quite directly.
Akka-Stream has taken a more custom way to respond to these requirements. To build complex data flows, it provides a very nice DSL described here. This DSL is based on the idea of a mutable structure used while building your graph until you decide to fix it definitely into an immutable structure.
An example from the doc:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
builder
is the mutable structure used to build the flow graph using the DSL inside the {...}
block.
The value g
is the immutable structure resulting from the builder that will later be materialized and run using live resources.
Please remark that once built, g
value can reused and materialized/run several times, it is just the description of your flow graph.
This idea of mutable builders is really interesting in general: mutability in the small can help a lot to make your building block efficient and easy to write/read without endangering immutability in the large.
My intuition was to hack these mutable Akka-Stream builders using Shapeless type-dependent mechanics to build a Flow of Coproducts from an HList of Flows…
Let’s show the real signature of coproductFlow
:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Frightening!!!!!!!
No, don’t be, it’s just the transcription in types of the requirements to build the full flow.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
The Scala code might seem a bit ugly to a few of you. That’s not false but keep in mind what we have done: mixing shapeless-style recursive implicit typeclass inference with the versatility of Akka-Stream mutable builders… And we were able to build our complex flow graph, to check all types and to plug all together at compile-time…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
FYI, Shapeless Coproduct provides a lot of useful operations on Coproducts such as unifying all types or merging Coproducts together.
Imagine you forget to manage one type of the Coproduct in the HList of flows:
1 2 3 4 5 6 |
|
If you compile, it will produce this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
OUCHHHH, this is a mix of the worst error of Akka-Stream and the kind of errors you get with Shapeless :)
Don’t panic, breathe deep and just tell yourself that in this case, it just means that your types do not fit well
In general, the first line and the last lines are the important ones.
1 2 3 4 5 |
|
It just means you try to plug a C == Int :+: String :+: Bool :+: CNil
to a Int :+: String :+: CNil
and the compiler is angry against you!!
1 2 3 4 5 |
|
It just means you try to plug a Int :+: String :+: CNil
to a C == Int :+: String :+: Bool :+: CNil
and the compiler is 2X-angry against you!!!
Mixing the power of Shapeless compile-time type dependent structures and Akka-Stream mutable builders, we are able to build at compile-time a complex dispatcher/flows/merger
flow graph that checks all types and all flows correspond to each other and plugs all together in a 1-liner…
This code is the first iteration on this principle but it appeared to be so efficient and I trusted the mechanism so much (nothing happens at runtime but just at compile-time) that I’ve put that in production two weeks ago. It runs like a charm.
Finally, there are a few specificities/limitations to know:
Wrapping input data into the Coproduct
is still the boring part with some pattern matching potentially. But this is like Json/Xml validation, you need to validate only the data you expect. Yet I expect to reduce the work soon by providing a Scala macro that will generate this part for you as it’s just mechanical…
Wrapping everything in Coproduct
could have some impact on performance if what you expect is pure performance but in my use-cases IO are so much more impacting that this is not a problem…
coproductFlow
is built with a custom FlexiRoute with DemandFromAll
condition & FlexiMerge using ReadAny
condition. This implies :
the order is NOT guaranteed due to the nature of used FlexiRoute & FlexiMerge and potentially to the flows you provide in your HList (each branch flow has its own parallelism/buffer/backpressure behavior and is not necessarily a 1-to-1 flow).
the slowest branch will slow down all other branches (as with a broadcast). To manage these issues, you can add buffers in your branch flows to allow other branches to go on pulling input data
The future?
A macro generating the Coproduct wrapping flow
Some other flows based on Shapeless
Have more backpressured and typed fun…
]]>Draft FreeR code is on Github
I’ve recently pushed some Free
code & doc to the cool project cats and I had a few more ideas in my head on optimizing Free
and never took time to make them concrete. I’ve just found this time during my holidays…
Free Monad
is often used to represent embedded DSL in functional programming languages like Haskell or Scala. One generally represents his grammar with a simple Functor
ADT representing the available operations. Then, from within your programming language, Free Monad
provides the facilities to:
To know more about the way to use
Free
and some more specific theory, please refer to recent draft doc I’ve pushed on cats
The well-known classic representation in Scala is the following:
1 2 3 |
|
Please note that
F[_]
should be a Functor to take advantage ofFree
construction.
Building a program can then just be a classic sequence of monadic bind/flatMap
on Free[S[_], _]
:
1 2 3 4 5 |
|
This actually constructs a recursive structure looking like:
1
|
|
It can be seen as a left-associated sequence of operations and as any left-associated structure, appending an element to it has a quadratic complexity. So the more you flatMap
, the longer (in n²) it will take to drill down the structure.
So if you try such code:
1 2 3 4 5 6 |
|
You will see that it has a quadratic curve in terms of execution time when you increase n
.
First weakness of classic
Free
is its left-associativity that induces a quadratic complexity when flatMapping
To solve it, the immediate idea is to make Free
right-associative instead of left associative (This idea was proposed by Kiselyov & al. in a paper and called Continuation-Passing-Style or also Codensity construction)
This is already done in current scalaz/cats.Free
by adding a new element to the Free
ADT:
1 2 3 4 5 6 |
|
If you test the same previous code, it has a linear behavior when n
increases.
In this great paper Reflection Without Remorse, Atze van der Ploeg & Oleg Kiselyov show that classic Free
are subject to another tricky quadratic behavior when, within your sequence of operations, one need to observe the current internal state of Free
.
Observing the state requires to drill down the recursive Free structure explicitly and go up and again down then up and again and again. As explained in the paper, this case is quite tricky because it’s very hard to see that it will happen. The deeper is the structure, the longer it takes to observe current state. It’s important to note that right-association doesn’t help in this case and that complexity is once again in O(n²).
The second weakness of
Free
is its quadratic complexity when observing internal state.
To solve it, in Reflection Without Remorse, they propose a very interesting approach by changing the representation of Free
and take advantage of its sequential nature.
A Free
becomes the association of 2 elements:
FreeView
representing current internal state of the Free
bind/flatMap
functions stored in an efficient data structure that can prepent/append in O(1).For the data structure, they propose to use a type-aligned dequeue to keep track of all types.
I have tried to implement this structure using a typed-aligned FingerTree in Scala. The code is here. The result is pretty interesting but not much efficient: it has a linear behavior for left-association & observability but…
n
, FingerTree
costs far too much to buildAs a conclusion, the idea is really nice on paper but in practice, we need to find something that costs less than this type-aligned dequeue (even if my FingerTree code is really raw, too strict and not optimized at all).
I wanted to improve Free
behavior and decided to create a new version of it called FreeR
thinking in terms of efficient Scala…
I really liked the idea of representing a Free
as a pure sequence of operations with a view of current internal state.
To gain in efficiency, I decided to choose another efficient append/prepend data structure, optimized and very well known: Vector
providing:
Then, I’ve decided to relax a lot type alignment and manipulate Any
values internally and cast/reify to the right types when required.
BTW, I plagiarized some code written by Alois Cochard for his new IO model in Scalaz/8.0 branch… Alois is a great dev & had made concrete the idea I had in my head so why rewrite them differently? uh ;)
I also decided to reify the 2 kinds of operations:
flatMap/bind
callsMap
for map
callsSo a Free
becomes:
1 2 3 4 5 6 7 8 |
|
with FreeView
as:
1 2 3 4 5 6 7 8 9 10 |
|
and Ops
are:
1 2 3 4 5 |
|
FYI This code is less than 300 lines so nothing really horrible except a few ugly casts ;)
The code used for testing can be found here
FreeR
behavior is linear even for millions offlatMap
(until the GC triggers naturally) whereas classicFree
has clearly quadratic curve.
The code used for testing can be found here
FreeR
behavior is quite linear even for millions offlatMap
(until the GC triggers naturally) whereas classicFree
has clearly quadratic curve.
I finally tried to check the behavior of my new FreeR
when using flatMap
in a right associated way like:
1 2 3 4 |
|
This is not so frequent code but anyway, Free
should be efficient for left & right associated code.
Using FreeR
as described previously, I’ve discovered that it wasn’t efficient in right association when increasing n
because it allocates recursively a lot of Vector with one element and it becomes slower and slower apparently (I’m not even sure of the real cause of it).
I’ve refined my representation by distinguishing 3 kinds of Free
in my ADT:
1 2 3 4 5 6 7 8 |
|
With this optimization, here is the performance in right association:
It is quite comparable to classic Free
for n
under 1 million but it becomes quite bad when n
becomes big. Yet, it remains for more efficient than previous representation with just Vector
.
I need to work more on this issue (apparently GC is triggered too early) to see if more optimizations for right association can be found…
Imagine doing a lot of map
operations on a Free
like:
1 2 3 |
|
If you think just a bit, you will clearly see that:
1
|
|
This is called map-fusion
and as you may have deduced already, my decision to reify explicitly Bind
and Map
operations was made in this purpose.
If I can know there are several Map
operations in a row, I can fusion
them in one single Map
by just calling mapFusion
on a Free
to optimize it:
1 2 3 4 5 6 7 |
|
Here is the difference in performance between FreeR
and FreeR.mapFusion
:
As you can see, mapFusion
can be very interesting in some cases.
Finally, I have created a new representation of Free
using:
Vector
Bind
& Map
operationsIt allows to have a Free
with:
n
(yet it is far more acceptable than other alternatives),Map
operations,n
, the cost is a bit higher than basic Free
but quite low and acceptable.It is really interesting as it makes Free
more and more usable in real-life problems without having to rewrite the code bootstrapped with Free
in a more optimized way. I personally find it quite promising!
Please remark that this code has been written for the great project cats that will soon be a viable & efficient alternative for functional structures in Scala.
The full code is there.
Don’t hesitate to test, find bugs, contribute, give remarks, ideas…
Have fun in FreeR world…
Scaledn is a Scala EDN parser (runtime & compile-time)/serializer/validator based on :
It works only in Scala 2.11.x
The code & sample apps can be found on Github
Because Json is not so good & quite limitating
EDN is described as an extensible data notation specified (not really standardized) there. Clojure & Datalog used in Datomic are supersets of EDN.
EDN allows much more things than Json while keeping the same simplicity.
Here are the main points making EDN great to represent & exchange Data.
For Json, all numbers (floating or integer, exponential or not) are all considered in the same way so numbers can only be mapped to the biggest number format: BigDecimal
. It is really bad in terms of semantics and performance.
In EDN, numbers can be :
Long
in Scala: 12345
Double
in Scala: 123.45e-9
BigInt
in Scala: 1234567891234N
BigDecimal
in Scala: 123.4578972345M
Collections in Json are just:
In EDN, you can have:
1
|
|
1
|
|
1
|
|
1
|
|
Json doesn’t know about characters outside strings.
EDN can manage chars:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
There are special syntaxes:
;
1
|
|
#_
are parsed but discarded1
|
|
These are notions that don’t exist in Json.
Symbols can reference anything external or internal that you want to identify. A Symbol
can have a namespace such as:
1
|
|
Keywords are unique identifiers or enumerated values that can be reused in your data structure. A Keyword
is just a symbol preceded by a :
such as
1
|
|
EDN is an extensible format using tags starting with #
such as:
1
|
|
When parsing EDN format, the parser should provide tag handlers that can be applied when a tag is discovered. In this way, you can extend default format with your own formats.
EDN specifies 2 tag handlers by default:
#inst "1985-04-12T23:20:50.52Z"
for RFC-3339 instants#uuid "f81d4fae-7dec-11d0-a765-00a0c91e6bf6"
for UUIDJson is defined to have a root map
node: { key : value }
or [ ... ]
.
Json can’t accept single values outside of this. So Json isn’t really meant to be streamed as you need to find closing tags to finish parsing a value.
EDN doesn’t require this and can consist in multiple heterogenous values:
1
|
|
As a consequence, EDN can be used to stream your data structures.
All of these points make EDN a far better & stricter & more evolutive notation to represent data structures than Json. It can be used in the same way as Json but could make a far better RPC string format than Json.
I still wonder why Json has become the de-facto standard except for the reason that the not so serious Javascript language parses it natively and because people were so sick about XML that they would have accepted anything changing their daily life.
But JS could also parse EDN without any problem and all more robust & typed backend languages would earn a lot from using EDN instead of JSON for their interfaces.
EDN could be used in REST API & also for streaming API. That’s exactly why, I wanted to provide a complete Scala API for EDN to test this idea a bit further.
Scaledn can be used to parse the EDN string or arrays of chars received by your API.
All types described in EDN format are isomorphic to Scala types so I’ve decided to skip the complete AST wrapping those types and directly parse to Scala types.
"foobar"
is parsed to String
123
is parsed to Long
(1 2 3)
is parsed to List[Long]
(1 "toto" 3)
is parsed to List[Any]
{"toto" 1 "tata" 2}
is parsed to Map[String, Long]
{1 "toto" 2 "tata"}
is parsed to Map[Long, String]
{1 "toto" true 3}
is parsed to Map[Any, Any]
The parser (based on Parboiled2) provides 2 main functions:
1 2 3 4 5 6 7 8 |
|
If you look in common package, you’ll see that EDN
is just an alias for Any
;)
Here is how you can use it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Some people will think
Any
is a bit too large and I agree but it’s quite practical to use. Moreover, using validation explained a bit later, you can parse your EDN and then map it to a stronger typed scala structure and thenAny
disappears.
When you use static EDN structures in your Scala code, you can write them in their string format and scaledn can parse them at compile-time using Scala macros and thus prevent a lot of errors you can encounter in dynamic languages.
The macro mechanism is based on quasiquotes & whitebox macro contexts which allow to infer types of your parsed EDN structures at compile-time. For example:
1 2 3 4 5 |
|
Whooohooo magic :)
Here is how you can use it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
EDN allows to manipulate heterogenous collections. In Scala, when one thinks heterogenous collection, one thinks Shapeless. Scaledn macros can parse & map your EDN stringified structures to Scala strongly typed structures.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
please note the
H
inEDNH
for heterogenousI must say using these macros, it might be even simpler to write Shapeless hlists or records than using scala API ;)
Scaledn provides different macros depending on the depth of introspection you require in your collection with respect to heterogeneity.
Have a look directly at Macro API
Following ideas implemented by Daniel James in Datomisca, scaledn proposes to use String interpolation mixed with parsing macro such as:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Nothing to add, macros are cool sometimes :)
When writing REST or external API, the received data can never be trusted before being validated. So, you generally try to validate what is received and map it to a strong-typed structures. For example:
1 2 3 4 5 6 7 8 9 10 11 |
|
The validation API is the following:
1 2 3 4 |
|
Scaledn validation is based on Generic Validation API developed by my MFGLabs’s colleague & friend Julien Tournay. This API was developed for Play Framework & Typesafe last year to generalize Json validation API to all data formats. But it will never be integrated in Play as Typesafe considers it to be too pure Scala & pure FP-oriented. Yet, we use this API in production at MFGLabs and maintain/extend it ourselves.
As explained before, Scaledn parser parses EDN values directly to Scala types as they are bijective so validation is often just a runtime cast and not very interesting in general.
What’s much more interesting is to validate to Shapeless HList, Records and even more interesting to CaseClasses & Tuples based on Shapeless fantastic auto-generated Generic macros.
Let’s take a few examples to show the power of this feature:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
I think here you can see the power of this validation feature without writing any boilerplate…
Using Generic Validation API, you can also write scala structures to any other data format.
Scaledn provides serialization from scala structures to EDN Strings. For example:
1 2 3 4 5 |
|
The write API is the following:
1 2 3 4 |
|
Once again, what’s more interesting is using shapeless & caseclasses & tuples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
This project is a first draft so it requires a bit more work.
Here are a few points to work on:
Don’t hesitate to test, find bugs, contribute, give remarks, ideas…
Have fun in EDN world…
Not an article, just some reflections on this idea…
You know what is a functor?
2 categories C
& D
, simplifying a category as:
f: x -> y
h . (g . f) = (h . g) .f
where .
is the composition (g . f)(x) = g(f(x))
)id(x) = x -> x
a functor F
associates :
x
of C
with an object of F(x)
of D
.f: x -> y
of C
with an element F(f): F(x) -> F(y)
of D
such that:
F(id(x)) = id(F(x))
F(g . f) = F(g) . F(f)
A Functor is a mapping (an homomorphism) between categories that conserves the structure of the category (the morphisms, the relation between objects) whatever the kind of objects those categories contain.
In scalaz, here is the definition of a Functor:
1 2 3 4 5 6 7 |
|
You can see the famous map
function that you can find in many structures in Scala : List[_]
, Option[_]
, Map[_, _]
, Future[_]
, etc…
Why? Because all these structures are Functors
between the categories of Scala types…
Math is everywhere in programming & programming is Math so don’t try to avoid it ;)
So you can write a Functor[List]
or Functor[Option]
as those structures are monoids.
Now let’s consider HList
the heterogenous List provided by Miles Sabin’s fantastic Shapeless. HList
looks like a nice Functor.
1 2 3 4 |
|
Ok, it’s a bit more complex as this functor requires not one function but several for each element type constituting the HList
, a kind of polymorphic function. Hopefully, Shapeless provides exactly a structure to represent this: Poly
What about writing a functor for HList
?
Scalaz Functor isn’t very helpful (ok I just copy the HMonoid text & tweak it ;)):
To be able to write a Functor of HList
, we need something else based on multiple different types…
I spent a few hours having fun on this idea with Shapeless and tried to implement a Functor for heterogenous structures like HList
, Sized
and even not heterogenous structures.
Here are the working samples.
Here is the code based on pseudo-dependent types as shapeless.
The signature of the HFunctor
as a map
function as expected:
1 2 3 4 5 6 7 8 9 |
|
This is just a sandbox to open discussion on this idea so I won’t explain more and let the curious ones think about it…
Have F(un)!
Not an article, just some reflections on this idea…
You know what is a monoid?
e x e -> e
(aka a SemiGroup)id
element id . e = e . id = e
(also called zero element)(and some associativity)
In scalaz, here is the definition:
1 2 3 4 5 6 7 |
|
You can see the zero
& the SemiGroup.append
operations, right?
So you can write a Monoid[Int]
or Monoid[List[A]]
as those structures are monoids.
Now let’s consider HList
the heterogenous List provided by Miles Sabin’s fantastic Shapeless. HList
looks like a nice monoid.
1
|
|
What about writing a monoid for HList
?
Scalaz monoid isn’t very helpful because our monoid operations would look like:
1 2 |
|
So, to be able to write a Monoid of HList
, we need something else based on multiple different types…
I spent a few hours having fun on this idea with Shapeless and tried to implement a Monoid for heterogenous structures like HList
, Nat
, Sized
and even not heterogenous structures.
Here are the working samples.
Here is the code based on pseudo-dependent types as shapeless.
The signature of the HMonoid
shows the zero
and the Semigroup
as expected:
1
|
|
This is just a sandbox to open discussion on this idea so I won’t explain more and let the curious ones think about it…
Have Monoids of Fun!
Forget the buzz title, this project is still very draft but it’s time to expel it out of my R&D sandbox as imperfect as it might be… Before I lose my sanity while wandering in Scala macro hygiene ;)
Daemonad
is a nasty Scala macro that aims at:
snoop
monad values deep into (some) monad stacks in the same way as ScalaAsync
i.e. in a pseudo-imperative way.This project is NOT yet stable, NOT very robust so use it at your own risks but we can discuss about it…
Here is what you can write right now.
1 2 3 4 5 6 7 8 9 |
|
I wanted to write a huge & complex Scala macro that would move pieces of code, create more code, types etc…
I wanted to know the difficulties that it implies.
I felt reckless, courageous!
Nothing could stop me!!!!
Result: I was quite insane and I will certainly write a post-mortem article about it to show the horrible difficulties I’ve encountered. My advice: let people like hit their head against the wall and wait for improved hygienic macros that should come progressively before writing big macros ;)
I had investigated ScalaAsync code and thought it would be possible to generalize it to all kinds of monads and go further by managing monad stacks.
Result : Simple monads are easy to manage (as seen also in scala-workflow which I discovered very recently) and some monad stacks can be managed with Scalaz monad transformers.
But don’t think you can use all kinds of monad transformers: the limits of Scala compiler with type-lambdas in macros and my very own limits blocked me from going as far as I expected.
So for now, it can manage Future/Option/List stacks & also Either \/ using type aliases.
There are 2 ways of seeing monads:
… And yet you use it everyday/everywhere.
This is what most of us (and it’s so shameful) do using those cool map/flatMap
functions provided by Scala libraries that allow to access the values inside Future
, List
, Option
in a protected way etc…
That’s enough for you need in your everyday life, right?
… and you want to use them on purpose.
This is what hippy developers do in advanced Scala using Scalaz or even crazier in pure FP languages like Haskell.
Guess what I prefer?
Here is the kind of code I’d like to write :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Ok I speak about pure functional programming and then about snooping the value out of the monad. This might seem a bit useless or even stupid compared to using directly Monad facilities. I agree and I still wonder about the sanity of this project but I’m stubborn and I try to finish what I start ;)
1 2 3 4 5 6 7 8 9 |
|
What does it do ?
monadic
marks the monadic blockmonadic[Future, Option]
declares that you manipulate a stack Future[Option]
(and no other)snoopX
means that you want to snoop the monad value at X-th level (1, 2, 3, 4 and no more for now)List
, Option
, Future
) and monad transformers (here OptionT
& ListT
) for this stackMonad.bind/point/lift/run
…snoop2
is used in first position: if you have used snoop1
, the macro would have rejected your monadic block. It’s logical, when you use flatMap
, you always start with the deeper stack of monad and I chose not to change the order of your code as I find this macro is already far too intrusive :)I’m sure you don’t want to see the code you would have to write for this, this is quite long and boring.
Let just say that this code is generated by a Scala macro for you.
The current generated code isn’t optimized at all and quite redundant but this is for next iterations.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Note that:
flatmap
between first and second and third list. 2*2*2 = 8… nothing strange but it can be surprising at first glance ;)MonadTrans[F[_], _]
instead of hardcoding monad transformers as now.MonadTrans
provided in the user code.snoop
nameHave a look at the code on Github.
Have snoop22(macrofun)!
The code & sample apps can be found on Github
Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…
Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.
So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:
Let’s remind that I’m not an expert in ML but more a student. So if I tell or do stupid ML things, be indulgent ;)
Here is what I propose:
Train a collaborative filtering rating model for a recommendation system (as explained in Spark doc there) using a first NIO server and a client as presented in part 2.
When model is trained, create a second server that will accept client connections to receive data.
Stream/merge all received data into one single stream, dstreamize it and perform streamed predictions using previous model.
As explained in Spark doc about collaborative filtering, we first need some data to train the model. I want to send those data using a NIO client.
Here is a function doing this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Now we need the training NIO server waiting for the training client to connect and piping the received data to the model.
Here is a useful function to help creating a server as described in previous article part:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
We can create the training server with it:
1 2 3 4 |
|
trainingServer
is a Process[Task, Bytes]
streaming the training data received from training client. We are going to train the rating model with them.
To train a model, we can use the following API:
1 2 3 4 5 6 7 8 |
|
RDD[Rating]
from server streamImagine that we have a continuous flow of training data that can be very long.
We want to train the model with just a slice of this flow. To do this, we can:
dstreamize
the server output streamRDD
s received during this timeRDD
sHere is the whole code with previous client:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Fantastic, we have trained our model in a very fancy way, haven’t we?
Personally, I find it interesting that we can take advantage of both APIs…
Now that we have a trained model, we can create a new server to receive data from clients for rating prediction.
Firstly, let’s generate some random data to send for prediction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
1 2 3 4 |
|
predictServer
is the stream of data to predict. Let’s stream it to the model by dstreamizing
it and transforming all built RDD
s by passing them through model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
StreamingContext
I’ve discovered a problem here because the recommendation model is built in a StreamingContext
and uses RDD
s built in it. So you must use the same StreamingContext
for prediction. So I must build my training dstreamized client/server & prediction dstreamized client/server in the same context and thus I must schedule both things before starting this context.
Yet the prediction model is built from training data received after starting the context so it’s not known before… So it’s very painful and I decided to be nasty and consider the model as a variable that will be set later. For this, I used a horrible SyncVar
to set the prediction model when it’s ready… Sorry about that but I need to study more about this issue to see if I can find better solutions because I’m not satisfied about it at all…
So here is the whole training/predicting painful code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
3 long articles to end in training a poor recommendation system with 2 clients/servers… A bit bloated isn’t it? :)
Anyway, I hope I printed in your brain a few ideas, concepts about spark & scalaz-stream and if I’ve reached this target, it’s already enough!
Yet, I’m not satisfied about a few things:
StreamingContext
is still clumsy and I must say that calling model.predict
from a map
function in a DStream
might not be so good in a cluster environment. I haven’t been digging this code enough to have a clear mind on it.But, I’m satisfied globally:
Process
into a spark DStream
works quite well and might be interesting after all.GO TO PART2 <—————————————————————————————————-
Have a look at the code on Github.
Have distributed & resilient yet continuous fun!
The code & sample apps can be found on Github
Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…
Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.
So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:
W
(for Write) to a serverI
(for Input) from a serverProcess
A client could be represented as:
Process[Task, I]
for input channel (receiving from server)Process[Task, W]
for output channel (sending to server)In scalaz-stream, recently a new structure has been added :
1
|
|
Precisely what we need!
Now, let’s consider that we work in NIO mode with everything non-blocking, asynchronous etc…
In this context, a client can be seen as something generating soon or later one (or more) Exchange[I, W]
i.e :
1
|
|
In the case of a pure TCP client,
I
andW
are oftenBytes
.
Scalaz-Stream now provides a helper to create a TCP binary NIO client:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Exchange
To plug your own data source to write to server, Scalaz-Stream provides 1 more API:
1 2 3 4 5 6 7 8 9 |
|
With this API, we can write data to the client and output received data.
1 2 3 4 5 6 7 8 9 |
|
Yet, in general, we need to:
So we need to be able to gather in the same piece of code received & emitted data.
Wye
Scalaz-stream can help us with the following API:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Whoaaaaa complex isn’t it? Actually not so much…
Wye
is a fantastic tool that can:
I love ASCII art schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
\/
is ScalaZ disjunction also called `Either in the Scala world.
So Wye[Task, I, W2, W \/ I2]
can be seen as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Exchange.wye
API do?Exchange.write: Sink[Task, W]
to the W
output of the Wye[Task, I, W2, W \/ I2]
for sending data to the server.Exchange.read: Process[Task, I]
receiving data from server to the left input of the Wye[Task, I, W2, W]
.W2
branch provides a plug for an external source of data in the shape of Process[Task, W2]
.I2
can be used to pipe data from the client to an external local process (like streaming out data received from the server).Exchange[I2, W2]
.In a summary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
As a conclusion,
Exchange.wye
combines the originalExchange[I, W]
with your customWye[Task, I, W2, W \/ I2]
which represents the business logic of data exchange between client & server and finally returns aExchange[I2, W2]
on which you can plug your own data source and retrieve output data.
wye/run
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Please note, I simply reuse the basic
echo
example provided in scalaz-stream ;)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
This might seem hard to catch to some people because of scalaz-stream notations and wye
Left/Right/Both
orwye.emitO/emitW
. But actually, you’ll get used to it quite quickly as soon as you understandwye
. Keep in mind that this code uses low-level scalaz-stream API without anything else and it remains pretty simple and straighforward.
1 2 3 4 5 6 7 8 9 |
|
It would give something like:
1
|
|
Now, you know about scalaz-stream clients, what about servers???
Let’s start again :D
I
(for Input) from the clientW
(for Write) to the clientProcess
Remember that a client was defined above as:
1
|
|
In our NIO, non-blocking, streaming world, a server can be considered as a stream of clients right?
So finally, we can model a server as :
1 2 |
|
Whoooohoooo, a server is just a stream of streams!!!!
Scalaz-Stream now provides a helper to create a TCP binary NIO server:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Don’t you find that quite elegant? ;)
There we simply re-use the Exchange
described above so you can use exactly the same API than the ones for client. Here is another API that can be useful:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
With this API, you can compute some business logic on the received data from client.
Let’s write the echo server corresponding to the previous client (you can find this sample in scalaz-stream too):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
receivedData
is Process[Task, Process[Task, Bytes]]
which is not so practical: we would prefer to gather all data received by clients in 1 single Process[Task, Bytes]
to stream it to another module.
Scalaz-Stream has the solution again:
1 2 3 4 5 6 7 |
|
Please note the Strategy which corresponds to the way Tasks will be executed and that can be compared to Scala
ExecutionContext
.
Fantastic, let’s plug it on our server:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Finally, we have a server and a client!!!!!
Let’s plug them all together
First of all, we need to create a server that can be stopped when required.
Let’s do in the scalaz-stream way using:
wye.interrupt
:1 2 3 4 5 6 |
|
async.signal
which is a value that can be changed asynchronously based on 2 APIs:1 2 3 4 5 6 7 8 9 10 |
|
Without lots of imagination, we can use a Signal[Boolean].discrete
to obtain a Process[Task, Boolean]
and wye it with previous server process using wye.interrupt
. Then, to stop server, you just have to call:
1
|
|
Here is the full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Naturally you rarely run the client & server in the same code but this is funny to see our easily you can do that with scalaz-stream as you just manipulate
Process
run on providedStrategy
Finally, we can go back to our subject: feeding a
DStream
using a scalaz-stream NIO client/server
clientEcho/serverEcho
are simple samples but not very useful.
Now we are going to use a custom client/server I’ve written for this article:
NioClient.sendAndCheckSize
is a client streaming all emitted data of a Process[Task, Bytes]
to the server and checking that the global size has been ack’ed by server.NioServer.ackSize
is a server acknowledging all received packets by their size (as a 4-bytes Int)Now let’s write a client/server dstreamizing data to Spark:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
When run, it prints :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Until 100…
I spent this second part of my tryptic mainly explaining a few concepts of the new scalaz-stream brand new NIO API. With it, a client becomes just a stream of exchanges Process[Task, Exchange[I, W]]
and a server becomes a stream of stream of exchanges Process[Task, Process[Task, Exchange[I, W]]]
.
As soon as you manipulate Process
, you can then use the dstreamize
API exposed in Part 1 to pipe streamed data into Spark.
Let’s go to Part 3 now in which we’re going to do some fancy Machine Learning training with these new tools.
GO TO PART1 < —————————————————————————–> GO TO PART3
The code & sample apps can be found on Github
Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…
Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.
So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:
Scalaz-stream Process[Task, T]
is a stream of T
elements that can interleave some Task
s (representing an external something doing somewhat). Process[Task, T]
is built as a state machine that you need to run to process all Task
effects and emit a stream of T
. This can manage both continuous or discrete, finite or infinite streams.
I restricted to
Task
for the purpose of this article but it can be anyF[_]
.
Spark DStream[T]
is a stream of RDD[T]
built by discretizing a continuous stream of T
. RDD[T]
is a resilient distributed dataset which is the ground data-structure behind Spark for distributing in-memory batch/map/reduce operations to a cluster of nodes with fault-tolerance & persistence.
In a summary, DStream
slices a continuous stream of T
by windows of time and gathers all T
s in the same window into one RDD[T]
. So it discretizes the continuous stream into a stream of RDD[T]
. Once built, those RDD[T]
s are distributed to Spark cluster. Spark allows to perform transform/union/map/reduce/… operations on RDD[T]
s. Therefore DStream[T]
takes advantage if the same operations.
Spark-Streaming also persists all operations & relations between DStream
s in a graph. Thus, in case of fault in a remote node while performing operations on DStream
s, the whole transformation can be replayed (it also means streamed data are also persisted).
Finally, the resulting DStream
obtained after map/reduce operations can be output to a file, a console, a DB etc…
Please note that
DStream[T]
is built with respect to aStreamingContext
which manages its distribution in Spark cluster and all operations performed on it. Moreover,DStream
map/reduce operations & output must be scheduled before starting theStreamingContext
. It could be somewhat compared to a state machine that you build statically and run later.
You may ask why not simply build a
RDD[T]
from aProcess[Task, T]
?
Yes sure we can do it:
1 2 3 4 5 6 7 8 9 |
|
This works but what if this Process[Task, T]
emits huge quantity of data or is infinite?
You’ll end in a OutOfMemoryException
…
So yes you can do it but it’s not so interesting. DStream
seems more natural since it can manage stream of data as long as it can discretize it over time.
Process[Task, T]
, Push to DStream[T]
with LocalInputDStream
To build a DStream[T]
from a Process[Task, T]
, the idea is to:
T
emitted by Process[Task, O]
,T
during a window of time & generate a RDD[T]
with them,RDD[T]
into the DStream[T]
,Spark-Streaming library provides different helpers to create DStream
from different sources of data like files (local/HDFS), from sockets…
The helper that seemed the most appropriate is the NetworkInputDStream
:
NetworkReceiver
based on a Akka actor to which we can push streamed data.NetworkReceiver
gathers streamed data over windows of time and builds a BlockRDD[T]
for each window.BlockRDD[T]
is registered to the global Spark BlockManager
(responsible for data persistence).BlockRDD[T]
is injected into the DStream[T]
.So basically, NetworkInputDStream
builds a stream of BlockRDD[T]
.
It’s important to note that NetworkReceiver
is also meant to be sent to remote workers so that data can be gathered on several nodes at the same time.
But in my case, the data source Process[Task, T]
run on the Spark driver node (at least for now) so instead of NetworkInputDStream
, a LocalInputDStream
would be better. It would provide a LocalReceiver
based on an actor to which we can push the data emitted by the process in an async way.
LocalInputDStream
doesn’t exist in Spark-Streaming library (or I haven’t looked well) so I’ve implemented it as I needed. It does exactly the same asNetworkInputDStream
without the remoting aspect. The current code is there…
Process
vs DStream
?There is a common point between DStream
and Process
: both are built as state machines that are passive until run.
In the case of Process
, it is run by playing all the Task
effects while gathering emitted values or without taking care of them, in blocking or non-blocking mode etc…
In the case of DStream
, it is built and registered in the context of a SparkStreamingContext
. Then you must also declare some outputs for the DStream
like a simple print, an HDFS
file output, etc… Finally you start the SparkStreamingContext
which manages everything for you until you stop it.
So if we want to adapt a Process[Task, T]
to a DStream[T]
, we must perform 4 steps (on the Spark driver node):
DStream[T]
using LocalInputDStream[T]
providing a Receiver
in which we’ll be able to push asynchronously T
.Sink[Task, T, Unit]
in charge of consuming all emitted data from Process[Task, T]
and pushing them using previous Receiver
.Process[Task, T]
to this Sink[Task, T, Unit]
& when Process[Task, T]
has halted, stop previous DStream[T]
: the result of this pipe operation is a Process[Task, Unit]
which is a pure effectful process responsible for pushing T
into the dstream without emitting anything.DStream[T]
and effectful consumer Process[Task, Unit]
.dstreamize
implementation1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Please remark that this builds a
Process[Task, Unit]
and aDStream[T]
but nothing has happened yet in terms of data consumption & streaming. Both need to be run now.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Please note that you have to:
- schedule your dstream operations/output before starting the streaming context.
- start the streaming context before running the consumer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Ok cool, we can see a warmup phase at beginning and then windows of 1 sec counting 20 elements which is great since one element every 50ms gives 20 elements in 1sec.
Now we can pipe a Process[Task, T]
into a DStream[T]
.
Please not that as we run the Process[Task, T]
on the Spark driver node, if this node fails, there is no real way to restore lost data. Yet, LocalInputDStream
relies on DStreamGraph
& BlockRDD
s which persist all DStream relations & all received blocks. Moreover, DStream
has exactly the same problem with respect to driver node for now.
That was fun but what can we do with that?
In part2, I propose to have more fun and stream data to DStream
using the brand new Scalaz-Stream NIO API to create cool NIO client/server streams…
——————————————————————————————————-> GO TO PART2
Today I’m going to write about a Proof of Concept I’ve been working on those last weeks: I wanted to use scalaz-stream as a driver of Spark distributed data processing. This is simply an idea and I don’t even know whether it is viable or stupid. But the idea is interesting!
2 of my preferred topics those last months are :
2 tools have kept running through my head those last months:
Scalaz-Stream for realtime/continuous streaming using pure functional concepts: I find it very interesting conceptually speaking & very powerful, specially the deterministic & non-deterministic demuxtiplexers provided out-of-the-box (Tee & Wye).
Spark for fast/fault-tolerant in-memory, resilient & clustered data processing.
I won’t speak much about Scalaz-Stream because I wrote a few articles about it.
Spark provides tooling for cluster processing of huge datasets in the same batch mode way as Hadoop, the very well known map/reduce infrastructure. But at the difference of Hadoop which is exclusively relying on HDFS cluster file systems when distributing data through the cluster, Spark tries to cache data in memory as much as possible so that latency of access is reduced as much as possible. Hadoop can scale a lot but is known to be slow in the context of a single node.
Spark is aimed at scaling as much as Hadoop but running faster on each node using in-memory caching. Fault-tolerance & data resilience is managed by Spark too using persistence & redundancy based on any nice storage like HDFS or files or whatever you can plug on Spark. So Spark is meant to be a super fast in-memory, fault-tolerant batch processing engine.
The basic concept of Spark is Resilient Distributed Dataset aka RDD
which is a read-only, immutable data structure representing a collection of objects or dataset that can be distributed across a set of nodes in a cluster to perform map/reduce style algorithms.
The dataset represented by this RDD
is partitioned i.e. cut into slices called partitions that can be distributed across the cluster of nodes.
Resilient
means these data can be rebuilt in case of fault on a node or data loss. To perform this, the dataset is replicated/persisted across nodes in memory or in distributed file system such as HDFS.
So the idea of RDD is to provide a seamless structure to manage clustered datasets with very simple API in “monadic”-style :
1 2 3 4 5 6 7 8 |
|
Depending on your SparkContext
configuration, Spark takes in charge of distributing behind the curtain your data to the cluster nodes to perform the required processing in a fully distributed way.
One thing to keep in mind is that Spark distributes data to remote nodes but it also distributes the code/closures remotely. So it means your code has to be serializable which is not the case of scalaz-stream in its current implementation.
As usual, before using Spark in any big project, I’ve been diving in its code to know whether I can trust this project. I must say I know Spark’s code better than its API ;)
I find Spark Scala implementation quite clean with explicit choices of design made clearly in the purpose of performance. The need to provide a compatible Java/Python API and to distribute code across clustered nodes involves a few restrictions in terms of implementation choices. Anyway, I won’t criticize much because I wouldn’t have written it better and those people clearly know what they do!
So Spark is very good to perform fast clustered batch data processing. Yet, what if your dataset is built progressively, continuously, in realtime?
On top of the core module, Spark provides an extension called Spark Streaming aiming at manipulating live streams of data using the power of Spark.
Spark Streaming can ingest different continuous data feeds like Kafka, Flume, Twitter, ZeroMQ or TCP socket and perform high-level operations on it such as map/reduce/groupby/window/…
The core data structure behind Spark Streams is DStream
for Discretized Stream (and not distributed).
Discretized
means it gets a continuous stream of data and makes it discrete by slicing it across time and wrapping those sliced data into the famous RDD
described above.
A DStream
is just a temporal data partitioner that can distribute data slices across the cluster of nodes to perform some data processing using Spark capabilities.
Here is the illustration in official Spark Stream documentation:
DStream
also tries to leverage Spark automated persistence/caching/fault-tolerance to the domain of live streaming.
DStream
is cool but it’s completely based on temporal aspects. Imagine you want to slice the stream depending on other criteria, with DStream
, it would be quite hard because the whole API is based on time. Moreover, using DStream, you can discretize a dataflow but you can’t go in the other way and make it continuous again (in my knowledge). This is something that would be cool, isn’t it?
If you want to know more about DStream discretization mechanism, have a look at the official doc.
As usual, I’m trying to investigate the edge-cases of concepts I like. In general, this is where I can test the core design of the project and determine whether it’s worth investigating in my every-day life.
I’ve been thinking about scalaz-stream concepts quite a lot and scalaz-stream is very good at manipulating continuous streams of data. Moreover, it can very easily partition a continuous stream regrouping data into chunks based on any criteria you can imagine.
Scalaz-stream represents a data processing algorithm as a static state machine that you can run when you want. This is the same idea behind map/reduce Spark API: you build your chain of map/filter/window and finally reduce it. Reducing a spark data processing is like running a scalaz-stream machine.
So my idea was the following:
- build a continuous stream of data based on scalaz-stream
Process[F, O]
- discretize the stream
Process[F, O] => Process[F, RDD[O]]
- implement count/reduce/reduceBy/groupBy for
Process[F, RDD[O]]
- provide a
continuize
method to doProcess[F, RDD[O]] => Process[F, O]
So I’ve been hacking between Scalaz-stream Process[F, O]
& Spark RDD[O]
and here is the resulting API that I’ve called ZPark-ZStream
(ZzzzzzPark-Zzzzztream).
Let’s play a bit with my little alpha API.
Let’s start with a very simple example.
Take a simple finite process containing integers:
1
|
|
Now I want to slice this stream of integer by slices of 4 elements.
First we have to create the classic Spark Streaming context and make it implicit (needed by my API).
Please remark that I could plug existing StreamingContext on my code without any problem.
1 2 |
|
Then let’s parallelize the previous process :
1 2 3 |
|
Ok folks, now, we have a discretized stream of Long
that can be distributed across a Spark cluster.
DStream
provides count
API which count elements on each RDD
in the stream.
Let’s do the same with my API:
1
|
|
What happens here? The `count operation on each RDD in the stream is distributed across the cluster in a map/reduce-style and results are gathered.
Ok that’s cool but you still have a discretized stream Process[Task, RDD[Int]]
and that’s not practical to use to see what’s inside it. So now we are going to re-continuize
it and make it a Process[Task, Int]
again.
1
|
|
Easy isn’t it?
All together :
1 2 3 4 5 |
|
Let’ print the result in the console
1 2 3 4 5 6 7 8 |
|
Oh yes that works: in each slice of 4 elements, we actually have 4 elements! Reassuring ;)
Let’s do the same with countByValue
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
You can see that 4 comes before 3. This is due to the fact the 2nd slice of 4 elements (3,3,4,4) is converted into a RDD which is then partitioned and distributed across the cluster to perform the map/reduce count operation. So the order of return might be different at the end.
An example of map/reduce ?
1 2 3 4 5 6 7 8 9 10 |
|
Please note that:
1
|
|
Now we could try to slice according to time in the same idea as DStream
First of all, let’s define a continuous stream of positive integers:
1 2 3 4 5 6 |
|
Now, I want integers to be emitted at a given tick for example:
1 2 |
|
Then, let’s discretize the continuous stream with ZPark-Ztream API:
1 2 |
|
The stream is sliced in slice of 500ms and all elements emitted during these 500ms are gathered in a Spark RDD
.
On this stream of RDD, we can apply
countRDD` as before and finally re-continuize it. All together we obtain:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Approximatively we have 50 elements per slice which looks like what we expected.
Please note that there is a short period of warmup where values are less homogenous.
DStream
keeps track of all created RDD slices of data (following Spark philosophy to cache as much as possible) and allows to do operation of windowing to redistribute RDD.
With ZPark API, you can write the same as following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We can see here that final interval haven’t 100 elements as we could expect. This is still a mystery to me and I must investigate a bit more to know where this differences comes from. I have a few ideas but need to validate.
Anyway, globally we get 500 elements meaning we haven’t lost anything.
Playing with naturals is funny but let’s work with a real source of data like a file.
It could be anything pluggable on scalaz-stream like kafka/flume/whatever as DStream
provides…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Is it possible to combine RDD Processes using scalaz-stream ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Please note that I drive Spark RDD stream with Scalaz-Stream always remains on the driver node and is never sent to a remote node as map/reduce closures are in Spark. So Scalaz-stream is used a stream driver in this case. Moreover, Scalaz Process isn’t serializable in its current implementation so it wouldn’t be possible as is.
After discretizing a process, you can persist each RDD :
1
|
|
Ok but DStream
does much more trying to keep in-memory every RDD that is generated and potentially persist it across the cluster. This makes things stateful & mutable which is not the approach of pure functional API like scalaz-stream. So, I need to think a bit more about this persistence topic which is huge.
Anyway I believe I’m currently investigating another way of manipulating distributed streams than DStream
.
Spark is quite amazing and easy to use with respect to the complexity of the subject.
I was also surprised to be able to use it with scalaz-stream so easily.
I hope you liked the idea and I encourage you to think about it and if you find it cool, please tell it! And if you find it stupid, please tell it too: this is still a pure experiment ;)
Have a look at the code on Github.
Have distributed & resilient yet continuous fun!
After 5 months studying theories deeper & deeper on my free-time and preparing 3 talks for scala.io & ping-conf with my friend Julien Tournay aka @skaalf, I’m back blogging and I’ve got a few more ideas of articles to come…
If you’re interested in those talks, you can find pingconf videos here:
Let’s go back to our today’s subject : Incoming Play2.3/Scala generic validation API & more.
Julien Tournay aka @skaalf has been working a lot for a few months developing this new API and has just published an article previewing Play 2.3 generic validation API.
This new API is just the logical extension of play2/Scala Json API (that I’ve been working & promoting those 2 last years) pushing its principles far further by allowing validation on any data types.
This new API is a real step further as it will progressively propose a common API for all validations in Play2/Scala (Form/Json/XML/…). It proposes an even more robust design relying on very strong theoretical ground making it very reliable & typesafe.
Julien has written his article presenting the new API basics and he also found time to write great documentation for this new validation API. I must confess Json API doc was quite messy but I’ve never found freetime (and courage) to do better. So I’m not going to spend time on basic features of this new API and I’m going to target advanced features to open your minds about the power of this new API.
Let’s have fun with this new APi & Shapeless, this fantastic tool for higher-rank polymorphism & type-safety!
A really cool & new feature of Play2.3 generic validation API is its ability to compose validation Rules in chains like:
1 2 3 4 |
|
In Play2.1 Json API, you couldn’t do that (you could only map on Reads).
Moreover, with new validation API, as in Json API, you can use macros to create basic validators from case-classes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Great but sometimes not enough as you would like to add custom validations on your class. For example, you want to verify :
foo
isn’t emptybar
is >5foo2
is <10For that you can’t use the macro and must write your caseclass Rule yourself.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Please note the new From[JsValue]
: if it were Xml, it would be From[Xml]
, genericity requires some more info.
Ok that’s not too hard but sometimes you would like to use first the macro and after those primary type validations, you want to refine with custom validations. Something like:
1 2 |
|
As you may know, you can’t do use this +:
from Scala Sequence[T]
as this list of Rules is typed heterogenously and Rule[I, O]
is invariant.
So we are going to use Shapeless heterogenous Hlist for that:
1 2 3 4 5 6 |
|
Rule[JsValue, FooBar]
with Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]
?We need to convert Rule[JsValue, FooBar]
to something like Rule[JsValue, T <: HList]
.
Based on Shapeless Generic[T]
, we can provide a nice little new conversion API .hlisted
:
1
|
|
Generic[T]
is able to convert any caseclass from Scala from/to Shapeless HList (& CoProduct).
So we can validate a case class with the macro and get a Rule[JsValue, T <: HList]
from it.
Rule[JsValue, String :: Int :: Long :: HNil]
with Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]
?Again, using Shapeless Polymorphic and HList RightFolder, we can implement a function :
1 2 |
|
This looks like some higher-kind zip function, let’s call it
HZIP
.
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
As you can see, the problem in this approach is that we lose the path of Json. Anyway, this can give you a few ideas! Now let’s do something really useful…
As in Play2.1 Json API, the new validation API provides an applicative builder which allows the following:
1 2 3 |
|
But, in Play2.1 Json API and also in new validation API, all functional combinators are limited by the famous Scala 22 limits.
In Scala, you CAN’T write :
Tuple23
So you can’t do Rule[JsValue, A] ~ Rule[JsValue, B] ~ ...
more than 22 times.
Nevertheless, sometimes you receive huge JSON with much more than 22 fields in it. Then you have to build more complex models like case-classes embedding case-classes… Shameful, isn’t it…
Let’s be shameless with Shapeless HList which enables to have unlimited heterogenously typed lists!
So, with HList, we can write :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
That’s cool but we want the ::
operator to have the same applicative builder behavior as the
~/and` operator:
1 2 |
|
This looks like a higher-kind fold so let’s call that
HFOLD
.
We can build this hfold
using Shapeless polymorphic functions & RighFolder.
In a next article, I may write about coding such shapeless feature. Meanwhile, you’ll have to discover the code on Github as it’s a bit hairy but very interesting ;)
Gathering everything, we obtain the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
Let’s write a play action using this rule:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
Awesome… now, nobody can say 22 limits is still a problem ;)
Have a look at the code on Github.
Have fun x 50!
Here is the function Play provides to create a websocket:
1 2 3 |
|
A websocket is a persistent bi-directional channel of communication (in/out) and is created with:
Iteratee[A, _]
to manage all frames received by the websocket endpointEnumerator[A]
to send messages through the websocketFrameFormatter[A]
to parse frame content to type A
(Play provides default FrameFormatter for String or JsValue)Here is how you traditionally create a websocket endpoint in Play:
1 2 3 4 5 6 7 8 9 10 |
|
Generally, the Enumerator[A]
is created using Concurrent.broadcast[A]
and Concurrent.unicast[A]
which are very powerful tools but not so easy to understand exactly (the edge-cases of connection close, errors are always tricky).
You often want to:
To do that in Play non-blocking/async architecture, you often end developing an Actor topology managing all events/messages on top of the previous Iteratee/Enumerator
.
The Iteratee/Enumerator
is quite generic but always not so easy to write.
The actor topology is quite generic because there are administration messages that are almost always the same:
Actor Room is a helper managing all of this for you. So you can just focus on message management using actors and nothing else. It provides all default behaviors and all behaviors can be overriden if needed. It exposes only actors and nothing else.
The code is based on the chatroom sample (and a cool sample by Julien Tournay) from Play Framework pushed far further and in a more generic way.
An actor room manages a group of connected members which are supervised by a supervisor
Each member is represented by 2 actors (1 receiver & 1 sender):
You MUST create at least a Receiver Actor because it’s your job to manage your own message format
The Sender Actor has a default implementation but you can override it.
All actors are managed by 1 supervisor which have two roles:
Creates/supervises all receiver/sender actors
Manages administration messages (routing, forwarding, broadcasting etc…)
1 2 3 4 5 6 |
|
The room creates the Supervisor actor for you and delegates the creation of receiver/sender actors to it.
If you want to broadcast a message or target a precise member, you should use the supervisor.
1 2 |
|
You can manage several rooms in the same project.
There is only one message to manage:
1 2 3 4 5 |
|
If your websocket frames contain Json, then it should be Received[JsValue]
.
You just have to create a simple actor:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Please note the Receiver Actor is supervised by the Supervisor
actor. So, within the Receiver Actor, context.parent
is the Supervisor
and you can use it to send/broadcast message as following:
1 2 3 4 5 6 7 8 9 |
|
Please note that each member is identified by a string that you define yourself.
import org.mandubian.actorroom._
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
AdminMsgFormatter
typeclass is used by ActorRoom to format administration messages (Connected, Disconnected and Error) by default.
AdminMsgFormatter[JsValue]
and AdminMsgFormatter[String]
are provided by default.
You can override the format as following:
1 2 3 4 5 6 7 8 9 |
|
You just have to create a new actor as following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Then you must initialize your websocket with it
1
|
|
You can override the following messages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Please note Supervisor
is an actor which manages a internal state containing all members:
1
|
|
You can override the default Supervisor as following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
A bot is a fake member that you can use to communicate with other members. It’s identified by an ID as any member.
You create a bot with these API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Then with returned Member
, you can simulate messages:
1 2 3 4 5 6 |
|
Naturally, you can override the Bot Sender Actor
1 2 3 4 5 6 7 8 9 10 11 12 |
|
So what else??? Everything you can override and everything that I didn’t implement yet…
On github project, you will find 2 samples:
simplest
which is a very simple working sample.websocket-chat
which is just the Play Framework ChatRoom sample rewritten with ActorRoom
.Have fun!
The aim of this article is to show how scalaz-stream could be plugged on existing Play Iteratee/Enumerator and used in your web projects. I also wanted to evaluate in depth the power of scalaz-stream Processes by trying to write a recursive streaming action: I mean a web endpoint streaming data and re-injecting its own streamed data in itself.
If you want to see now how scalaz-stream is used with Play, go to this paragraph directly.
I’m a fan of everything dealing with data streaming and realtime management in backends. I’ve worked a lot on Play Framework and naturally I’ve been using the cornerstone behind Play’s reactive nature: Play Iteratees.
Iteratees (with its counterparts, Enumerators and Enumeratees) are great to manipulate/transform linear streams of data chunks in a very reactive (non-blocking & asynchronous) and purely functional way:
Iteratee is really powerful but I must say I’ve always found them quite picky to use, practically speaking. In Play, they are used in their best use-case and they were created for that exactly. I’ve been using Iteratees for more than one year now but I still don’t feel fluent with them. Each time I use them, I must spend some time to know how I could write what I need. It’s not because they are purely functional (piping an Enumerator into an Enumeratee into an Iteratee is quite trivial) but there is something that my brain doesn’t want to catch.
If you want more details about my experience with Iteratees, go to this paragraph
That’s why I wanted to work with other functional streaming tools to see if they suffer the same kind of usability toughness or can bring something more natural to me. There are lots of other competitors on the field such as pipes, conduits and machines. As I don’t have physical time to study all of them in depth, I’ve chosen the one that appealed me the most i.e. Machines.
I’m not yet a Haskell coder even if I can mumble it so I preferred to evaluate the concept with scalaz-stream, a Scala implementation trying to bring machines to normal coders focusing on the aspect of IO streaming.
I’m not going to judge if Machines are better or not than Iteratees, this is not my aim. I’m just experimenting the concept in an objective way.
I won’t explain the concept of Machines in depth because it’s huge and I don’t think I have the theoretical background to do it right now. So, let’s focus on very basic ideas at first:
In scalaz-stream, you don’t manipulate machines which are too abstract for real-life use-cases but you manipulate simpler concepts:
Process[M, O]
is a restricted machine outputting a stream of O
. It can be a source if the monadic effect gets input from I/O or generates procedural data, or a sink if you don’t care about the output. Please note that it doesn’t infer the type of potential input at all.Wye[L, R, O]
is a machine that takes 2 inputs (left L
/ right R
) and outputs chunks of type O
(you can read from left or right or wait for both before ouputting)Tee[L, R, O]
is a Wye that can only read alternatively from left or from right but not from both at the same time.Process1[I, O]
can be seen as a transducer which accepts inputs of type I
and outputs chunks of type O
(a bit like Enumeratee)Channel[M, I, O]
is an effectul channel that accepts input of type I
and use it in a monadic effect M
to produce potential O
StartHere
sample provided by scalaz-stream:1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
But don’t think everything is so simple, machines is a complex concept with lots of theory behind it which is quite abstract. what I find very interesting is that it’s possible to vulgarize this very abstract concept with simpler concepts such as Process, Source, Sink, Tee, Wye… that you can catch quite easily as these are concepts you already manipulated when you were playing in your bathtub when you were child (or even now).
After these considerations, I wanted to experiment scalaz-stream with Play streaming capabilities in order to see how it behaves in a context I know.
Here is what I decided to study:
Process
Array[Byte]
using a scalaz-stream Process
Here is existing Play API :
Ok.stream(Enumerator)
WS.get(r: ResponseHeader => Iteratee)
As you can see, these API depends on Iteratee/Enumerator. As I didn’t want to hack Play too much as a beginning, I decided to try & plug scalaz-stream on Play Iteratee (if possible).
Enumerator[O]
from Process[Task, O]
The idea is to take a scalaz-stream Source[O] (Process[M,O]
) and wrap it into an Enumerator[O]
so that it can be used in Play controller actions.
An Enumerator is a data producer which can generate those data using monadic Future
effects (Play Iteratee is tightly linked to Future
).
Process[Task, O]
is a machine outputting a stream of O
so it’s logically the right candidate to be adapted with a Enumerator[O]
. Let’s remind’ Task
is just a scalaz Future[Either[Throwable,A]]
with a few helpers and it’s used in scalaz-stream.
So I’ve implemented (at least tried) an Enumerator[O]
that accepts a Process[Task, O]
:
1 2 3 4 5 6 |
|
The implementation just synchronizes the states of the
Iteratee[O, A]
consuming theEnumerator
with the states ofProcess[Task, O]
emitting data chunks ofO
. It’s quite simple actually.
Process1[I, O]
from Iteratee[I, O]
The idea is to drive an Iteratee from a scalaz-stream Process so that it can consume an Enumerator and be used in Play WS.
An Iteratee[I, O]
accepts inputs of type I
(and nothing else) and will fold the input stream into a single result of type O
.
A Process1[I, O]
accepts inputs of type I
and emits chunks of type O
but not necessarily one single output chunk. So it’s a good candidate for our use-case but we need to choose which emitted chunk will be the result of the Iteratee[I, O]
. here, totally arbitrarily, I’ve chosen to take the first emit as the result (but the last would be as good if not better).
So I implemented the following:
1 2 3 4 5 |
|
The implementation is really raw for experimentation as it goes through the states of the
Process1[I,O]
and generates the corresponding states ofIteratee[I,O]
until first emitted value. Nothing more nothing less…
Everything done in those samples could be done with Iteratee/Enumeratee more or less simply. The subject is not there!
1 2 3 4 5 |
|
1 2 |
|
1 2 3 4 5 6 7 |
|
1 2 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Please note :
scalaFuture2scalazTask
is just a helper to convert a Future
into Task
ticker
is quite simple to understand: it awaits Task[Int] and emits this
Int and repeats it again…processes.zipWith((a,b) => a)
is a tee (2 inputs left/right) that outputs only left data but consumes right also to have the delay effect..map(_.toString)
simply converts into something writeable by Ok.stream
.intersperse(",")
which simply add `”,” between each element1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Please note:
delayedNumber
uses an Akka scheduler to trigger our value after timeoutdelayedNumerals
shows a simple recursive `Process[Task, Int] construction which shouldn’t be too hard to understand1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Please note:
reader
is a Process1[Array[Byte], String] that folds all received
Array[Byte]into a
String`iterateeFirstEmit(reader)
simulates an Iteratee[Array[Byte], String]
driven by the reader
process that will fold all chunks of data received from WS call to routes.Application.sample2()
.get(rh => iterateeFirstEmit(reader))
will return a Future[Iteratee[Array[Byte], String]
that is run in .flatMap(_.run)
to return a Future[String]
Process.wrap(scalaFuture2scalazTask(maybeValues))
is a trick to wrap the folded Future[String]
into a Process[Task, String]
Process.emitAll(values.split(","))
splits the resulting string again and emits all chunks outside (stupid, just for demo)1 2 |
|
Still there? Let’s dive deeper and be sharper!
WS.executeStream(r: ResponseHeader => Iteratee[Array[Byte], A])
is cool API because you can build an iteratee from the ResponseHeader and then the iteratee will consume received `Array[Byte] chunks in a reactive way and will fold them. The problem is that until the iteratee has finished, you won’t have any result.
But I’d like to be able to receive chunks of data in realtime and re-emit them immediately so that I can inject them in realtime data flow processing. WS API doesn’t allow this so I decided to hack it a bit. I’ve written WSZ
which provides the API:
1 2 3 |
|
This API outputs a realtime Stream of Array[Byte]
whose flow is controlled by promises (Future
) being redeemed in AsyncHttpClient AsyncHandler
. I didn’t care about ResponseHeaders for this experimentation but it should be taken account in a more serious impl.
I obtain a Process[Future, Array[Byte]]
streaming received chunks in realtime and I can then take advantage of the power of machines to manipulate the data chunks as I want.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
Please note:
def splitFold(splitter: String): Process.Process1[String, String]
is just a demo that coding a Process transducer isn’t so crazy… Look at comments in code.translate(Task2FutureNF)
converts the Process[Future, Array[Byte]]
to Process[Task, Array[Byte]]
using Scalaz Natural Transformation.p |> splitFold(",")
means “pipe output of process p
to input of splitFold
”.1 2 3 4 5 6 |
|
Let’s finish our trip with a bit of puzzle and mystery.
As soon as my first experimentations of scalaz-stream with Play were operational, I’ve imagined an interesting case:
Is it possible to build an action generating a stream of data fed by itself: a kind of recursive stream.
With Iteratee, it’s not really possible since it can’t emit data before finishing iteration. It would certainly be possible with an Enumeratee but the API doesn’t exist and I find it much more obvious with scalaz-stream API!
The mystery isn’t in the answer to my question: YES it is possible!
The idea is simple:
Naturally, if it consumes its own data, it will recall itself again and again and again until you reach the connections or opened file descriptors limit. As a consequence, you must limit the depth of recursion.
I performed different experiences to show this use-case by zipping the stream with itself, adding elements with themselves etc… And after a few tries, I implemented the following code quite fortuitously :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
Launch it:
1 2 |
|
WTF??? This is Fibonacci series?
Just to remind you about it:
1 2 3 |
|
Here is the mystery!!!
How does it work???
I won’t tell the answer to this puzzling side-effect and let you think about it and discover why it works XD
But this sample shows exactly what I wanted: Yes, it’s possible to feed an action with its own feed! Victory!
Ok all of that was really funky but is it useful in real projects? I don’t really know yet but it provides a great proof of the very reactive character of scalaz-stream and Play too!
I tend to like scalaz-stream and I feel more comfortable, more natural using Process than Iteratee right now… Maybe this is just an impression so I’ll keep cautious about my conclusions for now…
All of this code is just experimental so be aware about it. If you like it and see that it could be useful, tell me so that we create a real library from it!
Have Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun, Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun, Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,!
Here are a few things that bother me when I use Play Iteratee (you don’t have to agree, this is very subjective):
Now you should use play-autosource 2.0 correcting a few issues & introducing ActionBuilder from play2.2
The code for all autosources & sample apps can be found on Github here
One month ago, I’ve demo’ed the concept of Autosource for Play2/Scala with ReactiveMongo in this article. ReactiveMongo was the perfect target for this idea because it accepts Json structures almost natively for both documents manipulation and queries.
But how does the concept behave when applied on a DB for which data are constrained by a schema and for which queries aren’t Json.
Add following lines to your project/Build.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
With ReactiveMongo Autosource, you could create a pure blob Autosource using JsObject
without any supplementary information. But with Datomic, it’s not possible because Datomic forces to use a schema for your data.
We could create a schema and manipulate JsObject
directly with Datomic and some Json validators. But I’m going to focus on the static models because this is the way people traditionally interact with a Schema-constrained DB.
Let’s create our model and schema.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Now that we have our schema, let’s write the autosource.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
If you compile previous code, you should have following error:
1
|
|
Actually, Datomisca Autosource requires 4 elements to work:
Json.Format[Person]
to convert Person
instances from/to Json (network interface)EntityReader[Person]
to convert Person
instances from Datomic entities (Datomic interface)PartialAddEntityWriter[Person]
to convert Person
instances to Datomic entities (Datomic interface)Reads[PartialAddEntity]
to convert Json to PartialAddEntity
which is actually a simple map of fields/values to partially update an existing entity (one single field for ex).It might seem more complicated than in ReactiveMongo but there is nothing different. The autosource converts Person
from/to Json and then converts Person
from/to Datomic structure ie PartialAddEntity
. In ReactiveMongo, the only difference is that it understands Json so well that static model becomes unnecessary sometimes ;)…
Let’s define those elements in Person
companion object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Now we have everything to work except a few configurations.
conf/routes
1
|
|
conf/play.plugins
to initialize Datomisca Plugin1
|
|
conf/application.conf
to initialize MongoDB connection1
|
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
In Datomic, you can’t do a getAll
without providing a Datomic Query.
But what is a Datomic query? It’s inspired by Datalog
which uses predicates to express the constraints on the searched entities. You can combine predicates together.
With Datomisca Autosource, you can directly send datalog queries in the query parameter q
for GET or in body for POST with one restriction: your query can’t accept input parameters and must return only the entity ID. For ex:
[ :find ?e :where [ ?e :person/name "john"] ] --> OK
[ :find ?e ?name :where [ ?e :person/name ?name] ] --> KO
Let’s use it by finding all persons.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Please note the use of POST here instead of GET because Curl doesn’t like
[]
in URL even using-g
option
Now you can use all other routes provided by Autosource
Play-Autosource’s ambition was to be DB agnostic (as much as possible) and showing that the concept can be applied to schemaless DB (ReactiveMongo & CouchDB) and schema DB (Datomic) is a good sign it can work. Naturally, there are a few more elements to provide for Datomic than in ReactiveMongo but it’s useful anyway.
Thank to @TrevorReznik for his contribution of CouchBase Autosource.
I hope to see soon one for Slick and a few more ;)
Have Autofun!
Do you remember JsPath
pattern matching presented in this article ?
Let’s now go further with something that you should enjoy even more: Json Interpolation & Pattern Matching.
I’ve had the idea of these features for some time in my mind but let’s render unto Caesar what is Caesar’s : Rapture.io proved that it could be done quite easily and I must say I stole got inspired by a few implementation details from them! (specially the @inline implicit conversion for string interpolation class which is required due to a ValueClass limitation that should be removed in further Scala versions)
First of all, code samples as usual…
1 2 3 4 5 6 7 8 9 10 11 |
|
Yes, pure Json in a string…
How does it work? Using String interpolation introduced in Scala 2.10.0 and Jackson for the parsing…
In String interpolation, you can also put Scala variables directly in the interpolated string. You can do the same in Json interpolation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Please note that string variables must be put between "..."
because without it the parser will complain.
Ok, so now it’s really trivial to write Json, isn’t it?
String interpolation just replaces the string you write in your code by some Scala code concatenating pieces of strings with variables as you would write yourself. Kind-of: s"toto ${v1} tata" => "toto + v1 + " tata" + ...
But at compile-time, it doesn’t compile your String into Json: the Json parsing is done at runtime with string interpolation. So using Json interpolation doesn’t provide you with compile-time type safety and parsing for now.
In the future, I may replace String interpolation by a real Macro which will also parse the string at compile-time. Meanwhile, if you want to rely on type-safety, go on using
Json.obj / Json.arr
API.
What is one of the first feature that you discover when learning Scala and that makes you say immediately: “Whoaa Cool feature”? Pattern Matching.
You can write:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Why not doing this with Json?
And…. Here it is with Json pattern matching!!!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Magical?
Not at all… Just unapplySeq
using the tool that enables this kind of Json manipulation as trees: JsZipper
…
The more I use JsZippers, the more I find fields where I can use them ;)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
If you like that, please tell it so that I know whether it’s worth pushing it to Play Framework!
These features are part of my experimental project JsZipper presented in this article.
To use it, add following lines to your SBT Build.scala
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
In your Scala code, import following packages
1 2 3 4 |
|
PatternMatch your fun!
Now you should use play-autosource 2.0 correcting a few issues & introducing ActionBuilder from play2.2
The module code and sample app can be found on Github here
Here we go:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
play-autosource:reactivemongo
dependency1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
conf/routes
1
|
|
conf/play.plugins
to initialize ReactiveMongo Plugin1
|
|
conf/application.conf
to initialize MongoDB connection1
|
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
1 2 3 4 5 6 7 8 9 10 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 10 |
|
1 2 3 4 5 6 7 |
|
With Play-Autosource, in a few lines, you obtain :
JsObject
but we’ll show later that we can use any type)It can be useful to kickstart any application in which you’re going to work iteratively on our data models in direct interaction with front-end.
It could also be useful to Frontend developers who need to bootstrap frontend code with Play Framework application backend. With Autosource, they don’t have to care about modelizing strictly a datasource on server-side and can dig into their client-side code quite quickly.
Now you tell me: “Hey that’s stupid, you store directly
JsObject
but my data are structured and must be validated before inserting them”
Yes you’re right so let’s add some type constraints on our data:
1 2 3 4 5 6 7 8 9 10 |
|
Try it now:
1 2 3 4 5 6 7 |
|
You can add progressively constraints on your data in a few lines. With AutoSource
, you don’t need to determine immediately the exact shape of your models and you can work with JsObject
directly as long as you need. Sometimes, you’ll even discover that you don’t even need a structured model and JsObject
will be enough. (but I also advise to design a bit things before implementing ;))
Keep in mind that our sample is based on an implementation for ReactiveMongo so using Json is natural. For other DB, other data structure might be more idiomatic…
Now you tell me: “Funny but but but
JsObject
is evil because it’s not strict enough. I’m a OO developer (maybe abused by ORM gurus when I was young) and my models are case-classes…”
Yes you’re right, sometimes, you need more business logic or you want to separate concerns very strictly and your model will be shaped as case-classes.
So let’s replace our nice little JsObject
by a more serious case class
.
1 2 3 4 5 6 7 8 9 10 11 |
|
Please note that I removed the validations I had introduced before because there are not useful anymore: using Json macros, I created an implicit Format[Person]
which is used implicitly by AutoSource.
So, now you can see why I consider AutoSource as a typesafe datasource.
You all know that AngularJS is the new kid on the block and that you must use it if you want to be sexy nowadays.
I’m already sexy so I must be able to use it without understanding anything to it and that’s exactly what I’ve done: in 30mn without knowing anything about Angular (but a few concepts), I wrote a dumb CRUD front page plugged on my wonderful AutoSource
.
This is the most important part of this sample: we need to call our CRUD autosource endpoints from angularJS.
We are going to use Angular resources for it even if it’s not really the best feature of AngularJS. Anyway, in a few lines, it works pretty well in my raw case.
(thanks to Paul Dijou for reviewing this code because I repeat I don’t know angularJS at all and I wrote this in 20mn without trying to understand anything :D)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
Now let’s create our CRUD UI page using angular directives. We need to be able to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
We need to import angularjs in our application and create angular application using ng-app
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
I know what you think: “Uhuh, the poor guy who exposes his DB directly on the network and who is able to delete everything without any security”
Once again, you’re right. (yes I know I love flattery)
Autosource is by default not secured in any way and actually I don’t really care about security because this is your job to secure your exposed APIs and there are so many ways to secure services that I prefer to let you choose the one you want.
Anyway, I’m a nice boy and I’m going to show you how you could secure the DELETE
endpoint using the authentication action composition sample given in Play Framework documentation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Nothing to complicated here. If you need to add headers in your responses and params to querystring, it’s easy to wrap autosource actions. Please refer to Play Framework doc for more info…
I won’t try it here, the article is already too long but it should work…
Play-Autosource
Core is independent of the DB and provides Reactive (Async/Nonblocking) APIs to fulfill PlayFramework requirements.
Naturally this 1st implementation uses ReactiveMongo which is one of the best sample of DB reactive driver. MongoDB fits very well in this concept too because document records are really compliant to JSON datasources.
But other implementations for other DB can be done and I count on you people to contribute them.
DB implementation contributions are welcome (Play-Autosource is just Apache2 licensed) and AutoSource API are subject to evolutions if they appear to be erroneous.
Play-Autosource provides a very fast & lightweight way to create a REST CRUD typesafe datasource in your Play/Scala application. You can begin with blob data such as JsObject
and then elaborate the model of your data progressively by adding constraints or types to it.
There would be many more things to say about Play/Autosource
:
There are also lots of features to improve/add because it’s still a very draft module.
If you like it and have ideas, don’t hesitate to discuss, to contribute, to improve etc…
curl -X POST -d "{ "coding" : "Have fun"} http://localhost:9000/developer
PS: Thanks to James Roper for his article about advanced routing in Play Framework which I copied shamefully XD
The sample app can be found on Github here
Hi again folks!
Now, you may certainly have realized I’m Play2.1 Json API advocate. But you may also have understood that I’m not interested in Json as an end in itself. What catches my attention is that it’s a versatile arborescent data structure that can be used in web server&client, in DB such as ReactiveMongo and also when communicating between servers with WebServices.
So I keep exploring what can be done with Json (specially in the context of PlayFramework reactive architecture) and building the tools that are required to concretize my ideas.
My last article introduced JsPath Pattern Matching and I told you that I needed this tool to use it with JsZipper. It’s time to use it…
Here is why I want to do:
Please note that this idea and its implementation is just an exercise of style to study the idea and introduce technical concepts but naturally it might seem a bit fake. Moreover, keep in mind, JsZipper API is still draft…
Imagine I want to gather twitter user timeline and github user profile in a single Json object.
I also would like to:
Let’s use a Json template such as:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Using the url
and user_id
found in __\streams\twitter, I can call twitter API to fetch the stream of tweets and the same for
__\streams\github`. Finally I replace the content of each node as following:
1 2 3 4 5 6 7 8 9 10 |
|
Moreover, I’d like to store multiple templates like previous sample with multiple user_id
to be able to retrieve multiple streams at the same time.
Recently, Stephane Godbillon has released ReactiveMongo v0.9 with corresponding Play plugin. This version really improves and eases the way you can manipulate Json directly with Play & Mongo from Scala.
Let’s store a few instance of previous templates using this API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Hard isn’t it?
Note that I use localhost
URL because with real Twitter/Github API I would need OAuth2 tokens and this would be a pain for this sample :)
Now, let’s do the real job i.e the following steps:
JsonCollection
JsZipperM[Future]
The interesting technical points here are that:
Future[JsValue]
Future[JsValue]
Seq[Future[JsValue]]
We could use Play/Json transformers presented in a previous article but knowing that you have to manage Futures and multiple WS calls, it would create quite complicated code.
Here is where Monadic JsZipper becomes interesting:
JsZipper
allows modifying immutable JsValue which is already cool
JsZipperM[Future]
allows modifying JsValue
in the future and it’s even better!
Actually the real power of JsZipper (besides being able to modify/delete/create a node in immutable Json tree) is to transform a Json tree into a Stream of nodes that it can traverse in depth, in width or whatever you need.
Here is the code because you’ll see how easy it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Please note:
Json.toJson(templates)
transforms a List[JsObject]
into JsArray
because we want to manipulate pure JsValue
with JsZipperM[Future]
.
.updateAllM( (JsPath, JsValue) => Future[JsValue] )
is a wrapper API hiding the construction of a JsZipperM[Future]
: once built, the `JsZipperM[Future] traverses the Json tree and for each node, it calls the provided function flatMapping on Futures before going to next node. This makes the calls to WS sequential and not parallel.
case (_ \ "twitter", value)
: yes here is the JsPath pattern matching and imagine the crazy stuff you can do mixing Json traversal and pattern matching ;)
Async
means the embedded code will return Future[Result]
but remember that it DOESN’T mean the Action
is synchronous/blocking because in Play, everything is Asynchronous/non-blocking by default.
Then you could tell me that this is cool but the WS are not called in parallel but sequentially. Yes it’s true but imagine that it’s less than 10 lines of code and could even be reduced. Yet, here is the parallelized version…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Note that:
jsonTemplates.findAll( filter: (JsPath, JsValue) => Boolean )
traverses the Json tree and returns a Stream[(JsPath, JsValue)]
containing the filtered nodes. This is not done with Future
because we want to get all nodes now to be able to launch all WS calls in parallel.
Future.traverse(nodes)(T => Future[T])
traverses the filtered values and calls all WS in parallel.
case (path@(_ \ "twitter"), value)
is just JsPath pattern matching once again keeping track of full path to be able to return it with the value path -> resp
for next point.
jsonTemplates.set( (JsPath, JsValue)* )
finally updates all values at given path. Note how easy it is to update multiple values at multiple paths.
A bit less elegant than the sequential case but not so much.
This sample is a bit stupid but you can see the potential of mixing those different tools together.
Alone, JsZipper and JsPath pattern matching provides very powerful ways of manipulating Json that Reads/Writes can’t do easily.
When you add reactive API on top of that, JsZipper becomes really interesting and elegant.
The sample app can be found on Github here
Have JsZipperM[fun]!
While experimenting Play21/Json Zipper in my previous article, I needed to match patterns on JsPath
and decided to explore a bit this topic.
This article just presents my experimentations on JsPath
pattern matching so that people interested in the topic can tell me if they like it or not and what they would add or remove. So don’t hesitate to let comments about it.
If the result is satisfying, I’ll propose it to Play team ;)
Let’s go to samples as usual.
match/scale
-style1 2 3 4 5 |
|
val
-style1 2 |
|
Note that I don’t write val __ \ toto = __ \ "toto"
(2x Underscore) as you would expect.
Why? Let’s write it:
1 2 3 |
|
Actually, 1st __
is considered as a variable to be affected by Scala compiler. Then the variable __
appears on left and right side which is not good.
So I use _
to ignore its value because I know it’s __
. If I absolutely wanted to match with __
, you would have written:
1 2 |
|
1 2 3 4 5 6 7 8 9 |
|
Note the usage of @@
operator that you can dislike. I didn’t find anything better for now but if anyone has a better idea, please give it to me ;)
1 2 |
|
Using _
, I ignore everything before gamma
node.
1 2 3 4 5 6 7 8 |
|
Note the \?\
operator which is also a temporary choice: I didn’t want to choose \\
ause \?\
operator only works in the case where you match between the first and the last element of the path and not between anything and anything…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
1 2 3 4 5 6 7 8 |
|
So, I think we can provide more features and now I’m going to use it with my JsZipper
stuff in my next article ;)
If you like it, tell it!
Have fun!
The code is available on Github project play-json-zipper
JsZipper
is a new tool allowing much more complex & powerful manipulations of Json structures for Play2/Json Scala API (not a part of Play2 core for now)
JsZipper
is inspired by the Zipper concept introduced by Gérard Huet in 1997.
The Zipper allows to update immutable traversable structures in an efficient way. Json is an immutable AST so it fits well. FYI, the Zipper behaves like a loupe that walks through each node of the AST (left/right/up/down) while keeping aware of the nodes on its left, its right and its upper. The interesting idea behind the loupe is that when it targets a node, it can modify and even delete the focused node. The analogy to the pants zipper is quite good too because when it goes down the tree, it behaves as if it was opening the tree to be able to drive the loupe through all nodes and when it goes up, it closes back the tree… I won’t tell more here, it would be too long.
JsZipper
is a specific interpretation of Zipper concept for Play/Json API based on :
Please note, JsZipper
is not an end in itself but a tool useful to provide new API to manipulate Json.
Let’s go to samples because it explains everything.
We’ll use following Json Object.
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 5 |
|
1 2 3 4 5 6 |
|
1 2 3 4 5 6 |
|
1 2 3 4 5 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
Let’s use
Future
as our Monad because it’s… coooool to do things in the future ;)
Imagine you call several services returning Future[JsValue]
and you want to build/update a JsObject
from it.
Until now, if you wanted to do that with Play2/Json, it was quite tricky and required some code.
Here is what you can do now.
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 |
|
1 2 3 4 5 6 7 8 |
|
It’s still draft so it can be improved but if you like it, don’t hesitate to comment and if people like it, it could become a part of Play Framework itself
Have fun!
For info, this dendrograph was pre-computed using Play2.1 app sucking Tweets & filtering/grouping the results in a very manual-o-matic way…
Have Fun(ctional)
]]>