Mandubian Blog

ShapelessStream, when Akka-Stream meets Shapeless Coproduct at compile-time

2015-05-05T23:23:00+02:00

You might have seen on my twitter that my current company MFG Labs has opensourced the library Akka-Stream Extensions. We have developed it with Alexandre Tamborrino and Damien Pignaud for our recent production projects based on Typesafe Akka-Stream.

In this article, I won’t explain all the reasons that motivated our choice of Akka-Stream at MFG Labs and the road towards our library Akka-Stream Extensions. Here, I’ll focus on one precise aspect of our choice: types… And I’ll tell you about a specific extension I’ve created for this project: ShapelessStream.

The code is there on github

Akka-Stream loves Types

As you may know, I’m a Type lover. Types are proofs and proofs are types. Proofs are what you want in your code to ensure in a robust & reliable way that it does what it pretends with the support of the compiler.

First of all, let’s remind that typesafety is a very interesting feature of Akka-Stream.

Akka-Stream most basic primitive Flow[A, B] represents a data-flow that accepts elements of type A and will return elements of type B. You can’t pass a C to it and you are sure that this flow won’t return any C for example.

At MFG Labs, we have inherited some Scala legacy code mostly based on Akka actors which provide a very good way to handle failures but which are not typesafe at all (till Akka Typed) and not composable. Developers using Akka tend to scatter the business logic in the code and it can become hard to maintain. It has appeared that in many cases where Akka was used to transform data in a flow, call external services, Akka-Stream would be a very good way to replace those actors:

better type-safety,
fluent & composable code with great builder DSL
same Akka failure management
buffer, parallel computing, backpressure out-of-the-box

Yes, it’s quite weird to say it but Akka-Stream helped us correct most problems that had been introduced using Akka (rightly or wrongly).

Multi-type flows

Ok, Akka-Stream promotes Types as first citizen in your data flows. That’s cool! But it appears that you often need to handle multiple types in the same input channel:

When you control completely the types in input, you can represent input types by a classic ADT:

sealed trait A
case class A1(...) extends A
case class A2(...) extends A
case class A3(...) extends A

… And manage it in Flow[A, B]:

Flow[A].map {
  case A1(...) => // some B
  case A2(...) => // some B
  case A3(...) => // some B
}

Nice but you need to wrap all input types in an ADT and this involves some boring code that can even be different for every custom flow.

Going further, in general, you don’t want to do that, you want to dispatch every element to a different flow according to its type:

… and merge all results of all flows in one single channel…

… and every flow has its own behavior in terms of parallelism, bufferization, data generation and back-pressure…

In Akka-Stream, if you wanted to build a flow corresponding to the previous schema, you would have to use:

a FlexiRoute for the input dispatcher
a FlexiMerge for the output merger.

Have a look at the doc and see that it requires quite a bunch of lines to write one of those. It’s really powerful but quite tedious to implement and not so typesafe after all. Moreover, you certainly would have to write one FlexiRoute and one FlexiMerge per use-case as the number of inputs types and return types depend on your context.

Miles Sabin to the rescue

In my latest project, this dispatcher/flows/merger pattern was required in multiple places and as I’m lazy, I wanted something more elegant & typesafe if possible to build this kind of flow graphs.

Thinking in terms of pure types and from an external point of view, we can see the previous dispatcher/flows/merger flow graph in pseudo-code as:

Flow[
  Input = A1 or A2 or A3, // in input it accepts A1 or A2 or A3
  Ouput = B1 or B2 or B3  // in output it generates B1 or B2 or B3
]

And to build the full flow graph, we need to provide a list of flows for all pairs of input/output types corresponding to our graph branches:

Flow[A1, B1] and Flow[A2, B2] and Flow[A3, B3]

In Shapeless, there are 2 very very very useful structures:

Coproduct is a generalization of the well known Either. You have A or B in Either[A, B]. With Coproduct, you can have more than 2 alternatives A or B or C or D. So, for our previous external view of flow graph, using Coproduct, it could be written as:

Flow[
  A1 :+: A2 :+: A3 :+: CNil,
  B1 :+: B2 :+: B3 :+: CNil
]

HList allows to build heterogenous List of elements keeping & tracking all types at compile time. For our previous list of flows, it would fit quite well as we want to match all input/output types of all flows. It would give:

Flow[A1, B1] :: Flow[A2, B2] :: Flow[A3, B3] :: HNil

So, from an external point of view, the process of building our dispatcher/flows/merger flow graph looks like a Function taking aHlist of flowsin input and returning the builtFlow of Coproducts`:

Flow[A1, B1] :: Flow[A2, B2] :: Flow[A3, B3] :: HNil =>
  Flow[A1 :+: A2 :+: A3 :+: CNil, B1 :+: B2 :+: B3 :+: CNil]

Let’s write it in terms of Shapeless Scala code:

/**
 * Builds at compile-time a fully typed-controlled flow that transforms a HList of Flows to a Flow of the Coproduct of inputs to Coproduct of outputs.
 *
 * @param a Hlist of flows Flow[A1, B1] :: FLow[A2, B2] :: ... :: Flow[An, Bn] :: HNil
 * @return the flow of the Coproduct of inputs and the Coproduct of outputs Flow[A1 :+: A2 :+: ... :+: An :+: CNil, B1 :+: B2 :+: ... +: Bn :+: CNil, Unit]
 */
def coproductFlow[HL <: HList, CIn <: Coproduct, COut <: Coproduct](
  flows: HL
): Flow[CIn, COut, Unit]

Fantastic !!!

Now the question is how can we build at compile-time this Flow[CIn, COut, Unit] from an HList of Flows and be sure that the compiler checks all links are correctly typed and all types are managed by the provided flows?

Akka-Stream Graph Mutable builders

An important concept in Akka-Stream is the separation of concerns between:

constructing/describing a data-flow
materializing with live resources (like actor system)
running the data-flow by plugging live sources/sinks on it (like web, file, hdfs, queues etc…).

For the curious, you find the same idea in scalaz-stream but in a FP-purer way as scalaz-stream directly relies on Free concepts that formalize this idea quite directly.

Akka-Stream has taken a more custom way to respond to these requirements. To build complex data flows, it provides a very nice DSL described here. This DSL is based on the idea of a mutable structure used while building your graph until you decide to fix it definitely into an immutable structure.

An example from the doc:

val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder[Unit] =>
  import FlowGraph.Implicits._
  val in = Source(1 to 10)
  val out = Sink.ignore

  val bcast = builder.add(Broadcast[Int](2))
  val merge = builder.add(Merge[Int](2))

  val f1, f2, f3, f4 = Flow[Int].map(_ + 10)

  in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
              bcast ~> f4 ~> merge
}

builder is the mutable structure used to build the flow graph using the DSL inside the {...} block.

The value g is the immutable structure resulting from the builder that will later be materialized and run using live resources.

Please remark that once built, g value can reused and materialized/run several times, it is just the description of your flow graph.

This idea of mutable builders is really interesting in general: mutability in the small can help a lot to make your building block efficient and easy to write/read without endangering immutability in the large.

Hacking mutable builders with Shapeless

My intuition was to hack these mutable Akka-Stream builders using Shapeless type-dependent mechanics to build a Flow of Coproducts from an HList of Flows…

Let’s show the real signature of coproductFlow:

def coproductFlow[HL <: HList, CIn <: Coproduct, COut <: Coproduct, CInOutlets <: HList, COutInlets <: HList](
  flows: HL
)(
  implicit
    flowTypes: FlowTypes.Aux[HL, CIn, COut],
    obuild: OutletBuilder.Aux[CIn, CInOutlets],
    ibuild: InletBuilder.Aux[COut, COut, COutInlets],
    otrav: ToTraversable.Aux[CInOutlets, List, Outlet[_]],
    itrav: ToTraversable.Aux[COutInlets, List, Inlet[COut]],
    selOutletValue: SelectOutletValue.Aux[CIn, CInOutlets],
    flowBuilder: FlowBuilderC.Aux[CIn, COut, CInOutlets, HL, COutInlets]
): Flow[CIn, COut, Unit]

Frightening!!!!!!!

No, don’t be, it’s just the transcription in types of the requirements to build the full flow.

def coproductFlow[HL <: HList, CIn <: Coproduct, COut <: Coproduct, CInOutlets <: HList, COutInlets <: HList](
  flows: HL
)(
  implicit
    // 1 - Introspects HList of Flows to extract
    //      * all input types in CIn Coproduct
    //      * all output types in COut Coproduct
    flowTypes: FlowTypes.Aux[HL, CIn, COut],

    // 2 - Builds all Akka-Stream outlets branching the FlexiRoute
    //     outlets to the right Flow inlet in the HList
    obuild: OutletBuilder.Aux[CIn, CInOutlets],

    // 3 - Builds all Akka-Stream inlets branching the outlets of each Flow
    //     in the HList to the right inlet of FlexiMerge
    ibuild: InletBuilder.Aux[COut, COut, COutInlets],

    // 4 - Technical structures to be able to traverse
    //     and select all those previously built thingies
    otrav: ToTraversable.Aux[CInOutlets, List, Outlet[_]],
    itrav: ToTraversable.Aux[COutInlets, List, Inlet[COut]],
    selOutletValue: SelectOutletValue.Aux[CIn, CInOutlets],

    // 5 - The AkkaStream mutable Builder that:
    //      * builds the input FlexiRoute with the right output types
    //      * plugs all FlexiRoute outlets to the right flow inlet
    //      * builds the output FlexiMerge with the right input types
    //      * plugs all flow outlets to the right inlet of FlexiMerge
    flowBuilder: FlowBuilderC.Aux[CIn, COut, CInOutlets, HL, COutInlets]
): Flow[CIn, COut, Unit]

The Scala code might seem a bit ugly to a few of you. That’s not false but keep in mind what we have done: mixing shapeless-style recursive implicit typeclass inference with the versatility of Akka-Stream mutable builders… And we were able to build our complex flow graph, to check all types and to plug all together at compile-time…

Sample

// 1 - Create a type alias for your coproduct
type C = Int :+: String :+: Boolean :+: CNil

// The sink to consume all output data
val sink = Sink.fold[Seq[C], C](Seq())(_ :+ _)

// 2 - a sample source wrapping incoming data in the Coproduct
val f = FlowGraph.closed(sink) { implicit builder => sink =>
  import FlowGraph.Implicits._
  val s = Source(() => Seq(
    Coproduct[C](1),
    Coproduct[C]("foo"),
    Coproduct[C](2),
    Coproduct[C](false),
    Coproduct[C]("bar"),
    Coproduct[C](3),
    Coproduct[C](true)
  ).toIterator)

// 3 - our typed flows
  val flowInt = Flow[Int].map{i => println("i:"+i); i}
  val flowString = Flow[String].map{s => println("s:"+s); s}
  val flowBool = Flow[Boolean].map{s => println("s:"+s); s}

// >>>>>> THE IMPORTANT THING
// 4 - build the coproductFlow in a 1-liner
  val fr = builder.add(ShapelessStream.coproductFlow(flowInt :: flowString :: flowBool :: HNil))
// <<<<<< THE IMPORTANT THING

// 5 - plug everything together using akkastream DSL
  s ~> fr.inlet
       fr.outlet ~> sink
}

// 6 - run it
f.run().futureValue.toSet should equal (Set(
  Coproduct[C](1),
  Coproduct[C]("foo"),
  Coproduct[C](2),
  Coproduct[C](false),
  Coproduct[C]("bar"),
  Coproduct[C](3),
  Coproduct[C](true)
))

FYI, Shapeless Coproduct provides a lot of useful operations on Coproducts such as unifying all types or merging Coproducts together.

Some compile errors now ?

Imagine you forget to manage one type of the Coproduct in the HList of flows:

...

// 4 - build the coproductFlow in a 1-liner
  val fr = builder.add(ShapelessStream.coproductFlow(flowInt :: flowString :: HNil))

..

If you compile, it will produce this:

ShapelessExtensionsSpec.scala:97: overloaded method value ~> with alternatives:
[error]   (to: akka.stream.SinkShape[C])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]   (to: akka.stream.Graph[akka.stream.SinkShape[C], _])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]   [Out](flow: akka.stream.FlowShape[C,Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](junction: akka.stream.UniformFanOutShape[C,Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](junction: akka.stream.UniformFanInShape[C,Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](via: akka.stream.Graph[akka.stream.FlowShape[C,Out],Any])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   (to: akka.stream.Inlet[C])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit
[error]  cannot be applied to (akka.stream.Inlet[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.CIn]]])
[error]       s ~> fr.inlet
[error]         ^
[error] /Users/pvo/workspaces/mfg/akka-stream-extensions/extensions/shapeless/src/test/scala/ShapelessExtensionsSpec.scala:98: overloaded method value ~> with alternatives:
[error]   (to: akka.stream.SinkShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]]])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]   (to: akka.stream.Graph[akka.stream.SinkShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]]], _])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]   [Out](flow: akka.stream.FlowShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]],Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](junction: akka.stream.UniformFanOutShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]],Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](junction: akka.stream.UniformFanInShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]],Out])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   [Out](via: akka.stream.Graph[akka.stream.FlowShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]],Out],Any])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])akka.stream.scaladsl.FlowGraph.Implicits.PortOps[Out,Unit] <and>
[error]   (to: akka.stream.Inlet[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]]])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit
[error]  cannot be applied to (sink.Shape)
[error]            fr.outlet ~> sink
[error]                      ^

OUCHHHH, this is a mix of the worst error of Akka-Stream and the kind of errors you get with Shapeless :)

Don’t panic, breathe deep and just tell yourself that in this case, it just means that your types do not fit well

In general, the first line and the last lines are the important ones.

For input:

ShapelessExtensionsSpec.scala:97: overloaded method value ~> with alternatives:
[error]   (to: akka.stream.SinkShape[C])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]  cannot be applied to (akka.stream.Inlet[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.CIn]]])
[error]       s ~> fr.inlet
[error]         ^

It just means you try to plug a C == Int :+: String :+: Bool :+: CNil to a Int :+: String :+: CNil and the compiler is angry against you!!

For output:

ShapelessExtensionsSpec.scala:98: overloaded method value ~> with alternatives:
[error]   (to: akka.stream.SinkShape[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]]])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit <and>
[error]   (to: akka.stream.Inlet[shapeless.:+:[Int,shapeless.:+:[String,com.mfglabs.stream.extensions.shapeless.FlowTypes.last.COut]]])(implicit b: akka.stream.scaladsl.FlowGraph.Builder[_])Unit
[error]  cannot be applied to (sink.Shape)
[error]            fr.outlet ~> sink

It just means you try to plug a Int :+: String :+: CNil to a C == Int :+: String :+: Bool :+: CNil and the compiler is 2X-angry against you!!!

Conclusion

Mixing the power of Shapeless compile-time type dependent structures and Akka-Stream mutable builders, we are able to build at compile-time a complex dispatcher/flows/merger flow graph that checks all types and all flows correspond to each other and plugs all together in a 1-liner…

This code is the first iteration on this principle but it appeared to be so efficient and I trusted the mechanism so much (nothing happens at runtime but just at compile-time) that I’ve put that in production two weeks ago. It runs like a charm.

Finally, there are a few specificities/limitations to know:

Wrapping input data into the Coproduct is still the boring part with some pattern matching potentially. But this is like Json/Xml validation, you need to validate only the data you expect. Yet I expect to reduce the work soon by providing a Scala macro that will generate this part for you as it’s just mechanical…
Wrapping everything in Coproduct could have some impact on performance if what you expect is pure performance but in my use-cases IO are so much more impacting that this is not a problem…
coproductFlow is built with a custom FlexiRoute with DemandFromAll condition & FlexiMerge using ReadAny condition. This implies :
- the order is NOT guaranteed due to the nature of used FlexiRoute & FlexiMerge and potentially to the flows you provide in your HList (each branch flow has its own parallelism/buffer/backpressure behavior and is not necessarily a 1-to-1 flow).
- the slowest branch will slow down all other branches (as with a broadcast). To manage these issues, you can add buffers in your branch flows to allow other branches to go on pulling input data

The future?

A macro generating the Coproduct wrapping flow
Some other flows based on Shapeless

Have more backpressured and typed fun…

FreeR - Hybrid Free Monads for Reduced Quadratic Complexity/Observability & Map-Fusion Optimization in Scala

2015-04-09T23:23:00+02:00

Draft FreeR code is on Github

Introduction

I’ve recently pushed some Free code & doc to the cool project cats and I had a few more ideas in my head on optimizing Free and never took time to make them concrete. I’ve just found this time during my holidays…

Free Monad is often used to represent embedded DSL in functional programming languages like Haskell or Scala. One generally represents his grammar with a simple Functor ADT representing the available operations. Then, from within your programming language, Free Monad provides the facilities to:

build in a monadic & stack-safe way a static program composed of a sequence of operations,
compile this program,
run it.

To know more about the way to use Free and some more specific theory, please refer to recent draft doc I’ve pushed on cats

The well-known classic representation in Scala is the following:

sealed abstract class Free[F[_], A]
case class Pure[F[_], A](a: A) extends Free[F, A]
case class Suspend[F[_], A](a: F[Free[F, A]]) extends Free[F, A]

Please note that F[_] should be a Functor to take advantage of Free construction.

Building a program can then just be a classic sequence of monadic bind/flatMap on Free[S[_], _]:

for {
  a <- doA(...)
  b <- doB(a, ...)
  c <- doC(c, ...)
} yield (c)

This actually constructs a recursive structure looking like:

Suspend(F(Suspend(F(Suspend(F(....(Pure(a))))))))

It can be seen as a left-associated sequence of operations and as any left-associated structure, appending an element to it has a quadratic complexity. So the more you flatMap, the longer (in n²) it will take to drill down the structure.

So if you try such code:

def gen[I](i: I): Trampoline[I] = Trampoline.suspend(Trampoline.done(i))

//(a flatMap (b flatMap (c flatMap (...))))
def lftBind[S[_]](n: Int)(gen: Int => S[Int])(implicit M: Monad[S]) = {
  (1 to n).foldLeft(gen(0)){ case (acc, i) => acc flatMap { a => gen(i) } }
}

You will see that it has a quadratic curve in terms of execution time when you increase n.

First weakness of classic Free is its left-associativity that induces a quadratic complexity when flatMapping

Solving left-associated quadratic complexity

To solve it, the immediate idea is to make Free right-associative instead of left associative (This idea was proposed by Kiselyov & al. in a paper and called Continuation-Passing-Style or also Codensity construction)

This is already done in current scalaz/cats.Free by adding a new element to the Free ADT:

/** Call a subroutine and continue with the given function. */
sealed abstract class Gosub[S[_], B] extends Free[S, B] {
  type C
  val a: () => Free[S, C]
  val f: C => Free[S, B]
}

If you test the same previous code, it has a linear behavior when n increases.

Quadratic observability

In this great paper Reflection Without Remorse, Atze van der Ploeg & Oleg Kiselyov show that classic Free are subject to another tricky quadratic behavior when, within your sequence of operations, one need to observe the current internal state of Free.

Observing the state requires to drill down the recursive Free structure explicitly and go up and again down then up and again and again. As explained in the paper, this case is quite tricky because it’s very hard to see that it will happen. The deeper is the structure, the longer it takes to observe current state. It’s important to note that right-association doesn’t help in this case and that complexity is once again in O(n²).

The second weakness of Free is its quadratic complexity when observing internal state.

To solve it, in Reflection Without Remorse, they propose a very interesting approach by changing the representation of Free and take advantage of its sequential nature.

A Free becomes the association of 2 elements:

a FreeView representing current internal state of the Free
a sequence of bind/flatMap functions stored in an efficient data structure that can prepent/append in O(1).

For the data structure, they propose to use a type-aligned dequeue to keep track of all types.

I have tried to implement this structure using a typed-aligned FingerTree in Scala. The code is here. The result is pretty interesting but not much efficient: it has a linear behavior for left-association & observability but…

for lower values of n, FingerTree costs far too much to build
memory cost is so high that it triggers the JVM GC far too soon and limitates a lot what you can do
allocated memory locality isn’t good and requires huge amounts of memory jumps.
type-alignment makes code quite ugly (yes Scala type inference isn’t as powerful as Haskell in this case and laziness of Haskell helps a lot for FingerTree)

As a conclusion, the idea is really nice on paper but in practice, we need to find something that costs less than this type-aligned dequeue (even if my FingerTree code is really raw, too strict and not optimized at all).

FreeR hybrid structure

I wanted to improve Free behavior and decided to create a new version of it called FreeR thinking in terms of efficient Scala…

I really liked the idea of representing a Free as a pure sequence of operations with a view of current internal state.

To gain in efficiency, I decided to choose another efficient append/prepend data structure, optimized and very well known: Vector providing:

append/prepend in O(1) (in average),
random access in constant time,
quite good locality.

Then, I’ve decided to relax a lot type alignment and manipulate Any values internally and cast/reify to the right types when required.

BTW, I plagiarized some code written by Alois Cochard for his new IO model in Scalaz/8.0 branch… Alois is a great dev & had made concrete the idea I had in my head so why rewrite them differently? uh ;)

I also decided to reify the 2 kinds of operations:

Bind for flatMap/bind calls
Map for map calls

So a Free becomes:

// to mitigate the shock of Any in the code
// and provide helpers for casting/reifying
type Val = Any

case class FreeR[S[_], A](
  head: FreeView[S, Val],
  ops: Ops = Vector.empty
) extends FreeR[S, A]

with FreeView as:

// The view of Free internal state can be of 2 types:
sealed abstract class FreeView[S[_], A]

object FreeView {
  // Pure value
  case class Pure[S[_], A](a: A) extends FreeView[S, A]

  // Impure computation determined by the Functor S
  case class Impure[S[_], A](a: S[FreeR[S, A]]) extends FreeView[S, A]
}

and Ops are:

sealed trait Op
object Op {
  case class Map(f: Val => Val) extends Op
  case class Bind[S[_]](f: Val => FreeR[S, Val]) extends Op
}

FYI This code is less than 300 lines so nothing really horrible except a few ugly casts ;)

Left Association

The code used for testing can be found here

FreeR behavior is linear even for millions of flatMap (until the GC triggers naturally) whereas classic Free has clearly quadratic curve.

Observability

The code used for testing can be found here

FreeR behavior is quite linear even for millions of flatMap (until the GC triggers naturally) whereas classic Free has clearly quadratic curve.

Right association complexity

I finally tried to check the behavior of my new FreeR when using flatMap in a right associated way like:

// (... flatMap (_ => c flatMap (_ => b flatMap (_ => a))))
def rgtBind[S[_]](n: Int)(gen: Int => S[Int])(implicit M: Monad[S]) = {
  (1 to n).foldLeft(gen(n)){ case (acc, i) => gen(n-i) flatMap { _ => acc } }
}

This is not so frequent code but anyway, Free should be efficient for left & right associated code.

Using FreeR as described previously, I’ve discovered that it wasn’t efficient in right association when increasing n because it allocates recursively a lot of Vector with one element and it becomes slower and slower apparently (I’m not even sure of the real cause of it).

I’ve refined my representation by distinguishing 3 kinds of Free in my ADT:

// No op
case class FreeR0[S[_], A](head: FreeView[S, Val]) extends FreeR[S, A]

// One single op (typically the right association case)
case class FreeR1[S[_], A](head: FreeView[S, Val], op: Op) extends FreeR[S, A]

// Multiple ops
case class FreeRs[S[_], A](head: FreeView[S, Val], ops: Ops = Vector.empty) extends FreeR[S, A]

With this optimization, here is the performance in right association:

It is quite comparable to classic Free for n under 1 million but it becomes quite bad when n becomes big. Yet, it remains for more efficient than previous representation with just Vector.

I need to work more on this issue (apparently GC is triggered too early) to see if more optimizations for right association can be found…

Cherry on the cake: map-fusion optimization

Imagine doing a lot of map operations on a Free like:

def mapalot(n: Int): Trampoline[Long] = {
  (1 to n).foldLeft(Trampoline.done(0L)){ case (acc, i) => acc map { a => a + 1 } }
}

If you think just a bit, you will clearly see that:

a.map(f).map(g) == a.map(g compose f)

This is called map-fusion and as you may have deduced already, my decision to reify explicitly Bind and Map operations was made in this purpose.

If I can know there are several Map operations in a row, I can fusion them in one single Map by just calling mapFusion on a Free to optimize it:

val free = FreeRTools.mapalot(x)

// optimized free
val freeOpt = free.mapFusion

// run the trampoline
freeOpt.run

Here is the difference in performance between FreeR and FreeR.mapFusion:

As you can see, mapFusion can be very interesting in some cases.

Conclusion

Finally, I have created a new representation of Free using:

type-relaxed version of Reflection w/o Remorse
sequence of operations managed by Scala Vector
reification of Bind & Map operations
differenciation of None/Single/Multiple operations cases
Map Fusion optimization

It allows to have a Free with:

Linear behavior in Left-Assocation, Observability,
Stack-safety is sill ensured,
Right-Association should be optimized because it still has a too high cost for bigger n (yet it is far more acceptable than other alternatives),
Map Fusion can provide an interesting optimization when using multiple consecutive Map operations,
For small n, the cost is a bit higher than basic Free but quite low and acceptable.

It is really interesting as it makes Free more and more usable in real-life problems without having to rewrite the code bootstrapped with Free in a more optimized way. I personally find it quite promising!

Please remark that this code has been written for the great project cats that will soon be a viable & efficient alternative for functional structures in Scala.

The full code is there.

Don’t hesitate to test, find bugs, contribute, give remarks, ideas…

Have fun in FreeR world…

Scaledn: Promote EDN as a far better alternative to Json in Scala

2014-12-21T23:23:00+01:00

SCALEDN, EDN Scala API

Scaledn is a Scala EDN parser (runtime & compile-time)/serializer/validator based on :

It works only in Scala 2.11.x

The code & sample apps can be found on Github

Why EDN?…

Because Json is not so good & quite limitating

EDN is described as an extensible data notation specified (not really standardized) there. Clojure & Datalog used in Datomic are supersets of EDN.

EDN allows much more things than Json while keeping the same simplicity.

Here are the main points making EDN great to represent & exchange Data.

EDN manages number types far better than Json

For Json, all numbers (floating or integer, exponential or not) are all considered in the same way so numbers can only be mapped to the biggest number format: BigDecimal. It is really bad in terms of semantics and performance.

In EDN, numbers can be :

64bits integer aka Long in Scala: 12345
64bits floating point numbers & exponentials aka Double in Scala: 123.45e-9
Natural Integers aka BigInt in Scala: 1234567891234N
Exact Floating Number aka BigDecimal in Scala: 123.4578972345M

EDN knows much more about collections

Collections in Json are just:

lists of heterogenous json values
maps of key strings and json values.

In EDN, you can have:

heterogenous lists

(1 true "toto)

heterogenous vectors/arrays

[1 true "toto]

heterogenous sets

#{1 true "toto}

heterogenous maps with heterogenous keys & values

{1 "toto", "foo" 2}

EDN accepts characters & unicode

Json doesn’t know about characters outside strings.

EDN can manage chars:

// simple char
\c

// special chars
\newline
\return
\space
\tag
\\

// unicode
\u0308

EDN accepts comments & discarded values

There are special syntaxes:

comments are lines starting with ;

; this is a comment

values starting with #_ are parsed but discarded

"toto" 3 #_discarded 1.234

EDN knows about symbols & keywords

These are notions that don’t exist in Json.

Symbols can reference anything external or internal that you want to identify. A Symbol can have a namespace such as:

foo.foo2/bar

Keywords are unique identifiers or enumerated values that can be reused in your data structure. A Keyword is just a symbol preceded by a : such as

:foo.foo2/bar

EDN is extensible using tags

EDN is an extensible format using tags starting with # such as:

#foo/bar value

When parsing EDN format, the parser should provide tag handlers that can be applied when a tag is discovered. In this way, you can extend default format with your own formats.

EDN specifies 2 tag handlers by default:

#inst "1985-04-12T23:20:50.52Z" for RFC-3339 instants
#uuid "f81d4fae-7dec-11d0-a765-00a0c91e6bf6" for UUID

EDN has no root node & can be streamed

Json is defined to have a root map node: { key : value } or [ ... ].

Json can’t accept single values outside of this. So Json isn’t really meant to be streamed as you need to find closing tags to finish parsing a value.

EDN doesn’t require this and can consist in multiple heterogenous values:

1 123.45 "toto" true nil (1 2 3)

As a consequence, EDN can be used to stream your data structures.

Conclusion: EDN should be preferred to Json

All of these points make EDN a far better & stricter & more evolutive notation to represent data structures than Json. It can be used in the same way as Json but could make a far better RPC string format than Json.

I still wonder why Json has become the de-facto standard except for the reason that the not so serious Javascript language parses it natively and because people were so sick about XML that they would have accepted anything changing their daily life.

But JS could also parse EDN without any problem and all more robust & typed backend languages would earn a lot from using EDN instead of JSON for their interfaces.

EDN could be used in REST API & also for streaming API. That’s exactly why, I wanted to provide a complete Scala API for EDN to test this idea a bit further.

Scaledn insight

Runtime Parsing

Scaledn can be used to parse the EDN string or arrays of chars received by your API.

All types described in EDN format are isomorphic to Scala types so I’ve decided to skip the complete AST wrapping those types and directly parse to Scala types.

"foobar" is parsed to String
123 is parsed to Long
(1 2 3) is parsed to List[Long]
(1 "toto" 3) is parsed to List[Any]
{"toto" 1 "tata" 2} is parsed to Map[String, Long]
{1 "toto" 2 "tata"} is parsed to Map[Long, String]
{1 "toto" true 3} is parsed to Map[Any, Any]
etc…

The parser (based on Parboiled2) provides 2 main functions:

import scaledn._
import parser._

// parses only the first EDN value discovered in the String input
def parseEDN(in: ParserInput): Try[EDN] = ...

// parses all EDN values discovered in the String input
def parseEDNs(in: ParserInput): Try[Seq[EDN]] = ...

If you look in common package, you’ll see that EDN is just an alias for Any ;)

Here is how you can use it:

import scaledn._
import parser._

// Single Value
parseEDN("""{1 "foo", "bar" 1.234M, :foo/bar [1,2,3]} #_foo/bar :bar/foo""") match {
  case Success(t) => \/-(t)
  case Failure(f : org.parboiled2.ParseError) => -\/(parser.formatError(f))
}

// Multiple Value
parseEDNs("""{1 "foo", "bar" 1.234M, :foo/bar [1,2,3]} :bar/foo""").success.value should be (
  Vector(
    Map(
      1L -> "foo",
      "bar" -> BigDecimal("1.234"),
      EDNKeyword(EDNSymbol("foo/bar", Some("foo"))) -> Vector(1, 2, 3)
    ),
    EDNKeyword(EDNSymbol("bar/foo", Some("bar")))
  )
))

Some people will think Any is a bit too large and I agree but it’s quite practical to use. Moreover, using validation explained a bit later, you can parse your EDN and then map it to a stronger typed scala structure and then Any disappears.

Compile-time parsing with Macros

When you use static EDN structures in your Scala code, you can write them in their string format and scaledn can parse them at compile-time using Scala macros and thus prevent a lot of errors you can encounter in dynamic languages.

The macro mechanism is based on quasiquotes & whitebox macro contexts which allow to infer types of your parsed EDN structures at compile-time. For example:

> val s:Long = EDN("\"toto\"")

[error]  found   : String("toto")
[error]  required: Long
[error]     val e: Long = EDN("\"toto\"")

Whooohooo magic :)

Classic Scala types

Here is how you can use it:

import scaledn._
import macros._

// All types are just for info and can be omitted below, the macro infers them quite well
val e: String = EDN("\"toto\"")

val bt: Boolean = EDN("true")

val bf: Boolean = EDN("false")

val l: Long = EDN("123")

val d: Double = EDN("123.456")

val bi: BigInt = EDN("123M")

val bd: BigDecimal = EDN("123.456N")

val s: EDNSymbol = EDN("foo/bar")

val kw: EDNKeyword = EDN(":foo/bar")

// Homogenous collection inferred as Vecto[String]
val vector: Vector[String] = EDN("""["tata" "toto" "tutu"]""")

// multiple heterogenous values inferred as Seq[Any]
val s = EDNs("""(1 2 3) "toto" [true false] :foo/bar""")
// note the small s at the end of EDN to inform the macro there are several values

Shapeless heterogenous collections

EDN allows to manipulate heterogenous collections. In Scala, when one thinks heterogenous collection, one thinks Shapeless. Scaledn macros can parse & map your EDN stringified structures to Scala strongly typed structures.

import scaledn._
import macros._

import shapeless.{HNil, ::}
import shapeless.record._
import shapeless.syntax.singleton._

// Heterogenous list
val s = EDNH("""(1 "toto" true)""")
s should equal (1L :: "toto" :: true :: HNil)

// Heterogenous Map/Record
val s3 = EDNH("""{1 "toto" true 1.234 "foo" (1 2 3)}""")
s3 should equal (
  1L ->> "toto" ::
  true ->> 1.234 ::
  "foo" ->> List(1L, 2L, 3L) ::
  HNil
)

please note the H in EDNH for heterogenous

I must say using these macros, it might be even simpler to write Shapeless hlists or records than using scala API ;)

Macro API

Scaledn provides different macros depending on the depth of introspection you require in your collection with respect to heterogeneity.

Have a look directly at Macro API

Mixing macro with Scala string interpolation

Following ideas implemented by Daniel James in Datomisca, scaledn proposes to use String interpolation mixed with parsing macro such as:

import scaledn._
import macros._

import shapeless.{HNil, ::}

val l = 123L
val s = List("foo", "bar")

val r: Long = EDN(s"$l")

val r1: Seq[Any] = EDN(s"($l $s)")
val r2: Long :: List[String] :: HNil = EDNH(s"($l $s)")

Nothing to add, macros are cool sometimes :)

Runtime validation of EDN to Scala

When writing REST or external API, the received data can never be trusted before being validated. So, you generally try to validate what is received and map it to a strong-typed structures. For example:

// parse the received string input
parseEDN("""{ 1 "toto" 2 "tata" 3 "tutu" }""")
// then validate it to a Scala type
  .map(validate[Map[Long, String]])
  .success.value should be (
    play.api.data.mapping.Success(Map(
      1L -> "toto",
      2L -> "tata",
      3L -> "tutu"
    ))
  )

The validation API is the following:

import scaledn._
import validate._

def validate[T](edn: EDN)(implicit r: RuleLike[EDN, T]): Validation[EDN, T] = r.validate(edn)

Scaledn validation is based on Generic Validation API developed by my MFGLabs’s colleague & friend Julien Tournay. This API was developed for Play Framework & Typesafe last year to generalize Json validation API to all data formats. But it will never be integrated in Play as Typesafe considers it to be too pure Scala & pure FP-oriented. Yet, we use this API in production at MFGLabs and maintain/extend it ourselves.

As explained before, Scaledn parser parses EDN values directly to Scala types as they are bijective so validation is often just a runtime cast and not very interesting in general.

What’s much more interesting is to validate to Shapeless HList, Records and even more interesting to CaseClasses & Tuples based on Shapeless fantastic auto-generated Generic macros.

Let’s take a few examples to show the power of this feature:

import scaledn._
import validate._

import play.api.data.mapping._
import shapeless.{HNil, ::}

case class CP(cp: Int)
case class Address(street: String, cp: CP)
case class Person(name: String, age: Int, addr: Address)
// Remark that NO implicits must be declared on our case classes

// HLISTS
parseEDN("""(1 "toto" true nil)""").map(
  validate[Long :: String :: Boolean :: EDNNil.type :: HNil]
).success.value should be (
  Success(1L :: "toto" :: true :: EDNNil :: HNil)
)

// TUPLES
parseEDN("""("toto" 34 {"street" "chboing", "cp" {"cp" 75009}})""").map(
  validate[Tuple3[String, Int, Address]]
).success.value should be (
  Success(("toto", 34, Address("chboing", CP(75009))))
)

// CASECLASSES
parseEDN("""("toto" 34 ("chboing" (75009)))""").map(
  validate[Person]
).success.value should be (
  Success(Person("toto", 34, Address("chboing", CP(75009))))
)

parseEDN("""{"name" "toto", "age" 34, "addr" {"street" "chboing", "cp" {"cp" 75009}}}""").map(
  validate[Person]
).success.value should be (
  Success(Person("toto", 34, Address("chboing", CP(75009))))
)

I think here you can see the power of this validation feature without writing any boilerplate…

Serializing Scala to EDN

Using Generic Validation API, you can also write scala structures to any other data format.

Scaledn provides serialization from scala structures to EDN Strings. For example:

import scaledn._
import write._

toEDNString("toto") should equal ("\"toto\"")
toEDNString(List(1, 2, 3)) should equal ("""(1 2 3)""")

The write API is the following:

import scaledn._
import write._

def toEDNString[I](i: I)(implicit w: WriteLike[I, String]): String = w.writes(i)

Once again, what’s more interesting is using shapeless & caseclasses & tuples.

import scaledn._
import write._

import shapeless.{HNil, ::}

// HLIST
toEDNString(1 :: true :: List(1L, 2L, 3L) :: HNil) should equal ("""(1 true (1 2 3))""")

// TUPLE
toEDNString((23, true)) should equal ("""(23 true)""")

// CASE CLASS
case class Address(street: String, cp: Int)
case class Person(name: String, age: Int, addr: Address)
// Remark that NO implicits must be declared on our case classes

toEDNString(Person("toto", 34, Address("chboing", 75009))) should equal (
  """{"name" "toto", "age" 34, "addr" {"street" "chboing", "cp" 75009}}"""
)

TODO

This project is a first draft so it requires a bit more work.

Here are a few points to work on:

patch remaining glitches/bugs
write more tests for all cases
study streamed parser asap
write sample apps

Don’t hesitate to test, find bugs, contribute, give remarks, ideas…

Have fun in EDN world…

Shapeless HFunctor for Heterogenous Structures (& others)

2014-07-30T08:08:00+02:00

Not an article, just some reflections on this idea…

You know what is a functor?

2 categories C & D, simplifying a category as:
- a set of objects with some arrows/morphisms/functions between objects: f: x -> y
- those morphisms are associative h . (g . f) = (h . g) .f where . is the composition (g . f)(x) = g(f(x)))
- for each object there is an identity morphism id(x) = x -> x
a functor F associates :
- each object x of C with an object of F(x) of D.
- each morphism f: x -> y of C with an element F(f): F(x) -> F(y) of D such that:
  - F(id(x)) = id(F(x))
  - F(g . f) = F(g) . F(f)

A Functor is a mapping (an homomorphism) between categories that conserves the structure of the category (the morphisms, the relation between objects) whatever the kind of objects those categories contain.

In scalaz, here is the definition of a Functor:

trait Functor[F[_]] {
  ////

  /** Lift `f` into `F` and apply to `F[A]`. */
  def map[A, B](fa: F[A])(f: A => B): F[B]
  ...
}

You can see the famous map function that you can find in many structures in Scala : List[_], Option[_], Map[_, _], Future[_], etc…

Why? Because all these structures are Functors between the categories of Scala types…

Math is everywhere in programming & programming is Math so don’t try to avoid it ;)

So you can write a Functor[List] or Functor[Option] as those structures are monoids.

Now let’s consider HList the heterogenous List provided by Miles Sabin’s fantastic Shapeless. HList looks like a nice Functor.

(1 :: true :: HNil) map (
   (i:Int)     => i.toString ,
   (b:Boolean) => b.toString)
) => ("1" :: "true" :: HNil)

Ok, it’s a bit more complex as this functor requires not one function but several for each element type constituting the HList, a kind of polymorphic function. Hopefully, Shapeless provides exactly a structure to represent this: Poly

What about writing a functor for HList?

Scalaz Functor isn’t very helpful (ok I just copy the HMonoid text & tweak it ;)):

To be able to write a Functor of HList, we need something else based on multiple different types…

I spent a few hours having fun on this idea with Shapeless and tried to implement a Functor for heterogenous structures like HList, Sized and even not heterogenous structures.

Here are the working samples.

Here is the code based on pseudo-dependent types as shapeless.

The signature of the HFunctor as a map function as expected:

trait HFunctor[HA, F <: Poly] {
    type Real

    trait HMapper[P <: Poly, In] extends DepFn1[In] { type Out }

    val hmapper: HMapper[F, HA]

    def map(ha: HA)(f: F): hmapper.Out = hmapper(ha)
  }

This is just a sandbox to open discussion on this idea so I won’t explain more and let the curious ones think about it…

Have F(un)!

Shapeless HMonoid for Heterogenous Structures (& others)

2014-07-29T23:23:00+02:00

Not an article, just some reflections on this idea…

You know what is a monoid?

a binary operation taking 2 elements and giving another element e x e -> e (aka a SemiGroup)
a Identity id element id . e = e . id = e (also called zero element)

(and some associativity)

In scalaz, here is the definition:

trait Monoid[F] extends Semigroup[F] { self =>
  ////
  /** The identity element for `append`. */
  def zero: F
  def append(f1: F, f2: => F): F
  ...
}

You can see the zero & the SemiGroup.append operations, right?

So you can write a Monoid[Int] or Monoid[List[A]] as those structures are monoids.

Now let’s consider HList the heterogenous List provided by Miles Sabin’s fantastic Shapeless. HList looks like a nice monoid.

(1 :: "toto" :: HNil) ++ (true :: HNil) => (1 :: "toto" :: true :: HNil)

What about writing a monoid for HList?

Scalaz monoid isn’t very helpful because our monoid operations would look like:

def zero: HNil.type
  def append(f1: H1 <: HList, f2: => H2 <: HList): H3 <: HList

So, to be able to write a Monoid of HList, we need something else based on multiple different types…

I spent a few hours having fun on this idea with Shapeless and tried to implement a Monoid for heterogenous structures like HList, Nat, Sized and even not heterogenous structures.

Here are the working samples.

Here is the code based on pseudo-dependent types as shapeless.

The signature of the HMonoid shows the zero and the Semigroup as expected:

trait HMonoid[A, B] extends HZero with HSemiGroup[A, B]

This is just a sandbox to open discussion on this idea so I won’t explain more and let the curious ones think about it…

Have Monoids of Fun!

Summon Daemonad to macro-snoop into monad stacks

2014-06-11T23:23:00+02:00

The code & sample apps can be found on Github

Forget the buzz title, this project is still very draft but it’s time to expel it out of my R&D sandbox as imperfect as it might be… Before I lose my sanity while wandering in Scala macro hygiene ;)

What?

Daemonad is a nasty Scala macro that aims at:

marking where you manipulate monads or stacks of monads
compile-checking monadic behavior & implicit monad instances
allowing to snoop monad values deep into (some) monad stacks in the same way as ScalaAsync i.e. in a pseudo-imperative way.

This project is NOT yet stable, NOT very robust so use it at your own risks but we can discuss about it…

Here is what you can write right now.

Await.result(
  monadic[Future, Option] {
    val a = Future ( Some(9) )
    val b = Some(7)
    val c = 10
    if(snoop2(a) < 10) snoop1(b) + 10
    else c
  }, duration.Duration("1 second")
) should equal (Some(17))

Motivations

1 - Experiment writing a very ugly Scala macro

I wanted to write a huge & complex Scala macro that would move pieces of code, create more code, types etc…

I wanted to know the difficulties that it implies.

I felt reckless, courageous!

Nothing could stop me!!!!

Result: I was quite insane and I will certainly write a post-mortem article about it to show the horrible difficulties I’ve encountered. My advice: let people like hit their head against the wall and wait for improved hygienic macros that should come progressively before writing big macros ;)

2 - Investigate ScalaAsync generalization to all monads + (some) monad stacks

I had investigated ScalaAsync code and thought it would be possible to generalize it to all kinds of monads and go further by managing monad stacks.

Result : Simple monads are easy to manage (as seen also in scala-workflow which I discovered very recently) and some monad stacks can be managed with Scalaz monad transformers.

But don’t think you can use all kinds of monad transformers: the limits of Scala compiler with type-lambdas in macros and my very own limits blocked me from going as far as I expected.

So for now, it can manage Future/Option/List stacks & also Either \/ using type aliases.

3 - Explicitly Mark monadic blocks

There are 2 ways of seeing monads:

You don’t need or you don’t want to know what is a monad…

… And yet you use it everyday/everywhere.

This is what most of us (and it’s so shameful) do using those cool map/flatMap functions provided by Scala libraries that allow to access the values inside Future, List, Option in a protected way etc… That’s enough for you need in your everyday life, right?

You want to know or you know what is a monad …

… and you want to use them on purpose.

This is what hippy developers do in advanced Scala using Scalaz or even crazier in pure FP languages like Haskell.

Guess what I prefer?

Here is the kind of code I’d like to write :

// I write my datastructure without any map/flatMap function
case class Toto[A](a: A)

// Hey I proved Toto was a monad (yes believe me)

// Let's bring this concept into my scope
implicit object TotoMonad extends Monad[Toto] {
  def bind[A, B](fa: Toto[A])(f: A => Toto[B]): Toto[B] = {
    f(fa.a)
  }

  def point[A](a: => A): Toto[A] = Toto(a)
}

...
// I create my toto
val toto = Toto("this is toto")

...

// Suddenly I decide that I must use this monadic behavior of toto
monadic[Toto] {
  val a = <snoop_value_inside_monad>(toto) // outside the monadic block, you shall not do that
  do_something_with_value(a)
  // The compiler takes care that my structure is used in a pure monadic way
  // and returns a monad Toto of the right type
}
...

Ok I speak about pure functional programming and then about snooping the value out of the monad. This might seem a bit useless or even stupid compared to using directly Monad facilities. I agree and I still wonder about the sanity of this project but I’m stubborn and I try to finish what I start ;)

Back to code Sample

Await.result(
  monadic[Future, Option] {
    val a = Future ( Some(9) )
    val b = Some(7)
    val c = 10
    if(snoop2(a) < 10) snoop1(b) + 10
    else c
  }, duration.Duration("1 second")
) should equal (Some(17))

What does it do ?

monadic marks the monadic block
monadic[Future, Option] declares that you manipulate a stack Future[Option] (and no other)
snoopX means that you want to snoop the monad value at X-th level (1, 2, 3, 4 and no more for now)
the macro checks for implicit instances of monads (here List, Option, Future) and monad transformers (here OptionT & ListT) for this stack
the macro translates this code into crazy embedded Monad.bind/point/lift/run…
snoop2 is used in first position: if you have used snoop1, the macro would have rejected your monadic block. It’s logical, when you use flatMap, you always start with the deeper stack of monad and I chose not to change the order of your code as I find this macro is already far too intrusive :)

I’m sure you don’t want to see the code you would have to write for this, this is quite long and boring.

Let just say that this code is generated by a Scala macro for you.

The current generated code isn’t optimized at all and quite redundant but this is for next iterations.

What is working ?

stacks with List/Option/Either up to depth 4
custom Monads
a few preliminary checkings that prevent you from writing stupid code but not so much
if/then/else & case/match in some cases

What isn’t working ?

monadic block assigned to a val not explicitly typed.
many things or edge-cases with mixed monad depth and if/match.
can’t use advanced monad transformers like StateT or WriterT in monadic block because Scala compiler doesn’t allow what I expected with type lambdas. This needs to be studied further.

A very stupid example to finish with 4-depth stack

it should """snoop4 on stupid stack""" in {
    type S[T] = ({ type l[T] = \/[String, T] })#l[T]
    Await.result(
      monadic[Future, S, List, Option] {
        val a: Future[S[List[Option[Int]]]] = Future(\/-(List(Some(5), Some(10))))
        val b: S[List[Option[Int]]] = \/-(List(Some(1), Some(2)))
        val c: List[Option[Int]] = List(Some(3), Some(4))
        val d: Option[Int] = Some(2)
        (snoop4(a) + snoop3(b) * 2 - snoop2(c)) / snoop1(d)
      }, duration.Duration("1 second")
    ) should equal (\/-(List(Some(2), Some(1), Some(3), Some(2), Some(4), Some(4), Some(5), Some(5))))
  }

Note that:

you have to use a type alias to Scalaz \/ to one parametric type.
you have to help a bit the compiler about type alias S or it will infer \/[A, B] which is not what we want for the monad. This might seem tedious but I’ll study if I can go around this.
look at the result: you have a 2 elements list and at the end, you have a 8 elements list: WHAT???? No it’s normal, this is the result flatmap between first and second and third list. 2*2*2 = 8… nothing strange but it can be surprising at first glance ;)

TODO

refactor all code because it’s ugly, not robust and redundant!!!
rely on MonadTrans[F[_], _] instead of hardcoding monad transformers as now.
accept custom MonadTrans provided in the user code.
steal some inspiration from scala-workflow because I find this code cool.

Special Thanks

Eugene Burmako (@xeno_by) for helping me each time I was lost in macros
Jason Zaugg (@retronym) for Scala Async and splicer
Daniel James (@dwhjames) for the snoop name
Thibaut Duplessis (@ornicar) for the monad stack idea

Have a look at the code on Github.

Have snoop22(macrofun)!

ZPark-Ztream II (Part 3/3): Fancy Spark Streamed Machine-Learning & new Scalaz-Stream NIO API

2014-03-10T19:19:00+01:00

Synopsis

The code & sample apps can be found on Github

Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…

Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.

So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:

Plug Scalaz-Stream Process[Task, T] on Spark DStream[T] (Part 1)
Build DStream using brand new Scalaz-Stream NIO API (client/server) (Part 2)
Train Spark-ML recommendation-like model using NIO client/server (Part 3)
Stream data from multiple NIO clients to the previous trained ML model mixing all together (Part 3)

[Part 3/3] Fancy Spark Machine Learning with NIO client/server & DStream…

Let’s remind that I’m not an expert in ML but more a student. So if I tell or do stupid ML things, be indulgent ;)

Here is what I propose:

Train a collaborative filtering rating model for a recommendation system (as explained in Spark doc there) using a first NIO server and a client as presented in part 2.
When model is trained, create a second server that will accept client connections to receive data.
Stream/merge all received data into one single stream, dstreamize it and perform streamed predictions using previous model.

Train collaborative filtering model

Training client

As explained in Spark doc about collaborative filtering, we first need some data to train the model. I want to send those data using a NIO client.

Here is a function doing this:

//////////////////////////////////////////////////
// TRAINING CLIENT
def trainingClient(addr: InetSocketAddress): Process[Task, Bytes] = {

  // naturally you could provide much more data
  val basicTrainingData = Seq(
    // user ID, item ID, Rating
    "1,1,5.0",
    "1,2,1.0",
    "1,3,5.0",
    "1,4,1.0",
    "2,4,1.0",
    "2,2,5.0"
  )
  val trainingProcess =
    Process.emitAll(basicTrainingData map (s => Bytes.of((s+"\n").getBytes)))

  // sendAndCheckSize is a special client sending all data emitted 
  // by the process and verifying the server received all data 
  // by acknowledging all data size
  val client = NioClient.sendAndCheckSize(addr, trainingProcess)

  client
}

Training server

Now we need the training NIO server waiting for the training client to connect and piping the received data to the model.

Here is a useful function to help creating a server as described in previous article part:

def server(addr: InetSocketAddress): (Process[Task, Bytes], Signal[Boolean]) = {

  val stop = async.signal[Boolean]
  stop.set(false).run

  // this is a server that is controlled by a stop signal
  // and that acknowledges all received data by their size
  val server =
    ( stop.discrete wye NioServer.ackSize(addr) )(wye.interrupt)

  // returns a stream of received data & a signal to stop server
  (server, stop)
}

We can create the training server with it:

val trainingAddr = NioUtils.localAddress(11100)
//////////////////////////////////////////////////
// TRAINING SERVER
val (trainingServer, trainingStop) = server(trainingAddr)

trainingServer is a Process[Task, Bytes] streaming the training data received from training client. We are going to train the rating model with them.

Training model

To train a model, we can use the following API:

// the rating with user ID, product ID & rating
case class Rating(val user: Int, val product: Int, val rating: Double)

// A RDD of ratings
val ratings: RDD[Rating] = ...

// train the model with it
val model: MatrixFactorizationModel = ALS.train(ratings, 1, 20, 0.01)

Building `RDD[Rating]` from server stream

Imagine that we have a continuous flow of training data that can be very long.

We want to train the model with just a slice of this flow. To do this, we can:

dstreamize the server output stream
run the dstream for some time
retrieve the RDDs received during this time
union all of those RDDs
push them to the model

Here is the whole code with previous client:

val trainingAddr = NioUtils.localAddress(11100)

//////////////////////////////////////////////////
// TRAINING SERVER
val (trainingServer, trainingStop) = server(trainingAddr)

//////////////////////////////////////////////////
// TRAINING CLIENT
val tclient = trainingClient(trainingAddr)

//////////////////////////////////////////////////
// DStreamize server received data
val (trainingServerSink, trainingDstream) = dstreamize(
  trainingServer
      // converts bytes to String (doesn't care about encoding, it shall be UTF8)
      .map  ( bytes => new String(bytes.toArray) )
      // rechunk received strings based on a separator \n 
      // to keep only triplets: "USER_ID,PROD_ID,RATING"
      .pipe (NioUtils.rechunk { s:String => (s.split("\n").toVector, s.last == '\n') } )
  , ssc
)

//////////////////////////////////////////////////
// Prepare dstream output 
// (here we print to know what has been received)
trainingDstream.print()

//////////////////////////////////////////////////
// RUN

// Note the time before
val before = new Time(System.currentTimeMillis)

// Start SSC
ssc.start()

// Launch server
trainingServerSink.run.runAsync( _ => () )

// Sleeps a bit to let server listen
Thread.sleep(300)

// Launches client and awaits until it ends
tclient.run.run

// Stop server
trainingStop.set(true).run

// Note the time after
val after = new Time(System.currentTimeMillis)

// retrieves all dstreamized RDD during this period
val rdds = trainingDstream.slice(
  before.floor(Milliseconds(1000)), after.floor(Milliseconds(1000))
)

// unions them (this operation can be expensive)
val union: RDD[String] = new UnionRDD(ssc.sparkContext, rdds)

// converts "USER_ID,PROD_ID,RATING" triplets into Ratings
val ratings = union map { e =>
  e.split(',') match {
    case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
  }
}

// finally train the model with it
val model = ALS.train(ratings, 1, 20, 0.01)

// Predict
println("Prediction(1,3)=" + model.predict(1, 3))

//////////////////////////////////////////////////
// Stop SSC
ssc.stop()

Run it

-------------------------------------------
Time: 1395079621000 ms
-------------------------------------------
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,4,1.0
2,2,5.0

-------------------------------------------
Time: 1395079622000 ms
-------------------------------------------

-------------------------------------------
Time: 1395079623000 ms
-------------------------------------------

-------------------------------------------
Time: 1395079624000 ms
-------------------------------------------

-------------------------------------------
Time: 1395079625000 ms
-------------------------------------------

Prediction(1,3)=4.94897842056338

Fantastic, we have trained our model in a very fancy way, haven’t we?

Personally, I find it interesting that we can take advantage of both APIs…

Predict Ratings

Now that we have a trained model, we can create a new server to receive data from clients for rating prediction.

Prediction client

Firstly, let’s generate some random data to send for prediction.

//////////////////////////////////////////////////
// PREDICTION CLIENT
def predictionClient(addr: InetSocketAddress): Process[Task, Bytes] = {

  // PREDICTION DATA
  def rndData =
    // userID
    (Math.abs(scala.util.Random.nextInt) % 4 + 1).toString +
    // productID
    (Math.abs(scala.util.Random.nextInt) % 4 + 1).toString +
    "\n"

  val rndDataProcess = Process.eval(Task.delay{ rndData }).repeat

  // a 1000 elements process emitting every 10ms
  val predictDataProcess =
    (Process.awakeEvery(10 milliseconds) zipWith rndDataProcess){ (_, s) => Bytes.of(s.getBytes) }
      .take(1000)

  val client = NioClient.sendAndCheckSize(addr, predictDataProcess)

  client
}

Prediction server

val predictAddr = NioUtils.localAddress(11101)
//////////////////////////////////////////////////
// PREDICTION SERVER
val (predictServer, predictStop) = server(predictAddr)

Prediction Stream

predictServer is the stream of data to predict. Let’s stream it to the model by dstreamizing it and transforming all built RDDs by passing them through model

//////////////////////////////////////////////////
// DStreamize server
val (predictServerSink, predictDstream) = dstreamize(
  predictServer
      // converts bytes to String (doesn't care about encoding, it shall be UTF8)
      .map  ( bytes => new String(bytes.toArray) )
      // rechunk received strings based on a separator \n
      .pipe (NioUtils.rechunk { s:String => (s.split("\n").toVector, s.last == '\n') } )
  , ssc
)

//////////////////////////////////////////////////
// pipe dstreamed RDD to prediction model
// and print result
predictDstream map { _.split(',') match {
  // converts to integers required by the model (USER_ID, PRODUCT_ID)
  case Array(user, item) => (user.toInt, item.toInt)
}} transform { rdd =>
  // prediction happens here
  model.predict(rdd)
} print()

Running all in same `StreamingContext`

I’ve discovered a problem here because the recommendation model is built in a StreamingContext and uses RDDs built in it. So you must use the same StreamingContext for prediction. So I must build my training dstreamized client/server & prediction dstreamized client/server in the same context and thus I must schedule both things before starting this context.

Yet the prediction model is built from training data received after starting the context so it’s not known before… So it’s very painful and I decided to be nasty and consider the model as a variable that will be set later. For this, I used a horrible SyncVar to set the prediction model when it’s ready… Sorry about that but I need to study more about this issue to see if I can find better solutions because I’m not satisfied about it at all…

So here is the whole training/predicting painful code:

//////////////////////////////////////////////////
// TRAINING

val trainingAddr = NioUtils.localAddress(11100)

// TRAINING SERVER
val (trainingServer, trainingStop) = server(trainingAddr)

// TRAINING CLIENT
val tclient = trainingClient(trainingAddr)

// DStreamize server
val (trainingServerSink, trainingDstream) = dstreamize(
  trainingServer
      // converts bytes to String (doesn't care about encoding, it shall be UTF8)
      .map  ( bytes => new String(bytes.toArray) )
      // rechunk received strings based on a separator \n
      .pipe (NioUtils.rechunk { s:String => (s.split("\n").toVector, s.last == '\n') } )
  , ssc
)

// THE HORRIBLE SYNCVAR CLUDGE (could have used atomic but not better IMHO)
var model = new SyncVar[org.apache.spark.mllib.recommendation.MatrixFactorizationModel]
// THE HORRIBLE SYNCVAR CLUDGE (could have used atomic but not better IMHO)


//////////////////////////////////////////////////
// PREDICTING
val predictAddr = NioUtils.localAddress(11101)

// PREDICTION SERVER
val (predictServer, predictStop) = server(predictAddr)

// PREDICTION CLIENT
val pClient = predictionClient(predictAddr)

// DStreamize server
val (predictServerSink, predictDstream) = dstreamize(
  predictServer
      // converts bytes to String (doesn't care about encoding, it shall be UTF8)
      .map  ( bytes => new String(bytes.toArray) )
      // rechunk received strings based on a separator \n
      .pipe ( NioUtils.rechunk { s:String => (s.split("\n").toVector, s.last == '\n') } )
  , ssc
)

// Piping received data to the model
predictDstream.map {
  _.split(',') match {
    case Array(user, item) => (user.toInt, item.toInt)
  }
}.transform { rdd =>
  // USE THE HORRIBLE SYNCVAR
  model.get.predict(rdd)
}.print()

//////////////////////////////////////////////////
// RUN ALL
val before = new Time(System.currentTimeMillis)

// Start SSC
ssc.start()

// Launch training server
trainingServerSink.run.runAsync( _ => () )

// Sleeps a bit to let server listen
Thread.sleep(300)

// Launch training client
tclient.run.run

// Await SSC termination a bit
ssc.awaitTermination(1000)
// Stop training server
trainingStop.set(true).run
val after = new Time(System.currentTimeMillis)

val rdds = trainingDstream.slice(before.floor(Milliseconds(1000)), after.floor(Milliseconds(1000)))
val union: RDD[String] = new UnionRDD(ssc.sparkContext, rdds)

val ratings = union map {
  _.split(',') match {
    case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
  }
}

// SET THE HORRIBLE SYNCVAR
model.set(ALS.train(ratings, 1, 20, 0.01))

println("**** Model Trained -> Prediction(1,3)=" + model.get.predict(1, 3))

// Launch prediction server
predictServerSink.run.runAsync( _ => () )

// Sleeps a bit to let server listen
Thread.sleep(300)

// Launch prediction client
pClient.run.run

// Await SSC termination a bit
ssc.awaitTermination(1000)
// Stop server
predictStop.set(true).run

Run it…

-------------------------------------------
Time: 1395144379000 ms
-------------------------------------------
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,4,1.0
2,2,5.0

**** Model Trained -> Prediction(1,3)=4.919459410565401

...

-------------------------------------------
Time: 1395144384000 ms
-------------------------------------------
----------------

-------------------------------------------
Time: 1395144385000 ms
-------------------------------------------
Rating(1,1,4.919459410565401)
Rating(1,1,4.919459410565401)
Rating(1,1,4.919459410565401)
Rating(1,1,4.919459410565401)
Rating(1,2,1.631952450379809)
Rating(1,3,4.919459410565401)
Rating(1,3,4.919459410565401)

-------------------------------------------
Time: 1395144386000 ms
-------------------------------------------
Rating(1,1,4.919459410565401)
Rating(1,1,4.919459410565401)
Rating(1,3,4.919459410565401)
Rating(1,3,4.919459410565401)
Rating(1,3,4.919459410565401)
Rating(1,4,0.40813133837755494)
Rating(1,4,0.40813133837755494)

...

Final conclusion

3 long articles to end in training a poor recommendation system with 2 clients/servers… A bit bloated isn’t it? :)

Anyway, I hope I printed in your brain a few ideas, concepts about spark & scalaz-stream and if I’ve reached this target, it’s already enough!

Yet, I’m not satisfied about a few things:

Training a model and using it in the same StreamingContext is still clumsy and I must say that calling model.predict from a map function in a DStream might not be so good in a cluster environment. I haven’t been digging this code enough to have a clear mind on it.
I tried using multiple clients for prediction (like 100 in parallel) and it works quite well but I have encountered problems ending both my clients/servers and the streaming context and I often end into having zombies SBT process that I can’t kill until reboot (some threads remain RUNNING while other AWAITS and sockets aren’t released… resources issues…). Closing cleanly all of these tools creating threads & more after intensive work isn’t yet good.

But, I’m satisfied globally:

Piping a scalaz-stream Process into a spark DStream works quite well and might be interesting after all.
The new scalaz-stream NIO API considering clients & servers as pure streams of data gave me so many ideas that my free-time has suddenly been frightened and went away.

GO TO PART2 <—————————————————————————————————-

Have a look at the code on Github.

Have distributed & resilient yet continuous fun!

ZPark-Ztream II (Part 2/3): Fancy Spark Streamed Machine-Learning & new Scalaz-Stream NIO API

2014-03-09T19:19:00+01:00

Synopsis

The code & sample apps can be found on Github

Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…

Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.

So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:

Plug Scalaz-Stream Process[Task, T] on Spark DStream[T] (Part 1)
Build DStream using brand new Scalaz-Stream NIO API (client/server) (Part 2)
Train Spark-ML recommendation-like model using NIO client/server (Part 3)
Stream data from multiple NIO clients to the previous trained ML model mixing all together (Part 3)

[Part 2/3] From Scalaz-Stream NIO client & server to Spark DStream

Scalaz-Stream NIO Client

What is a client?

Something sending some data W (for Write) to a server
Something reading some data I (for Input) from a server

Client seen as `Process`

A client could be represented as:

a Process[Task, I] for input channel (receiving from server)
a Process[Task, W] for output channel (sending to server)

In scalaz-stream, recently a new structure has been added :

final case class Exchange[I, W](read: Process[Task, I], write: Sink[Task, W])

Precisely what we need!

Now, let’s consider that we work in NIO mode with everything non-blocking, asynchronous etc…

In this context, a client can be seen as something generating soon or later one (or more) Exchange[I, W] i.e :

Client[I, W] === Process[Task, Exchange[I, W]]

In the case of a pure TCP client, I and W are often Bytes.

Creating a client

Scalaz-Stream now provides a helper to create a TCP binary NIO client:

// the address of the server to connect to
val address: InetSocketAddress = new InetSocketAddress("xxx.yyy.zzz.ttt", port)

// create a client
val client: Process[Task, Exchange[Bytes, Bytes]] = nio.connect(address)

client map { ex: Exchange[Bytes, Bytes] =>
  // read data sent by server in ex.read
  ???
  // write data to the server with ex.write
  ???
}

Plug your own data source on `Exchange`

To plug your own data source to write to server, Scalaz-Stream provides 1 more API:

case class Exchange[I, W](read: Process[Task, I], write: Sink[Task, W]) {
  /**
   * Runs supplied Process of `W` values by sending them to remote system.
   * Any replies from remote system are received as `I` values of the resulting process.
   */
  def run(p:Process[Task,W]): Process[Task,I]

  // the W are sent to the server and we retrieve only the received data
}

With this API, we can write data to the client and output received data.

// some data to be sent by client
val data: Process[Task, W] = ...

// send data and retrieve only responses received by client
val output: Process[Task, I] = client flatMap { ex =>
  ex.run(data)
}

val receivedData: Seq[Bytes] = output.runLog.run

Yet, in general, we need to:

send custom data to the server
expect its response
do some business logic
send more data
etc…

So we need to be able to gather in the same piece of code received & emitted data.

Managing client/server business logic with `Wye`

Scalaz-stream can help us with the following API:

case class Exchange[I, W](read: Process[Task, I], write: Sink[Task, W]) {
...
  /**
   * Transform this Exchange to another Exchange where queueing, and transformation of this `I` and `W`
   * is controlled by supplied WyeW.
   */
  def wye(w: Wye[Task, I, W2, W \/ I2]): Exchange[I2, W2]
...

// It transforms the original Exchange[I, W] into a Exchange[I2, W2]

}

Whoaaaaa complex isn’t it? Actually not so much…

Wye is a fantastic tool that can:

read from a left and/or right branch (in a non-deterministic way: left or right or left+right),
perform computation on left/right received data,
emit data in output.

I love ASCII art schema:

> Wye[Task, I, I2, W]

    I(left)       I2(right)
          v       v
          |       |
          ---------
              |
 ---------------------------
|    Wye[Task, I, I2, W]    |
 ---------------------------
              |
              v
              W

\/ is ScalaZ disjunction also called `Either in the Scala world.

So Wye[Task, I, W2, W \/ I2] can be seen as:

> Wye[Task, I, W2, W \/ I2]

          I       W2
          v       v
          |       |
          ---------
              |
 ---------------------------
| Wye[Task, I, W2, W \/ I2] |
 ---------------------------
              |
          ---------
          |       |
          v       v
          W       I2

So what does this `Exchange.wye` API do?

It plugs the original Exchange.write: Sink[Task, W] to the W output of the Wye[Task, I, W2, W \/ I2] for sending data to the server.
It plugs the Exchange.read: Process[Task, I] receiving data from server to the left input of the Wye[Task, I, W2, W].
The right intput W2 branch provides a plug for an external source of data in the shape of Process[Task, W2].
The right output I2 can be used to pipe data from the client to an external local process (like streaming out data received from the server).
Finally it returns an Exchange[I2, W2].

In a summary:

> (ex:Exchange[I, W]).wye( w:Wye[Task, I, W2, W \/ I2] )

        ex.read
          |
          v
          I       W2
          v       v
          |       |
          ---------
              |
 -----------------------------
| w:Wye[Task, I, W2, W \/ I2] |
 -----------------------------
              |
          ---------
          |       |
          v       v
          W       I2
          |
          v
      ex.write

======> Returns Exchange[I2, W2]

As a conclusion, Exchange.wye combines the original Exchange[I, W] with your custom Wye[Task, I, W2, W \/ I2] which represents the business logic of data exchange between client & server and finally returns a Exchange[I2, W2] on which you can plug your own data source and retrieve output data.

Implement the client with `wye/run`

// The source of data to send to server
val data2Send: Process[Task, Bytes] = ...

// The logic of your exchange between client & server
val clientWye: Wye[Task, Bytes, Bytes, Bytes \/ Bytes])= ...
// Scary, there are so many Bytes

val clientReceived: Process[Task, Bytes] = for {
  // client connects to the server & returns Exchange
  ex   <- nio.connect(address)

  // Exchange is customized with clientWye
  // Data are injected in it with run
  output <- ex.wye(clientWye).run(data2Send)
} yield (output)

Implement simple client/server business logic?

Please note, I simply reuse the basic echo example provided in scalaz-stream ;)

def clientEcho(address: InetSocketAddress, data: Bytes): Process[Task, Bytes] = {

  // the custom Wye managing business logic
  def echoLogic: Wye[Bytes, Bytes, Bytes, Byte \/ Bytes] = {

    def go(collected: Int): WyeW[Bytes, Bytes, Bytes, Bytes] = {
      // Create a Wye that can receive on both sides
      receiveBoth {
        // Receive on left == receive from server
        case ReceiveL(rcvd) =>
          // `emitO` outputs on `I2` branch and then...
          emitO(rcvd) fby
            // if we have received everything sent, halt
            (if (collected + rcvd.size >= data.size) halt
            // else go on collecting
            else go(collected + rcvd.size))

        // Receive on right == receive on `W2` branch == your external data source
        case ReceiveR(data) =>
          // `emitW` outputs on `W` branch == sending to server
          // and loops
          emitW(data) fby go(collected)

        // When server closes
        case HaltL(rsn)     => Halt(rsn)
        // When client closes, we go on collecting echoes
        case HaltR(_)       => go(collected)
      }
    }

    // Init
    go(0)
  }

  // Finally wiring all...
  for {
    ex   <- nio.connect(address)
    rslt <- ex.wye(echoSent).run(emit(data))
  } yield {
    rslt
  }
}

This might seem hard to catch to some people because of scalaz-stream notations and wye Left/Right/Both or wye.emitO/emitW. But actually, you’ll get used to it quite quickly as soon as you understand wye. Keep in mind that this code uses low-level scalaz-stream API without anything else and it remains pretty simple and straighforward.

Run the client for its output

// create a client that sends 1024 random bytes
val dataArray = Array.fill[Byte](1024)(1)
scala.util.Random.nextBytes(dataArray)
val clientOutput = clientEcho(addr, Bytes.of(dataArray))

// consumes all received data... (it should contain dataArray)
val result = clientOutput.runLog.run

println("Client received:"+result)

It would give something like:

Client received:Vector(Bytes1: pos=0, length=1024, src: (-12,28,55,-124,3,-54,-53,66,-115,17...)

Now, you know about scalaz-stream clients, what about servers???

Scalaz-stream NIO Server

Let’s start again :D

What is a server?

Something listening for client(s) connection
When there is a client connected, the server can :
- Receive data I (for Input) from the client
- Send data W (for Write) to the client
A server can manage multiple clients in parallel

Server seen as `Process`

Remember that a client was defined above as:

Client === Process[Task, Exchange[I, W]]

In our NIO, non-blocking, streaming world, a server can be considered as a stream of clients right?

So finally, we can model a server as :

Server === Process[Task, Client[I, W]]
       === Process[Task, Process[Task, Exchange[I, W]]]

Whoooohoooo, a server is just a stream of streams!!!!

Writing a server

Scalaz-Stream now provides a helper to create a TCP binary NIO server:

// the address of the server
val address: InetSocketAddress = new InetSocketAddress("xxx.yyy.zzz.ttt", port)

// create a server
val server: Process[Task, Process[Task, Exchange[Bytes, Bytes]]] =
  nio.server(address)

server map { client =>
  // for each client
  client flatMap { ex: Exchange[Bytes, Bytes] =>
    // read data sent by client in ex.read
    ???
    // write data to the client with ex.write
    ???
  }
}

Don’t you find that quite elegant? ;)

Managing client/server interaction business logic

There we simply re-use the Exchange described above so you can use exactly the same API than the ones for client. Here is another API that can be useful:

type Writes1[W, I, I2] = Process[I, W \/ I2]

case class Exchange[I, W](read: Process[Task, I], write: Sink[Task, W]) {
...
  /**
   * Transforms this exchange to another exchange, that for every received `I` will consult supplied Writer1
   * and eventually transforms `I` to `I2` or to `W` that is sent to remote system.
   */
  def readThrough[I2](w: Writer1[W, I, I2])(implicit S: Strategy = Strategy.DefaultStrategy) : Exchange[I2,W]
...
}

// A small schema?
            ex.read
              |
              v
              I
              |
 ---------------------------
|    Writer1[W, I, I2]      |
 ---------------------------
              |
          ---------
          |       |
          v       v
          W       I2
          |
          v
       ex.write

======> Returns Exchange[I2, W]

With this API, you can compute some business logic on the received data from client.

Let’s write the echo server corresponding to the previous client (you can find this sample in scalaz-stream too):

def serverEcho(address: InetSocketAddress): Process[Task, Process[Task, Bytes]] = {

  // This is a Writer1 that echoes everything it receives to the client and emits it locally
  def echoAll: Writer1[Bytes, Bytes, Bytes] =
    receive1[Bytes, Bytes \/ Bytes] { i =>
      // echoes on left, emits on right and then loop (fby = followed by)
      emitSeq( Seq(\/-(i), -\/(i)) ) fby echoAll
    }

  // The server that echoes everything
  val receivedData: Process[Task, Process[Task, Bytes]] =
    for {
      client <- nio.server(address)
      rcv    <- ex.readThrough(echoAll).run()
    } yield rcv
  }

  receivedData
}

receivedData is Process[Task, Process[Task, Bytes]] which is not so practical: we would prefer to gather all data received by clients in 1 single Process[Task, Bytes] to stream it to another module.

Scalaz-Stream has the solution again:

package object merge {
  /**
   * Merges non-deterministically processes that are output of the `source` process.
   */
  def mergeN[A](source: Process[Task, Process[Task, A]])
    (implicit S: Strategy = Strategy.DefaultStrategy): Process[Task, A]
}

Please note the Strategy which corresponds to the way Tasks will be executed and that can be compared to Scala ExecutionContext.

Fantastic, let’s plug it on our server:

// The server that echoes everything
def serverEcho(address: InetSocketAddress): Process[Task, Bytes] = {

  // This is a Writer1 that echoes everything it receives to the client and emits it locally
  def echoAll: Writer1[Bytes, Bytes, Bytes] =
    receive1[Bytes, Bytes \/ Bytes] { i =>
      // echoes on left, emits on right and then loop (fby = followed by)
      emitSeq( Seq(\/-(i), -\/(i)) ) fby echoAll
    }

  // The server that echoes everything
  val receivedData: Process[Task, Process[Task, Bytes]] =
    for {
      client <- nio.server(address)
      rcv    <- ex.readThrough(echoAll).run()
    } yield rcv
  }

  // Merges all client streams
  merge.mergeN(receivedData)
}

Finally, we have a server and a client!!!!!

Let’s plug them all together

Run a server

First of all, we need to create a server that can be stopped when required.

Let’s do in the scalaz-stream way using:

wye.interrupt :

/**
   * Let through the right branch as long as the left branch is `false`,
   * listening asynchronously for the left branch to become `true`.
   * This halts as soon as the right branch halts.
   */
  def interrupt[I]: Wye[Boolean, I, I]

async.signal which is a value that can be changed asynchronously based on 2 APIs:

/**
   * Sets the value of this `Signal`. 
   */
  def set(a: A): Task[Unit]

  /**
   * Returns the discrete version of this signal, updated only when `value`
   * is changed ...
   */
  def discrete: Process[Task, A]

Without lots of imagination, we can use a Signal[Boolean].discrete to obtain a Process[Task, Boolean] and wye it with previous server process using wye.interrupt. Then, to stop server, you just have to call:

signal.set(true)

Here is the full code:

// local bind address
val addr = localAddress(12345)

// The stop signal initialized to false
val stop = async.signal[Boolean]
stop.set(false).run

// Create the server controlled by the previous signal
val stoppableServer = (stop.discrete wye echoServer(addr))(wye.interrupt)

// Run server in async without taking care of output data
stopServer.runLog.runAsync( _ => ())

// DO OTHER THINGS

// stop server
stop.set(true)

Run server & client in the same code

// local bind address
val addr = localAddress(12345)

// the stop signal initialized to false
val stop = async.signal[Boolean]
stop.set(false).run

// Create the server controlled by the previous signal
val stoppableServer = (stop.discrete wye serverEcho(addr))(wye.interrupt)

// Run server in async without taking care of output data
stoppableServer.runLog.runAsync( _ => ())

// create a client that sends 1024 random bytes
val dataArray = Array.fill[Byte](1024)(1)
scala.util.Random.nextBytes(dataArray)
val clientOutput = clientEcho(addr, Bytes.of(dataArray))

// Consume all received data in a blocking way...
val result = clientOutput.runLog.run

// stop server
stop.set(true)

Naturally you rarely run the client & server in the same code but this is funny to see our easily you can do that with scalaz-stream as you just manipulate Process run on provided Strategy

Finally, we can go back to our subject: feeding a DStream using a scalaz-stream NIO client/server

Pipe server output to DStream

clientEcho/serverEcho are simple samples but not very useful.

Now we are going to use a custom client/server I’ve written for this article:

NioClient.sendAndCheckSize is a client streaming all emitted data of a Process[Task, Bytes] to the server and checking that the global size has been ack’ed by server.
NioServer.ackSize is a server acknowledging all received packets by their size (as a 4-bytes Int)

Now let’s write a client/server dstreamizing data to Spark:

// First create a streaming context
val ssc = new StreamingContext(clusterUrl, "SparkStreamStuff", Seconds(1))

// Local bind address
val addr = localAddress(12345)

// The stop signal initialized to false
val stop = async.signal[Boolean]
stop.set(false).run

// Create the server controlled by the previous signal
val stoppableServer = (stop.discrete wye NioServer.ackSize(addr))(wye.interrupt)

// Create a client that sends a natural integer every 50ms as a string (until reaching 100)
val clientData: Process[Task, Bytes] = naturalsEvery(50 milliseconds).take(100).map(i => Bytes.of(i.toString.getBytes))
val clientOutput = NioClient.sendAndCheckSize(addr, clientData)

// Dstreamize the server into the streaming context
val (consumer, dstream) = dstreamize(stoppableServer, ssc)

// Prepare dstream output
dstream.map( bytes => new String(bytes.toArray) ).print()

// Start the streaming context
ssc.start()

// Run the server just for its effects
consumer.run.runAsync( _ => () )

// Run the client in a blocking way
clientOutput.runLog.run

// Await SSC termination a bit
ssc.awaitTermination(1000)

// stop server
stop.set(true)

// stop the streaming context
ssc.stop()

When run, it prints :

-------------------------------------------
Time: 1395049304000 ms
-------------------------------------------

-------------------------------------------
Time: 1395049305000 ms
-------------------------------------------
0
1
2
3
4
5
6
7
8
9
...

-------------------------------------------
Time: 1395049306000 ms
-------------------------------------------
20
21
22
23
24
25
26
27
28
29
...

Until 100…

Part2’s conclusion

I spent this second part of my tryptic mainly explaining a few concepts of the new scalaz-stream brand new NIO API. With it, a client becomes just a stream of exchanges Process[Task, Exchange[I, W]] and a server becomes a stream of stream of exchanges Process[Task, Process[Task, Exchange[I, W]]].

As soon as you manipulate Process, you can then use the dstreamize API exposed in Part 1 to pipe streamed data into Spark.

Let’s go to Part 3 now in which we’re going to do some fancy Machine Learning training with these new tools.

GO TO PART1 < —————————————————————————–> GO TO PART3

ZPark-Ztream II (Part 1/3): Fancy Spark Streamed Machine-Learning & new Scalaz-Stream NIO API

2014-03-08T19:19:00+01:00

Synopsis

The code & sample apps can be found on Github

Zpark-Zstream I article was a PoC trying to use Scalaz-Stream instead of DStream with Spark-Streaming. I had deliberately decided not to deal with fault-tolerance & stream graph persistence to keep simple but without it, it was quite useless for real application…

Here is a tryptic of articles trying to do something concrete with Scalaz-Stream and Spark.

So, what do I want? I wantttttttt a shrewburyyyyyy and to do the following:

Plug Scalaz-Stream Process[Task, T] on Spark DStream[T] (Part 1)
Build DStream using brand new Scalaz-Stream NIO API (client/server) (Part 2)
Train Spark-ML recommendation-like model using NIO client/server (Part 3)
Stream data from multiple NIO clients to the previous trained ML model mixing all together (Part 3)

[Part 1/3] From Scalaz-Stream Process to Spark DStream

Reminders on Process[Task, T]

Scalaz-stream Process[Task, T] is a stream of T elements that can interleave some Tasks (representing an external something doing somewhat). Process[Task, T] is built as a state machine that you need to run to process all Task effects and emit a stream of T. This can manage both continuous or discrete, finite or infinite streams.

I restricted to Task for the purpose of this article but it can be any F[_].

Reminders on DStream[T]

Spark DStream[T] is a stream of RDD[T] built by discretizing a continuous stream of T. RDD[T] is a resilient distributed dataset which is the ground data-structure behind Spark for distributing in-memory batch/map/reduce operations to a cluster of nodes with fault-tolerance & persistence.

In a summary, DStream slices a continuous stream of T by windows of time and gathers all Ts in the same window into one RDD[T]. So it discretizes the continuous stream into a stream of RDD[T]. Once built, those RDD[T]s are distributed to Spark cluster. Spark allows to perform transform/union/map/reduce/… operations on RDD[T]s. Therefore DStream[T] takes advantage if the same operations.

Spark-Streaming also persists all operations & relations between DStreams in a graph. Thus, in case of fault in a remote node while performing operations on DStreams, the whole transformation can be replayed (it also means streamed data are also persisted).

Finally, the resulting DStream obtained after map/reduce operations can be output to a file, a console, a DB etc…

Please note that DStream[T] is built with respect to a StreamingContext which manages its distribution in Spark cluster and all operations performed on it. Moreover, DStream map/reduce operations & output must be scheduled before starting the StreamingContext. It could be somewhat compared to a state machine that you build statically and run later.

From Process[Task, T] to RDD[T]

You may ask why not simply build a RDD[T] from a Process[Task, T] ?

Yes sure we can do it:

// Initialize Spark Context
implicit scc = new SparkContext(...)

// Build a process
val p: Process[Task, T] = ...

// Run the process using `runLog` to aggregate all results
// and build a RDD using spark context parallelization
val rdd = sc.parallelize(p.runLog.run)

This works but what if this Process[Task, T] emits huge quantity of data or is infinite? You’ll end in a OutOfMemoryException…

So yes you can do it but it’s not so interesting. DStream seems more natural since it can manage stream of data as long as it can discretize it over time.

From Process[Task, T] to DStream[T]

Pull from `Process[Task, T]`, Push to `DStream[T]` with `LocalInputDStream`

To build a DStream[T] from a Process[Task, T], the idea is to:

Consume/pull the T emitted by Process[Task, O],
Gather emitted T during a window of time & generate a RDD[T] with them,
Inject RDD[T] into the DStream[T],
Go to next window of time…

Spark-Streaming library provides different helpers to create DStream from different sources of data like files (local/HDFS), from sockets…

The helper that seemed the most appropriate is the NetworkInputDStream:

It provides a NetworkReceiver based on a Akka actor to which we can push streamed data.
NetworkReceiver gathers streamed data over windows of time and builds a BlockRDD[T] for each window.
Each BlockRDD[T] is registered to the global Spark BlockManager (responsible for data persistence).
BlockRDD[T] is injected into the DStream[T].

So basically, NetworkInputDStream builds a stream of BlockRDD[T]. It’s important to note that NetworkReceiver is also meant to be sent to remote workers so that data can be gathered on several nodes at the same time.

But in my case, the data source Process[Task, T] run on the Spark driver node (at least for now) so instead of NetworkInputDStream, a LocalInputDStream would be better. It would provide a LocalReceiver based on an actor to which we can push the data emitted by the process in an async way.

LocalInputDStream doesn’t exist in Spark-Streaming library (or I haven’t looked well) so I’ve implemented it as I needed. It does exactly the same as NetworkInputDStream without the remoting aspect. The current code is there…

`Process` vs `DStream` ?

There is a common point between DStream and Process: both are built as state machines that are passive until run.

In the case of Process, it is run by playing all the Task effects while gathering emitted values or without taking care of them, in blocking or non-blocking mode etc…
In the case of DStream, it is built and registered in the context of a SparkStreamingContext. Then you must also declare some outputs for the DStream like a simple print, an HDFS file output, etc… Finally you start the SparkStreamingContext which manages everything for you until you stop it.

So if we want to adapt a Process[Task, T] to a DStream[T], we must perform 4 steps (on the Spark driver node):

build a DStream[T] using LocalInputDStream[T] providing a Receiver in which we’ll be able to push asynchronously T.
build a custom scalaz-stream Sink[Task, T, Unit] in charge of consuming all emitted data from Process[Task, T] and pushing them using previous Receiver.
pipe the Process[Task, T] to this Sink[Task, T, Unit] & when Process[Task, T] has halted, stop previous DStream[T]: the result of this pipe operation is a Process[Task, Unit] which is a pure effectful process responsible for pushing T into the dstream without emitting anything.
return previous DStream[T] and effectful consumer Process[Task, Unit].

`dstreamize` implementation

def dstreamize[T : ClassTag](
  p: Process[Task, T],
  ssc: StreamingContext,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): (Process[Task, Unit], ZparkInputDStream[T]) = {

  // Build a custom LocalInputDStream
  val dstream = new ZparkInputDStream[T](ssc, storageLevel)

  // Build a Sink pushing into dstream receiver
  val sink = receiver2Sink[T](dstream.receiver.asInstanceOf[ZparkReceiver[T]])

  // Finally pipe the process to the sink and when finished, closes the dstream
  val consumer: Process[Task, Unit] =
    (p to sink)
    // when finished, it closes the dstream
    .append ( eval(Task.delay{ dstream.stop() }) )
    // when error, it closes the dstream
    .handle { case e: Exception =>
      println("Stopping on error "+e.getMessage)
      e.printStackTrace()
      eval(Task.delay{ dstream.stop() })
    }

  // Return the effectful consumer sink and the DStream
  (consumer, dstream)
}

Please remark that this builds a Process[Task, Unit] and a DStream[T] but nothing has happened yet in terms of data consumption & streaming. Both need to be run now.

Use it…

// First create a streaming context
val ssc = new StreamingContext(clusterUrl, "SparkStreamStuff", Seconds(1))

// Create a data source sample as a process generating a natural every 50ms 
// (take 1000 elements)
val p: Process[Task, Int] = naturalsEvery(50 milliseconds).take(1000)

// Dstreamize the process in the streaming context
val (consumer, dstream) = dstreamize(p, ssc)

// Prepare the dstream operations (count) & output (print)
dstream.count().print()

// Start the streaming context
ssc.start()

// Run the consumer for its effects (consuming p and pushing into dstream)
// Note this is blocking but it could be runAsync too
consumer.run.run

// await termination of stream with a timeout
ssc.awaitTermination(1000)

// stops the streaming context
ssc.stop()

Please note that you have to:

schedule your dstream operations/output before starting the streaming context.

start the streaming context before running the consumer.

Run it…

14/03/11 11:32:09 WARN util.Utils: Your hostname, localhost.paris.zenexity.com resolves to a loopback address: 127.0.0.1; using 10.0.24.228 instead (on interface en0)
14/03/11 11:32:09 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
-------------------------------------------
Time: 1394533933000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1394533934000 ms
-------------------------------------------
14

-------------------------------------------
Time: 1394533935000 ms
-------------------------------------------
20

-------------------------------------------
Time: 1394533936000 ms
-------------------------------------------
20

...

Ok cool, we can see a warmup phase at beginning and then windows of 1 sec counting 20 elements which is great since one element every 50ms gives 20 elements in 1sec.

Part 1’s conclusion

Now we can pipe a Process[Task, T] into a DStream[T].

Please not that as we run the Process[Task, T] on the Spark driver node, if this node fails, there is no real way to restore lost data. Yet, LocalInputDStream relies on DStreamGraph & BlockRDDs which persist all DStream relations & all received blocks. Moreover, DStream has exactly the same problem with respect to driver node for now.

That was fun but what can we do with that?

In part2, I propose to have more fun and stream data to DStream using the brand new Scalaz-Stream NIO API to create cool NIO client/server streams…

——————————————————————————————————-> GO TO PART2

ZPark-Ztream: Driving Spark distributed stream with Scalaz-Stream

2014-02-13T18:18:00+01:00

The code & sample apps can be found on Github

Today I’m going to write about a Proof of Concept I’ve been working on those last weeks: I wanted to use scalaz-stream as a driver of Spark distributed data processing. This is simply an idea and I don’t even know whether it is viable or stupid. But the idea is interesting!

Introduction

2 of my preferred topics those last months are :

Realtime streaming
Realtime clustered data processing (in-memory & fault-tolerant)

2 tools have kept running through my head those last months:

Scalaz-Stream for realtime/continuous streaming using pure functional concepts: I find it very interesting conceptually speaking & very powerful, specially the deterministic & non-deterministic demuxtiplexers provided out-of-the-box (Tee & Wye).
Spark for fast/fault-tolerant in-memory, resilient & clustered data processing.

I won’t speak much about Scalaz-Stream because I wrote a few articles about it.

Let’s focus on Spark.

Spark provides tooling for cluster processing of huge datasets in the same batch mode way as Hadoop, the very well known map/reduce infrastructure. But at the difference of Hadoop which is exclusively relying on HDFS cluster file systems when distributing data through the cluster, Spark tries to cache data in memory as much as possible so that latency of access is reduced as much as possible. Hadoop can scale a lot but is known to be slow in the context of a single node.

Spark is aimed at scaling as much as Hadoop but running faster on each node using in-memory caching. Fault-tolerance & data resilience is managed by Spark too using persistence & redundancy based on any nice storage like HDFS or files or whatever you can plug on Spark. So Spark is meant to be a super fast in-memory, fault-tolerant batch processing engine.

RDD Resilient Distributed Dataset

The basic concept of Spark is Resilient Distributed Dataset aka RDD which is a read-only, immutable data structure representing a collection of objects or dataset that can be distributed across a set of nodes in a cluster to perform map/reduce style algorithms.

The dataset represented by this RDD is partitioned i.e. cut into slices called partitions that can be distributed across the cluster of nodes.

Resilient means these data can be rebuilt in case of fault on a node or data loss. To perform this, the dataset is replicated/persisted across nodes in memory or in distributed file system such as HDFS.

So the idea of RDD is to provide a seamless structure to manage clustered datasets with very simple API in “monadic”-style :

val sc = new SparkContext(
  "local[4]",
  "Simple App",
  "YOUR_SPARK_HOME",
  List("target/scala-2.10/simple-project_2.10-1.0.jar")
)

val logData = sc.textFile(logFile, 2).cache().filter(line => line.contains("a")).map( _ + "foo" ).count()

Depending on your SparkContext configuration, Spark takes in charge of distributing behind the curtain your data to the cluster nodes to perform the required processing in a fully distributed way.

One thing to keep in mind is that Spark distributes data to remote nodes but it also distributes the code/closures remotely. So it means your code has to be serializable which is not the case of scalaz-stream in its current implementation.

Just a word on Spark code

As usual, before using Spark in any big project, I’ve been diving in its code to know whether I can trust this project. I must say I know Spark’s code better than its API ;)

I find Spark Scala implementation quite clean with explicit choices of design made clearly in the purpose of performance. The need to provide a compatible Java/Python API and to distribute code across clustered nodes involves a few restrictions in terms of implementation choices. Anyway, I won’t criticize much because I wouldn’t have written it better and those people clearly know what they do!

Spark Streaming

So Spark is very good to perform fast clustered batch data processing. Yet, what if your dataset is built progressively, continuously, in realtime?

On top of the core module, Spark provides an extension called Spark Streaming aiming at manipulating live streams of data using the power of Spark.

Spark Streaming can ingest different continuous data feeds like Kafka, Flume, Twitter, ZeroMQ or TCP socket and perform high-level operations on it such as map/reduce/groupby/window/…

DStream

The core data structure behind Spark Streams is DStream for Discretized Stream (and not distributed).

Discretized means it gets a continuous stream of data and makes it discrete by slicing it across time and wrapping those sliced data into the famous RDD described above.

A DStream is just a temporal data partitioner that can distribute data slices across the cluster of nodes to perform some data processing using Spark capabilities.

Here is the illustration in official Spark Stream documentation:

DStream also tries to leverage Spark automated persistence/caching/fault-tolerance to the domain of live streaming.

DStream is cool but it’s completely based on temporal aspects. Imagine you want to slice the stream depending on other criteria, with DStream, it would be quite hard because the whole API is based on time. Moreover, using DStream, you can discretize a dataflow but you can’t go in the other way and make it continuous again (in my knowledge). This is something that would be cool, isn’t it?

If you want to know more about DStream discretization mechanism, have a look at the official doc.

As usual, I’m trying to investigate the edge-cases of concepts I like. In general, this is where I can test the core design of the project and determine whether it’s worth investigating in my every-day life.

Driving Spark Streams with Scalaz-Stream

I’ve been thinking about scalaz-stream concepts quite a lot and scalaz-stream is very good at manipulating continuous streams of data. Moreover, it can very easily partition a continuous stream regrouping data into chunks based on any criteria you can imagine.

Scalaz-stream represents a data processing algorithm as a static state machine that you can run when you want. This is the same idea behind map/reduce Spark API: you build your chain of map/filter/window and finally reduce it. Reducing a spark data processing is like running a scalaz-stream machine.

So my idea was the following:

build a continuous stream of data based on scalaz-stream Process[F, O]

discretize the stream Process[F, O] => Process[F, RDD[O]]

implement count/reduce/reduceBy/groupBy for Process[F, RDD[O]]

provide a continuize method to do Process[F, RDD[O]] => Process[F, O]

So I’ve been hacking between Scalaz-stream Process[F, O] & Spark RDD[O] and here is the resulting API that I’ve called ZPark-ZStream (ZzzzzzPark-Zzzzztream).

Let’s play a bit with my little alpha API.

Discretization by simple slicing

Let’s start with a very simple example.

Take a simple finite process containing integers:

val p: Process[Task, Long] = Process(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L)

Now I want to slice this stream of integer by slices of 4 elements.

First we have to create the classic Spark Streaming context and make it implicit (needed by my API).

Please remark that I could plug existing StreamingContext on my code without any problem.

val clusterUrl = "local[4]"
implicit ssc = new StreamingContext(clusterUrl, "SparkSerial", Seconds(1))

Then let’s parallelize the previous process :

val prdd: Process[Task, RDD[Long]] = p.parallelize(4)
// type is just there to show what scalac will infer
// Just to remind that Task is the Future equivalent in Scalaz

Ok folks, now, we have a discretized stream of Long that can be distributed across a Spark cluster.

DStream provides count API which count elements on each RDD in the stream.

Let’s do the same with my API:

val pcount: Process[Task, RDD[Int]] = prdd.countRDD()

What happens here? The `count operation on each RDD in the stream is distributed across the cluster in a map/reduce-style and results are gathered.

Ok that’s cool but you still have a discretized stream Process[Task, RDD[Int]] and that’s not practical to use to see what’s inside it. So now we are going to re-continuize it and make it a Process[Task, Int] again.

val pfinal: Process[Task, Int] = pcount.continuize()

Easy isn’t it?

All together :

val p =
  Process(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
  .parallelize(4)
  .countRDD()
  .continuize()

Let’ print the result in the console

def stdOutLines[I]: Sink[Task, I] =
  Process.constant{ (s: I) => Task.delay { println(s" ----> [${System.nanoTime}] *** $s") }}

(p through stdOutLines).run.run
// 1 run for the process & 1 run for the Task

 ----> [1392418478569989000] *** 4
 ----> [1392418478593226000] *** 4

Oh yes that works: in each slice of 4 elements, we actually have 4 elements! Reassuring ;)

Let’s do the same with countByValue:

val p =
  Process(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
  .parallelize(4)
  .countRDDByValue()
  .continuize()

(p through stdOutLines).run.run
// 1 run for the process & 1 run for the Task

 ----> [1392418552751011000] *** (1,2)
 ----> [1392418552751176000] *** (2,2)
 ----> [1392418552770527000] *** (4,2)
 ----> [1392418552770640000] *** (3,2)

You can see that 4 comes before 3. This is due to the fact the 2nd slice of 4 elements (3,3,4,4) is converted into a RDD which is then partitioned and distributed across the cluster to perform the map/reduce count operation. So the order of return might be different at the end.

An example of map/reduce ?

val p =
  Process(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
  .parallelize(4)
  .mapRDD(_ + 1L)
  .reduceRDD(_ + _)
  .continuize()

(p through stdOutLines).run.run
 ----> [1392418619885745000] *** 10 (2+2+3+3)
 ----> [1392418619905817000] *** 18 (4+4+5+5)

Please note that:

p mapRDD f === p.map{ rdd => rdd map f }

Discretization by time slicing

Now we could try to slice according to time in the same idea as DStream

First of all, let’s define a continuous stream of positive integers:

def naturals: Process[Task, Int] = {
  def go(i: Int): Process[Task, Int] =
    Process.await(Task.delay(i)){ i => Process.emit(i) ++ go(i+1) }

  go(0)
}

Now, I want integers to be emitted at a given tick for example:

def naturalsEvery(duration: Duration): Process[Task, Int] =
  (naturals zipWith Process.awakeEvery(duration)){ (i, b) => i }

Then, let’s discretize the continuous stream with ZPark-Ztream API:

val p: Process[Task, RDD[Int]] =
  naturalsEvery(10 milliseconds).discretize(500 milliseconds)

The stream is sliced in slice of 500ms and all elements emitted during these 500ms are gathered in a Spark RDD.

On this stream of RDD, we can applycountRDD` as before and finally re-continuize it. All together we obtain:

val p =
  naturalsEvery(10 milliseconds)
  .take(5000)  // takes only 5000 because an infinite stream is hard to log in an article
  .discretize(500 milliseconds)
  .countRDD()
  .continuize()

(p through stdOutLines).run.run

 ----> [1392395213389954000] *** 47
 ----> [1392395213705505000] *** 28
 ----> [1392395214191637000] *** 47
 ----> [1392395214688724000] *** 48
 ----> [1392395215189453000] *** 45
 ----> [1392395215697655000] *** 48
 ----> [1392395240677357000] *** 50
 ----> [1392395241175632000] *** 49
 ----> [1392395241674446000] *** 50
 ----> [1392395242175416000] *** 50
 ----> [1392395242675183000] *** 50
 ----> [1392395243177056000] *** 50
 ----> [1392395243676848000] *** 49
 ----> [1392395244175938000] *** 49
 ----> [1392395244676315000] *** 50
 ----> [1392395245175042000] *** 50
 ----> [1392395245677394000] *** 50
 ...

Approximatively we have 50 elements per slice which looks like what we expected.

Please note that there is a short period of warmup where values are less homogenous.

Discretization by time slicing keeping track of time

DStream keeps track of all created RDD slices of data (following Spark philosophy to cache as much as possible) and allows to do operation of windowing to redistribute RDD.

With ZPark API, you can write the same as following:

val p =
  naturalsEvery(10 milliseconds)
  .take(500)
  .discretizeKeepTime(500 milliseconds)
  .windowRDD(1000 milliseconds)
  .map { case (time, rdd) =>
    (time, rdd.count())
  }

(p through stdOutLines).run.run

 ----> [1392397573066484000] *** (1392397571981061000,68)
 ----> [1392397574069315000] *** (1392397572981063000,85)
 ----> [1392397575058895000] *** (1392397573981072000,87)
 ----> [1392397576059640000] *** (1392397574981078000,89)
 ----> [1392397577069518000] *** (1392397575981086000,89)
 ----> [1392397577538941000] *** (1392397576981095000,82)

We can see here that final interval haven’t 100 elements as we could expect. This is still a mystery to me and I must investigate a bit more to know where this differences comes from. I have a few ideas but need to validate.

Anyway, globally we get 500 elements meaning we haven’t lost anything.

Mixing scalaz-stream IO & Spark streaming

Playing with naturals is funny but let’s work with a real source of data like a file.

It could be anything pluggable on scalaz-stream like kafka/flume/whatever as DStream provides…

val p =
  io.linesR("testdata/fahrenheit.txt")
    .filter(s => !s.trim.isEmpty && !s.startsWith("//"))
    .map(line => line.toDouble)
    .discretize(100 milliseconds)
    .mapRDD { x => (x, 1L) }
    .groupByKey()
    .mapRDD { case (k, v) => (k, v.size) }
    .continuize()

(p through stdOutLines).run.run

 ----> [1392398529009755000] *** (18.0,23)
 ----> [1392398529010064000] *** (19.0,22)
 ----> [1392398529010301000] *** (78.0,22)
 ----> [1392398529010501000] *** (55.3,22)
 ----> [1392398529010700000] *** (66.0,22)
 ----> [1392398529010892000] *** (64.0,22)
...

Infusing tee with RDD Processes

Is it possible to combine RDD Processes using scalaz-stream ?

val p0 = naturalsEvery(100 milliseconds).take(50).discretize(250 milliseconds)
val p1 = naturalsEvery(100 milliseconds).take(50).discretize(250 milliseconds)
val p =
 (p0 zipWith p1){ (a,b) =>
   new org.apache.spark.rdd.UnionRDD(ssc.sparkContext, Seq(a,b))
 }.countRDDByValue()
  .continuize()

(p through stdOutLines).run.run

 ----> [1392412464151650000] *** (0,2)
 ----> [1392412464151819000] *** (1,2)
 ----> [1392412464230343000] *** (2,2)
 ----> [1392412464230528000] *** (3,1)
 ----> [1392412464477775000] *** (4,2)
 ----> [1392412464477921000] *** (5,2)
 ----> [1392412464478034000] *** (6,2)
 ----> [1392412464478143000] *** (3,1)
 ----> [1392412464726860000] *** (8,2)
 ----> [1392412464727039000] *** (7,2)
 ----> [1392412464975370000] *** (9,2)
 ----> [1392412464975511000] *** (10,2)
 ----> [1392412464975620000] *** (11,2)
 ----> [1392412465224087000] *** (12,2)
 ----> [1392412465224227000] *** (13,2)
 etc...

Please note that I drive Spark RDD stream with Scalaz-Stream always remains on the driver node and is never sent to a remote node as map/reduce closures are in Spark. So Scalaz-stream is used a stream driver in this case. Moreover, Scalaz Process isn’t serializable in its current implementation so it wouldn’t be possible as is.

What about persistence & fault tolerance?

After discretizing a process, you can persist each RDD :

p.discretize(250 milliseconds).mapRDD { _.persist() }

Ok but DStream does much more trying to keep in-memory every RDD that is generated and potentially persist it across the cluster. This makes things stateful & mutable which is not the approach of pure functional API like scalaz-stream. So, I need to think a bit more about this persistence topic which is huge.

Anyway I believe I’m currently investigating another way of manipulating distributed streams than DStream.

Conclusion

Spark is quite amazing and easy to use with respect to the complexity of the subject.

I was also surprised to be able to use it with scalaz-stream so easily.

I hope you liked the idea and I encourage you to think about it and if you find it cool, please tell it! And if you find it stupid, please tell it too: this is still a pure experiment ;)

Have a look at the code on Github.

Have distributed & resilient yet continuous fun!

New Play 2.3 Validation API : breaking the 22 limits with Shapeless

2014-01-31T08:08:00+01:00

The code & sample apps can be found on Github

After 5 months studying theories deeper & deeper on my free-time and preparing 3 talks for scala.io & ping-conf with my friend Julien Tournay aka @skaalf, I’m back blogging and I’ve got a few more ideas of articles to come…

If you’re interested in those talks, you can find pingconf videos here:

Entropic history & Play2.3 new validation API

Play2, scalaz-stream & SciFi

Let’s go back to our today’s subject : Incoming Play2.3/Scala generic validation API & more.

Julien Tournay aka @skaalf has been working a lot for a few months developing this new API and has just published an article previewing Play 2.3 generic validation API.

This new API is just the logical extension of play2/Scala Json API (that I’ve been working & promoting those 2 last years) pushing its principles far further by allowing validation on any data types.

This new API is a real step further as it will progressively propose a common API for all validations in Play2/Scala (Form/Json/XML/…). It proposes an even more robust design relying on very strong theoretical ground making it very reliable & typesafe.

Julien has written his article presenting the new API basics and he also found time to write great documentation for this new validation API. I must confess Json API doc was quite messy but I’ve never found freetime (and courage) to do better. So I’m not going to spend time on basic features of this new API and I’m going to target advanced features to open your minds about the power of this new API.

Let’s have fun with this new APi & Shapeless, this fantastic tool for higher-rank polymorphism & type-safety!

Warm-up with Higher-kind Zipping of Rules

A really cool & new feature of Play2.3 generic validation API is its ability to compose validation Rules in chains like:

val rule1: Rule[A, B] = ...
val rule2: Rule[B, C] = ...

val rule3: Rule[A, C] = rule1 compose rule2

In Play2.1 Json API, you couldn’t do that (you could only map on Reads).

Moreover, with new validation API, as in Json API, you can use macros to create basic validators from case-classes.

case class FooBar(foo: String, bar: Int, foo2: Long)

val rule = Rule.gen[JsValue, FooBar]

/** Action to validate Json:
  * { foo: "toto", bar: 5, foo2: 2 }
  */
def action = Action(parse.json) { request =>
  rule.validate(request.body) map { foobar =>
    Ok(foobar.toString)
  } recoverTotal { errors =>
    BadRequest(errors.toString)
  }
}

Great but sometimes not enough as you would like to add custom validations on your class. For example, you want to verify :

foo isn’t empty
bar is >5
foo2 is <10

For that you can’t use the macro and must write your caseclass Rule yourself.

case class FooBar(foo: String, bar: Int, foo2: Long)

import play.api.data.mapping.json.Rules
import Rules._

val rule = From[JsValue] { __ =>
  (
    (__ \ "foo").read[String](notEmpty) ~
    (__ \ "bar").read[Int](min(5)) ~
    (__ \ "foo2").read[Long](max(10))
  )(FooBar.apply _)
}

Please note the new From[JsValue]: if it were Xml, it would be From[Xml], genericity requires some more info.

Ok that’s not too hard but sometimes you would like to use first the macro and after those primary type validations, you want to refine with custom validations. Something like:

Rule.gen[JsValue, FooBar] +?+?+ ( (notEmpty:Rule[String, String]) +: (min(5):Rule[Int, Int]) +: (min(10L):Rule[Long,Long]) )
// +?+?+ is a non-existing operator meaning "compose"

As you may know, you can’t do use this +: from Scala Sequence[T] as this list of Rules is typed heterogenously and Rule[I, O] is invariant.

So we are going to use Shapeless heterogenous Hlist for that:

val customRules =
  (notEmpty:Rule[String, String]) ::
  (min(5):Rule[Int, Int]) ::
  (min(10L):Rule[Long,Long]) ::
  HNil
// customRules is inferred Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]

How to compose `Rule[JsValue, FooBar]` with `Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]` ?

We need to convert Rule[JsValue, FooBar] to something like Rule[JsValue, T <: HList].

Based on Shapeless Generic[T], we can provide a nice little new conversion API .hlisted:

val rule: Rule[JsValue, String :: Int :: Long :: HNil] = Rule.gen[JsValue, FooBar].hlisted

Generic[T] is able to convert any caseclass from Scala from/to Shapeless HList (& CoProduct).

So we can validate a case class with the macro and get a Rule[JsValue, T <: HList] from it.

How to compose `Rule[JsValue, String :: Int :: Long :: HNil]` with `Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]`?

Again, using Shapeless Polymorphic and HList RightFolder, we can implement a function :

Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long] :: HNil) =>
    Rule[String :: Int :: Long :: HNil, String :: Int :: Long :: HNil]

This looks like some higher-kind zip function, let’s call it HZIP.

Now, we can compose them…

val ruleZip = Rule.gen[JsValue, FooBar].hlisted compose hzip(
  (notEmpty:Rule[String, String]) ::
  (min(5):Rule[Int, Int]) ::
  (min(10L):Rule[Long,Long]) ::
  HNil
)

Finally, let’s wire all together in a Play action:

def hzipper = Action(parse.json) { request =>
  ruleZip.validate(request.body) map { foobar =>
    Ok(foobar.toString)
  } recoverTotal { errors =>
    BadRequest(errors.toString)
  }
}

// OK case
{
  "foo" : "toto",
  "bar" : 5,
  "foo2" : 5
} => toto :: 5 :: 8 :: HNil

// KO case
{
  "foo" : "",
  "bar" : 2,
  "foo2" : 12
} => Failure(List(
  ([0],List(ValidationError(error.max,WrappedArray(10)))),
  ([1],List(ValidationError(error.min,WrappedArray(5)))),
  ([2],List(ValidationError(error.required,WrappedArray())))
))

As you can see, the problem in this approach is that we lose the path of Json. Anyway, this can give you a few ideas! Now let’s do something really useful…

Higher-kind Fold of Rules to break the 22 limits

As in Play2.1 Json API, the new validation API provides an applicative builder which allows the following:

(Rule[I, A] ~ Rule[I, B] ~ Rule[I, C]).tupled => Rule[I, (A, B, C)]

(Rule[I, A] ~ Rule[I, B] ~ Rule[I, C])(MyClass.apply _) => Rule[I, MyClass]

But, in Play2.1 Json API and also in new validation API, all functional combinators are limited by the famous Scala 22 limits.

In Scala, you CAN’T write :

a case-class with >22 fields
a Tuple23

So you can’t do Rule[JsValue, A] ~ Rule[JsValue, B] ~ ... more than 22 times.

Nevertheless, sometimes you receive huge JSON with much more than 22 fields in it. Then you have to build more complex models like case-classes embedding case-classes… Shameful, isn’t it…

Let’s be shameless with Shapeless HList which enables to have unlimited heterogenously typed lists!

So, with HList, we can write :

val bigRule =
  (__  \ "foo1").read[String] ::
  (__  \ "foo2").read[String] ::
  (__  \ "foo3").read[String] ::
  (__  \ "foo4").read[String] ::
  (__  \ "foo5").read[String] ::
  (__  \ "foo6").read[String] ::
  (__  \ "foo7").read[String] ::
  (__  \ "foo8").read[String] ::
  (__  \ "foo9").read[String] ::
  (__  \ "foo10").read[Int] ::
  (__  \ "foo11").read[Int] ::
  (__  \ "foo12").read[Int] ::
  (__  \ "foo13").read[Int] ::
  (__  \ "foo14").read[Int] ::
  (__  \ "foo15").read[Int] ::
  (__  \ "foo16").read[Int] ::
  (__  \ "foo17").read[Int] ::
  (__  \ "foo18").read[Int] ::
  (__  \ "foo19").read[Int] ::
  (__  \ "foo20").read[Boolean] ::
  (__  \ "foo21").read[Boolean] ::
  (__  \ "foo22").read[Boolean] ::
  (__  \ "foo23").read[Boolean] ::
  (__  \ "foo25").read[Boolean] ::
  (__  \ "foo26").read[Boolean] ::
  (__  \ "foo27").read[Boolean] ::
  (__  \ "foo28").read[Boolean] ::
  (__  \ "foo29").read[Boolean] ::
  (__  \ "foo30").read[Float] ::
  (__  \ "foo31").read[Float] ::
  (__  \ "foo32").read[Float] ::
  (__  \ "foo33").read[Float] ::
  (__  \ "foo34").read[Float] ::
  (__  \ "foo35").read[Float] ::
  (__  \ "foo36").read[Float] ::
  (__  \ "foo37").read[Float] ::
  (__  \ "foo38").read[Float] ::
  (__  \ "foo39").read[Float] ::
  (__  \ "foo40").read[List[Long]] ::
  (__  \ "foo41").read[List[Long]] ::
  (__  \ "foo42").read[List[Long]] ::
  (__  \ "foo43").read[List[Long]] ::
  (__  \ "foo44").read[List[Long]] ::
  (__  \ "foo45").read[List[Long]] ::
  (__  \ "foo46").read[List[Long]] ::
  (__  \ "foo47").read[List[Long]] ::
  (__  \ "foo48").read[List[Long]] ::
  (__  \ "foo49").read[List[Long]] ::
  (__  \ "foo50").read[JsNull.type] ::
  HNil

// inferred as Rule[JsValue, String] :: Rule[JsValue, String] :: ... :: Rule[JsValue, List[Long]] :: HNil

That’s cool but we want the :: operator to have the same applicative builder behavior as the~/and` operator:

Rule[JsValue, String] :: Rule[JsValue, Long] :: Rule[JsValue, Float] :: HNil =>
  Rule[JsValue, String :: Long :: Float :: HNil]

This looks like a higher-kind fold so let’s call that HFOLD.

We can build this hfold using Shapeless polymorphic functions & RighFolder.

In a next article, I may write about coding such shapeless feature. Meanwhile, you’ll have to discover the code on Github as it’s a bit hairy but very interesting ;)

Gathering everything, we obtain the following:

/* Rules Folding */
val ruleFold = From[JsValue]{ __ =>
  hfold[JsValue](
    (__  \ "foo1").read[String] ::
    (__  \ "foo2").read[String] ::
    (__  \ "foo3").read[String] ::
    (__  \ "foo4").read[String] ::
    (__  \ "foo5").read[String] ::
    (__  \ "foo6").read[String] ::
    (__  \ "foo7").read[String] ::
    (__  \ "foo8").read[String] ::
    (__  \ "foo9").read[String] ::
    (__  \ "foo10").read[Int] ::
    (__  \ "foo11").read[Int] ::
    (__  \ "foo12").read[Int] ::
    (__  \ "foo13").read[Int] ::
    (__  \ "foo14").read[Int] ::
    (__  \ "foo15").read[Int] ::
    (__  \ "foo16").read[Int] ::
    (__  \ "foo17").read[Int] ::
    (__  \ "foo18").read[Int] ::
    (__  \ "foo19").read[Int] ::
    (__  \ "foo20").read[Boolean] ::
    (__  \ "foo21").read[Boolean] ::
    (__  \ "foo22").read[Boolean] ::
    (__  \ "foo23").read[Boolean] ::
    (__  \ "foo25").read[Boolean] ::
    (__  \ "foo26").read[Boolean] ::
    (__  \ "foo27").read[Boolean] ::
    (__  \ "foo28").read[Boolean] ::
    (__  \ "foo29").read[Boolean] ::
    (__  \ "foo30").read[Float] ::
    (__  \ "foo31").read[Float] ::
    (__  \ "foo32").read[Float] ::
    (__  \ "foo33").read[Float] ::
    (__  \ "foo34").read[Float] ::
    (__  \ "foo35").read[Float] ::
    (__  \ "foo36").read[Float] ::
    (__  \ "foo37").read[Float] ::
    (__  \ "foo38").read[Float] ::
    (__  \ "foo39").read[Float] ::
    (__  \ "foo40").read[List[Long]] ::
    (__  \ "foo41").read[List[Long]] ::
    (__  \ "foo42").read[List[Long]] ::
    (__  \ "foo43").read[List[Long]] ::
    (__  \ "foo44").read[List[Long]] ::
    (__  \ "foo45").read[List[Long]] ::
    (__  \ "foo46").read[List[Long]] ::
    (__  \ "foo47").read[List[Long]] ::
    (__  \ "foo48").read[List[Long]] ::
    (__  \ "foo49").read[List[Long]] ::
    (__  \ "foo50").read[JsNull.type] ::
    HNil
  )
}

Let’s write a play action using this rule:

def hfolder = Action(parse.json) { request =>
  ruleFold.validate(request.body) map { hl =>
    Ok(hl.toString)
  } recoverTotal { errors =>
    BadRequest(errors.toString)
  }
}

// OK
{
  "foo1" : "toto1",
  "foo2" : "toto2",
  "foo3" : "toto3",
  "foo4" : "toto4",
  "foo5" : "toto5",
  "foo6" : "toto6",
  "foo7" : "toto7",
  "foo8" : "toto8",
  "foo9" : "toto9",
  "foo10" : 10,
  "foo11" : 11,
  "foo12" : 12,
  "foo13" : 13,
  "foo14" : 14,
  "foo15" : 15,
  "foo16" : 16,
  "foo17" : 17,
  "foo18" : 18,
  "foo19" : 19,
  "foo20" : true,
  "foo21" : false,
  "foo22" : true,
  "foo23" : false,
  "foo24" : true,
  "foo25" : false,
  "foo26" : true,
  "foo27" : false,
  "foo28" : true,
  "foo29" : false,
  "foo30" : 3.0,
  "foo31" : 3.1,
  "foo32" : 3.2,
  "foo33" : 3.3,
  "foo34" : 3.4,
  "foo35" : 3.5,
  "foo36" : 3.6,
  "foo37" : 3.7,
  "foo38" : 3.8,
  "foo39" : 3.9,
  "foo40" : [1,2,3],
  "foo41" : [11,21,31],
  "foo42" : [12,22,32],
  "foo43" : [13,23,33],
  "foo44" : [14,24,34],
  "foo45" : [15,25,35],
  "foo46" : [16,26,36],
  "foo47" : [17,27,37],
  "foo48" : [18,28,38],
  "foo49" : [19,29,39],
  "foo50" : null
} => toto1 :: toto2 :: toto3 :: toto4 :: toto5 :: toto6 :: toto7 :: toto8 :: toto9 ::
  10 :: 11 :: 12 :: 13 :: 14 :: 15 :: 16 :: 17 :: 18 :: 19 ::
  true :: false :: true :: false :: false :: true :: false :: true :: false ::
  3.0 :: 3.1 :: 3.2 :: 3.3 :: 3.4 :: 3.5 :: 3.6 :: 3.7 :: 3.8 :: 3.9 ::
  List(1, 2, 3) :: List(11, 21, 31) :: List(12, 22, 32) :: List(13, 23, 33) ::
  List(14, 24, 34) :: List(15, 25, 35) :: List(16, 26, 36) :: List(17, 27, 37) ::
  List(18, 28, 38) :: List(19, 29, 39) :: null ::
  HNil


// KO
{
  "foo1" : "toto1",
  "foo2" : "toto2",
  "foo3" : "toto3",
  "foo4" : "toto4",
  "foo5" : "toto5",
  "foo6" : "toto6",
  "foo7" : 50,
  "foo8" : "toto8",
  "foo9" : "toto9",
  "foo10" : 10,
  "foo11" : 11,
  "foo12" : 12,
  "foo13" : 13,
  "foo14" : 14,
  "foo15" : true,
  "foo16" : 16,
  "foo17" : 17,
  "foo18" : 18,
  "foo19" : 19,
  "foo20" : true,
  "foo21" : false,
  "foo22" : true,
  "foo23" : false,
  "foo24" : true,
  "foo25" : false,
  "foo26" : true,
  "foo27" : "chboing",
  "foo28" : true,
  "foo29" : false,
  "foo30" : 3.0,
  "foo31" : 3.1,
  "foo32" : 3.2,
  "foo33" : 3.3,
  "foo34" : 3.4,
  "foo35" : 3.5,
  "foo36" : 3.6,
  "foo37" : 3.7,
  "foo38" : 3.8,
  "foo39" : 3.9,
  "foo40" : [1,2,3],
  "foo41" : [11,21,31],
  "foo42" : [12,22,32],
  "foo43" : [13,23,33],
  "foo44" : [14,24,34],
  "foo45" : [15,25,35],
  "foo46" : [16,26,"blabla"],
  "foo47" : [17,27,37],
  "foo48" : [18,28,38],
  "foo49" : [19,29,39],
  "foo50" : "toto"
} => Failure(List(
  (/foo50,List(ValidationError(error.invalid,WrappedArray(null)))),
  (/foo46[2],List(ValidationError(error.number,WrappedArray(Long)))),
  (/foo27,List(ValidationError(error.invalid,WrappedArray(Boolean)))),
  (/foo15,List(ValidationError(error.number,WrappedArray(Int)))),
  (/foo7,List(ValidationError(error.invalid,WrappedArray(String))
))))

Awesome… now, nobody can say 22 limits is still a problem ;)

Have a look at the code on Github.

Have fun x 50!

Play 2.2 Actor Room: websocket (&more) room manager only with actors

2013-09-22T17:17:00+02:00

The code & sample apps can be found on Github here

Actor-Room makes it easy to:

create any group of connected entities (people or not) (chatroom, forum, broadcast pivot…).
manage connections, disconnections, broadcast, targetted message through actor and nothing else.

For now, members can be:

websocket endpoints through actors without taking care of Iteratees/Enumerators…
Bots to simulate members

Reminders on websockets in Play

Here is the function Play provides to create a websocket:

def async[A](
  f: RequestHeader => Future[(Iteratee[A, _], Enumerator[A])]
)(implicit frameFormatter: FrameFormatter[A]): WebSocket[A]

A websocket is a persistent bi-directional channel of communication (in/out) and is created with:

an Iteratee[A, _] to manage all frames received by the websocket endpoint
an Enumerator[A] to send messages through the websocket
an implicit FrameFormatter[A] to parse frame content to type A (Play provides default FrameFormatter for String or JsValue)

Here is how you traditionally create a websocket endpoint in Play:

object MyController extends Controller {
    def connect = Websocket.async[JsValue]{ rh =>
        // the iteratee to manage received messages
        val iteratee = Iteratee.foreach[JsValue]( js => ...)

        // the enumerator to be able to send messages
        val enumerator = // generally a PushEnumerator
        (iteratee, enumerator)
    }
}

Generally, the Enumerator[A] is created using Concurrent.broadcast[A] and Concurrent.unicast[A] which are very powerful tools but not so easy to understand exactly (the edge-cases of connection close, errors are always tricky).

You often want to:

manage multiple client connections at the same time
parse messages received from websockets,
do something with the message payload
send messages to a given client
broadcast messages to all connected members
create bots to be able to simulate fake connected members
etc…

To do that in Play non-blocking/async architecture, you often end developing an Actor topology managing all events/messages on top of the previous Iteratee/Enumerator.

The Iteratee/Enumerator is quite generic but always not so easy to write.

The actor topology is quite generic because there are administration messages that are almost always the same:

Connection/Forbidden/Disconnection
Broadcast/Send

Actor Room is a helper managing all of this for you. So you can just focus on message management using actors and nothing else. It provides all default behaviors and all behaviors can be overriden if needed. It exposes only actors and nothing else.

The code is based on the chatroom sample (and a cool sample by Julien Tournay) from Play Framework pushed far further and in a more generic way.

What is Actor Room?

An actor room manages a group of connected members which are supervised by a supervisor

Member = 2 actors (receiver/sender)

Each member is represented by 2 actors (1 receiver & 1 sender):

You MUST create at least a Receiver Actor because it’s your job to manage your own message format
The Sender Actor has a default implementation but you can override it.

Supervisor = 1 actor

All actors are managed by 1 supervisor which have two roles:

Creates/supervises all receiver/sender actors
Manages administration messages (routing, forwarding, broadcasting etc…)

Code sample step by step

Create the Actor Room

// default constructor
  val room = Room()

  // constructor with custom supervisor
  // custom supervisor are described later
  val room = Room(Props(classOf[CustomSupervisor]))

The room creates the Supervisor actor for you and delegates the creation of receiver/sender actors to it.

If you want to broadcast a message or target a precise member, you should use the supervisor.

room.supervisor ! Broadcast("fromId", Json.obj("foo" -> "bar"))
  room.supervisor ! Send("fromId", "toId", Json.obj("foo" -> "bar"))

You can manage several rooms in the same project.

Create the mandatory Receiver Actor

There is only one message to manage:

/** Message received and parsed to type A
  * @param from the ID of the sender
  * @param payload the content of the message
  */
case class Received[A](from: String, payload: A) extends Message

If your websocket frames contain Json, then it should be Received[JsValue].

You just have to create a simple actor:

// Create an actor to receive messages from websocket
class Receiver extends Actor {
  def receive = {
    // Received(fromId, js) is the only Message to manage in receiver
    case Received(from, js: JsValue) =>
      (js \ "msg").asOpt[String] match {
        case None =>
          play.Logger.error("couldn't msg in websocket event")

        case Some(s) =>
          play.Logger.info(s"received $s")
          // broadcast message to all connected members
          context.parent ! Broadcast(from, Json.obj("msg" -> s))
      }
  }
}

Please note the Receiver Actor is supervised by the Supervisor actor. So, within the Receiver Actor, context.parent is the Supervisor and you can use it to send/broadcast message as following:

context.parent ! Send(fromId, toId, mymessage)
context.parent ! Broadcast(fromId, mymessage)

// The 2 messages
/** Sends a message from a member to another member */
case class   Send[A](from: String, to: String, payload: A) extends Message

/** Broadcasts a message from a member */
case class   Broadcast[A](from: String, payload: A) extends Message

Create your Json websocket endpoint

Please note that each member is identified by a string that you define yourself.

import org.mandubian.actorroom._

class Receiver extends Actor {
  def receive = {
    ...
  }
}

object Application extends Controller {
  val room = Room()

  /** websocket requires :
    * - the type of the Receiver actor
    * - the type of the payload
    */
  def connect(id: String) = room.websocket[Receiver, JsValue](id)

  // or
  def connect(id: String) = room.websocket[JsValue](id, Props[Receiver])

}

All together

import akka.actor._

import play.api._
import play.api.mvc._
import play.api.libs.json._

// Implicits
import play.api.Play.current
import play.api.libs.concurrent.Execution.Implicits._

import org.mandubian.actorroom._

class Receiver extends Actor {
  def receive = {
    case Received(from, js: JsValue) =>
      (js \ "msg").asOpt[String] match {
        case None => play.Logger.error("couldn't msg in websocket event")
        case Some(s) =>
          play.Logger.info(s"received $s")
          context.parent ! Broadcast(from, Json.obj("msg" -> s))
      }
  }
}

object Application extends Controller {

  val room = Room()

  def websocket(id: String) = room.websocket[Receiver, JsValue](id)

}

Extend default behaviors

Override the administration message format

AdminMsgFormatter typeclass is used by ActorRoom to format administration messages (Connected, Disconnected and Error) by default.

AdminMsgFormatter[JsValue] and AdminMsgFormatter[String] are provided by default.

You can override the format as following:

// put this implicit in the same scope where you create your websocket endpoint
implicit val msgFormatter = new AdminMsgFormatter[JsValue]{
    def connected(id: String) = Json.obj("kind" -> "connected", "id" -> id)
    def disconnected(id: String) = Json.obj("kind" -> "disconnected", "id" -> id)
    def error(id: String, msg: String) = Json.obj("kind" -> "error", "id" -> id, "msg" -> msg)
}

// then this msgFormatter will be used for all administration messages  
def websocket(id: String) = room.websocket[Receiver, JsValue](id)

Override the Sender Actor

You just have to create a new actor as following:

class MyCustomSender extends Actor {

  def receive = {
    case s: Send[JsValue]        => // message send from a member to another one

    case b: Broadcast[JsValue]   => // message broadcast by a member

    case Connected(id)           => // member "id" has connected

    case Disconnected(id)        => // member "id" has disconnected

    case Init(id, receiverActor) => // Message sent when sender actor is initialized by ActorRoom

  }

}

Then you must initialize your websocket with it

def connect(id: String) = room.websocket[JsValue](_ => id, Props[Receiver], Props[MyCustomSender])

You can override the following messages:

// public sender messages
/** Sender actor is initialized by Supervisor */
case class   Init(id: String, receiverActor: ActorRef)

/** Sends a message from a member to another member */
case class   Send[A](from: String, to: String, payload: A) extends Message

/** Broadcasts a message from a member */
case class   Broadcast[A](from: String, payload: A) extends Message

/** member with ID has connected */
case class   Connected(id: String) extends Message

/** member with ID has disconnected */
case class   Disconnected(id: String) extends Message

Override the Supervisor Actor

Please note Supervisor is an actor which manages a internal state containing all members:

var members = Map.empty[String, Member]

You can override the default Supervisor as following:

class CustomSupervisor extends Supervisor {

    def customBroadcast: Receive = {
      case Broadcast(from, js: JsObject) =>
        // adds members to all messages
        val ids = Json.obj("members" -> members.map(_._1))

        members.foreach {
          case (id, member) =>
            member.sender ! Broadcast(from, js ++ ids)

          case _ => ()
        }
    }

    override def receive = customBroadcast orElse super.receive
  }

Create a bot to simulate member

A bot is a fake member that you can use to communicate with other members. It’s identified by an ID as any member.

You create a bot with these API:

case class Member(id: String, val receiver: ActorRef, val sender: ActorRef) extends Message

def bot[Payload](id: String)
    (implicit msgFormatter: AdminMsgFormatter[Payload]): Future[Member]

def bot[Payload](
    id: String,
    senderProps: Props
  )(implicit msgFormatter: AdminMsgFormatter[Payload]): Future[Member]


def bot[Payload](
    id: String,
    receiverProps: Props,
    senderProps: Props): Future[Member]

Then with returned Member, you can simulate messages:

val room = Room()

val bot = room.bot[JsValue]("robot")

// simulate a received message
bot.receiver ! Received(bod.id, Json.obj("foo" -> "bar"))

Naturally, you can override the Bot Sender Actor

/** The default actor sender for Bots */
class BotSender extends Actor {

  def receive = {
    case s =>
      play.Logger.info(s"Bot should have sent ${s}")

  }

}

val bot = room.bot[JsValue]("robot", Props[BotSender])

So what else??? Everything you can override and everything that I didn’t implement yet…

On github project, you will find 2 samples:

simplest which is a very simple working sample.
websocket-chat which is just the Play Framework ChatRoom sample rewritten with ActorRoom.

Have fun!

Scalaz-Stream Plug'n'Play2 Iteratee/WS + Recursive streaming

2013-08-21T23:23:00+02:00

The code for all autosources & sample apps can be found on Github here

The aim of this article is to show how scalaz-stream could be plugged on existing Play Iteratee/Enumerator and used in your web projects. I also wanted to evaluate in depth the power of scalaz-stream Processes by trying to write a recursive streaming action: I mean a web endpoint streaming data and re-injecting its own streamed data in itself.

If you want to see now how scalaz-stream is used with Play, go to this paragraph directly.

Why Scalaz-Stream when you have Play Iteratees?

Play Iteratees are powerful & cool but…

I’m a fan of everything dealing with data streaming and realtime management in backends. I’ve worked a lot on Play Framework and naturally I’ve been using the cornerstone behind Play’s reactive nature: Play Iteratees.

Iteratees (with its counterparts, Enumerators and Enumeratees) are great to manipulate/transform linear streams of data chunks in a very reactive (non-blocking & asynchronous) and purely functional way:

Enumerators identifies the data producer that can generate finite/infinite/procedural data streams.
Iteratee is simply a data folder built as a state machine based on 3 states (Continue, Done, Error) which consumes data from Enumerator to compute a final result.
Enumeratee is a kind of transducer able to adapt an Enumerator producing some type of data to an Iteratee that expects other type of data. Enumeratee can be used as both a pipe transformer and adapter.

Iteratee is really powerful but I must say I’ve always found them quite picky to use, practically speaking. In Play, they are used in their best use-case and they were created for that exactly. I’ve been using Iteratees for more than one year now but I still don’t feel fluent with them. Each time I use them, I must spend some time to know how I could write what I need. It’s not because they are purely functional (piping an Enumerator into an Enumeratee into an Iteratee is quite trivial) but there is something that my brain doesn’t want to catch.

If you want more details about my experience with Iteratees, go to this paragraph

That’s why I wanted to work with other functional streaming tools to see if they suffer the same kind of usability toughness or can bring something more natural to me. There are lots of other competitors on the field such as pipes, conduits and machines. As I don’t have physical time to study all of them in depth, I’ve chosen the one that appealed me the most i.e. Machines.

I’m not yet a Haskell coder even if I can mumble it so I preferred to evaluate the concept with scalaz-stream, a Scala implementation trying to bring machines to normal coders focusing on the aspect of IO streaming.

Scratching the concepts of Machine / Process ?

I’m not going to judge if Machines are better or not than Iteratees, this is not my aim. I’m just experimenting the concept in an objective way.

I won’t explain the concept of Machines in depth because it’s huge and I don’t think I have the theoretical background to do it right now. So, let’s focus on very basic ideas at first:

Machine is a very generic concept that represents a data processing mechanism with potential multiple inputs, an output and monadic effects (typically Future input chunks, side-effects while transforming, delayed output…)
To simplify, let say a machine is a bit like a mechano that you construct by plugging together other more generic machines (such as source, transducer, sink, tee, wye) as simply as pipes.
Building a machine also means planning all the steps you will go through when managing streamed data but it doesn’t do anything until you run it (no side-effect, no resource consumption). You can re-run a machine as many times as you want.
A machine is a state machine (Emit/Await/Halt) as Iteratee but it manages error in a more explicit way IMHO (fallback/error)

In scalaz-stream, you don’t manipulate machines which are too abstract for real-life use-cases but you manipulate simpler concepts:

Process[M, O] is a restricted machine outputting a stream of O. It can be a source if the monadic effect gets input from I/O or generates procedural data, or a sink if you don’t care about the output. Please note that it doesn’t infer the type of potential input at all.
Wye[L, R, O] is a machine that takes 2 inputs (left L / right R) and outputs chunks of type O (you can read from left or right or wait for both before ouputting)
Tee[L, R, O] is a Wye that can only read alternatively from left or from right but not from both at the same time.
Process1[I, O] can be seen as a transducer which accepts inputs of type I and outputs chunks of type O (a bit like Enumeratee)
Channel[M, I, O] is an effectul channel that accepts input of type I and use it in a monadic effect M to produce potential O

What I find attractive in Machines?

Machines is producer/consumer/transducer in the same place and Machines can consume/fold as Iteratee, transform as Enumeratee and emit as Enumerator at the same time and it opens lots of possibilities (even if 3 concepts in one could make it more complicated too).
I feel like playing with legos as you plug machines on machines and this is quite funny actually.
Machines manages monadic effects in its design and doesn’t infer the type of effect so you can use it with I/O, Future and whatever you can imagine that is monadic…
Machines provide out-of-the-box Tee/Wye to compose streams, interleave, zip them as you want without writing crazy code.
The early code samples I’ve seen were quite easy to read (even the implementation is not so complex). Have a look at the StartHere sample provided by scalaz-stream:

property("simple file I/O") = secure {

    val converter: Task[Unit] =
      io.linesR("testdata/fahrenheit.txt")
        .filter(s => !s.trim.isEmpty && !s.startsWith("//"))
        .map(line => fahrenheitToCelsius(line.toDouble).toString)
        .intersperse("\n")
        .pipe(process1.utf8Encode)
        .to(io.fileChunkW("testdata/celsius.txt"))
        .run

    converter.run
    true
  }

But don’t think everything is so simple, machines is a complex concept with lots of theory behind it which is quite abstract. what I find very interesting is that it’s possible to vulgarize this very abstract concept with simpler concepts such as Process, Source, Sink, Tee, Wye… that you can catch quite easily as these are concepts you already manipulated when you were playing in your bathtub when you were child (or even now).

Scalaz-stream Plug’n’Play Iteratee/Enumerator

After these considerations, I wanted to experiment scalaz-stream with Play streaming capabilities in order to see how it behaves in a context I know.

Here is what I decided to study:

Stream data out of a controller action using a scalaz-stream Process
Call an AsyncWebService & consume the response as a stream of Array[Byte] using a scalaz-stream Process

Here is existing Play API :

Action provides Ok.stream(Enumerator)
WS call consuming response as a stream of data WS.get(r: ResponseHeader => Iteratee)

As you can see, these API depends on Iteratee/Enumerator. As I didn’t want to hack Play too much as a beginning, I decided to try & plug scalaz-stream on Play Iteratee (if possible).

Building `Enumerator[O]` from `Process[Task, O]`

The idea is to take a scalaz-stream Source[O] (Process[M,O]) and wrap it into an Enumerator[O] so that it can be used in Play controller actions.

An Enumerator is a data producer which can generate those data using monadic Future effects (Play Iteratee is tightly linked to Future).

Process[Task, O] is a machine outputting a stream of O so it’s logically the right candidate to be adapted with a Enumerator[O]. Let’s remind’ Task is just a scalaz Future[Either[Throwable,A]] with a few helpers and it’s used in scalaz-stream.

So I’ve implemented (at least tried) an Enumerator[O] that accepts a Process[Task, O]:

def enumerator[O](p: Process[Task, O])(implicit ctx: ExecutionContext) =
    new Enumerator[O] {
      ...
      // look the code in github project
      ...
  }

The implementation just synchronizes the states of the Iteratee[O, A] consuming the Enumerator with the states of Process[Task, O] emitting data chunks of O. It’s quite simple actually.

Building `Process1[I, O]` from `Iteratee[I, O]`

The idea is to drive an Iteratee from a scalaz-stream Process so that it can consume an Enumerator and be used in Play WS.

An Iteratee[I, O] accepts inputs of type I (and nothing else) and will fold the input stream into a single result of type O.

A Process1[I, O] accepts inputs of type I and emits chunks of type O but not necessarily one single output chunk. So it’s a good candidate for our use-case but we need to choose which emitted chunk will be the result of the Iteratee[I, O]. here, totally arbitrarily, I’ve chosen to take the first emit as the result (but the last would be as good if not better).

So I implemented the following:

def iterateeFirstEmit[I, O](p: Process.Process1[I, O])(implicit ctx: ExecutionContext): Iteratee[I, O] = {
  ...
  // look the code in github project
  ...
}

The implementation is really raw for experimentation as it goes through the states of the Process1[I,O] and generates the corresponding states of Iteratee[I,O] until first emitted value. Nothing more nothing less…

A few basic action samples

Everything done in those samples could be done with Iteratee/Enumeratee more or less simply. The subject is not there!

Sample 1 : Generates a stream from a Simple Emitter Process

def sample1 = Action {
  val process = Process.emitAll(Seq(1, 2, 3, 4)).map(_.toString)

  Ok.stream(enumerator(process))
}

> curl "localhost:10000/sample1" --no-buffer
1234

Sample 2 : Generates a stream from a continuous emitter

/** A process generating an infinite stream of natural numbers */
val numerals = Process.unfold(0){ s => val x = s+1; Some(x, x) }.repeat

// we limit the number of outputs but you don't have it can stream forever...
def sample2 = Action {
  Ok.stream(enumerator(numerals.map(_.toString).intersperse(",").take(40)))
}

> curl "localhost:10000/sample2" --no-buffer
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,

Sample 3 : Generates a stream whose output frequency is controlled by a tee with numeral generator on left and ticker on right

/** ticks constant every delay milliseconds */
def ticker(constant: Int, delay: Long): Process[Task, Int] = Process.await(
  scalaFuture2scalazTask(delayedNumber(constant, delay))
)(Process.emit).repeat

def sample3 = Action {
  Ok.stream(enumerator(
    // creates a Tee outputting only numerals but consuming ticker // to have the delayed effect
    (numerals tee ticker(0, 100))(processes.zipWith((a,b) => a))
      .take(100)
      .map(_.toString)
      .intersperse(",")
  ))
}

Please note :

scalaFuture2scalazTask is just a helper to convert a Future into Task
tickeris quite simple to understand: it awaits Task[Int] and emits thisInt and repeats it again…
processes.zipWith((a,b) => a) is a tee (2 inputs left/right) that outputs only left data but consumes right also to have the delay effect.
.map(_.toString) simply converts into something writeable by Ok.stream
.intersperse(",") which simply add `”,” between each element

> curl "localhost:10000/sample3" --no-buffer
1... // to simulate the progressive apparition of numbers on screen
1,...
1,2...
...
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100

Sample 4 : Generates a stream using side-effect to control output frequency

/** Async generates this Int after delay*/
def delayedNumber(i: Int, delay: Long): Future[Int] =
  play.api.libs.concurrent.Promise.timeout(i, delay)

/** Creates a process generating an infinite stream natural numbers
  * every `delay milliseconds
  */
def delayedNumerals(delay: Long) = {
  def step(i: Int): Process[Task, Int] = {
    Process.emit(i).then(
      Process.await(scalaFuture2scalazTask(delayedNumber(i+1, delay)))(step)
    )
  }
  Process.await(scalaFuture2scalazTask(delayedNumber(0, delay)))(step)
}

def sample4 = Action {
  Ok.stream(enumerator(delayedNumerals(100).take(100).map(_.toString).intersperse(",")))
}

Please note:

delayedNumber uses an Akka scheduler to trigger our value after timeout
delayedNumerals shows a simple recursive `Process[Task, Int] construction which shouldn’t be too hard to understand

> curl "localhost:10000/sample4" --no-buffer
0... // to simulate the progressive apparition of numbers every 100ms
0,...
0,1...
0,1,...
0,1,2...
...
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99

Sample 5 : Generates a stream by consuming completely another stream

// a process folding all Array[Byte] into a big String
val reader: Process.Process1[Array[Byte], String] = processes.fold1[Array[Byte]]((a, b) => a ++ b )
  .map{ arr => new String(arr) } |> processes.last

def sample5 = Action {
  // the WS call with response consumer by previous Process1[Array[Byte], String] driving the Iteratee[Array[Byte], String]
  val maybeValues: Future[String] =
    WS.url(routes.Application.sample2().absoluteURL())
      .get(rh => iterateeFirstEmit(reader))
      .flatMap(_.run)

  Ok.stream(enumerator(
    // wraps the received String in a Process
    // re-splits it to remove ","
    // emits all chunks
    Process.wrap(scalaFuture2scalazTask(maybeValues))
      .flatMap{ values => Process.emitAll(values.split(",")) }
  ))
}

Please note:

reader is a Process1[Array[Byte], String] that folds all receivedArray[Byte]into aString`
iterateeFirstEmit(reader) simulates an Iteratee[Array[Byte], String] driven by the reader process that will fold all chunks of data received from WS call to routes.Application.sample2()
.get(rh => iterateeFirstEmit(reader)) will return a Future[Iteratee[Array[Byte], String] that is run in .flatMap(_.run) to return a Future[String]
Process.wrap(scalaFuture2scalazTask(maybeValues)) is a trick to wrap the folded Future[String] into a Process[Task, String]
Process.emitAll(values.split(",")) splits the resulting string again and emits all chunks outside (stupid, just for demo)

> curl "localhost:10000/sample5" --no-buffer
1234567891011121314151617181920

Still there? Let’s dive deeper and be sharper!

Building recursive streaming action consuming itself

Hacking WS to consume & re-emit WS in realtime

WS.executeStream(r: ResponseHeader => Iteratee[Array[Byte], A]) is cool API because you can build an iteratee from the ResponseHeader and then the iteratee will consume received `Array[Byte] chunks in a reactive way and will fold them. The problem is that until the iteratee has finished, you won’t have any result.

But I’d like to be able to receive chunks of data in realtime and re-emit them immediately so that I can inject them in realtime data flow processing. WS API doesn’t allow this so I decided to hack it a bit. I’ve written WSZ which provides the API:

def getRealTime(): Process[Future, Array[Byte]]
// based on
private[libs] def realtimeStream: Process[Future, Array[Byte]]

This API outputs a realtime Stream of Array[Byte] whose flow is controlled by promises (Future) being redeemed in AsyncHttpClient AsyncHandler. I didn’t care about ResponseHeaders for this experimentation but it should be taken account in a more serious impl.

I obtain a Process[Future, Array[Byte]] streaming received chunks in realtime and I can then take advantage of the power of machines to manipulate the data chunks as I want.

Sample 6 : Generates a stream by forwarding/refolding another stream in realtime

/** A Process1 splitting input strings using splitter and re-grouping chunks */
def splitFold(splitter: String): Process.Process1[String, String] = {
  // the recursive splitter / refolder
  def go(rest: String)(str: String): Process.Process1[String, String] = {
    val splitted = str.split(splitter)
    println(s"""$str - ${splitted.mkString(",")} --""")
    (splitted.length match {
      case 0 =>
        // string == splitter
        // emit rest
        // loop
        Process.emit(rest).then( Process.await1[String].flatMap(go("")) )
      case 1 =>
        // splitter not found in string 
        // so waiting for next string
        // loop by adding current str to rest
        // but if we reach end of input, then we emit (rest+str) for last element
        Process.await1[String].flatMap(go(rest + str)).orElse(Process.emit(rest+str))
      case _ =>
        // splitter found
        // emit rest + splitted.head
        // emit all splitted elements but last
        // loops with rest = splitted last element
        Process.emit(rest + splitted.head)
               .then( Process.emitAll(splitted.tail.init) )
               .then( Process.await1[String].flatMap(go(splitted.last)) )
    })
  }
  // await1 simply means "await an input string and emits it"
  Process.await1[String].flatMap(go(""))
}

def sample6 = Action { implicit request =>
  val p = WSZ.url(routes.Application.sample4().absoluteURL()).getRealTime.translate(Task2FutureNT)

  Ok.stream(enumerator(p.map(new String(_)) |> splitFold(",")))
}

Please note:

def splitFold(splitter: String): Process.Process1[String, String] is just a demo that coding a Process transducer isn’t so crazy… Look at comments in code
.translate(Task2FutureNF) converts the Process[Future, Array[Byte]] to Process[Task, Array[Byte]] using Scalaz Natural Transformation.
p |> splitFold(",") means “pipe output of process p to input of splitFold”.

> curl "localhost:10000/sample6" --no-buffer
0...
01...
012...
...
01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798

Let’s finish our trip with a bit of puzzle and mystery.

THE FINAL MYSTERY: recursive stream generating Fibonacci series

As soon as my first experimentations of scalaz-stream with Play were operational, I’ve imagined an interesting case:

Is it possible to build an action generating a stream of data fed by itself: a kind of recursive stream.

With Iteratee, it’s not really possible since it can’t emit data before finishing iteration. It would certainly be possible with an Enumeratee but the API doesn’t exist and I find it much more obvious with scalaz-stream API!

The mystery isn’t in the answer to my question: YES it is possible!

The idea is simple:

Create a simple action
Create a first process emitting a few initialization data
Create a second process which consumes the WS calling my own action and re-emits the received chunks in realtime
Append first process output and second process output
Stream global output as a result of the action which will back-propagated along time to the action itself…

Naturally, if it consumes its own data, it will recall itself again and again and again until you reach the connections or opened file descriptors limit. As a consequence, you must limit the depth of recursion.

I performed different experiences to show this use-case by zipping the stream with itself, adding elements with themselves etc… And after a few tries, I implemented the following code quite fortuitously :

/** @param curDepth the current recursion depth
  * @param maxDepth the max recursion depth
  */
def sample7(curDepth: Int, maxDepth: Int) = Action { implicit request =>

  // initializes serie with 2 first numerals output with a delay of 100ms
  val init: Process[Task, String] = delayedNumerals(100).take(2).map(_.toString)

  // Creates output Process
  // If didn't reach maxDepth, creates a process consuming my own action
  // If reach maxDepth, just emit 0
  val outputProcess =
    if(curDepth < maxDepth) {
      // calling my own action and streaming chunks using getRealTime

      val myself = WSZ.url(
        routes.Application.sample7(curDepth+1, maxDepth).absoluteURL()
      ).getRealTime.translate(Task2FutureNT).map(new String(_))
      // splitFold isn't useful, just for demo
      |> splitFold(",")

      // THE IMPORTANT PART BEGIN
      // appends `init` output with `myself` output
      // pipe it through a helper provided scalaz-stream `processes.sum[Long]`
      // which sums elements and emits partial sums
      ((init append myself).map(_.toLong) |> processes.sum[Long])
      // THE IMPORTANT PART END
      // just for output format
      .map(_.toString).intersperse(",")
    }
    else Process.emit(0).map(_.toString)

  Ok.stream(enumerator(outputProcess))
}

Launch it:

curl "localhost:10000/sample7?curDepth=0&maxDepth=10" --no-buffer
0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765

WTF??? This is Fibonacci series?

Just to remind you about it:

e(0) = 0
e(1) = 1
e(n) = e(n-1) + e(n-2)

Here is the mystery!!!

How does it work???

I won’t tell the answer to this puzzling side-effect and let you think about it and discover why it works XD

But this sample shows exactly what I wanted: Yes, it’s possible to feed an action with its own feed! Victory!

Conclusion

Ok all of that was really funky but is it useful in real projects? I don’t really know yet but it provides a great proof of the very reactive character of scalaz-stream and Play too!

I tend to like scalaz-stream and I feel more comfortable, more natural using Process than Iteratee right now… Maybe this is just an impression so I’ll keep cautious about my conclusions for now…

All of this code is just experimental so be aware about it. If you like it and see that it could be useful, tell me so that we create a real library from it!

Have Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun, Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun, Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,Fun,!

PostScriptum

A few more details about Iteratees

Here are a few things that bother me when I use Play Iteratee (you don’t have to agree, this is very subjective):

Enumeratees are really powerful (maybe the most powerful part of the API) but they can be tricky: for ex, defining a new Enumeratee from scratch isn’t easy at first sight due to the signature of the Enumeratee itself, Enumeratee composes differently on left (with Enumerators) and on right (with Iteratees) and it can be strange at beginning…
Enumerators are not defined (when not using helpers) in terms of the data they produce but with respect to the way an Iteratee will consume the data they will produce. You must somewhat reverse your way of thinking which is not so natural.
Iteratees are great to produce one result by folding a stream of data but if you want to consume/cut/aggregate/re-emit the chunks, the code you write based on Iteratee/Enumeratee quickly becomes complex, hard to re-read and edge cases (error, end of stream) are hard to treat.
When you want to manipulate multiple streams together, zip/interleave them, you must write very complex code too.
End of iteration and Error management with Iteratees isn’t really clear IMHO and when you begin to compose Iteratees together, it becomes hard to know what will happen…
If you want to manipulate a stream with side-effecting, you can do it with Enumeratees but it’s not so obvious…

Play AutoSource & Datomisca: Proofing the concept on Datomic, a Schema DB

2013-07-11T18:18:00+02:00

Now you should use play-autosource 2.0 correcting a few issues & introducing ActionBuilder from play2.2

The code for all autosources & sample apps can be found on Github here

Brand New Autosources

Play AutoSource now have 2 more implementations :

Datomic based on Datomisca, the Scala API I developed with Daniel James (@dwhjames) sponsored by Pellucid Analytics & Zenexity which is presented in this article
CouchBase contributed by Mathieu Ancelin @TrevorReznik
Slick/JDBC based on Play2-Slick contributed by Renato Cavalcanti and Loic Descotte

One month ago, I’ve demo’ed the concept of Autosource for Play2/Scala with ReactiveMongo in this article. ReactiveMongo was the perfect target for this idea because it accepts Json structures almost natively for both documents manipulation and queries.

But how does the concept behave when applied on a DB for which data are constrained by a schema and for which queries aren’t Json.

Using Datomisca-Autosource in your Play project

Add following lines to your project/Build.scala

val mandubianRepo = Seq(
  "Mandubian repository snapshots" at "https://github.com/mandubian/mandubian-mvn/raw/master/snapshots/",
  "Mandubian repository releases" at "https://github.com/mandubian/mandubian-mvn/raw/master/releases/"
)

val appDependencies = Seq()

val main = play.Project(appName, appVersion, appDependencies).settings(
  resolvers ++= mandubianRepo,
  libraryDependencies ++= Seq(
    "play-autosource"   %% "datomisca"       % "1.0",
    ...
  )
)

Create your Model + Schema

With ReactiveMongo Autosource, you could create a pure blob Autosource using JsObject without any supplementary information. But with Datomic, it’s not possible because Datomic forces to use a schema for your data.

We could create a schema and manipulate JsObject directly with Datomic and some Json validators. But I’m going to focus on the static models because this is the way people traditionally interact with a Schema-constrained DB.

Let’s create our model and schema.

// The Model (with characters pointing on Datomic named entities)
case class Person(name: String, age: Long, characters: Set[DRef])

// The Schema written with Datomisca
object Person {
  // Namespaces
  val person = new Namespace("person") {
    val characters = Namespace("person.characters")
  }

  // Attributes
  val name       = Attribute(person / "name",       SchemaType.string, Cardinality.one) .withDoc("Person's name")
  val age        = Attribute(person / "age",        SchemaType.long,   Cardinality.one) .withDoc("Person's age")
  val characters = Attribute(person / "characters", SchemaType.ref,    Cardinality.many).withDoc("Person's characterS")

  // Characters named entities
  val violent = AddIdent(person.characters / "violent")
  val weak    = AddIdent(person.characters / "weak")
  val clever  = AddIdent(person.characters / "clever")
  val dumb    = AddIdent(person.characters / "dumb")
  val stupid  = AddIdent(person.characters / "stupid")

  // Schema
  val schema = Seq(
    name, age, characters,
    violent, weak, clever, dumb, stupid
  )

Create Datomisca Autosource

Now that we have our schema, let’s write the autosource.

import datomisca._
import Datomic._

import play.autosource.datomisca._

import play.modules.datomisca._
import Implicits._

import scala.concurrent.ExecutionContext.Implicits.global
import play.api.Play.current

import models._
import Person._

object Persons extends DatomiscaAutoSourceController[Person] {
  // gets the Datomic URI from application.conf
  val uri = DatomicPlugin.uri("mem")

  // ugly DB initialization ONLY for test purpose
  Datomic.createDatabase(uri)

  // Datomic connection is required
  override implicit val conn = Datomic.connect(uri)
  // Datomic partition in which you store your entities
  override val partition = Partition.USER

  // more than ugly schema provisioning, ONLY for test purpose
  Await.result(
    Datomic.transact(Person.schema),
    Duration("10 seconds")
  )

}

Implementing Json <-> Person <-> Datomic transformers

If you compile previous code, you should have following error:

could not find implicit value for parameter datomicReader: datomisca.EntityReader[models.Person]

Actually, Datomisca Autosource requires 4 elements to work:

Json.Format[Person] to convert Person instances from/to Json (network interface)
EntityReader[Person] to convert Person instances from Datomic entities (Datomic interface)
PartialAddEntityWriter[Person] to convert Person instances to Datomic entities (Datomic interface)
Reads[PartialAddEntity] to convert Json to PartialAddEntity which is actually a simple map of fields/values to partially update an existing entity (one single field for ex).

It might seem more complicated than in ReactiveMongo but there is nothing different. The autosource converts Person from/to Json and then converts Person from/to Datomic structure ie PartialAddEntity. In ReactiveMongo, the only difference is that it understands Json so well that static model becomes unnecessary sometimes ;)…

Let’s define those elements in Person companion object.

object Person {
...
  // Classic Play2 Json Reads/Writes
  implicit val personFormat = Json.format[Person]

  // Partial entity update : Json to PartialAddEntity Reads
  implicit val partialUpdate: Reads[PartialAddEntity] = (
    ((__ \ 'name).read(readAttr[String](Person.name)) orElse Reads.pure(PartialAddEntity(Map.empty))) and
    ((__ \ 'age) .read(readAttr[Long](Person.age)) orElse Reads.pure(PartialAddEntity(Map.empty)))  and
    // need to specify type because a ref/many can be a list of dref or entities so need to tell it explicitly
    (__ \ 'characters).read( readAttr[Set[DRef]](Person.characters) )
    reduce
  )

  // Entity Reads (looks like Json combinators but it's Datomisca combinators)
  implicit val entity2Person: EntityReader[Person] = (
    name      .read[String]   and
    age       .read[Long]     and
    characters.read[Set[DRef]]
  )(Person.apply _)

  // Entity Writes (looks like Json combinators but it's Datomisca combinators)
  implicit val person2Entity: PartialAddEntityWriter[Person] = (
    name      .write[String]   and
    age       .write[Long]     and
    characters.write[Set[DRef]]
  )(DatomicMapping.unlift(Person.unapply))

...
}

Now we have everything to work except a few configurations.

Add AutoSource routes at beginning `conf/routes`

->      /person                     controllers.Persons

Create `conf/play.plugins` to initialize Datomisca Plugin

400:play.modules.datomisca.DatomicPlugin

Append to `conf/application.conf` to initialize MongoDB connection

datomisca.uri.mem="datomic:mem://mem"

Insert your first 2 persons with Curl

>curl -X POST -d '{ "name":"bob", "age":25, "characters": ["person.characters/stupid", "person.characters/violent"] }' --header "Content-Type:application/json" http://localhost:9000/persons --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 21

{"id":17592186045423} -> oh a Datomic ID

>curl -X POST -d '{ "name":"john", "age":43, "characters": ["person.characters/clever", "person.characters/weak"] }' --header "Content-Type:application/json" http://localhost:9000/persons --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 21

{"id":17592186045425}

Querying is the biggest difference in Datomic

In Datomic, you can’t do a getAll without providing a Datomic Query.

But what is a Datomic query? It’s inspired by Datalog which uses predicates to express the constraints on the searched entities. You can combine predicates together.

With Datomisca Autosource, you can directly send datalog queries in the query parameter q for GET or in body for POST with one restriction: your query can’t accept input parameters and must return only the entity ID. For ex:

[ :find ?e :where [ ?e :person/name "john"] ] --> OK

[ :find ?e ?name :where [ ?e :person/name ?name] ] --> KO

Let’s use it by finding all persons.

>curl -X POST --header "Content-Type:text/plain" -d '[:find ?e :where [?e :person/name]]' 'http://localhost:9000/persons/find' --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 231

[
    {
        "name": "bob",
        "age": 25,
        "characters": [
            ":person.characters/violent",
            ":person.characters/stupid"
        ],
        "id": 17592186045423
    },
    {
        "name": "john",
        "age": 43,
        "characters": [
            ":person.characters/clever",
            ":person.characters/weak"
        ],
        "id": 17592186045425
    }
]

Please note the use of POST here instead of GET because Curl doesn’t like [] in URL even using -g option

Now you can use all other routes provided by Autosource

Autosource Standard Routes

Get / Find / Stream

GET /persons?… -> Find by query
GET /persons/ID -> Find by ID
GET /persons/stream -> Find by query & stream result by page

Insert / Batch / Find

POST /persons + BODY -> Insert
POST /persons/find + BODY -> find by query (when query is too complex to be in a GET)
POST /persons/batch + BODY -> batch insert (multiple)

Update / batch

PUT /persons/ID + BODY -> Update by ID
PUT /persons/ID/partial + BODY -> Update partially by ID
PUT /persons/batch -> batch update (multiple)

Delete / Batch

DELETE /persons/ID -> delete by ID
DELETE /persons/batch + BODY -> batch delete (multiple)

Conclusion

Play-Autosource’s ambition was to be DB agnostic (as much as possible) and showing that the concept can be applied to schemaless DB (ReactiveMongo & CouchDB) and schema DB (Datomic) is a good sign it can work. Naturally, there are a few more elements to provide for Datomic than in ReactiveMongo but it’s useful anyway.

Thank to @TrevorReznik for his contribution of CouchBase Autosource.

I hope to see soon one for Slick and a few more ;)

Have Autofun!

Play2 Json Interpolation & Pattern Matching

2013-07-04T08:08:00+02:00

EXPERIMENTAL / DRAFT

Do you remember JsPath pattern matching presented in this article ?

Let’s now go further with something that you should enjoy even more: Json Interpolation & Pattern Matching.

I’ve had the idea of these features for some time in my mind but let’s render unto Caesar what is Caesar’s : Rapture.io proved that it could be done quite easily and I must say I ~~stole~~ got inspired by a few implementation details from them! (specially the @inline implicit conversion for string interpolation class which is required due to a ValueClass limitation that should be removed in further Scala versions)

First of all, code samples as usual…

Create JsValue using String interpolation

scala> val js = json"""{ "foo" : "bar", "foo2" : 123 }"""
js: play.api.libs.json.JsValue = {"foo":"bar","foo2":123}

scala> js == Json.obj("foo" -> "bar", "foo2" -> 123)
res1: Boolean = true

scala> val js = json"""[ 1, true, "foo", 345.234]"""
js: play.api.libs.json.JsValue = [1,true,"foo",345.234]

scala> js == Json.arr(1, true, "foo", 345.234)
res2: Boolean = true

Yes, pure Json in a string…

How does it work? Using String interpolation introduced in Scala 2.10.0 and Jackson for the parsing…

In String interpolation, you can also put Scala variables directly in the interpolated string. You can do the same in Json interpolation.

scala> val alpha = "foo"
alpha: String = foo

scala> val beta = 123L
beta: Long = 123

scala> val js = json"""{ "alpha" : "$alpha", "beta" : $beta}"""
js: play.api.libs.json.JsValue = {"alpha":"foo","beta":123}

scala> val gamma = Json.arr(1, 2, 3)
gamma: play.api.libs.json.JsArray = [1,2,3]

scala> val delta = Json.obj("key1" -> "value1", "key2" -> "value2")
delta: play.api.libs.json.JsObject = {"key1":"value1","key2":"value2"}

scala> val js = json"""
     |         {
     |           "alpha" : "$alpha",
     |           "beta" : $beta,
     |           "gamma" : $gamma,
     |           "delta" : $delta,
     |           "eta" : {
     |             "foo" : "bar",
     |             "foo2" : [ "bar21", 123, true, null ]
     |           }
     |         }
     |       """
js: play.api.libs.json.JsValue = {"alpha":"foo","beta":123,"gamma":[1,2,3],"delta":{"key1":"value1","key2":"value2"},"eta":{"foo":"bar","foo2":["bar21",123,true,null]}}

Please note that string variables must be put between "..." because without it the parser will complain.

Ok, so now it’s really trivial to write Json, isn’t it?

String interpolation just replaces the string you write in your code by some Scala code concatenating pieces of strings with variables as you would write yourself. Kind-of: s"toto ${v1} tata" => "toto + v1 + " tata" + ...

But at compile-time, it doesn’t compile your String into Json: the Json parsing is done at runtime with string interpolation. So using Json interpolation doesn’t provide you with compile-time type safety and parsing for now.

In the future, I may replace String interpolation by a real Macro which will also parse the string at compile-time. Meanwhile, if you want to rely on type-safety, go on using Json.obj / Json.arr API.

Json pattern matching

What is one of the first feature that you discover when learning Scala and that makes you say immediately: “Whoaa Cool feature”? Pattern Matching.

You can write:

scala> val opt = Option("toto")
opt: Option[String] = Some(toto)

scala> opt match {
  case Some(s) => s"not empty option:$s"
  case None    => "empty option"
}
res2: String = not empty option:toto

// or direct variable assignement using pattern matching

scala> val Some(s) = opt
s: String = toto

Why not doing this with Json?

And…. Here it is with Json pattern matching!!!

scala> val js = Json.obj("foo" -> "bar", "foo2" -> 123L)
js: play.api.libs.json.JsObject = {"foo":"bar","foo2":123}

scala> js match {
  case json"""{ "foo" : $a, "foo2" : $b }""" => Some(a -> b)
  case _ => None
}
res5: Option[(play.api.libs.json.JsValue, play.api.libs.json.JsValue)] =
Some(("bar",123))

scala> val json"""{ "foo" : $a, "foo2" : $b}""" = json""" { "foo" : "bar", "foo2" : 123 }"""
a: play.api.libs.json.JsValue = "bar"
b: play.api.libs.json.JsValue = 123

scala> val json"[ $v1, 2, $v2, 4 ]" = Json.arr(1, 2, 3, 4)
v1: play.api.libs.json.JsValue = 1
v2: play.api.libs.json.JsValue = 3

Magical?

Not at all… Just unapplySeq using the tool that enables this kind of Json manipulation as trees: JsZipper…

The more I use JsZippers, the more I find fields where I can use them ;)

More complex Json pattern matching

scala> val js = json"""{
    "key1" : "value1",
    "key2" : [
      "alpha",
      { "foo" : "bar",
        "foo2" : {
          "key21" : "value21",
          "key22" : [ "value221", 123, false ]
        }
      },
      true,
      123.45
    ]
  }"""
js: play.api.libs.json.JsValue = {"key1":"value1","key2":["alpha",{"foo":"bar","foo2":{"key21":"value21","key22":["value221",123,false]}},true,123.45]}

scala> val json"""{ "key1" : $v1, "key2" : ["alpha", $v2, true, $v3] }""" = js
v1: play.api.libs.json.JsValue = "value1"
v2: play.api.libs.json.JsValue = {"foo":"bar","foo2":{"key21":"value21","key22":["value221",123,false]}}
v3: play.api.libs.json.JsValue = 123.45

scala> js match {
    case json"""{
      "key1" : "value1",
      "key2" : ["alpha", $v1, true, $v2]
    }"""   => Some(v1, v2)
    case _ => None
  }
res9: Option[(play.api.libs.json.JsValue, play.api.libs.json.JsValue)] =
Some(({"foo":"bar","foo2":{"key21":"value21","key22":["value221",123,false]}},123.45))

// A non matching example maybe ? ;)
scala>  js match {
    case json"""{
      "key1" : "value1",
      "key2" : ["alpha", $v1, false, $v2]
    }"""   => Some(v1, v2)
    case _ => None
  }
res10: Option[(play.api.libs.json.JsValue, play.api.libs.json.JsValue)] = None

If you like that, please tell it so that I know whether it’s worth pushing it to Play Framework!

Using these features right now in a Scala/SBT project

These features are part of my experimental project JsZipper presented in this article.

To use it, add following lines to your SBT Build.scala:

object ApplicationBuild extends Build {
  ...
  val mandubianRepo = Seq(
    "Mandubian repository snapshots" at "https://github.com/mandubian/mandubian-mvn/raw/master/snapshots/",
    "Mandubian repository releases" at "https://github.com/mandubian/mandubian-mvn/raw/master/releases/"
  )
  ...

  val main = play.Project(appName, appVersion, appDependencies).settings(
    resolvers ++= mandubianRepo,
    libraryDependencies ++= Seq(
      ...
      "play-json-zipper"  %% "play-json-zipper"    % "0.1-SNAPSHOT",
      ...
    )
  )
  ...
}

In your Scala code, import following packages

import play.api.libs.json._
import syntax._
import play.api.libs.functional.syntax._
import play.api.libs.json.extensions._

PatternMatch your fun!

Play AutoSource : 2'30 Kickstart Full REST & CRUD Datasource in Play App (demo'ed with ReactiveMongo + AngularJS)

2013-06-11T18:18:00+02:00

Now you should use play-autosource 2.0 correcting a few issues & introducing ActionBuilder from play2.2

The module code and sample app can be found on Github here

Here we go:

0’ : Create App

> play2 new auto-persons
       _            _
 _ __ | | __ _ _  _| |
| '_ \| |/ _' | || |_|
|  __/|_|\____|\__ (_)
|_|            |__/

play! 2.1.1 (using Java 1.7.0_21 and Scala 2.10.0), http://www.playframework.org

The new application will be created in /Users/pvo/zenexity/workspaces/workspace_mandubian/auto-persons

What is the application name? [auto-persons]
>

Which template do you want to use for this new application?

  1             - Create a simple Scala application
  2             - Create a simple Java application

> 1
OK, application auto-persons is created.

Have fun!

10’ : edit project/Build.scala, add `play-autosource:reactivemongo` dependency

val mandubianRepo = Seq(
  "Mandubian repository snapshots" at "https://github.com/mandubian/mandubian-mvn/raw/master/snapshots/",
  "Mandubian repository releases" at "https://github.com/mandubian/mandubian-mvn/raw/master/releases/"
)

val appDependencies = Seq()

val main = play.Project(appName, appVersion, appDependencies).settings(
  resolvers ++= mandubianRepo,
  libraryDependencies ++= Seq(
    "play-autosource"   %% "reactivemongo"       % "1.0-SNAPSHOT",
    "org.specs2"        %% "specs2"              % "1.13"        % "test",
    "junit"              % "junit"               % "4.8"         % "test"
  )
)

30’ : Create new ReactiveMongo AutoSource Controller in app/Person.scala

package controllers

import play.api._
import play.api.mvc._

// BORING IMPORTS
// Json
import play.api.libs.json._
import play.api.libs.functional.syntax._
// Reactive JSONCollection
import play.modules.reactivemongo.json.collection.JSONCollection
// Autosource
import play.autosource.reactivemongo._
// AutoSource is Async so imports Scala Future implicits
import scala.concurrent.ExecutionContext.Implicits.global
import play.api.Play.current

// >>> THE IMPORTANT PART <<<
object Persons extends ReactiveMongoAutoSourceController[JsObject] {
  val coll = db.collection[JSONCollection]("persons")
}

50’ : Add AutoSource routes at beginning `conf/routes`

->      /person                     controllers.Persons

60’ : Create `conf/play.plugins` to initialize ReactiveMongo Plugin

400:play.modules.reactivemongo.ReactiveMongoPlugin

70’ : Append to `conf/application.conf` to initialize MongoDB connection

mongodb.uri ="mongodb://localhost:27017/persons"

80’ : Launch application

> play2 run

[info] Loading project definition from /.../auto-persons/project
[info] Set current project to auto-persons (in build file:/.../auto-persons/)

[info] Updating {file:/.../auto-persons/}auto-persons...
[info] Done updating.
--- (Running the application from SBT, auto-reloading is enabled) ---

[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000

(Server started, use Ctrl+D to stop and go back to the console...)
[info] Compiling 5 Scala sources and 1 Java source to /.../auto-persons/target/scala-2.10/classes...
[warn] there were 2 feature warnings; re-run with -feature for details
[warn] one warning found
[success] Compiled in 6s

100’ : Insert your first 2 persons with Curl

>curl -X POST -d '{ "name":"bob", "age":25 }' --header "Content-Type:application/json" http://localhost:9000/person --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 33

{"id":"51b868ef31d4002c0bac8ba4"} -> oh a BSONObjectId

>curl -X POST -d '{ "name":"john", "age":43 }' --header "Content-Type:application/json" http://localhost:9000/person --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 33

{"id":"51b868fa31d4002c0bac8ba5"}

110’ : Get all persons

>curl http://localhost:9000/person --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 118

[
  {"name":"bob","age":25.0,"id":"51b868ef31d4002c0bac8ba4"},
  {"name":"john","age":43.0,"id":"51b868fa31d4002c0bac8ba5"}
]

115’ : Delete one person

>curl -X DELETE http://localhost:9000/person/51b868ef31d4002c0bac8ba4 --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 33

{"id":"51b868ef31d4002c0bac8ba4"}

120’ : Verify person was deleted

>curl -X GET http://localhost:9000/person/51b868ef31d4002c0bac8ba4 --include

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
Content-Length: 37

ID 51b868ef31d4002c0bac8ba4 not found

125’ : Update person

>curl -X PUT -d '{ "name":"john", "age":35 }' --header "Content-Type:application/json" http://localhost:9000/person/51b868fa31d4002c0bac8ba5 --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 33

{"id":"51b868fa31d4002c0bac8ba5"}

130’ : Batch insert 2 persons (johnny & tom) with more properties

>curl -X POST -d '[{ "name":"johnny", "age":15, "address":{"city":"Paris", "street":"rue quincampoix"} },{ "name":"tom", "age":3, "address":{"city":"Trifouilly", "street":"rue des accidents de poucettes"} }]' --header "Content-Type:application/json" http://localhost:9000/person/batch --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 8

{"nb":2}

135’ : Get all persons whose name begins by “john”

>curl -X POST -d '{"name":{"$regex":"^john"}}' --header "Content-Type:application/json" http://localhost:9000/person/find --include

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 175

[
  {"name":"john","age":35.0,"id":"51b868fa31d4002c0bac8ba5"},
  {"id":"51b86a1931d400bc01ac8ba8","name":"johnny","age":15.0,"address":{"city":"Paris","street":"rue quincampoix"}}
]

140’ : Delete all persons

>curl -X DELETE -d '{}' --header "Content-Type:application/json" http://localhost:9000/person/batch --include

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 7

deleted

145’ : Take 5’ rest

150’ : Done

So what was demonstrated here?

With Play-Autosource, in a few lines, you obtain :

A backed abstract datasource (here implemented for ReactiveMongo but it could be implemented for other DBs)
All CRUD operations are exposed as pure REST services
The datasource is typesafe (here JsObject but we’ll show later that we can use any type)

It can be useful to kickstart any application in which you’re going to work iteratively on our data models in direct interaction with front-end.

It could also be useful to Frontend developers who need to bootstrap frontend code with Play Framework application backend. With Autosource, they don’t have to care about modelizing strictly a datasource on server-side and can dig into their client-side code quite quickly.

Adding constraints & validation

Now you tell me: “Hey that’s stupid, you store directly JsObject but my data are structured and must be validated before inserting them”

Yes you’re right so let’s add some type constraints on our data:

object Persons extends ReactiveMongoAutoSourceController[JsObject] {
  val coll = db.collection[JSONCollection]("persons")

  // we validate the received Json as JsObject because the autosource type is JsObject
  // and we add classic validations on types
  override val reader = __.read[JsObject] keepAnd (
    (__ \ "name").read[String] and
    (__ \ "age").read[Int](Reads.min(0) keepAnd Reads.max(117))
  ).tupled
}

Try it now:

curl -X POST -d '{ "nameXXX":"bob", "age":25 }' --header "Content-Type:application/json" http://localhost:9000/person --include

HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=utf-8
Content-Length: 62

{"obj.name":[{"msg":"validate.error.missing-path","args":[]}]}

You can add progressively constraints on your data in a few lines. With AutoSource, you don’t need to determine immediately the exact shape of your models and you can work with JsObject directly as long as you need. Sometimes, you’ll even discover that you don’t even need a structured model and JsObject will be enough. (but I also advise to design a bit things before implementing ;))

Keep in mind that our sample is based on an implementation for ReactiveMongo so using Json is natural. For other DB, other data structure might be more idiomatic…

Use typesafe models

Now you tell me: “Funny but but but JsObject is evil because it’s not strict enough. I’m a OO developer (maybe abused by ORM gurus when I was young) and my models are case-classes…”

Yes you’re right, sometimes, you need more business logic or you want to separate concerns very strictly and your model will be shaped as case-classes.

So let’s replace our nice little JsObject by a more serious case class.

// the model
case class Person(name: String, age: Int)
object Person{
  // the famous Json Macro which generates at compile-time a Reads[Person] in a one-liner
  implicit val fmt = Json.format[Person]
}

// The autosource... shorter than before
object Persons extends ReactiveMongoAutoSourceController[Person] {
  val coll = db.collection[JSONCollection]("persons")
}

Please note that I removed the validations I had introduced before because there are not useful anymore: using Json macros, I created an implicit Format[Person] which is used implicitly by AutoSource.

So, now you can see why I consider AutoSource as a typesafe datasource.

Let’s be front-sexy with AngularJS

You all know that AngularJS is the new kid on the block and that you must use it if you want to be sexy nowadays.

I’m already sexy so I must be able to use it without understanding anything to it and that’s exactly what I’ve done: in 30mn without knowing anything about Angular (but a few concepts), I wrote a dumb CRUD front page plugged on my wonderful AutoSource.

Client DS in app/assets/javascripts/persons.js

This is the most important part of this sample: we need to call our CRUD autosource endpoints from angularJS.

We are going to use Angular resources for it even if it’s not really the best feature of AngularJS. Anyway, in a few lines, it works pretty well in my raw case.

(thanks to Paul Dijou for reviewing this code because I repeat I don’t know angularJS at all and I wrote this in 20mn without trying to understand anything :D)

var app =
  // injects ngResource
  angular.module("app", ["ngResource"])
  // creates the Person factory backed by our autosource
  // Please remark the url person/:id which will use transparently our CRUD AutoSource endpoints
  .factory('Person', ["$resource", function($resource){
    return $resource('person/:id', { "id" : "@id" });
  }])
  // creates a controller
  .controller("PersonCtrl", ["$scope", "Person", function($scope, Person) {

    $scope.createForm = {};

    // retrieves all persons
    $scope.persons = Person.query();

    // creates a person using createForm and refreshes list
    $scope.create = function() {
      var person = new Person({name: $scope.createForm.name, age: $scope.createForm.age});
      person.$save(function(){
        $scope.createForm = {};
        $scope.persons = Person.query();
      })
    }

    // removes a person and refreshes list
    $scope.remove = function(person) {
      person.$remove(function() {
        $scope.persons = Person.query();
      })
    }

    // updates a person and refreshes list
    $scope.update = function(person) {
      person.$save(function() {
        $scope.persons = Person.query();
      })
    }
}]);

CRUD UI in index.scala.html

Now let’s create our CRUD UI page using angular directives. We need to be able to:

list persons
update/delete each person
create new persons

@(message: String)

@main("Welcome to Play 2.1") {

     ng-controller="PersonCtrl">
      
       for="name">name: ng-model="createForm.name"/>
       for="age">age: ng-model="createForm.age" type="number"/>
       ng-click="create()">Create new person
      
      
                   ng-repeat="person in persons">
                                                          
    

}

name age actions

 ng-model="person.name"/>
 type="number" ng-model="person.age"/>
 ng-click="update(person)">Update ng-click="remove(person)">Delete





Import Angular in main.scala.html

We need to import angularjs in our application and create angular application using ng-app

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@(title: String)(content: Html)




 ng-app="app">
    
        </span>@title<span class="nt">
         rel="stylesheet" media="screen" href="@routes.Assets.at("stylesheets/main.css")">
         rel="shortcut icon" type="image/png" href="@routes.Assets.at("images/favicon.png")">
        
        
        

        
    
    
        @content
    




What else??? Oh yes Security…

I know what you think: “Uhuh, the poor guy who exposes his DB directly on the network and who is able to delete everything without any security”

Once again, you’re right. (yes I know I love flattery)

Autosource is by default not secured in any way and actually I don’t really care about security because this is your job to secure your exposed APIs and there are so many ways to secure services that I prefer to let you choose the one you want.

Anyway, I’m a nice boy and I’m going to show you how you could secure the DELETE endpoint using the authentication action composition sample given in Play Framework documentation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// FAKE USER class to simulate a user extracted from DB.
case class User(name: String)
object User {
  def find(name: String) = Some(User(name))
}

object Persons extends ReactiveMongoAutoSourceController[Person] {
  // The action composite directly copied for PlayFramework doc
  def Authenticated(action: User => EssentialAction): EssentialAction = {
    // Let's define a helper function to retrieve a User
    def getUser(request: RequestHeader): Option[User] = {
      request.session.get("user").flatMap(u => User.find(u))
    }

    // Now let's define the new Action
    EssentialAction { request =>
      getUser(request).map(u => action(u)(request)).getOrElse {
        Done(Unauthorized)
      }
    }
  }

  val coll = db.collection[JSONCollection]("persons")

  // >>> IMPORTANT PART <<<
  // We simply override the delete action
  // If authenticated, we call the original action
  override def delete(id: BSONObjectID) = Authenticated { _ =>
    super.delete(id)
  }

  def index = Action {
    Ok(views.html.index("ok"))
  }

  // the login action which log any user
  def login(name: String) = Action {
    Ok("logged in").withSession("user" -> name)
  }

  // the logout action which log out any user
  def logout = Action {
    Ok("logged out").withNewSession
  }
}



Nothing to complicated here.
If you need to add headers in your responses and params to querystring, it’s easy to wrap autosource actions. Please refer to Play Framework doc for more info…

I won’t try it here, the article is already too long but it should work…









Play-Autosource is DB agnostic

Play-Autosource Core is independent of the DB and provides Reactive (Async/Nonblocking) APIs to fulfill PlayFramework requirements.

Naturally this 1st implementation uses ReactiveMongo which is one of the best sample of DB reactive driver. MongoDB fits very well in this concept too because document records are really compliant to JSON datasources.

But other implementations for other DB can be done and I count on you people to contribute them.

DB implementation contributions are welcome (Play-Autosource is just Apache2 licensed) and AutoSource API are subject to evolutions if they appear to be erroneous.









Conclusion

Play-Autosource provides a very fast & lightweight way to create a REST CRUD typesafe datasource in your Play/Scala application. You can begin with blob data such as JsObject and then elaborate the model of your data progressively by adding constraints or types to it.

There would be many more things to say about Play/Autosource:


you can also override writers to change output format
you have some alpha streaming API also
etc…



There are also lots of features to improve/add because it’s still a very draft module.

If you like it and have ideas, don’t hesitate to discuss, to contribute, to improve etc…

curl -X POST -d "{ "coding" : "Have fun"} http://localhost:9000/developer

PS: Thanks to James Roper for his article about advanced routing in Play Framework which I copied shamefully XD



















Reactive Json Crafting : JsZipper + ReactiveMongo + multiple Async WS calls
2013-05-01T17:17:00+02:00
EXPERIMENTAL / DRAFT

The sample app can be found on Github here





Hi again folks!

Now, you may certainly have realized I’m Play2.1 Json API advocate. But you may also have understood that I’m not interested in Json as an end in itself. What catches my attention is that it’s a versatile arborescent data structure that can be used in web server&client, in DB such as ReactiveMongo and also when communicating between servers with WebServices.

So I keep exploring what can be done with Json (specially in the context of PlayFramework reactive architecture) and building the tools that are required to concretize my ideas.


My last article introduced JsPath Pattern Matching and I told you that I needed this tool to use it with JsZipper. It’s time to use it…
Here is why I want to do:

  Build dynamically a Json structure by aggregating data obtained by calling several external WS such as twitterAPI or github API or whatever API.

  Build this structure from a Json template stored in MongoDB in which I will find the URL and params of WebServices to call.

  Use Play2.1/WS & ReactiveMongo reactive API meaning resulting Json should be built in an asynchronous and non-blocking way.

  Use concept of JsZipper introduced in my previous article to be able to modify efficiently Play2.1/Json immutable structures.




Please note that this idea and its implementation is just an exercise of style to study the idea and introduce technical concepts but naturally it might seem a bit fake. Moreover, keep in mind, JsZipper API is still draft…





The idea of Json template

Imagine I want to gather twitter user timeline and github user profile in a single Json object.

I also would like to:


configure the URL of WS and query parameters to fetch data
customize the resulting Json structure



Let’s use a Json template such as:

1
2
3
4
5
6
7
8
9
10
11
12
{
  "streams" : {
    "twitter" : {
      "url" : "http://localhost:9000/twitter/statuses/user_timeline",
      "user_id" : "twitter_nick"
    },
    "github" : {
      "url" : "http://localhost:9000/github/users",
      "user_id" : "github_nick"
    }
  }
}



Using the url and user_id found in __\streams\twitter, I can call twitter API to fetch the stream of tweets and the same for__\streams\github`. Finally I replace the content of each node as following:

1
2
3
4
5
6
7
8
9
10
{
  "streams" : {
    "twitter" : {
      // TWITTER USER TIMELINE HERE
    },
    "github" : {
      // GITHUB USER PROFILE HERE
    }
  }
}



Moreover, I’d like to store multiple templates like previous sample with multiple user_id to be able to retrieve multiple streams at the same time.





Creating Json template in Play/ReactiveMongo (v0.9)

Recently, Stephane Godbillon has released ReactiveMongo v0.9 with corresponding Play plugin. This version really improves and eases the way you can manipulate Json directly with Play & Mongo from Scala.

Let’s store a few instance of previous templates using this API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// gets my mongo collection
def coll = db.collection[JSONCollection]("templates")

def provision = Action { Async {
  val values = Enumerator(
    Json.obj(
      "streams" -> Json.obj(
        "twitter" -> Json.obj(
          "url" -> "http://localhost:9000/twitter/statuses/user_timeline",
          "user_id" -> "twitter_nick1"
        ),
        "github" -> Json.obj(
          "url" -> "http://localhost:9000/github/users",
          "user_id" -> "github_nick1"
        )
      )
    ),
    ... more templates
  )

  coll.bulkInsert(values).map{ nb =>
    Ok(Json.obj("nb"->nb))
  }

} }



Hard isn’t it?

Note that I use localhost URL because with real Twitter/Github API I would need OAuth2 tokens and this would be a pain for this sample :)









Reactive Json crafting

Now, let’s do the real job i.e the following steps:


retrieve the template(s) from Mongo using ReactiveMongo JsonCollection
call the WebServices to fetch the data using Play Async WS
update the Json template(s) using Monadic JsZipper JsZipperM[Future]



The interesting technical points here are that:


ReactiveMongo is async so we get Future[JsValue]
Play/WS is Async so we get also Future[JsValue]
We need to call multiple WS so we have a Seq[Future[JsValue]]



We could use Play/Json transformers presented in a previous article but knowing that you have to manage Futures and multiple WS calls, it would create quite complicated code.

Here is where Monadic JsZipper becomes interesting:


JsZipper allows modifying immutable JsValue which is already cool
JsZipperM[Future] allows modifying JsValue in the future and it’s even better!



Actually the real power of JsZipper (besides being able to modify/delete/create a node in immutable Json tree) is to transform a Json tree into a Stream of nodes that it can traverse in depth, in width or whatever you need.





Less code with WS sequential calls

Here is the code because you’ll see how easy it is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// a helper to call WS
def callWSFromTemplate(value: JsValue): Future[JsValue] =
  WS.url((value \ "url").as[String])
    .withQueryString( "user_id" -> (value \ "user_id").as[String] )
    .get().map{ resp => resp.json }

// calling WS sequentially
def dataSeq = Action{
  Async{
    for{
      templates <- coll.find(Json.obj()).cursor[JsObject].toList   // retrieves templates from Mongo
      updated   <- Json.toJson(templates).updateAllM{
        case (_ \ "twitter", value) => callWSFromTemplate(value)
        case (_ \ "github", value)  => callWSFromTemplate(value)
        case (_, value)             => Future.successful(value)
      }
    } yield Ok(updated)
  }
}



Please note:


Json.toJson(templates) transforms a List[JsObject] into JsArray because we want to manipulate pure JsValue with JsZipperM[Future].
.updateAllM( (JsPath, JsValue) => Future[JsValue] ) is a wrapper API hiding the construction of a JsZipperM[Future]: once built, the `JsZipperM[Future] traverses the Json tree and for each node, it calls the provided function flatMapping on Futures before going to next node. This makes the calls to WS sequential and not parallel.
case (_ \ "twitter", value) : yes here is the JsPath pattern matching and imagine the crazy stuff you can do mixing Json traversal and pattern matching ;)
Async means the embedded code will return Future[Result] but remember that it DOESN’T mean the Action is synchronous/blocking because in Play, everything is Asynchronous/non-blocking by default.



Then you could tell me that this is cool but the WS are not called in parallel but sequentially. Yes it’s true but imagine that it’s less than 10 lines of code and could even be reduced. Yet, here is the parallelized version…





Parallel WS calls

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def dataPar = Action{
  Async{
    coll.find(Json.obj()).cursor[JsObject].toList.flatMap{ templates =>
      // converts List[JsObject] into JsArray
      val jsonTemplates = Json.toJson(templates)

      // gathers all nodes that need to be updated
      val nodes = jsonTemplates.findAll{
        case (_ \ "twitter", _) | (_ \ "github", _) => true
        case (_, value) => false
      }

      // launches WS calls in parallel and updates original JsArray
      Future.traverse(nodes){
        case (path@(_ \ "twitter"), value) => callWSFromTemplate(value).map( resp => path -> resp )
        case (path@(_ \ "github"), value)  => callWSFromTemplate(value).map( resp => path -> resp )
      }.map{ pathvalues => Ok(jsonTemplates.set(pathvalues:_*)) }
    }
  }
}



Note that:


jsonTemplates.findAll( filter: (JsPath, JsValue) => Boolean ) traverses the Json tree and returns a Stream[(JsPath, JsValue)] containing the filtered nodes. This is not done with Future because we want to get all nodes now to be able to launch all WS calls in parallel.
Future.traverse(nodes)(T => Future[T]) traverses the filtered values and calls all WS in parallel.
case (path@(_ \ "twitter"), value) is just JsPath pattern matching once again keeping track of full path to be able to return it with the value path -> resp for next point.
jsonTemplates.set( (JsPath, JsValue)* ) finally updates all values at given path. Note how easy it is to update multiple values at multiple paths.



A bit less elegant than the sequential case but not so much.









Conclusion

This sample is a bit stupid but you can see the potential of mixing those different tools together.

Alone, JsZipper and JsPath pattern matching provides very powerful ways of manipulating Json that Reads/Writes can’t do easily.

When you add reactive API on top of that, JsZipper becomes really interesting and elegant.

The sample app can be found on Github here

Have JsZipperM[fun]!



















Play2 Json Path Pattern Matching
2013-05-01T17:17:00+02:00
EXPERIMENTAL / DRAFT








While experimenting Play21/Json Zipper in my previous article, I needed to match patterns on JsPath and decided to explore a bit this topic.
This article just presents my experimentations on JsPath pattern matching so that people interested in the topic can tell me if they like it or not and what they would add or remove. So don’t hesitate to let comments about it.
If the result is satisfying, I’ll propose it to Play team ;)



Let’s go to samples as usual.

Very simple pattern matching

match/scale-style

1
2
3
4
5
scala> __ \ "toto" match {
  case __ \ key => Some(key)
  case _ => None
}
res0: Option[String] = Some(toto)



val-style

1
2
scala> val _ \ toto = __ \ "toto"
toto: String = toto



Note that I don’t write val __ \ toto = __ \ "toto" (2x Underscore) as you would expect.

Why? Let’s write it:

1
2
3
scala> val __ \ toto = __ \ "toto"
<console>:20: error: recursive value x$1 needs type
val __ \ toto = __ \ "toto"



Actually, 1st __ is considered as a variable to be affected by Scala compiler. Then the variable __ appears on left and right side which is not good.

So I use _ to ignore its value because I know it’s __. If I absolutely wanted to match with __, you would have written:

1
2
scala> val JsPath \ toto = __ \ "toto"
toto: String = toto









Pattern matching with indexed path

1
2
3
4
5
6
7
8
9
scala> val (_ \ toto)@@idx = (__ \ "toto")(2)
toto: String = toto
idx: Int = 2

scala> (__ \ "toto")(2) match {
  case (__ \ "toto")@@idx => Some(idx)
  case _      => None
}
res1: Option[Int] = Some(2)



Note the usage of @@ operator that you can dislike. I didn’t find anything better for now but if anyone has a better idea, please give it to me ;)





Pattern matching the last element of a JsPath

1
2
scala> val _ \ last = __ \ "alpha" \ "beta" \ "delta" \ "gamma"
last: String = gamma



Using _, I ignore everything before gamma node.





Matching only the first element and the last one

1
2
3
4
5
6
7
8
scala> val _ \ first \?\ last = __ \ "alpha" \ "beta" \ "gamma" \ "delta"
first: String = alpha
last: String = delta

scala> val (_ \ first)@@idx \?\ last = (__ \ "alpha")(2) \ "beta" \ "gamma" \ "delta"
first: String = alpha
idx: Int = 2
last: String = delta



Note the \?\ operator which is also a temporary choice: I didn’t want to choose \\ ause \?\ operator only works in the case where you match between the first and the last element of the path and not between anything and anything…





A few more complex cases

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
scala> val (_ \ alpha)@@idx \ beta \ gamma \ delta = (__ \ "alpha")(2) \ "beta" \ "gamma" \ "delta"
alpha: String = alpha
idx: Int = 2
beta: String = beta
gamma: String = gamma
delta: String = delta

scala> val (_ \ alpha)@@idx \ _ \ _ \ delta = (__ \ "alpha")(2) \ "beta" \ "gamma" \ "delta"
alpha: String = alpha
idx: Int = 2
delta: String = delta

scala> val _@@idx \?\ gamma \ delta = (__ \ "alpha")(2) \ "beta" \ "gamma" \ "delta"
idx: Int = 2
gamma: String = gamma
delta: String = delta

scala> (__ \ "alpha")(2) \ "beta" \ "gamma" \ "delta" match {
  case _@@2 \ "beta" \ "gamma" \ _ => true
  case _ => false
}
res4: Boolean = true









And finally using regex?

1
2
3
4
5
6
7
8
scala> val pattern = """al(\d)*pha""".r
pattern: scala.util.matching.Regex = al(\d)*pha

scala> (__ \ "foo")(2) \ "al1234pha" \ "bar" match {
  case (__ \ "foo")@@idx \ pattern(_) \ "bar" => true
  case _ => false
}
res6: Boolean = true



So, I think we can provide more features and now I’m going to use it with my JsZipper stuff in my next article ;)

If you like it, tell it!

Have fun!



















JsZipper : Play2 Json advanced (& monadic) manipulations
2013-05-01T17:17:00+02:00
EXPERIMENTAL / DRAFT





The code is available on Github project play-json-zipper


JsZipper is a new tool allowing much more complex & powerful manipulations of Json structures for Play2/Json Scala API (not a part of Play2 core for now)
JsZipper is inspired by the Zipper concept introduced by Gérard Huet in 1997.

The Zipper allows to update immutable traversable structures in an efficient way. Json is an immutable AST so it fits well. FYI, the Zipper behaves like a loupe that walks through each node of the AST (left/right/up/down) while keeping aware of the nodes on its left, its right and its upper. The interesting idea behind the loupe is that when it targets a node, it can modify and even delete the focused node. The analogy to the pants zipper is quite good too because when it goes down the tree, it behaves as if it was opening the tree to be able to drive the loupe through all nodes and when it goes up, it closes back the tree… I won’t tell more here, it would be too long.

JsZipper is a specific interpretation of Zipper concept for Play/Json API based on :
Scala Streams to go through / update / construct Json AST nodes in a lazy way
Monadic aspects to provide funnier ways of manipulating the Json AST (plz see below)




Please note, JsZipper is not an end in itself but a tool useful to provide new API to manipulate Json.



Let’s go to samples because it explains everything.

We’ll use following Json Object.

1
2
3
4
5
6
7
8
9
10
11
scala> val js = Json.obj(
  "key1" -> Json.obj(
    "key11" -> "TO_FIND",
    "key12" -> 123L,
    "key13" -> JsNull
  ),
  "key2" -> 123,
  "key3" -> true,
  "key4" -> Json.arr("TO_FIND", 345.6, "test", Json.obj("key411" -> Json.obj("key4111" -> "TO_FIND")))
)
js: play.api.libs.json.JsObject = {"key1":{"key11":"TO_FIND","key12":123,"key13":null},"key2":123,"key3":true,"key4":["TO_FIND",345.6,"test",{"key411":{"key4111":"TO_FIND"}}]}



Basic manipulations

Setting multiple paths/values

1
2
3
4
5
scala> js.set(
  (__ \ "key4")(2) -> JsNumber(765.23),
  (__ \ "key1" \ "key12") -> JsString("toto")
)
res1: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND","key12":"toto","key13":null},"key2":123,"key3":true,"key4":["TO_FIND",345.6,765.23,{"key411":{"key4111":"TO_FIND"}}]}



Deleting multiple paths/values

1
2
3
4
5
6
scala> js.delete(
  (__ \ "key4")(2),
  (__ \ "key1" \ "key12"),
  (__ \ "key1" \ "key13")
)
res2: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND"},"key2":123,"key3":true,"key4":["TO_FIND",345.6,{"key411":{"key4111":"TO_FIND"}}]}



Finding paths/values according to a filter

1
2
3
4
5
6
scala> js.findAll( _ == JsString("TO_FIND") ).toList
res5: List[(play.api.libs.json.JsPath, play.api.libs.json.JsValue)] = List(
  (/key1/key11,"TO_FIND"),
  (/key4(0),"TO_FIND"),
  (/key4(3)/key411/key4111,"TO_FIND")
)



Updating values according to a filter based on value

1
2
3
4
5
scala> js.updateAll( (_:JsValue) == JsString("TO_FIND") ){ js =>
  val JsString(str) = js
  JsString(str + "2")
}
res6: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND2","key12":123,"key13":null},"key2":123,"key3":true,"key4":["TO_FIND2",345.6,"test",{"key411":{"key4111":"TO_FIND2"}}]}



Updating values according to a filter based on path+value

1
2
3
4
5
6
7
scala> js.updateAll{ (path, js) =>
  JsPathExtension.hasKey(path) == Some("key4111")
}{ (path, js) =>
  val JsString(str) = js
  JsString(str + path.path.last)
}
res1: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND","key12":123,"key13":null},"key2":123,"key3":true,"key4":["TO_FIND",345.6,"test",{"key411":{"key4111":"TO_FIND/key4111"}}]}



Creating an object from scratch

1
2
3
4
5
6
7
scala> val build = JsExtensions.buildJsObject(
  __ \ "key1" \ "key11" -> JsString("toto"),
  __ \ "key1" \ "key12" -> JsNumber(123L),
  (__ \ "key2")(0)      -> JsBoolean(true),
  __ \ "key3"           -> Json.arr(1, 2, 3)
)
build: play.api.libs.json.JsValue = {"key1":{"key11":"toto","key12":123},"key3":[1,2,3],"key2":[true]}









Let’s be funnier with Monads now

Let’s use Future as our Monad because it’s… coooool to do things in the future ;)

Imagine you call several services returning Future[JsValue] and you want to build/update a JsObject from it.
Until now, if you wanted to do that with Play2/Json, it was quite tricky and required some code.

Here is what you can do now.

Updating multiple FUTURE values at given paths

1
2
3
4
5
6
7
8
scala> val maybeJs = js.setM[Future](
  (__ \ "key4")(2)        -> future{ JsNumber(765.23) },
  (__ \ "key1" \ "key12") -> future{ JsString("toto") }
)
maybeJs: scala.concurrent.Future[play.api.libs.json.JsValue] = scala.concurrent.impl.Promise$DefaultPromise@6beb722d

scala> Await.result(maybeJs, Duration("2 seconds"))
res4: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND","key12":"toto","key13":null},"key2":123,"key3":true,"key4":["TO_FIND",345.6,765.23,{"key411":{"key4111":"TO_FIND"}}]}



Update multiple FUTURE values according to a filter

1
2
3
4
5
6
7
8
9
10
scala> val maybeJs = js.updateAllM[Future]( (_:JsValue) == JsString("TO_FIND") ){ js =>
  future {
    val JsString(str) = js
    JsString(str + "2")
  }
}
maybeJs: scala.concurrent.Future[play.api.libs.json.JsValue] = scala.concurrent.impl.Promise$DefaultPromise@35a4bb1a

scala> Await.result(maybeJs, Duration("2 seconds"))
res6: play.api.libs.json.JsValue = {"key1":{"key11":"TO_FIND2","key12":123,"key13":null},"key2":123,"key3":true,"key4":["TO_FIND2",345.6,"test",{"key411":{"key4111":"TO_FIND2"}}]}



Creating a FUTURE JsArray from scratch

1
2
3
4
5
6
7
8
scala> val maybeArr = JsExtensions.buildJsArrayM[Future](
  future { JsNumber(123.45) },
  future { JsString("toto") }
)
maybeArr: scala.concurrent.Future[play.api.libs.json.JsValue] = scala.concurrent.impl.Promise$DefaultPromise@220d48e4

scala> Await.result(maybeArr, Duration("2 seconds"))
res0: play.api.libs.json.JsValue = [123.45,"toto"]



It’s still draft so it can be improved but if you like it, don’t hesitate to comment and if people like it, it could become a part of Play Framework itself

Have fun!



















Quick Survey about most basic concept in Functional Programming
2013-04-13T14:14:00+02:00
The question


What’s the first word coming in your mind when I say:
“Most basic concept of functional programming?”

















For info, this dendrograph was pre-computed using Play2.1 app sucking Tweets & filtering/grouping the results in a very manual-o-matic way…

Have Fun(ctional)

Mandubian Blog

ShapelessStream, when Akka-Stream meets Shapeless Coproduct at compile-time

Akka-Stream loves Types

Multi-type flows

Miles Sabin to the rescue

Akka-Stream Graph Mutable builders

Hacking mutable builders with Shapeless

Sample

Some compile errors now ?

For input:

For output:

Conclusion

FreeR - Hybrid Free Monads for Reduced Quadratic Complexity/Observability & Map-Fusion Optimization in Scala

Introduction

Solving left-associated quadratic complexity

Quadratic observability

FreeR hybrid structure

Left Association

Observability

Right association complexity

Cherry on the cake: map-fusion optimization

Conclusion

Scaledn: Promote EDN as a far better alternative to Json in Scala

SCALEDN, EDN Scala API

Why EDN?…

EDN manages number types far better than Json

EDN knows much more about collections

EDN accepts characters & unicode

EDN accepts comments & discarded values

EDN knows about symbols & keywords

EDN is extensible using tags

EDN has no root node & can be streamed

Conclusion: EDN should be preferred to Json

Scaledn insight

Runtime Parsing

Compile-time parsing with Macros

Classic Scala types

Shapeless heterogenous collections

Macro API

Mixing macro with Scala string interpolation

Runtime validation of EDN to Scala

Serializing Scala to EDN

TODO

Shapeless HFunctor for Heterogenous Structures (& others)

Shapeless HMonoid for Heterogenous Structures (& others)

Summon Daemonad to macro-snoop into monad stacks

What?

Motivations

1 - Experiment writing a very ugly Scala macro

2 - Investigate ScalaAsync generalization to all monads + (some) monad stacks

3 - Explicitly Mark monadic blocks

You don’t need or you don’t want to know what is a monad…

You want to know or you know what is a monad …

Back to code Sample

What is working ?

What isn’t working ?

A very stupid example to finish with 4-depth stack

TODO

Special Thanks

ZPark-Ztream II (Part 3/3): Fancy Spark Streamed Machine-Learning & new Scalaz-Stream NIO API

Synopsis

[Part 3/3] Fancy Spark Machine Learning with NIO client/server & DStream…

Train collaborative filtering model

Training client

Training server

Training model

Building RDD[Rating] from server stream

Run it

Predict Ratings

Prediction client

Prediction server

Prediction Stream

Running all in same StreamingContext

Run it…

Final conclusion

ZPark-Ztream II (Part 2/3): Fancy Spark Streamed Machine-Learning & new Scalaz-Stream NIO API

Synopsis

[Part 2/3] From Scalaz-Stream NIO client & server to Spark DStream

Scalaz-Stream NIO Client

What is a client?

Building `RDD[Rating]` from server stream

Running all in same `StreamingContext`

Client seen as `Process`

Plug your own data source on `Exchange`

Managing client/server business logic with `Wye`

So what does this `Exchange.wye` API do?

Implement the client with `wye/run`

Server seen as `Process`

Pull from `Process[Task, T]`, Push to `DStream[T]` with `LocalInputDStream`

`Process` vs `DStream` ?

`dstreamize` implementation

How to compose `Rule[JsValue, FooBar]` with `Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]` ?

How to compose `Rule[JsValue, String :: Int :: Long :: HNil]` with `Rule[String, String]) :: Rule[Int, Int] :: Rule[Long,Long]`?