Anyone worked with Spark in Scala?

Discussion in 'Tech Heads' started by AgelessDrifter, Oct 16, 2018.

  1. AgelessDrifter

    AgelessDrifter TZT Neckbeard Lord

    Post Count:
    43,342
    I've never worked with a compiled language before. I like that it catches more bugs at compile time so you don't have jobs running for hours before hitting an exception, but there's gotta be a better workflow for debugging than tweaking the source code, compiling a .jar package with sbt and then spark-submitting to actually run the job, then sifting through tons of logs to see the output in console, right?
     
  2. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    I don't work directly w/Spark/Scala - but the workflow you described sounds very standard to me.

    This is where the whole CI/CD methodology comes in I think. Obviously manually doing all that shit is a pain and inefficient, so ppl these days tend to set up pipeline automation for all this shit. The ideal standard (from what I can tell) these days for development is, check in your code, and the pipeline handles all the compiling, publishing, testing, and displaying of any logs/errors.
     
  3. AgelessDrifter

    AgelessDrifter TZT Neckbeard Lord

    Post Count:
    43,342
    Hmm, it is only one line each to compile and run, it just takes so fucking long to make a minor tweak

    I've been spoiled by jupyter notebooks
     
  4. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    Yep, it's a bitch, but that's how compiled shit goes. Again, I'm stating all this out of osmosis-level knowledge (I'm not a build-and-release guy, but I do run the systems underneath, so have to understand the concepts for work). It is my assumption that all modern compiled languages are handled this way, but it's entirely possible that specific languages (hell maybe all of them) have other intermediate steps (unit testing?) that can be taken without running your shit through full CI/CD.
     
  5. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    only worked w/ spark in python

    ur compile-y time workflow sounds normal to me tho

    & tbh even in python you end up doing an even worse version of a lot of what you said when you work in full clusters w/ spark, because you can't access STDOUT in spark from inside executors on core nodes until a job finishes. then you gotta yarn logs -applicationId &> filename that shit, and you end up with a 20 GB output log because why not
     
  6. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    tldr; build pipelines, git gud
     
  7. AgelessDrifter

    AgelessDrifter TZT Neckbeard Lord

    Post Count:
    43,342
    Hmm, yeah that does sound horrible. spark-shell's logging is nightmarishly verbose by default as well. Even with the logging level set to WARN I have to scroll up through 40 or 50 lines of bullshit in console to see what the output was if I want to see it in console

    I guess I was just wondering if standard practice didn't look something more like building up the bulk of the code in some souped-up variation on spark-shell or scala equivalent of a jupyter notebook and then porting over to a stand-alone package afterward everything is more or less working
     
  8. AgelessDrifter

    AgelessDrifter TZT Neckbeard Lord

    Post Count:
    43,342
    I'm not sure what pipeline means in this context, Ut

    And I'm not sure it'd solve the (minor) issue of having to wait ~1-2 minutes for code to compile and Spark to boot up every time I track down a missing closing } or whatever, if it's anything like what I'm guessing it means
     
  9. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    buttumno has never even used spark, half used it in a full cluster. have 0 faith he knows what he's talking about, but it sounds like he believes he would find it ez to write a better core-nodes logging system than comes shipped w/ spark by default. do not believe

    re: souped up spark-shell -- not familiar with anything like that, but i don't see why it couldn't exist. i very rarely ever do code development in a REPL though and think ppl who do that are strange, so i generally see little difference between compiling+running code and just running interpreted code. my workflow's basically the same either way
     
  10. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    i p much only use REPLs to make sure i didnt forget what the syntax is for something when im p sure i remember it

    or when i need a calculator
     
  11. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    i assume ur not manually doing each step tho ofc?

    like, you have a bash or scala or python script that automates compilation & then maybe zipping up the file structure & then spark-submit + cmd line args?
     
  12. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    yeah, if it's only 1-2 mins then it's not going to help much (except to automate repetitive steps).

    i think what i see though when a whole dev team is going to take on a project that requires regular compilation, they first take on building a pipeline, so everyone uses the same system (checkins triggering builds, logs centralized to a convenient location, etc.)

    also - the current trend (and something i only see accelerating) is AWS generally bending over backwards to simplify setup for developers (and in the process get them dependent on AWS-proprietary implementations of open-sourcy-stuff)

    So things like Amazon EMR: https://aws.amazon.com/emr/features/spark/

    (not sure if this specifically will help you, but have a look)
     
  13. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    this is actually 100% true and i also enjoy talking out of my ass, humor me a bit k?
     
  14. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    (it's actually not entirely true - i did implement something called databricks here at work which i'm p sure was written by the original Spark developers who create a service on AWS to streamline all this shit).

    it was pretty snazzy but not cheap and you had to pay for it on TOP of what you already pay AWS but it really made it easy to distribute a bunch of jobs out to spark clusters, have them auto-scale up as needed, spit back output, and do other neat things.

    we liked it pretty well until ppl saw the bill and we've since dropped it Lol!
     
  15. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    the clusters i'm workin in are in EMR. EMR makes cluster spinup & basic 'big data' package setup super simple, but afaik it doesnt do shit to improve spark's core-node logging capabilities, unless there's some nice package we're not using

    you can attempt to jury-rig your own by e.g. writing to hdfs or s3, for example

    but if you write to hdfs you have to have each sub-process spawned by each executor write to a different unique filename, bc hdfs is write-once read-many. and then you'll have to write something to combine all of these into an uber-file regularly if you want real-time output

    and if you write to s3 you have to go out of your way just to simulate appending to a file, since appending to s3 objects is not rlly a thing
     
  16. Utumno

    Utumno Administrator Staff Member

    Post Count:
    39,317
    seems like neither s3 or hdfs are ideal for this... wouldn't a shared network storage be best? can it write out to Amazon EFS?

    that's cool about EMR though - it sounds like you've had positive experiences w/it?

    Also don't you work in a govt. hellhole? Last you posted about stuff I thought it sounded like you'd be doomed to wrangling w/IT numbnuts who were professional bureaucrats that ensured nothing could ever get done.
     
  17. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    there's still plenty of the IT numbnuts shit. EMR is a good service, but, for example, what should be a 5 minute wait for us to spin up or down a new cluster is generally more like a 3 hour to 72 hour wait. i'll leave the bitchin' details outa this forum tho

    i dont have any experience with EFS. what would the advantage be of it over hdfs? fwiw the hdfs issues can be overcome to get a single file of continuous real-time output, it's just a pain in the ass and is exactly the kind of thing you would expect spark's built-in logging to handle but for some reason it doesn't handle it at all
     
  18. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    + altho there's lots of gov't hellhole aspects to it, i get to work on really cool math problems & w/ pretty fancy tech even if the general context seems intent on trying their hardest to fuck it up so i'm p happy right now
     
  19. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    oh unless our EMR-node-attached storage is EFS? i was thinking it was EBS but i'm not sure

    if that's the case then i don't think it solves the problem. an't nobody got time to ssh into each node to see what it be talk about
     
  20. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    44,931
    oic from the docs. EFS is like a somewhat fancier hdfs. does EFS support simultaneous file write?