aws s3/ec2/emr tips+tricks

Discussion in 'Tech Heads' started by Agrul, Nov 13, 2017.

  1. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    I'm really good with S3. What questions you guys got
     
  2. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    here's an S3 QUESTIOn 4 U chemosh

    if i want to compute w/ s3 shit somewhere do i have to write it to disk first or can i load s3 data directly into RAM on whatever machine i'm working from
     
  3. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    Why would you have to load from disk. S3 is just object storage that you can download your content from using RESTful API calls. You could write code in whatever language you like using our SDK and save the content to memory via your variables.

    However, that being said, if your content is to large, disk maybe your only option.
     
  4. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    i've just been using the aws cli to transfer s3 objects

    i guess i should be using boto3 if i want to directly load s3 objects into memory since im working in python
     
  5. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    What are you doing? If the contents are proper in the objects you could use Athena
     
  6. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    we're implementing differential privacy at massive scale, then applying quadratic and mixed-integer linear programming at even larger scale to turn noisy differentially private query estimates back into microdata

    we've organized most of our data access through numpy multiarrays, scipy.sparse sparse arrays (of various specific types), and pandas dataframes so far - haven't been using SQL at all, just numpy/scipy/pandas usual read_csv functions & standard python slicing. eventually we move intermediate computations on this stuff into spark RDDs and/or dataframes, and call a .map operator repeatedly over a function that does a lot of the aforementioned differential privacy and mathematical programming

    athena is one of the standard AWS tools? we may have access to it, not sure. what does it mean for the contents of a bucket to be 'proper'?
     
  7. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    If your objects are in a particular format say csv, or json, you could query the prefix in s3 that contains your data and run SQL queues against it.
     
  8. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    i guess that might work. i think i would rather just be able to directly load each csv into a scipy.sparse matrix, pandas dataframe, or dense numpy multiarray though
     
  9. Sifter

    Sifter TZT Addict

    Post Count:
    2,841
    If you are using the AWS SDK for java, you can just read from or write to an InputStream. You don't want to use local files IMO, unless you have lots of data.

    If you've got an s3 bucket setup, you can retrieve your s3 Object doing something like this:

    S3Object someObject = s3Client.getObject(new GetObjectRequest("somebucket", "somekey"));
    //and if you want to do something with the stream.
    S3ObjectInputStream someStream = someObject.getObjectContent();

    You could also do something like, retrieve a bunch of s3 objects using the bucket and key prefix, then join all the streams together into a SequenceInputStream, read and perform some operations on all the data, and write your processed data to another file in an s3 bucket. No intermediary files needed.
     
  10. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    oic. i guess the boto3 equivalent of that first bit is

    response = s3_client.get_object(Bucket=bn,Key=obj['Key'])
     
  11. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    here we go

    thx u internet

    obj = client.get_object(Bucket='my-bucket', Key='path/to/my/table.csv')
    grid_sizes = pd.read_csv(obj['Body'])

    https://dluo.me/s3databoto3
     
  12. Sifter

    Sifter TZT Addict

    Post Count:
    2,841
  13. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    hrm, the opposite direction sounds less easy:

    To upload files, it is best to save the file to disk and upload it using a bucket resource (and deleting it afterwards using os.remove if necessary).

    my_bucket.upload_file('file',Key='path/to/my/file')

    It also may be possible to upload it directly from a python object to a S3 object but I have had lots of difficulty with this.
     
  14. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
  15. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    Using Athena in same region will cost 0 on data transfer out
     
  16. Agrul

    Agrul TZT Neckbeard Lord

    Post Count:
    43,840
    im not sure that helps my current use cases but is really neat to know, might be useful in the future. by cost you mean monetary of course? does that apply no matter the transfer rate per month?
     
  17. Chemosh

    Chemosh TZT Addict

    Post Count:
    4,094
    All data transfer into same region is free for most if not all services. This means it may be cheaper for you to run ec2 instances to parse your data if it's huge amount of data. Like running a emr job