Discussion in 'Tech Heads' started by Agrul, Nov 13, 2017.
I'm really good with S3. What questions you guys got
here's an S3 QUESTIOn 4 U chemosh
if i want to compute w/ s3 shit somewhere do i have to write it to disk first or can i load s3 data directly into RAM on whatever machine i'm working from
Why would you have to load from disk. S3 is just object storage that you can download your content from using RESTful API calls. You could write code in whatever language you like using our SDK and save the content to memory via your variables.
However, that being said, if your content is to large, disk maybe your only option.
i've just been using the aws cli to transfer s3 objects
i guess i should be using boto3 if i want to directly load s3 objects into memory since im working in python
What are you doing? If the contents are proper in the objects you could use Athena
we're implementing differential privacy at massive scale, then applying quadratic and mixed-integer linear programming at even larger scale to turn noisy differentially private query estimates back into microdata
we've organized most of our data access through numpy multiarrays, scipy.sparse sparse arrays (of various specific types), and pandas dataframes so far - haven't been using SQL at all, just numpy/scipy/pandas usual read_csv functions & standard python slicing. eventually we move intermediate computations on this stuff into spark RDDs and/or dataframes, and call a .map operator repeatedly over a function that does a lot of the aforementioned differential privacy and mathematical programming
athena is one of the standard AWS tools? we may have access to it, not sure. what does it mean for the contents of a bucket to be 'proper'?
If your objects are in a particular format say csv, or json, you could query the prefix in s3 that contains your data and run SQL queues against it.
i guess that might work. i think i would rather just be able to directly load each csv into a scipy.sparse matrix, pandas dataframe, or dense numpy multiarray though
If you are using the AWS SDK for java, you can just read from or write to an InputStream. You don't want to use local files IMO, unless you have lots of data.
If you've got an s3 bucket setup, you can retrieve your s3 Object doing something like this:
S3Object someObject = s3Client.getObject(new GetObjectRequest("somebucket", "somekey"));
//and if you want to do something with the stream.
S3ObjectInputStream someStream = someObject.getObjectContent();
You could also do something like, retrieve a bunch of s3 objects using the bucket and key prefix, then join all the streams together into a SequenceInputStream, read and perform some operations on all the data, and write your processed data to another file in an s3 bucket. No intermediary files needed.
oic. i guess the boto3 equivalent of that first bit is
response = s3_client.get_object(Bucket=bn,Key=obj['Key'])
here we go
thx u internet
obj = client.get_object(Bucket='my-bucket', Key='path/to/my/table.csv')
grid_sizes = pd.read_csv(obj['Body'])
hrm, the opposite direction sounds less easy:
To upload files, it is best to save the file to disk and upload it using a bucket resource (and deleting it afterwards using os.remove if necessary).
It also may be possible to upload it directly from a python object to a S3 object but I have had lots of difficulty with this.
Using Athena in same region will cost 0 on data transfer out
im not sure that helps my current use cases but is really neat to know, might be useful in the future. by cost you mean monetary of course? does that apply no matter the transfer rate per month?
All data transfer into same region is free for most if not all services. This means it may be cheaper for you to run ec2 instances to parse your data if it's huge amount of data. Like running a emr job
i just got to use a small 4-node EMR cluster in AWS w/ 4x m4.16xlarge nodes in it, parallelizing jobs across the nodes using spark
holy shit those suckers fly. did a job i that was taking us 48 hours in an hr and 12 mins, and i think i can get that shit down to like 15 minutes with a few more o' them nodes
Usability pretty good? Can it intelligently scale up and down to accommodate different sized jobs?
autoscaling is amazing when it's dialed in.
Separate names with a comma.