Using the NDK Performantly

15 Likes comments off

 

DAN GALPIN: Android Internals–
Writing Performant Native Code.

If you haven’t heard
enough about me already,

I have spent 5 plus years
talking to developers just

like you around
the world, and it

is awesome to be here in Hurst.

And I spent 15 years
as a software developer

before doing that.

So I have a little
bit of street cred.

I actually started
developing Android right

around Android 1.1.

Seriously actually, I started
developing commercially then.

I was working with
it ever since 1.0.

I wear a lot of hats.

This is like one of the
smallest hats that I wear.

And I have no shame.

And I think I’m kind of
funny sometimes, especially

with lack of sleep like now.

All right.

Performant is an
entirely invented word.

It is not a real word.

How many of you knew that
performant wasn’t a real word?

OK.

Good, I got a bunch of English
majors here, that’s awesome.

So actually according to
Urban Dictionary and most

of the research that I did–
because I do extensive research

on a talk like this–
performant was actually

invented by software developers.

OK?

And there’s some
theories behind this.

But the Urban
Dictionary defines it

is having adequate performance.

And but really this would
not be nearly as cool

of a talk– native code
having adequate performance.

So this is why the
word was created, OK?

It just doesn’t
have the ring to it.

We don’t like adequate
in our industry.

We like awesome.

So yeah, so rather than title it
that, I use the invented word.

And we’re going to be
talking about really, really

tiny benchmarks.

Because in order for your
app to actually perform well,

you have to do everything
and make sure everything

happens within
16.67 milliseconds.

That is how you get
60 frames per second.

But most of the
benchmarks that I’m

going to be talking
about in this lecture

are in nanoseconds.

So we take this
into nanoseconds.

That’s a lot of nanoseconds.

So this stuff is
really, really fast.

So don’t worry when I tell you
it takes five times as long

to do something on one
version as the other.

Because it’s really fast still.

But it’s good for you to know.

All right.

So again, do not panic.

Remember the clock
speed of the Nexus,

which is what I used to
do most of my testing,

because I wanted something that
could run KitKat as well as run

L&M, is about 2.3 gigahertz,
tops, with four cores.

So that’s like billions and
billions of instructions.

I know, I’m like sounding
like Carl Sagan here.

So we’re talking in very
small micro benchmark terms.

Now Android Internals is this
kind of thing I’m working up.

I want to see what
you guys think of it.

So afterwards we’ll have a quiz.

And it really is about the
unique voyages of discovery

we can take in an open
source platform like Android.

So the idea is not
just to understand

how to code Android, but
understand how it works,

so that when you
run into problems

you have a better
idea of actually what

you’re dealing with.

And we’re going to do it kind
of this way– we’re gonna

actually test our assumptions.

We’re going to benchmark.

We’re going to look
at source code.

And we’re going to
debug, even in native.

So this voyage one
takes us into the land

of optimizations in ART.

If you say my talk yesterday,
I got a bit into that.

But this time we’re going
to be a bit more pragmatic,

because we’re really going to
talk about how the world of J

and I has changed as we’ve
moved from a world of Dalvik

to a world of art.

But I am getting
ahead of myself.

So let’s talk about native code.

Most of you guys have
actually done native code.

But I’m talking about code
written using the NDK.

And we’re talking about
primarily C and C++ code that

interfaces with the Android
runtime using J and I.

Here is a really, really
abbreviated architecture

diagram of what this looks like.

Applications written with the
NDK take the form of these,

you know, decks classes, that
execute on the Android Runtime.

They interact with
system libraries

by the SDK framework classes.

And SDK application
code is written

in a language like Java,
that the runtime can support.

So the Linux kernel was
written primarily in C and C++,

and so are the system libraries.

The framework and
the Java runtime

call into these libraries
using the Java Native Interface

or JNI.

Now the NDK
essentially allows you

to write a dynamically
linked native library.

But it can’t run directly
against the system libraries,

because these ABIs or
APIs aren’t stable.

So the purpose of the
NDK is to give you

a stable application
binary interface to run

your own compiled code
against, that provides access

to only the most critical OS
features, so the platform can

still continue to grow
and expand, and change

how they implement
things, and be awesome.

But your application
code is talking to this

through this ABI.

It’s all important stuff.

And that’s what it looks like.

Boom, your application code now
talks to your library, which

is going straight to native.

We talked to you a little bit
about the history of the NDK.

The original first versions of
Android did not even have it.

But we got it in Cupcake.

And we’ve been slowly
expanding it ever since.

So in the first versions
of it, you got a C runtime,

really minimal C++ support,
Zlib compression, logging,

networking, dynamic linking,
some math– not a lot,

but enough.

We then added graphics.

So the first version
couldn’t even talk to OpenGL.

But we added graphics there.

And you know what took the
longest part about this slide

was actually trying to find
all these images again.

It’s like, when have I used
a slide with these images?

And then Gingerbread
really expanded things.

Gingerbread got
much more serious

about gaming and multimedia.

So we added our native
application API.

So you can actually build.

That was the first
version of Android

that you could actually
build a native application

without needing to use
any Java whatsoever.

And also sound,
which is really cool,

like Open SL was
really nice to have.

We continued to evolve it.

In Ice Cream Sandwich, we added
the OpenMAX AL media layer.

Not many people know
this, but you actually

can access renderscript directly
from the NDK, as of KitKat.

And it’s pretty cool stuff.

It was a long request.

And we also did a bunch
of graphic stuff here.

That’s why these
aren’t in order.

But in Jelly Bean
MR2, we added ES 3.

And in Lollipop, we added 3.1,
as well as 64-bit support.

So that’s pretty cool.

So let’s talk about
some assumptions.

So our assumptions are basically
to follow the suggestions

that the Perf-JNI article.

If you have not
read this article,

it is the gospel
for looking at how

to deal with JNI on Android.

But do they still
make sense today?

We haven’t updated the
article since we shipped ART.

So here are the basic
things you have to do, OK?

Absolutely critical when you’re
trying to make JNI performant.

One is you’re going to
cache field and method IDs.

You’re going to do
it intelligently.

Two, you’re going to
GetStrings in a reasonable way.

And you’re going to
copy things in native.

These are the only
three real tips we gave.

But I’ll go more into details.

So how did we benchmark this?

We actually used
something called Caliper.

Now how many of you have
actually have ever heard

of Caliper in this room?

No one, that’s good.

Oh, one person, sorry.

I had never heard of
it before doing this.

But I was interested
in doing benchmarking.

It turns out if you
actually look at AOSP,

we have Caliper
tests checked in.

This is actually how we
benchmark our VM ourselves.

And we use this
thing called Vogar.

And if you actually look at
what’s checked into Vogar,

it’s a really ancient
version of Caliper.

I’m hoping some day we
actually update that.

It would have made my
life a little easier.

But Caliper is a
really cool framework

for running micro benchmarks.

All right.

So let’s get to the first thing.

This– if you haven’t
use NDK before–

is how you access a
class from native code.

So once you have the class,
and I’m passing this class

in from Java– you can see
jclass type– that is actually

the class information.

And that’s what we
call it GetFieldID,

the name of the field, the type
of the field– so in this case,

integer.

And then finally I
can call GetItField

to actually pull the value.

So that’s how we actually
access an integer that’s

inside of a Java class
from native code.

All right.

So the first suggestion, which
is a really, really good one,

is to cache field
and method IDs.

And here’s why.

Those field and method
IDs are just numbers.

They don’t actually change
once the class has been loaded.

And if you want to be really,
really good about when

you actually grab them, inside
of the static initialize

is a class in Java.

You can actually call some
JNI code– in this case,

I’m calling ir nativeInIt.

And inside of
nativeInIt– it really

shouldn’t have been
named nativeInIt, now

that I’m looking at this
slide, but that’s OK.

You could see I’m
getting that field ID.

And that field ID
will be good as long

as this class is loaded.

So that’s pretty awesome.

I don’t have to think about it.

I’m just storing it in
a little variable there

that’s associated
with my native class.

All right.

Let’s talk about performance.

So here is how it
benchmarks on a Nexus 5

running KitKat and marshmallow.

And you’ll notice something.

ART takes longer.

That’s going to be in
general, a common theme.

ART is more complicated
than Dalvik in general.

And so it’s even
more important today

than it was initially
to cache these things.

Because devices
are running faster.

And ART is already
faster doing most things.

So you’ll even notice this
maybe a little bit more

than even these
benchmarks should say.

So let’s look at
the code and try

to figure out why this is
faster, or slower I should say.

And the really key thing
is this thing here,

this ScopedJNIThreadState
and ScopedObjectAccess.

This is why JNI
actually does not

run at lightning warp Speed.

And that’s because every
single thread in Android

can be in one of two states.

It could be in running state.

Actually it can be in more
than that– but two states

that we care about.

It can be in running state.

That’s actually when we’re
in the Java virtual machine,

and we’re actually
executing stuff.

And it has access to
all that great– sorry,

I should say the runtime.

We do not have a Java
virtual machine in Android.

You can strike that
from your memory.

The Android runtime–
that’s when it’s in there.

Or it can be in the non-running
state or native state.

So when we’re actually
accessing a variable

like this, which is an
int field, all of these

are variables that are
inside of the runtime.

We actually have to switch
the state of our thread

in order to do that.

And that means we’re doing a
whole bunch of synchronization.

And that synchronization
is expensive.

It’s expensive on the order
of about 300 nanoseconds.

Now to give you some context,
because 300 nanoseconds

is a really small number.

In the average function call,
an average function call in ART

is about five nanoseconds.

In Dolvic, it’s more like 10.

 

So once again, we’re
talking about something

that is a really tiny number.

But it’s still like 60 times
longer than a standard function

  About

call.

So it’s still something
to think about.

So let’s look at
our first work hard.

And yes, based upon our
benchmark caching field

and method IDs is great
for Dolvic and ART.

It’s even better in ART.

All right, so let’s
look at the suggestion

two of this, which was
use GetStringChars.

Now this was kind
of interesting.

So basically as you
probably all know,

the standard for Java–
which ART also follows–

is to treat all strings
as double byte character

strings, UCS2.

And this is important
because we’re

in a world that’s
highly international,

single byte strings are kind
of passe, et cetera, et cetera.

Not to mention as it
turns out, the VM actually

doesn’t particularly
have great instructions

for dealing with bytes.

So it’s actually kind of nice to
have these things in these two

byte characters, thank you.

So the suggestion here is that
rather than getting string UTF

characters, like we have
there at the bottom,

we actually call it
GetStringChars, which actually

takes our string it gets
us the closest to being

a native representation of
it that you could imagine.

And we would expect this
to always outperform

the UTF equivalent,
where it actually

has to do a copy of memory.

All right, so
let’s look at this.

I took a 15 character string.

I ran some benchmarks on this.

And I was actually really
astonished to see two things.

One, as we expected,
ART is actually slow.

But it’s actually much slower.

And two, this was a real shock.

if you look at those
two blue lines,

on a 15 character string,
GetStringUTFChars actually

performs faster
than GetStringChars.

How can that possibly happen?

Because we already said
GetStringChars doesn’t

have to copy the
string, it doesn’t

have to translate the
string between UTFA.

So something is happening
to actually make

on this very short string.

It’s actually faster to do all
that copying and translation.

So let’s try a longer string,
just to see if I’m crazy here.

So this is a string that’s
100 characters long.

And we see more of
what we would expect.

So GetStringUTFChars is now
slower than GetStringChars.

So the question is,
why was GetStringUTF

ever faster under ART?

So let’s look at
some source code.

So you can see here that
GetStringUTFChars always

has to copy.

So what it does,
it goes through,

and it actualy just goes
to that copy operation.

Well, GetStringChars actually
has to go and check, huh?

Well, can I actually
avoid this copy?

So it actually goes,
looks at the key,

and checks to see if
that’s a movable object.

And it turns out that’s actually
somewhat of an expensive call.

So you can see here, there
is this fine continuous space

from object.

That just sounds dangerous and
what is this actually doing

in real life?

Well, it actually
calls this, which

has a for loop into
it, which looks

to find continuous spaces.

So you can see already
here, even though the VN

is doing all of this work to
try to avoid this little tiny 15

character, 30 bite mem
copy, it’s actually

failing to run this
particular case optimally.

And so some point in between
100 characters in 15 characters

happens to be the
break even point.

What does this
really, really mean?

Is it unless you’re passing
very, very large strings

around, do what’s ever the most
convenient to you, honestly.

It’s not a big deal.

You have to do a whole bunch
of crazy stuff in native code

to actually make your code
handle two byte strings– two

byte characters, it might
or may not be worth it.

You’ll probably want to look
at actually profiling it.

So here’s our scorecard.

So in general, yes,
GetStringUTFChars

is going to be faster for large
characters, but not always.

I’ll give it 3/4
of a star for ART.

Here’s another suggestion
that came out of there,

which is use GetStringRegion.

This is kind of interesting.

So here is what that looks like.

So normally if you want to copy
a string into a native buffer–

in this case in my native, I
mean just literally a buffer

of characters.

You’re going to
call GetStringChars,

and then you’ll mem copy it,
and then et cetera, et cetera.

And you’ll see I’m also doing
some memory deallocation here,

just to be fair on both sides.

You can see it’s actually
several lines of code

and several more accesses.

Because every time you
actually do something

like GetStringChars
or GetStringRegion,

you’re actually talking
to the VM as well.

Well, this is
actually kind of cool.

Sorry, you’re talking to native
code as well and to the VM.

So this is kind of cool here.

And you could actually
use GetStringRegion

and GetStringRegion
does the copy for you.

 

That’s kind of nice.

Also one thing I’m
doing here, which

is a nice little
optimization is I’m actually

passing the length of
the string into this.

And that’s kind of cool.

Because as it turns out, passing
extra parameters into JNI

is almost free.

It takes literally on the order
of a couple of nanoseconds

for every single additional
parameter you want to use.

So that’s awesome.

And if I were going to
actually query the string,

and say give me the
size of the string,

that would be another
300 nanosecond round trip

through the machine.

So adding additional
parameters is a great way

of optimizing your JNI.

So I thought I’d point this out.

This is sort of a little
minor optimization here.

But these things are what
you’re thinking about.

Again you’re trying to avoid
round trips on both sides.

You’re trying to avoid
extra calls into the VM

or into the runtimes, I
should say, from native code.

And you’re also trying
to do the other way.

You’re trying to avoid
extra calls into native code

from the run time.

All right.

So what does this really
look like after all of this?

Well, it’s kind of
as you’d expect.

GetStringRegion is way faster.

You’re avoiding doing
an extra allocation.

 

And so that’s going to,
in general, be good.

 

And you can also see
in ART is actually

a lot slower than in Dalvik.

And a lot is all relative.

Again these are all
little tiny things.

You might think after this
talk that ART isn’t very fast.

And I don’t want to give that
impression to you at all.

In fact, ART is scary fast at
doing almost anything but this.

So in almost any other way it
is going to blow away Dalvik.

So do not take this any kind
of indictment against ART.

In fact, you could
also– when I was

asking one of the
internal guys about why

this case, ART
actually was written

in a time when we had multiple
processor cores in a system.

So when they started
designing it and writing it,

they were thinking the entire
time about deadlock problems.

And I would say that ART
takes incredibly conservative

approach to make
sure that you’re not

going to have deadlock.

And if you look in the
list of bugs on AOSP,

you will find deadlock bugs
in Dalvik– most of which

have been fixed.

But I think part of
what you’re seeing

is the art team wanted this
to be incredibly robust.

And that’s why you’re
seeing a little bit of this.

So maybe in the
future, we can actually

figure out how to make
these even closer together.

But that’s what it’s like today.

All right.

So another big win on ART,
and a big win on Dalvik

to use GetStringRegion.

All right.

Let’s talk about a problem
that a lot of people have which

is sharing raw data
with native code.

And this is also part of this.

Now if you haven’t figured
this out by the talk,

JNI calls are
relatively expensive.

And you know again,
this is relative.

We’re talking about five
nanoseconds for regular a call,

to about 300 nanoseconds–
on a Nexus 5,

to be fair– of a JNI call.

So what are we really
talking about, the overhead

of a one-way call?

 

Or I’m sorry, a two-way call.

This is a two-way call.

So on Dalvik our overhead was
about a little less than 130

nanoseconds.

On ART, it’s almost twice that.

And good thing that
devices are getting faster.

You can see I’ve
also benchmarked

the Nexus 6P and a Nexus
9, both in 64 bit mode.

And you can see they’re
actually pretty fast.

But even the Nexus
9 actually doesn’t

outscore Dalvik
running on a Nexus 5,

for doing these kinds of things.

So JNI is expensive.

And the real goal of
all this, and if there’s

any takeback from this entire
lecture is, avoid chattiness.

Every bit of chattiness
you add adds extra time.

And a lot of that is stuff
you don’t even think of.

So for example, let’s
say you’re like, you

know what, I’m going to avoid
writing a whole bunch of code.

If you’ve ever
played with Unity–

how many people here
have played with Unity?

So one of the ways in which
you talk to Android from Unity

is to use something
called Android Java Proxy.

And Android Java
Proxy is really cool.

Because basically it takes
it in under proxy interfaces,

and it creates a dynamic
class, essentially

on the fly, that’s used to fill
out some interface that you can

then use to talk to a whole
bunch of internal systems–

which realizes
that by doing that,

you are getting the
chattiest possible interface

into Android.

And so if you’re
trying to do something

over and over and
over again, that’s

going to actually
impact your performance.

So for example, let’s say
you’re trying to read bytes out

of some class in
Java one at a time,

you realize this is going
to very, very quickly

exhaust all of your CPU
time on the main thread.

So you really do have
to be careful with what

you do on this.

And think about
the interfaces you

have between your
native code and the VM.

All right.

Let’s go back over
to this thing.

So how do we actually
deal with sending

big chunks of data between
native code and the runtime?

And there’s this cool thing
called a direct byte buffer.

I don’t know how many people
have played with direct byte

buffers here before.

You pretty much only ever want
to deal with a direct byte

buffer if you’re
working in native code.

There’s really no
other reason for them

to exist, as far as I can tell.

Although the VM might
choose to not actually

allocate this memory out
of its normal page pool.

So on some VMs, it
actually might get you

memory you don’t
normally have access to.

But in our runtimes,
it does not.

 

And you get this
nice allocate direct.

And then when you’re
inside native code,

you can just get an address
for that chunk of memory,

and start writing to it–
which is really cool.

And there’s no like,
I want to free this.

There’s no like,
release address.

It’s one call.

So that’s nice and fast right?

In theory.

So this is what this
looks like when you’re

using direct byte buffer.

So let’s look at
the performance.

Again we’re looking
at benchmarks

here all the time
on these things–

keep me up sleepless
nights doing these–

and what that
actually looks like.

And as you can see
once again, this

is a very, very slow access.

Because DirectBuffers
actually involve even more

synchronization.

And so then actually we’re
talking about something

that aren’t running
a Nexus 5, is almost

in the 600 nanoseconds range.

So you really want– once
you actually grab this,

  Android Studio: Using and Creating Them

the answer is you really
want to use it for something.

If you’re using it to pass
an integer, not a good idea.

You want to actually use
it to pass lots of data.

All right.

But there’s another
side to this.

Once you’re inside of code
that’s running on the runtime,

what’s the performance of byte
buffer– direct byte buffer

versus regular byte buffer?

So let’s take a look at that.

What?

OK, now let me just back
up a little bit here.

Because you’re seeing something
really, really strange here.

You’re seeing that
first of all Dalvik

Nexus 5 direct byte
buffer is the slowest

call by a substantial
amount, compared

to all of these other calls.

OK?

Takes it takes 300 nanoseconds.

The other thing you’re noticing
is a direct byte buffer

is way slower than the
standard one, which

is backed by a standard
byte array in Java.

So once again, two things
that are kind of weird.

And wherever we see really
weird stuff like this,

other than scratching
my head, it

is time to go and
explore some code,

and try to figure out
why that’s the case.

All right.

So here is what actually
happens when you do

allocate and allocate direct.

You actually get
a different class.

We’re using polymorphism
here, it’s awesome.

You either get ByteArrayBuffer
or DirectByteBuffer,

one of the two, OK?

And as you can see,
ByteArrayBuffer

is backed by an array.

And DirectByteBuffer is actually
backed by this class called

memory block.

All right.

And here’s how we
start reading integer.

We use the call and get Int.

And in ByteArrayBuffer,
it’s pretty standard.

It actually goes into another
class called MemoryPeekInt.

And inside of memory block,
we have a little bit–

an extra bit of indirection.

We actually have to
call into the block

class, which calls
into Memory.PeakInt,

but a different call.

It’s that PeakInt is
taking a backing array,

and that other one is taking
an address plus offset.

And yes, you are actually
looking essentially

at pointer arithmetic
inside of the runtime right

here– not something
you see very often.

So what does this mean?

Well, when you’re
actually looking

at how this is implemented–
if you try to find the source

code, this is what you’ll see.

You’ll see probably the
most classic implementation

of how to pull
data from an array

and get it into an integer that
you see inside of the ByteArray

class.

And inside of memory block,
it actually calls into JNI.

All right, all right.

So now, remember– let’s
go back to this graph here.

So we saw that ART is way,
way faster than Dalvik

at doing this.

And yet we just demonstrated
looking at the source code,

that it’s actually
calling into JNI.

So that’s really weird.

Why is it so much faster?

All right.

So once again, here’s
what actually happens

inside of that native code.

But that really doesn’t matter.

Because we’ve shown almost all
of the cost of this operation

is going to be in synchronizing
between the different thread

states, between the two VMS.

So it turns out that ART is
actually doing a little trick.

And that is, when it
actually declares the method,

it’s declaring it with
this little exclamation

point on it– which is a
flag to the VM that says,

well, this is a
dangerous function.

Actually it’s a flag
to the person coding it

that it’s a dangerous function.

It’s a flag to the
VM saying this is

a very non-dangerous function.

It’s not going to try
to do anything in Java.

It’s not going to
last very long.

So let’s not actually
go through and change

the state of the thread at all.

Let’s just run this code
as quickly as possible.

So once again this is
how long it actually

takes to read that
integer from a ByteBuffer.

Now it’s still about half
the speed– even on ART–

of our StandardByteBuffer call–
even with all of that, even

with this fast switching.

And that’s because if you
actually go and throw this

into a debugger, you realize
that that whole statement

about where it’s using
lots and lots of shifts

in order to do it, is actually
not getting run at all.

It’s actually an intrinsic.

And so that’s how this
is speeding it up.

And also even if it
was running that code,

ART is just really fast.

Like you know, it’s
really, really fast.

And it turns out, there is
some overhead in doing even

this fast JNI call.

Because it still has
to set up the call

stack and all the
other things that it

would have to do
to actually switch

from running in the
runtime to running native.

and that takes about
50 to 60 nanoseconds,

according to my benchmarks
just to do that, in fast JNI.

All right.

So is there anything
we can do to avoid

having to make a JNI
call for every single int

we want to read?

It turns out there is.

We can actually get it all
at once using something

like this– so we
can get buffer,

we can allocate an array, we can
wrap that in a new ByteBuffer,

and then we can get that.

OK, and then we have to
fiddle with the position,

because otherwise
there’s an overflow,

there’s no fast call to
actually just give me

the contents of that buffer
that’ll actually work.

So believe it or not, this
is what you have to do–

and what does that
look like, if you

do all of that, allegations?

And this is even including
deallocations and stuff

like that.

And the answer is of
course, it’s pretty slow.

It’s really, really
slow on Dalvik.

You can see like–
this is where you

start getting into multiple
levels of optimization here.

But if you’re going to
be moving a lot, lot,

lot of data around– big,
big, big chunk of data.

And you’re going to be
accessing that from with inside

of the run time, then yes.

This is a strategy that
might make sense for you.

Like for example,
one of the things

you might want to be doing
is using like FlatBuffers,

to move big C structures
to the runtime.

How many of you are
familiar with FlatBuffers,

first of all, when I say that?

All right, so FlatBuffers
are really cool.

They’re an open source
project that my team created.

And it basically allows you
to do really, really efficient

translation from
stuff that’s coming in

either from disk
or from network,

into structures
that you can use.

It is about as efficient as
you can get, given the amount

of flexibility that it has.

It’s actually very similar
to protobufs, if any of you

have used that– except that
it’s designed from the start

to run on mobile, and to
run really, really fast.

So if you’re doing
something like that,

you might actually get some
performance out of this.

All right, so now since
we have a little time,

I wanted to show you
just a little bit of how

you use JNI in Android Studio.

All right.

So once again, how
many of you here

have actually tried doing
this in Android Studio?

OK, so that’s not an enormous,
enormous number of people.

But that’s OK.

Because this is
actually really cool.

This made my life so
much easier than trying

to actually deal with
the various things that

go on in JNI.

Here’s a whole bunch
of native declarations

that are inside of my
JNI benchmark class.

And you can see
the kinds of things

you’d expect, like
these ByteArrayCalls

and these string calls.

So let’s say I wanted to
add another native method.

OK.

So I’m going to type Native.

And let’s have it
return an Int–

I don’t have to call it JNI,
but just for consistency, I’m

going to call it JNI pass
a bunch of stuff to native.

So we’re going to pass let’s
say a string, a ByteBuffer,

an integer, a long
et cetera, et cetera.

And you see a couple of
things have happened here.

Probably the most useful
thing is that we actually now

are compiling the native
code and the Java code,

all in one Gradle build–
which is really awesome.

Because we can do
stuff like say, hey,

this function
actually isn’t found.

We can’t resolve this.

So you see, it shows up red.

It knows that it’s
not in my native code.

So here’s the really,
really cool trick.

For anyone who’s done
a lot of JNI code,

the ability to do
this is awesome.

I can do create function
here, click on this,

and now I have a native function
that’s been created inside

of that C file.

And this is really cool.

First of all, it’s also done
some helper things for me.

It thinks I might want to get
this string interesting enough,

into UTF rather than
double byte characters.

But hey, you know, it’s
probably what your code wants.

And then it’s also gotten
the ByteArryElements for me.

And it’s released
them at the end.

Because it’s
assuming that you’re

actually going to want
to use these things.

And so it actually puts
in that code for you.

So this is really, really cool.

And the best part is when I
go back to my benchmark class

here, you’ll see
it’s no longer red.

It’s actually done the
compile and we are golden.

We are actually ready to now
run that inside of this class.

So if you haven’t had
a chance to play around

with Android Studio and
in support of the NDK,

I highly recommend it.

It’s still a little bit of work
to get your Gradle project up

and running.

Because you’ve still got to
use the experimental version

of Gradle.

But you don’t actually have to
use the experimental version

of Android Studio.

It is now in mainline.

So go check it
out, play with it,

and make sure that
you’re not making

your applications that actually
use the NDK very chatty.

If there’s anything you can take
back from this entire lecture.

All right, so I’m
going to switch back

into non-mirroring
mode, so I can

finish all of the
citing slides that

are left in my presentation,
which is really just this.

If you need to get in touch
with me, this is how you do it.

And I hope you have
enjoyed the talk.

I hope you’ve
learned a little bit.

I have time for some questions,
if anyone wants to stump me,

this is a really
good chance to do it.

Because you most likely will.

But other than that,
again, it’s not

that it’s scary to use the NDK.

It is really cool to use the
new Android Studio stuff.

And you just have
to be cognizant

of the kinds of
performance problems

you could create with it.

And I hope you’ve got a
little bit of it from this.

And also once again,
what’s wonderful

about a platform like
Android, an open source

platform like Android, you
can go and explore the code.

You can actually understand how
we solve these very difficult

problems in many cases.

And you can learn something
and take something back

with your engineering
career, and use it again.

And that to me is half the fun
of working in an open source

project.

I mean, wouldn’t it be awesome
if everyone could simply say,

you know, here’s the reason
why that doesn’t perform well.

Let’s go look at
the source code.

And I think everyone
should be able to do that.

So I’m super excited to be
able to work on a development

project that actually does
have an open source backend.

So that being said,
thank you very

much for coming this morning.

[APPLAUSE]

  How To Shoot CINEMATIC VIDEO with Smartphone!

 

And I will take questions now.

OK?

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: I do not know.

That is a really good question.

You’ve now stumped me.

I’m so embarrassed.

That’s OK.

And yes?

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: Uh huh.

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: Mm hm.

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: So if it
is, like let’s say

you’re doing like Java getString
critical, which would mark that

as being in use.

That will never be moved.

That is fixed in memory.

DirectByteBuffers are also fixed
in memory, they can’t be moved.

There is a little bit of
weirdness around that.

Because if you look at the
way they are allocated, ,

there is a little bit of code
that checks around moving them.

But once you’re
actually accessing them,

getting that direct byte buffer
address, it is fixed in memory.

So it can be moved, however
outside of that call.

So once that call goes
away, my understanding

is that it can be moved.

So again, it’s protected for the
lifetime of that call, I think.

That’s a really good question.

I think that’s what I remember,
and don’t quote me on that one.

I might be wrong.

It might be always protected.

But in looking at
the allocator, there

are actually two different kinds
of allocation that can happen.

And for a very, very small–
like less than three pages,

it goes into the
movable allocation pool.

And for things that are
larger than that, at least

in the current implementation,
it’s not movable ever.

So yeah, kind of yes and no.

Yeah?

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: It depends
on whether or not–

so the question is, if you’re
using a backend to deal with

this data, and you’re
talking to C++ code,

ultimately you need to get
that data into your C++ code,

is it better to just use the
networking services that are

built into the NDK, or is it
better to actually use and do

everything in Java?

And there’s kind of two
questions I have about this.

Is the performance of your
networking something you

actually even care
that much about?

That’s the first thing.

If you’re not on
the main thread,

and you’re processing
some stuff in native code,

you might not care that’s it’s
a little bit more expensive.

Because you’re not actually
affecting the frame

rate of your application.

And you might be saving
an enormous amount of time

by actually using the
implementations that

are in Java.

So as a general
rule, you really want

to look to see whether
or not you actually

care about that particular
performance loss,

and then weigh it.

Yes, for a performance
you’re going

to do way better if
you parse something

in completely in native code–
especially if you’re not using

it on the Java side of things.

Then yeah, that
would make sense.

But the real question
you have to ask

is what’s the cost of that?

What’s the cost in
terms of opportunity?

How much more time is
it going to take me?

Is it really worthwhile?

And that’s– with all of these
things, that’s what I say.

If it’s an easy optimization,
like let’s throw a couple

parameters into a JNI
call, by all means do it.

Don’t waste more time,
don’t waste more battery.

But if it’s going to mean
rewriting an entire library,

then really look closely at
it and say, how much am I

really gaining out of this?

Mm hm?

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: Mm hm.

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: OK, so the
question is about JNI

versus using Renderscript.

When does it make sense?

So what’s really cool about
Renderscript first of all,

is that Renderscript is
actually LLVM byte code that

gets compiled on the device.

And there’s some beautiful
things that you get from that.

One is that it can
be actually optimized

for that particular CPU
that’s running on that device,

to some degree.

They’re certain kinds
of optimizations

you can’t do for LLVM.

But there’s a whole
bunch that you can.

There’s intrinsics that you
can actually swap in and out.

There’s like peephole
optimizations

that are specific to actually
how that particular device

works.

So one of the secrets
of Renderscript

is that it actually can generate
better code than the Compiler

can, in some cases.

The second thing is it’s
also running in a kernel.

It’s actually running in
its own little tiny machine,

that is used to run
massively parallel stuff.

And it’s really set up to
do that very, very well.

So if your problem space
falls into Renderscript,

something that’s really
helped my parallelization,

and something that
is also helped

by using these intrinsics that
you get from the LLVM byte

code, then by all means use it.

But as a general
rule, I would say

that again you’re looking at
opportunity, time and cost.

If you’re not seeing that it’s
a performance issue that’s

impacting you, it may not make
sense to go through to that.

Part of the reason we have JNI
is to be able to reuse all this

crazy amount of C and C++
code that’s out there.

And so for me, it’s you always
have to balance these things.

But from a true
performance standpoint,

it is very possible
that Renderscript

will be the highest
performing way

to do certain kinds
of operations,

because it can just do
a whole bunch of things

that the Compiler can’t
do because it just

doesn’t know enough about
the system architecture.

And it really depends on how
well the individual OEMs have

actually managed to– or
chip providers have actually

managed to optimize the
Renderscript Compiler

on their particular chipsets.

So there’s a lot
of variables here.

I wish there was a
cut and dry answer.

But what’s great
about Renderscript–

the really cool reason you might
want to use it anyway– even

despite all that, is
because as I said,

the LLVM byte code gets compiled
on the individual system.

So you only have to ship
one copy of the byte code.

You don’t have to use a
dependency on the NDK.

Or you don’t have to worry about
it bloating the size of your

build with a bunch of
different executables.

And that by itself
might be worth

investigating Renderscript,
just for that one reason.

Now with 64-bit, I
believe you actually

do need to ship 64-bit byte
code so it not completely

transparent to
architecture, I think.

I haven’t actually tried this.

But I vaguely remember
reading that somewhere.

AUDIENCE: [INAUDIBLE]

DAN GALPIN: Sure.

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: So if you do
[? Maleks ?] and Freeze,

it’s separate.

It’s actually using a
different allocator.

It’s using JE [? Malek ?]
when you’re doing stuff from

the NDK, and you’re
using [? Rozalek, ?] when

you’re in the virtual machine.

And the reason is that if you
went to the talk yesterday,

[? Rozalek ?] is really,
really good at garbage

collecting in the background.

And we’re trying to avoid heap
fragmentation by bucketing

all of our memory allocations.

[? JE Malek ?] is not trying
to have everything cleaned up

in the background.

It doesn’t have to
be as paralyzed.

So it’s a slightly faster
allocator then [? Rozalek ?],

when you’re in native.

So yeah, they don’t share space.

It’s been a long
time thing in Android

that if you
desperately, desperately

need to run something that
couldn’t run inside of the heap

space that we give
you, in the run time

you can add native code.

There’s other ways
to do that too.

You can run multiple VMs.

Like by launching each activity
into a different process.

There’s all sorts of ways
of getting around this.

But even using ashmem is a last
ditch resort, if you actually

are completely out
of all the memory,

we allow you to do that.

But realistically, yes, they’re
entirely separate heaps.

 

Yes?

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: Oh, it’s
not that complicated.

It’s just if you
are– so the reason

I say– complicated was
probably not the right term

to use there.

It’s mostly that we’ve changed
the structure of the way

the Gradle files look.

So if you actually look
at what we’ve done,

we’ve added the concept of model
into the experimental version

of Gradle.

Oh sorry, you can’t see.

Let me mirror, let me
mirror the display.

I just did like the dumb
Californian thing here.

All right.

So now you can see
what I’m seeing.

So if you take a look at
the Build.Gradle here,

you’ll notice that we’ve
added this concept of model.

So now Android is now at the
top, model is at the top.

So basically you need to
go through and restructure

your Gradle build a
little bit, in order

to take advantage of this.

There are some pretty
good stuff online.

You also see like all the
kind of standard things you’d

expect to see in the
old NDK build is there.

You can actually
add libraries here.

And also turn on– sorry,
static libraries as well as

dynamic libraries here.

So pretty basic stuff.

You can see I’m not using
any of this in that.

This is also how you build
different product flavors.

I’m building x86 arm
seven and arm eight.

Actually I’m building
all, because I

have all of these here.

It’s hilarious.

But any case, this is how
you would do product flavors,

and dependencies.

Again just like
normal Gradle stuff.

So it’s a little
different in structure.

but it’s really
not hard to set up

once you’ve actually set it up.

If you want, I can even show
you debugging, it’s really cool.

I think actually– I may not
be able to show you debugging.

We’re basically out of time.

But if you want I’ll
show you debugging.

If anyone wants to come
to a table out there,

I can show you how
the debugger works.

AUDIENCE: [INAUDIBLE]

 

DAN GALPIN: So if
[AUDIO OUT] two

options [AUDIO OUT] Google Play.

You can either upload
each individual variant

as a separate multi-APK chunk.

Basically those are all
separated by version codes.

Or your other option is, you
can put them all into one APK,

and it will do the right
thing when it actually

launches the applications.

You’ve got two options.

It really depends how
much native code you have,

and what percentage of
your APK size it is.

For some people, even having six
flavors of their NDK libraries

will only be a negligible
amount of their space.

For others, let’s say
you’re running something

really big and
heavy like Unity–

you now, it has its own
runtime and all sorts of stuff.

You’re definitely going to
have to seriously consider

[AUDIO OUT]
distribute it on Play.

So a multi-APK is
really the way to go

if you want [AUDIO OUT]
lots of different versions.

And [AUDIO OUT] native
machine like that, and I

highly recommend doing it.

[AUDIO OUT] use our translator
to actually arm code.

It’s pretty fast, but it’s not
nearly as battery efficient

as x86 code.

So I highly recommend
doing an x86 build as well.

And I think one of the
big things I hope we do

is make multiple APK
even easier to use.

Because right now, there’s
sort of a partitioning

scheme we suggest.

And it’s a little bit
more of a challenge

to walk through the first
time on the Play Store.

So I’m hoping we actually
make that better.

I think out of time though.

So I can totally take
questions afterwards.

But thank you all for coming.

I hope this was fun.

[APPLAUSE]

And enjoy the rest
of your barbecue.

[MUSIC PLAYING]

 

 

You might like

About the Author: admin

You cannot copy content of this page