Macro magic, Coroutines in production, Faster text parsing with SIMD at Bay Area C++

JFrog is a proud Community Sponsor for the Bay Area C++ Meetup Group

September 14, 2022

< 1 min read

Lightning Talk Extravaganza!

RSVP for this in-person event here: https://www.meetup.com/cpp-bay-area/events/287849729/

This month, prepare yourself for a variety of fast-paced and interesting talks from excellent speakers in the Bay Area.

We are looking for more speakers! If you have an idea for a 5 to 15 minute talk, please post your idea in the comments section below or in the #topic-suggestions channel in our Discord server.

Topics include:

  • Macro magic
  • Coroutines in production
  • Software Engineering Career Motivations and Job Market Trends
  • Faster text parsing with SIMD
  • Your talk here

Video Transcript

user avatar
US – Frog Field
00:00
Okay, welcome to J. Frog. Thanks a lot for coming out to our to our company. We’re going to kick off the meetup in just a sec. But first I wanted to say just a few words as the host. So hopefully, everybody everybody enjoying the food. How’s the pizza?
00:16
Okay, excellent.
00:18
Um also did folks discover or find the raffle on entrance?
00:24
Ok. So for folks who are nodding, your chances went up. If you didn’t discover the raffle, yet feel free to scan this Qr. Code Margo. Do you mind
00:35
with this on the back table, so we’ll leave it on the table by the food and um to enter to win the raffle. Do it in the next five to ten minutes, because we’re going to draw the winner closer to the beginning, and then we’ll announce it at the end of the meetup,
00:51
and just a few words also I My name is Steven Shin. I run the developer relations program at J. Frog. So we sponsor meetups and open source projects and a lot of community efforts.
01:03
I’m: very happy to be able to sponsor the Bay area C group. So it’s awesome that you guys come out here each month and are able to do great technical presentations. And I think you guys are in store for something very special tonight, with a a series of awesome lightning talks by speakers
user avatar
Unknown Speaker
01:20
and just one of their notes. So we we also what are the open source projects we sponsor is code in
user avatar
US – Frog Field
01:26
the Cnc plus package manager. So if you have questions about coding, or you’re interested in how you would package or manage your libraries. It’s an entirely open-source effort which is not related to our commercial offerings and it’s a great way to manage your C Andc post libraries.
01:41
So with that
user avatar
Unknown Speaker
01:42
i’m going to turn it over to Max, and thanks very much for joining us here.
user avatar
US – Frog Field
01:51
All right.
01:52
So, hey, i’m, Max, and welcome to the September Sequence Bay area. Meet up um. So out of everyone here who’s here for the first time.
02:05
Oh, nice, Nice!
02:09
See, We started the recording. I tell you, to follow us on Twitter at Ctp Bay Area, and we also are on discord. Now you can get the invite for that by going to discord Cpp Bay area. Com.
02:24
Um,
02:27
see? Oh, if you want to take photos, we encourage you to do that, and you can tweet it or post it on the discord that’d be cool.
02:36
Uh, we’re going to be ending at around eight. Thirty. So
02:42
Ah get ready for that.
02:45
Um, Let’s see.
02:47
I’m supposed to describe recent C events. I don’t know. Well, there’s Cbpc going on right now, but if you’re just finding out about this now, I’m sorry to say, but you missed it so
03:00
Oops. But now I’m supposed to say upcoming ones. So this is where I say there’ll be another one next year. So there you go. There’ll also probably be another one of these next month. So also the Llvm meetup is next Monday
03:15
in Mountain View. You can look that up on meetup Com to search for
03:20
Lvm.
03:23
Let’s see.
03:25
And if anyone here has talks that they want to give, please reach out to us on either the meetup page or twitter or discord. We are definitely looking for more speakers.
03:36
That would be very helpful. Or if you want to help organize, if you think? I’m. Just what is this guy doing? Just struggling to read from? You know
03:44
this, this phone? Anything you can do a better job than
03:48
you know. Put your money where your mouth is. Uh you try this next month.
03:52
So
03:54
let’s see. So uh, does. Is anybody here hiring
03:58
who here is hiring? All right. Allison,
04:12
my employer, Aurora innovation, which is a vendor of publicly held vendor of
04:20
vehicle autonomy systems. We have an office in mountain view. We have a full-time removed possible, and many remote workers.
04:29
I checked on the way here, and we have ah, fifty, five software job openings, three, say explicitly, c. There are locals associated with all the job openings. But there are just where the hiring manager
04:46
theoretically works.
04:47
So if you’re interested in learning more about that, I’m happy to talk to you about it, or else look at Aurora X wears. Thanks.
04:57
All right.
05:02
Oh, yeah. Who else is hiring? Sorry. Okay, let me. This is an an efficient algorithm here to find everyone.
05:12
Hey, I work at apple We do high-performance computing in circuit simulation. We’re looking for people who know some numerical analysis, and I know a little bit of electrical and engineering as well. So if you do, I mean your resume. Thank you
05:32
All right. Another one over here.
05:42
Yeah, My team in Nvidia is hiring as well. We work on an open source project called Cutlass. You can look it up on Github. It’s essentially deep learning libraries, Hpc. And open source blast libraries,
06:00
which is specific to Nvidia Gpus. If you’re interested um, you can just ping me anyone on cutlass page or come talk to me
06:10
all right. Anyone else hiring
06:13
all right. J. Frog.
06:15
Yes,
06:20
don’t we all um
06:23
all Righty
06:28
um. Oh, so after this we’re going to be getting dinner. We’re actually going to hold a poll to figure out where we’re going to eat,
06:36
because we couldn’t decide on our own.
06:39
Ah,
06:41
see. I’m going to thank the host, which is Jay Frog for doing pretty much everything for us.
06:49
I don’t even know really what I did for this, So I started reading off the script,
06:55
and this is where I will introduce the the first speaker for tonight. That is, Matt. He is going to be talking about? C.
user avatar
Unknown Speaker
07:07
All right.
user avatar
US – Frog Field
07:13
Whoa,
user avatar
Unknown Speaker
07:32
Yeah, I’ll be it.
user avatar
Unknown Speaker
07:33
What is it? Share it’s sharing the the problem.
user avatar
US – Frog Field
07:51
So I was sort of
07:53
I was sort of ripped into this very exciting um sort of last minute. It’s like, Oh, I should do something, and I’ve seen a lot of presentation. People like do various benchmarks of standard containers, and see like, Oh, vectors actually like way more efficient. Why does anybody use this for literally anything? Um, you know
08:10
like? How does that compare like? How does it? How much does it matter to like actually allocate things ahead of time, that sort of thing? So I thought it’d be fun to sort of investigate that on my own, just for sort of like a lightning talk, and see if I can make some cool graphs and see what happens from that. So Oh,
user avatar
Unknown Speaker
08:43
um,
user avatar
US – Frog Field
08:45
yeah. So I made a piece of code that basically um
08:48
and puts a bunch of different standard containers and then pulls them together and test like a few basic tasks like insertion, like random find on vectors. The allocated, unsorted, sorted vectors and lists sets on to allocated sets
09:05
just to see like what kind of difference it is, and also because I want to. I want to make a local graph. I’ve seen a lot of graphs where people will do like,
09:13
as like the number of elements in the set grows like how or the like. How does the performance vary and compare that between the painters, and I thought it’d be cool because they do that. But then it like it also matters how big the item is in the
09:27
for the performance a lot, and they’ll sort of show different comparisons. But I was like, Oh, I wonder if I can make like a threed graph where you have like one axis is how many of them there are, and how different things compare, because they have very different profiles based on those two parameters.
09:45
Um. So I wrote a big script. It’s
09:49
I need to upward it still. I finish this like last night, and then I dumped all that into A. Json file. I wrote a big python script that just one of the many plotting libraries to turn that into three visualization of the lab.
10:07
Yeah, I
user avatar
Unknown Speaker
10:11
i’m, i’m. So i’m sorry.
user avatar
Unknown Speaker
10:16
Thank you.
user avatar
US – Frog Field
10:19
So i’m not i’m not Mac Y’all.
10:36
So he needs to reloc in to his Google account.
11:36
I was speaking of Cpp. Con this week, if you’re wondering where the usual posts are. Ah, John and Richard, they are both at Cpu time, so they thought, as a will be a cruel joke. They could play to put Stragger and I in charge of everything.
11:53
It just be totally radio silent for the last month. So we
11:56
Ah, we sort of stumbled their way through this and realized a few nights ago,
12:03
specifically last night, that we were missing a lot of talks, so we we
12:08
specifically. I bullied a couple of people into writing talks. Uh in the next.
12:15
Yeah, within a day or so so this is uh
12:19
it’ll be better when
12:21
you know the the event will be better organized.
12:24
Ah, John and Richard,
user avatar
Unknown Speaker
12:31
i’s going to be a good stuff.
user avatar
US – Frog Field
12:32
Um,
12:34
yeah, I sort of explain the Why, there’s some. There’s some good links to um. If you want to see something like this where they do just general comparisons. Um, I think I found it really interesting when I was younger, and I was like oh, like what I thought it was. There’s a lot of nuance in actual hardware compared to like they go notation and wearing these things,
12:54
and like the optimizations that happen, can like render what you thought like, not through. And that was sort of like, I think.
13:02
Um: Oh, yeah, Brief answer.
13:07
Brief tangent time. When I was implementing this, I created a struct that basically holds it templates to arbitrary size to have the arbitrary size elements, and I was like, Oh, I can.
13:26
Ah,
13:28
yeah, it’s something I say, I hash it to do comparisons, and I have a vow that I can use to do an operator like just so be able to insert things, and in place things without running into issues.
13:38
But I re into the problem of. I wanted to pass it into generic insertion functions that Don’t know the size. But you can’t do generic template specialization for functions. You can only do it for structs and classes.
13:54
But what you can do is you can create a struct that’s a template, and then do partial specialization on that.
14:00
Then, in sort of having to specialize the whole function, you just do that, and then you have a static method. But it seems like such a hack,
14:06
and I don’t know why this isn’t like an actual part of the language, because it’s basically the same thing. And I just thought I would do it aside to just be like, Why am I the only person that’s run into this? I don’t understand,
14:21
like I probably could have just given up and done it in a different way. All the answers I found in line were like You’re probably doing it wrong, but at that point I was committed.
user avatar
Unknown Speaker
14:32
It’s It’s like It’s all right.
user avatar
Unknown Speaker
14:41
I’m: like i’m not this preacher.
user avatar
US – Frog Field
15:28
So this is the Python script that I wrote that basically takes a big Json that I dumped from my
15:36
and um at the bottom should create a bunch of lovely free graphs visualizing the results they got.
15:47
Whoa! That’s the scout one.
user avatar
Unknown Speaker
15:57
I don’t see you at the back.
user avatar
US – Frog Field
16:01
Um. So in red we have the list. Which so okay, let’s Let’s move over a little bit.
user avatar
Unknown Speaker
16:10
I said, Oh, I got it. I got it.
user avatar
US – Frog Field
16:12
Ok. So here we have a number of elements. So as we start here also, all these are logarithmic, because it’s kind of stuff about it, I promise. This looks better than the alternative.
16:23
Oh,
16:26
so, for instance, list kind of as you would expect it. Doesn’t care about the number of um about the size of the omelette. It only cares about how many there are in terms of performance. This is for random access.
16:36
So I basically just um picked a bunch of elements out of the list and like a sequence. And then I just kept the values. And then I went back. And Then i’m like, okay. Find these as fast as you can. And then I ventured how long it took them to find them so for, or you can see, It’s a curve like that. But for
16:55
for the vectors it’s more sort of like flat over here, and then it starts to curve up as both um number and size increase. But when you have stuff like so an ordered set obviously is the very bottom. It’s by far, the fastest,
17:12
all the sort of stuff. It’s actually much much faster than the unsorted, as you’d kind of expect up there with the list, which I guess makes sense, because it has to do an iteration index. I would have actually expected it to be a little bit referencing. But
17:30
it my test. It ended up being out there,
17:32
and I thought it was really cool that you can sort of like, go around and visualize this.
17:42
Yeah,
17:44
i’ll find Sorry.
17:46
Not not like a widow like Find this? I sorry. Yeah, I was not clear. This is this: is Find this element
17:54
fine. Yeah, The misnomer. My bad.
18:00
I love it.
18:03
It’s
18:18
yeah, there’s Ah, there was a a bit of the methodology is perhaps not ideal. I probably should have done more runs. But this was just sort of what I got from my most recent run. There’s like a few artifacts. If I rent it a bunch of times it would probably average out. But um
18:32
like, I think it’s more the trends. You can also, if you want you can see it
18:37
after you go through. I’m going to run the the non-normalized version, because I have a over that
18:42
or the non-logarithmic version.
18:45
Um. And so here we have. I’ll try to be clear this time. Here we have
18:49
um random insert, which is, it inserts a random item into a random spot. If it’s a sorted thing, it will just insert it in where it’s supposed to be. If it’s sorted. If it’s not it, it just takes a spot, and it’s like you. Go here.
user avatar
Unknown Speaker
19:06
We’re still
user avatar
US – Frog Field
19:17
until we see something somewhere.
19:21
List is sort of much slower with a small number of moments. But as you get increasingly
19:30
larger elements,
19:34
yeah, Random insert the on the effects of having to resize the vector started to become increasingly worse,
19:40
and it was sort of interesting to see that
19:43
obviously the um,
19:46
The unordered sets are still like way at the bottom, and if you turn off um, the log scale becomes even more apparent. It’s like completely just flat against the bottom. It’s not even close
19:57
here. These are all cursed, So we have trim back, which is basically just I removed like five elements from ideally one side. If it was like a set I just
20:09
pick to the edge like the frontmost five elements and just move those.
20:16
Yeah, this was kind of a mouse
20:18
we’re going to go. We’re going to do a
user avatar
Unknown Speaker
20:20
as i’m right. I’m.
user avatar
US – Frog Field
20:34
No. You should just use a pointer. Never do this.
20:39
I just think it’s interesting,
20:44
because I’ve again. I’ve seen these presentations before, and I find it very especially when you have a lot of elements. Again, how it compares to lists, and how it can drastically outperform it, even on insertion sometimes.
21:04
But this is
user avatar
Unknown Speaker
21:07
there we go.
user avatar
US – Frog Field
21:09
So this is random. Find the
21:12
we have at the very bottom. You can see
21:19
the light will be green
21:23
uh we that’s like the unordered sets and the
21:28
reallocated sorted vectors. The sort of vectors here are basically as fast as the
21:33
the hash maps which I found interesting at least compared to having to do the manual fine with the lists and the vectors. I was honestly shocked, like how much faster this was. But I guess it makes sense.
21:46
And how yeah, it does seem to care about. Uh, yeah, the size they want a little bit.
21:54
It’s a winner,
22:05
and then you can see a lot more clearly for um. The random insert just how much of a different it makes um just being the vector like, the vector
22:15
yeah, just the standard bog standard unsorted, not pre allocated. Vector It was like a horribly.
22:21
And then you can see at the very bottom here, like barely making a dent is the or the unordered sets
22:28
just completely for out of cross. But over here it’s a fine, but I think it’s interesting. You can see, just like as either of these gets bigger. It just goes up like this.
22:39
It’s more just like consistent across
22:44
this one. It’s impossible to see, because the
user avatar
Unknown Speaker
22:50
and that’s honestly pretty much it um
user avatar
Unknown Speaker
22:54
tough question.
user avatar
US – Frog Field
22:57
I kind of just wanted to.
23:00
This is honestly just sort of an experiment that I wanted to do for a while, because I thought I would be fun to sort of visualize this. See what it looks like, and sort of a fun experiment to learn these libraries.
23:09
But if all questions i’m happy to.
23:25
Yeah for this I was trying to. It was originally with the very small elements It was very hard to tell. I use
23:35
Get time monotonic to the milliseconds I’m using just the Linux standard clock libraries.
23:52
Well, I left my left my script so. Oh, but It’s just the next speaker. I think I’m supposed to do now, which will be Macro magic.
24:27
All right.
24:28
I need to pull up my notes.
24:30
Okay,
24:32
this is extending, c. With Macros for fun, and not so much profit. Just. I did this purely for fun,
24:40
and this is with first Lambdas, because Lambda and C. Can be pretty verbose.
24:46
So.
24:48
Ah, why would you want to use Macros? I’m going to talk about that a little bit, and the motivation behind? Why, I made this, and some design principles that I use for. I’m coming up with Macro. We can talk about the actual Macro and i’ll wrap up with some of the difficulties I had.
25:05
So why would you want to use Macros?
25:07
Ah, the main reasons I would know for Macros is, if there’s some future missing from the language, particularly particularly around, plus being verbose. You can also use it to like poly-filled attributes. You can say, Oh, I want to be one hundred and one.
25:26
Mark this variable as possibly unused. But if you want to be able to support before C seventeen, when that atropy became standard, and you can do that with a macro.
25:37
The one example is forwarding uh stood forward is painful to use sometimes, particularly with Lambdas, where you don’t know the types of your arguments, and one
25:51
it can be much nicer to just use a macro to do that for yourself.
25:55
Another example, if you have like a stood expected, which is coming in c. Twenty, three, I believe, and you want to propagate the error upwards like return it that is, if it’s an error,
26:08
it can be painful to keep doing this annually. You could write a macro to do it for you,
26:15
not saying that it’s necessarily a great idea, but you could do it.
26:19
If you’re curious. This is what those macro definitions would look like,
26:23
although I haven’t tested them. This is Slide code.
26:28
So this is a story of writing a mackerel like that to hack in terse lambdas to the language.
26:35
So by macro-design principles,
26:38
don’t use a macro
26:40
if it’s possible regular functions or templates are just better. Ah, magic Macros change the syntax of C, which just makes it harder for people to understand the code when they’re coming to the project it really has to be worth it. If you’re going to add a macro.
26:58
And yeah, you really shouldn’t use macros most of the time
27:02
in the for-profit case I wouldn’t use Macros much.
27:06
But the another thing is, I would say, don’t do anything too cute, so I in the reddit post that I made about this
27:15
I saw people recommending. Oh, why don’t you like how to find underscore one for the first parameter of the macro, or define on lowercase r at the macro. And i’m like, I don’t like that because that pollutes the global namespace, and that’s not a good thing.
27:30
And I also think that it’s important to make Macro fit into the language in a way that makes sense.
27:37
And I also think that it’s a bad idea to do things that are too complicated in particular. There’s this library in boost. It’s boost preprocessor.
27:46
It can be a blast to play around with. But anytime that I come back to Code that I wrote with it. I have no idea what it’s doing,
27:54
so I recommend avoiding that,
27:57
and then, because C. Is very limited to what you can do with Macros. You’re going to have to be flexible at the syntax.
28:03
So why terse Lambdas Kobe Pike, who’s also known as vector of Boule wrote a blog post: Now I am become pearl in two thousand and eighteen. In this blog post Pike argues for tourist lambdas, pointing out that they can help with readability. In some cases, for example,
28:23
with regular C plus lambdas, the syntax can get in the way of the code, and especially for examples that are more complex than this. It requires for me pushing things to my mental stack to understand what the code is doing.
28:37
So In this case, if i’m reading the code, I see. Oh, we’re sorting some screens. Okay?
28:43
And then oh, suddenly, there’s a whole bunch of lambda in that We have Lambda coming in, and then I have to check. Do these parameters make sense on string reference that makes sense for sorting two or the same. That would make sense might also make sense to take a string view.
28:59
So these things are going to my mind. And then oh, we’re sorting by size of the strings. That kind of makes sense. But then I have to
29:08
piece together all these little pieces that I read to understand what’s going on, and
29:13
I find that that makes it harder to read if a lambda this long.
29:18
But if you imagine using vector riboul’s, imaginary syntax, you might have some terrest lambda that looks like this and for me. This means I don’t have to do any of that mental stack pushing, because I just look at that, and I see Oh, we’re comparing the first argument in the second argument by size,
29:35
and I do that without having to. It’s just one chunk that I read that as
29:40
so, I can very quickly see that this is sorting the strings my size.
29:44
But if you look at that, that’s nowhere near valid, c.
29:48
You can’t take the reference of of integer literal, for instance, and that’s
29:54
not what’s intended at all. But you can imagine you could replace those Ampersands with an underscore,
30:01
and then maybe make some variable names for that.
30:04
So if you squint, if you imagine some macro that looks like this, this looks kind of similar to that magic macro syntax.
30:17
So how would we make this look and feel a little bit more like a lambda? Because I really don’t like seeing something like this that looks like a function call, but it’s not a function at all. It’s creating a lambda for us. So the thing that sets Lambdas apart when i’m reading, c. Code is the capture list,
30:40
although, being terse
30:42
uh terse Lambda is surprisingly long the type. So there we go. It’s a little bit better.
30:51
So how about we do this? How do we get those underscore one on the screen, and so on?
30:57
If we want to have
30:59
you move the zoom thing if you want to have the parameters. You
31:05
we don’t know the right number to create just from the back room in use. The
31:09
we don’t know the number of arguments that we’re supposed to make.
31:12
So how do we do this? We could try overloading a bunch of landas,
31:18
but that’s kind of horrible,
31:20
and it would also prevent us from having that prefix capture list,
31:24
so we could also make the number of arguments be periodic, and then magically create
31:32
uh underscore one, and so on is variable names, and that’s one ended up trying.
31:38
So
31:40
you can imagine having some magic way of creating the variable names and the new Macro
31:48
just returning the expression that you were given, and the reason why there’s no brackets at the beginning of that pound of fine is because we want to see those that capture us there.
31:59
And so you might have this being expanded to that. And you’re like, Okay, So how do I get that? And it’s for one and for two.
32:05
Well, if we have a little bit of template machinery, you can return the nth parameter of a list fairly, straightforwardly, so I could just have to do this,
32:16
but that still has a problem, because what happens if you pass fewer arguments in that.
32:21
Oh, yeah, It’s also nice to use a periodic because they are really like arguments, because then you can make a very out of lambda you would have. They wanted to. You like you could return the sum of all of the arguments.
32:32
How do you do it? If there is fewer parameters that are being used that are being passed into the thing so like with stood range is fine. It’s passing in a single screen and saying, Is this predicate? Match it, and we obviously need to continue compiling in that case.
32:48
So how would we do that?
32:50
My solution to that
32:52
was to return a placeholder type?
32:54
I called it not a parameter, because I figured that that would be a little bit better of an error message to see.
33:01
And then you can just extract that particular argument with, Uh,
33:08
yeah. So in this case this is the function I use for extracting the argument, and i’m doing the balance check first, because it was easy to implement that way, and then a little bit of recursive template
33:20
the way that I use to pull out the particular index,
33:28
and that’s about it. It works.
33:31
But there is some like additional features and problems with this. So for one, those examples that I showed uh return everything by value. And if you’re going to return strings by valley, for instance. That’s kind of expensive one.
33:48
Maybe you’d want in some cases to return things by reference. But what if you made that If you made that Macro always return by deckle-type auto, which would return by reference.
33:59
Then what happens when you return like a literal value if you return, parenthesize things, C says, Oh, it’s the value type of that expression. And so that’s going to be an our value reference. And then we’re returning a dangling reference.
34:13
Or if you remove the parentheses, then it would return that by value. But then, if you’re dealing with instruct members,
34:19
it’s going to return those by value when you meant to turn to my reference. So that’s kind of an unsolvable problem. So my solution to that was just to create a second macro,
34:31
and then you have like. What if you want to make your lambda? No. Except friendly or a friendly? Those are
34:39
surprisingly difficult, particularly in two thousand and eighteen, when I was playing with this when compilers did not support C twenty at all.
34:47
And then, as the C. Twenty features slowly rolled in, I was able to implement a little bit at a time.
34:54
And then there’s also like the question of how many parameters do you want to support? I arbitrarily chose the number four, because I was thinking,
35:01
if you’re using more than that,
35:03
go use a regular lambda.
35:06
Yeah, if you’re interested in looking at my code, it’s posted on Github. It’s mit license, and it’s scary to imagine people actually using this in production. So don’t do that,
35:22
anyways. That’s how I made a person in the Macro
35:47
honestly. Two thousand and eighteen is long enough to go that I don’t remember.
35:51
It was one of the first.
35:53
It was either the first or the second.
36:00
All right. And now we are going to be learning about a faster Skerland.
36:06
You just use that science.
user avatar
Unknown Speaker
36:08
It’s
user avatar
US – Frog Field
36:08
I’ll get it bye.
36:11
Thank you, Max.
36:18
It’s still presenting your
user avatar
Unknown Speaker
36:21
they.
user avatar
US – Frog Field
36:40
There we go.
36:43
Oh, my name’s cut off on the corner. I am stranger. Um, i’m going to be talking about faster text processing with Cindy. So who here is familiar with Cindy?
36:55
But I have you,
36:56
and who here has done text processing or text parsing
37:01
about the same, and who wants to make their code faster.
37:05
It should be everybody. Okay,
37:08
So as Max said, we’re going to discuss sterling, so sterling the core is finding the null terminator, right?
37:17
So we’re going to be dealing with C. Strings here.
37:20
So our function final takes in a sharp pointer,
37:24
and as Andre Alexandrescu always says, If you want to write fast code, you have to start with an infinite loop. So we write an infinite loop,
37:32
and we process the first character by checking. If it’s
37:36
no, and if it is, we’re done, we can return the pointer to that null
37:41
and otherwise we need to move on to the next character in the string, and then loop through and try it.
37:48
So that’s the naive implementation.
37:51
So here’s a diagram showing a spring, and we get a pointer to the letter t the first character we check. If it’s a null, it’s not at all. So we move on, and we keep doing that for every character
38:06
until we hit the null terminator, and then we stop our algorithm.
38:11
So what we want to do is take that algorithm
38:16
and make it faster. But instead of changing the code, we’re going to add new code in the top. So what we’re going to do is you, Simd to process sixteen bytes at a time, and when we process sixteen bytes at a time.
38:29
Um!
38:31
We will process sixteen bytes, and we’ll look for null terminators in those sixteen bytes.
38:37
It’s going to be hardware accelerated. So we’re not writing like a mini loop for sixteen bytes. It’s going to be just like one instruction.
38:44
So after we process the sixteen bytes. Then we’re going to do this slower algorithm if we do notice a null terminator, because we need to know exactly where the null terminator is.
38:56
So we’re going to put the fast path above the old implementation.
39:01
So what that looks like is, we get a pointer to the beginning of the string just like before,
39:06
and we use a vector register.
39:09
So on my arm machine right here. It’s a un eight by sixteen t on intel there’s an equivalent,
39:15
and that stores sixteen, eight byte integers.
39:20
So we were load up all sixteen at once,
39:23
we’re going to compare them component-wise with a bunch of null terminators, or at well, no bytes
39:29
so component-wise means we check. Well, the Cpu will check the l against the first null terminator,
39:36
and it gives us false, which will represent a zero.
39:40
Then it checks a second, and it is zero. You know it does all these in parallel because it’s hardware. It can be parallelized
39:47
um
39:48
as just one instruction so super fast. Once we found the result of this quality check
39:55
we will check. Ah! If there are any ones in the result any trues, and that tells us whether there are any null terminators or not. In this case it’s all zeros, which means there’s no null terminators.
40:07
So we just skip the sixteen bytes, and then move on to this part the second half of the string.
40:14
So the second half. When we do the comparison. There is a null terminator there at the very end,
40:20
so what we need to do is exit the fast path loop, and then go on to the slow path which checks character by character.
40:30
Okay, So here’s the implementation for them.
40:33
Ah, first we need to allocate a bunch of zeros that’s pretty straightforward, and inside the loop we have this load. So this is a sixteen byte Load the Vld. One. Q underscore U. Eight
40:45
stands for vector load one quadward,
40:50
and you wait is the the type we want to care about. So Quadword is the sixteen bytes.
40:56
Um. So this is loading it into our charge. Variable
41:00
then we do a Vc. Eq. U. Eight,
41:04
which is a vector component-wise equality comparison.
41:08
So that’s the the equals and we compare a zeros with the characters
41:13
then, in order to figure out if there are any ones in the result we use Max to evolve them are zeros. Max is going to give us zero, and if any of them are one we’re going to get a one out.
41:24
So we use that to tell. Did we find any null terminators? If we did find an old terminator in the sixteen bytes. We go to the slow path, and we check care to my character,
41:35
and if we found no no terminators. We skipped fast, or we skipped to sixteen bytes.
41:50
Yes. So what happens when we are on the program?
41:54
We get a bus error exactly for the reason you said so. The problem is that we’re
42:03
Yeah, yeah, I’m gonna do that right now.
42:05
So when when we’re reading sixteen bytes at a time, If our string is smaller than sixteen bytes, then we’re going to be reading out of bounds of the string, and the problem with that is one you can get garbage data, But that’s not really a problem for our algorithm because external terminators is fine.
42:23
But the the real issue is, we get the crash, because what might, what is out of bounds might be unallocated pages, and if it’s on allocated pages, the Cpu is going to trap our program’s, going to crash,
42:35
so to fix that issue
42:38
we just add more null terminators. We could add a check in our loop to check the length. Well, we don’t know the length, so we can’t do anything but adwords and mortal terminators.
42:53
Uh so what this does is it means any read that we have that is, sixteen bytes wide
42:59
is going to be in bounds of our array.
43:03
So once we fix that
43:05
ah, for small strings
43:08
like the Hello, world string, It’s basically the same speed as our simple notation, because our fast path doesn’t have a chance to do anything.
43:18
Ah, but for huge strings we get a huge performance improvement about fifteen times or fifteen and a half times speed up, which is what you would expect if you’re processing sixteen bytes at a time, instead of one bite at a time.
43:31
So actually we’re about the speed of the standard string length and invitation already with huge strings. But of course, if for small strings are still slower. So how do we make the small string slower?
43:43
Well, to make the sal small string case slower, we need to optimize that uh that fall back. And and so how we’re going to do that is, we can observe that this number we build well, the the we build at the bottom with the zeros and the ones
44:01
we can convert that to a binary number,
44:04
and for reasons you need to swap it.
44:07
Ah, you know, put it backwards,
44:09
and we can use a Cpu instruction to count the number of zeros,
44:13
and that’ll tell us the length of the string, which is eleven, because you know five for the Hello, five for the world, and then the space.
44:21
So this is the code for that.
44:25
After we do the quality comparisons we do a bit mask operation we check if the mask has any ones in it indicating the terminators. If it does, if it does have a null terminator in it, we count all the zeros, and then that’s the We found the null terminator there,
44:44
otherwise we can
44:47
do the six. You can skip ahead sixteen bytes,
44:51
and there’s now no more slow path.
44:54
So this is the previous version.
44:59
And when we do this on a small string we’re really really fast, we’re actually faster than the standard string implementation, because on a very small string
45:09
the standard Sterland needs to do some book. You go overhead because the standard sterling can’t take advantage of the fact that we have a bunch of mill bytes at the end, so it needs to do some checks
45:20
to see if it’s safe to access that memory. We just assume that we can access memory because we have control over the strings.
45:26
So our our string is quite a bit faster.
45:38
This benchmar does not count the allocation of the string. This is just the Stirling operation.
45:59
So the comment was that we need to reallocate the string to add the null terminators. But now slow things down. But here we’re assuming that you have control over remoting the data.
46:09
So in my case i’m building a compiler, and we’re loading files from disk. And I could, just when i’m reading those files I can just add the sixteen bytes in the allocation
46:19
um, or if you’re downloading things from the network. You you make a slightly bigger buffer, you know, so usually have control over the strings. It’s not just given to you. If you’re writing a general purpose library, you don’t have that that ability, but you wouldn’t be able to take advantage of these optimizations anyway, because you don’t know
46:37
if you’re going to copy the string to the heap, you don’t know how big to allocate. In the first place, that’s the point of this function is to find the length of the string. So Ah, but for other operations that might make sense. Um! I’ve seen some Json parsing libraries do that where they? You give it a string, and if
46:55
and it’ll copy the string for you
46:57
so that it could do these optimizations.
47:01
So in small strings our our masking approach is faster.
47:06
Ah, but let’s look at huge strings.
47:09
So our huge string thing is ah, two and a half times slower than our other method, And the reason that with huge strings of slower is because that bit mask function that I talked about is actually quite a bit of code on arm on intel. It’s one instruction but an arm. We need to do some emulation, and it’s kind of gross.
47:28
It compiles down to the handful of assembly instructions, which are pretty fast. You can see it is still much faster than the naive character by character implementation, but it’s still some overhead you need to account for.
47:47
So the small string is just hello space world. So it’s eleven
47:51
and bytes plus andal terminators, and the huge string here is one megabyte.
48:00
They they’re really big one.
48:03
There it is just, too. It’s a small and huge
48:09
Okay. And here’s some credits. So i’ll leave it open to questions.
48:27
Yes.
48:29
So the comment was that instead of adding the null terminators
48:34
um, instead of adding, no terminators like we do here Instead, you can like, uh,
48:41
do do figure out you down, aligned the address,
48:45
and by downloading the address you can guarantee you never hit the edge of the page, and you never hit that cpu trap, and that is what the standard Library implementation does of sterling
48:57
um. But as you saw
49:00
in our benchmark, we were able to beat the performance.
49:05
I’m.
49:07
We were able to be the performance of the standard string, because we don’t need to do that,
49:12
but we don’t necessarily need to do that right.
49:25
So the comment was that we could combine my optimization with the standard library implementation. I don’t know if the Standard Library does this mask and notation.
49:33
I know that arm provides sterling functions as that’s kind of where I base this implementation on. I don’t know if apple or any other os’s use that version, or if they have their own. I didn’t check.
49:48
So yeah, it could be an optimization opportunity.
50:15
So the comment was that the Standard Library will have a slow path for the beginning of the string, and then it goes fast once it’s aligned. But what we could do is align our strings so we can avoid that check and always do the fast path
50:29
that would work if you’re getting this length of strings or processing from the beginning. But often when you’re parsing like Json or other human languages, or whatever
50:42
you’re you’re not necessarily aligned right.
50:44
You’re in the middle of a string somewhere, and you need So what one example I have is for parsing comments, and you don’t know whether a comment in the Source code is going to be aligned to a sixteen by boundary,
50:58
so you would still need to do the slow path for comma processing or just use unaligned accesses.
51:12
So sympi operations. Some of them need to be aligned, and some of them Don’t need to be aligned. It depends on what functions you use. The function I use to load the data does not require alignment.
51:24
The vld one. Q. You eight does not need alignment if you just do a cast instead and try to read. That would require alignment. It actually depends on the Cpu, some cpus to do different things, to be on the safe side. You should use the the functions, as say, on aligned access.
52:13
Oh, so can we parallelize this across threads.
52:16
Is that what your question is?
52:18
Um. So for like a real example like, So I showed sterling because it’s a super basic function that everybody understands. But realistically, you wouldn’t search for just the null terminator like, especially in C, You could just throw the size. So what you would do is, do something like, find a null,
52:37
a new line, character like a line terminator, or something like that. And you could use the same algorithm to find the line Terminator.
52:46
Um, and and here you you would know the size of your whole string, and you could use parallel as like threads, and divvy up the work in like megabyte chunks or something. If you wanted
52:59
using standard transform, reduce or red building blocks or whatever library.
53:21
Okay, So what can the compiler do to optimize these things? The compiler does a bad job in general. If you care fully, code it just right. You can make clang a gcc compile the the basic sterling function into a call to the standard sterling. Um! Instead of actually writing the loop out.
53:40
I have not seen it. Any compiler vectorize these things. One of the problem is that it doesn’t know the length of the string, so it doesn’t know if it’s safe to access. Some number of elements ahead of time. I’ve tried telling it the length of the string
53:57
and then doing a sterling, and it still doesn’t work. I haven’t been able to get a compiler to optimize these
54:03
without manual sympathy operations.
54:11
Okay, thank you very much.
user avatar
Unknown Speaker
54:18
Yes,
user avatar
US – Frog Field
54:24
you don’t need a zero
user avatar
Unknown Speaker
54:25
oops.
user avatar
US – Frog Field
54:30
Okay? So the comment is to find a if there is a zero. In a word, you don’t need.
54:36
You don’t need to use Simd. You are correct, but Simd gives us access to bigger registers, which makes the operation like we could use along like a sixteen or sixty four bit integer to do this operation, or at least the the first version of it.
54:50
Um! But the problem is then we would only be going eight bits at a time instead of sixteen bytes, and if you only go eight bytes at a time. You only get an eightx eight x speed up instead of a sixteenx speed up
55:09
um For some algorithms you are able to um use a sixty-four bit integer or a thirty two bit integer, but for some more complicated things you need component-wise um. So, for example, if you’re going to check. If you’re going to find identifiers, you need to be able to find like A to Z,
55:28
and in order to find a easy you need to do less than or greater than, and to do that, it just doesn’t work without the help of Cd. Instructions.
55:35
So you’re right, like some sterling function would be able to be faster, even if you didn’t have Cindy. But more complicated things wouldn’t be able to take advantage of those registers.
55:50
All right also. Thanks for J. Frog for hosting.
55:59
All right. We ended perfectly on time, which was my main here,
56:05
but we have three minutes left, which is just enough time for us to do the raffle. I hope I look at the size of that thing playing,
56:14
and I hope everyone signed up and wanted to.
56:20
All right. Margo’s going to announce our winner,
56:23
hey? Banana is, I mean, Michelle is holding our Okay. The winner is Ivan. Oh, Yang,
56:33
what?
56:38
Yes,
56:39
All right. You’re going to have your inside and get your price. Thank you so much. Everyone for coming in. Feel free to take any of the leftover pizza.
56:48
All right. So the very last thing we need to do the most important part of you know the Cvp Bay Area Meetux is, of course, the dinner afterwards, so we were able to narrow it down to two places, and we are going to let you guys vote on where you want to eat. So the first option we have
57:06
is off the rails uh Brewing Co. Which is right down the street, and uh on Murphy Street. It’s any valid
57:13
uh I actually don’t know if I would drive. It is, I imagine, maybe like ten minutes or so.
57:18
And then the second option is five guys, which is.