AMD EPYC Advanced User Training on Expanse

Presented on Wednesday, April 21, 2021 by Mark Klonower (AMD), Mary Thomas, Mahidhar Tatineni, Bob Sinkovits (SDSC).

The complexity of the AMD EPYC architecture, with its large core counts, non-uniform memory access and distributed L3 caches, can make it challenging to obtain peak performance. This 3-hour webinar covers a range of intermediate-to-advanced topics that will help you to make most effective use of Expanse and other EPYC-based systems. These include an overview of the EPYC architecture, AMD's compilers and math libraries, strategies for mapping processes and tasks to compute cores, Slurm, application tuning and profiling tools.

Github | Youtube | Chat Text

Scrollable/Clickable Transcript

Okay let's go ahead and get started good morning everyone or good afternoon on the east coast.
Today we've got a great webinar from 9am to 1pm so it's going to be a long one on the AMD EPYC advanced user training on Expanse before we get started, I want to bring everyone's attention to the XSEDE code of conduct.
We are all generally well behaved, but if you observe any conduct that you think is inappropriate, please let us know by sending us a message to xsede.
org/code of conduct.
And we also want to emphasize that XSEDEs committed to providing training events that foster inclusion and show respect for everyone if you see any terminology that you think is inappropriate, please let us know by sending an email to terminology@XSEDE.
here's the agenda and I will hand it off to marry Thomas who, I think, is going to do introductions.
To all right Jeff if you just leave that.
up because I was only sharing the agenda but.
that's okay I'll just I'll do it.
Okay, great Thank you so good morning everybody, and thank you for showing up for this first EPYC advanced user training event that we've had we have a full day morning plan for you and our goal is, at the end of it that.
You have some skills and.
tools for doing some of the more complicated.
work that needs to be done, as you port your code to the EPYC processor so without further ado, I want to bring.
Go ahead and turn over control to Mark we for each of the talks we do have a Q amp a let's go ahead and make him a Co host Mark and you can start sharing Mark is an engineer with AMD and so he'll give us a pretty good in depth overview of the EPYC processors so without much further ado Mark welcome.
Your maybe on mute because I can't hear you.
Still can't hear you Mark.
how's that.
Yes, I can hear you now.
All right.
Any presentation looks good so.
Okay, thank you appreciate the opportunity to.
Share about a the EPYC processors and hopefully in this short little presentation you'll walk away understanding the architecture and how that affects your programming.
here's the agenda.
So just a high level, you know there's a traditional monolithic cpus and then EPYC took a different path and number almost 10 years ago we started designing and.
And the key takeaways here are that you no longer really have a monolithic cache and there's also because there's so many cores now there's you're going to find some resources like io and memory.
Can can be closer to some cores than others and we've made ways to take advantage of that so that's just something to always keep in mind it's just different from what you're used to.
In in the past.
here's a sort of a few of the key key aspects of love Rome.
You know I'm sure you've all probably read about it, but it has these chiplets which are CCD so compute core die down down there.
bit of trivia is that these are the same chiplets used and rising and thread Ripper we just sort them for different market segments reliability is key, when it comes to EPYC processors.
And then we have an io die.
What one of the main benefits is that the io die having so many pins is is in the larger geometry 14 nanometer, whereas the chiplets being cores and cache really get a benefit from being in seven nanometer and were able to mix and match the technology for performance and manufacturing.
which we do in Horizon as well.
Every every EPYC processor, that without we put out to date has been in the same socket SP three socket that's characterized by 128 PCI lanes and eight memory channels that can run up to 3200.
Just something to keep my.
Another key takeaway is what we call a compute core complex a CC X over there here in the lower left and that's key because it means a CC X our number of cores that share the same L3 cache, you know it's easy to see from from the picture that.
The caches are not going to be contiguous between CCD and they're not even contiguous inside the CCD on Rome.
hcc the has to have these compute core complexes the CC exes were four cores you know eight threads share the 16 megabytes of LTE cache and access to that cache is is pretty consistent.
To nanoseconds difference between locations there.
there's there's anything else here but you'll also see from from this picture.
And, and that picture there in the middle, is very close to what an actual the delisted part has.
And you can see that there are two CDs in each of the four quadrants and that's really the way it works out when whenever we when you start looking at it from a topology point of view, as you go to pin pin the different workloads to the different cores.
So here's more of a block diagram and you can see that the four quadrants of that io die that big Gray block in the middle there.
Very symmetrical capabilities, you know you will you can hook up at least two CDs in each quadrant each quadrant has 32 PCI lanes in each quadrant has to.
memory channels now by by default all the memory channels are internally, so that so that, for the cpu socket there's one Newman oh.
But, but you can set it to to break it up into either to noma nodes or you know the Left half and the right half you have for memory channels that are interleague or for the best performance, you can set it up memory performance.
You can set it up to have four or Newman notes and that's a BIOS settings so I I'm not sure exactly what what you guys have it set up for but by putting it.
into four Newman nodes it's called NP s or no Newman notes per socket you will you can see that a couple of CDs, are going to be closer to a a.
A pair of memory channels and Ion channels as low as well, so if you think about putting the drivers at service various iot devices it's good to pin those drivers and support code close you know, on on a core that's within that that quadrant to get the best latency and performance.
How we make the parts and is we populate two to eight chiplets giving you different different.
Product skews.
Well, you guys have have a 64 core, which means eight CDs and I show this to you, not because you're gonna go by different.
parts, but you can model any of these 48 core 32 cores 16 core devices, simply by not pinning code to those cores.
here's here's what your what your device looks like and see all the cores are active within a CC X and CCD.
And, but, but if you were to only pin your workloads to every other every other core the on the unused cores assuming you have power management turned on quickly power down.
And you would end up having what looks like a 32 core device and it's worked really well for machine learning.
different type of workloads that really reward a high cache to core ratio, you can see here you go from four megabytes of L three per core on average two to eight.
And you can see significant improvements by by doing that also if if you were to have a high.
memory bandwidth workload where you just know you're being limited by memory bandwidth obviously dropping down.
To every other core normally would or would you get that high cache rate, but these cores now have twice the memory bandwidth per per core just something else you might find helpful.
that's why we make the parts and how you can model any of those eight CCD parts over here on on the left, simply by not using the course and letting them go to sleep and that extra power and cache will be distributed as as needed within the system.
So here's the.
Product stack is that at a glance there.
Now, no but i've talked a lot about that already but I'll just reinforce it.
Again, you can buy or by default there's there's a single Newman node per per socket and depending on the BIOS settings you can change that to to Newman notes that's the blue line down line down the middle.
or for Newman and see where it's divided up each each quadrant and for most htc workloads we find that to be very useful, obviously, if well if you have a large set of data that all the cores need access to.
Leaving it within a single lumen node per socket works works best or or, if you have.
legacy code and you just want to run it the way you used to run it on older devices it's it's probably helpful just to leave it in a single Newman node per per socket.
talk a little bit about power management.
This you wonder why why I bring it up and talking about the architecture, but it is in important it's a big difference from what you're used to.
The power management inside EPYC is just a lot like what you'd find on your phone or notebook or modern desktop processor.
That it distributed power where it's needed on on demand so like I mentioned before, if you don't pin one of your workloads to a certain core it will quickly.
Go we'll go to sleep, it can even power off if all if you leave that in enabled.
I know there's a lot of habits on over processors to where you turn off boosting at times thinking that it gives you better latency.
You can certainly turn off.
Allowing CC six which which puts it into you know you're actually power off the cores that that are not used, you can disable that within the os and that and that will save you some latency, especially on network type of threads.
But but doing anything more than that it's just.
Like 10s a net 10 nanoseconds or maybe a little more to come out a CC to come out to see what the OSC I'll see one that's very different than than.
What you've seen in the past and if you've disabled CC six within the ios there's no way if you let it go to see one.
that'll ever drop down to that see that see two or what we call CC six.
I have this graph of all over here, this is the 7742, which is, I think the processor that you had and then legal not let me show you the code or in there, but uh it's not a hbc type code, but, but there is math involved and square root, and you know it's not a.
integer type work workload and it starts off at 3.
4 up there, and as you add cores in this particular workload it drops down to 3.
2 gigahertz so.
You just can't judge what what the base frequency and boost is just by looking at speeds and feeds, we have a very simple Speck of a simple base frequency in a in a simple all all core boost Max.
If if we were to run that same workload, but, for whatever reason, we had a thread going on, that was banging away on some PCI jin jin for ios devices.
Well, that I hope power would have to be accounted for and this you would quickly drop, you may start dropping down 50 hundreds of megahertz just because that io power is shared with with the rest of love the cpu is a soc so.
Just something to keep in mind if you're wondering why you know why Why am I not seeing the effect of frequency I expected you may look around and see what else is drawing power within the system.
But, but we do encouraged to manage the power, not from the BIOS as much as from the os and let the chip.
play it smart you know just like you do, on your phone and your notebook, to a large degree it's very good at maximizing the performance.
For each of the different workloads.
Just one more comment on that you know.
If you disable CC six that if if you know, because you need the latency right you want your you're concerned about that for networking, or what well, whatever it will take away a little bit of power from the overall allowing the course to boost.
So if well, if at all possible, just let the os and chip.
Have forward.
I have a few.
comments on best practices.
It is a, it is a x86 64 bit processor so most of the instructions are used to work.
At X 512 is not supported so.
There are a few other odds and ends that are supported, but they're pretty.
You you, you will have I mean if you're used to running on fewer cores.
This will be a challenge not challenged, but it's something to think about it, you just have to think about how many cores you want to run.
You will have a lot more cache for core and as and, as you get by 32 cores if you're using all 64 cores you may even have less memory bandwidth than what you used to have on on on your 28 or 20 core device that had six channels, so if you're wondering, we know I have got double the.
The course here Why am I not getting double the performance just something to think about is that you are limited to eight eight Member channels that are fast and very efficient compared to what you're used to but it's still eight memory channels for 64 cores something to think about.
Probably the takeaway from wall from this from this slide is is when you're doing your makes go ahead and use all the course they.
Will on the system that will make it go a lot faster, if you do that, and the other thing is, if you do some what we call the hybrid programming.
Where where you have one MPI rank for each L1 L3 and then assign the number of threads for the chords you have within that L3 by default, that will be four cores unless you have.
s amp T or what the what Intel calls hyper threading turned on, in which case it could be eight like threads.
And just a comment on so empty most H PC users.
disable that because one you're going to start being limited by our memory bandwidth per core anyway and two if you're doing a lot of 64 bit trouble precision floating point.
You know you're sharing a resource between two cores when it can be used very effectively by all by a single core.
Here are some resources, these are all.
open to anybody who can access the Internet.
A lot of good stuff there a lot of good stuff there there's something there you don't see my email is is on the beginning of these slides and just Ping me and I'll see if there's anything I can send you that would help.
Whatever problem you're having they're finding documentation.
I have these slides next few slides in will in here for the compiler libraries, but I know we're going to have some extended sessions, later on, so.
we'll go into a lot of details here.
AMD does have an optimize compiler and libraries, we just released a three dot O version of those.
Are I would say to where to where we have similar.
Well, I think the compiler is the C C compiler is very competitive with what Intel offers.
or trans getting their libraries, the ones we have are fairly competitive in some cases, better, I think the entails a lot broader and certainly it has the ABS 512.
But our compiler is based on ll vm.
And it's free anybody can can access it and, along with the library and profiling tools.
Another little note, I was going to point out here is a SEC cpu two or 2017.
it's a Open Source standard Intel us cheered it for years.
But what I want to point out is it's they have a floating point set of tests and an integer set of tests, you can see here's here's here's what the tests are.
And the good thing is, if you go down I I picked up a 7742 from Dell submission, you can go here and click on on this link and it will take you to those results very detailed and it will show you the compiler flags used with the ao CC.
compiler to to get the best scores it's a.
Good jumpstart for these types of workloads just to see how we got our best scores there'll be some base flags that are applied to all the workloads.
And what you might be looking for is more the peak flags that are the individual workloads of how they were doing by AMD for those various workloads instead there's a floating point set of tests and integer set of tests it's a good jumpstart if you're looking at what flags to try.
One final note, on things to watch out for.
Almost all the customers that I deal with us us us Intel tools they're fantastic and they work on an AMD processors, but there's are few pitfalls one don't use the dash X, you know it's good to go out there and.
See if if if to make sure you are intelligent or not it's it's just not gonna be happy with you.
We, there are the workarounds for that the pre 2020 ample in kl it was good to set an environmental flag.
There on the screen that will make or made sure to use the add to it instruction capabilities, the newer 2020 I don't know if it's in the initial release, but maybe the first update.
It it, it does pretty well, it does pretty well that they've got rid of that debug flag and but but they they no longer.
penalize us as much so, it still works fairly well, of course, GCC and others have.
Architecture flags built built in for AMD processors.
This a little warning there some of the ice codes like matlab you have to make sure you're you're if you're using the older in kl the debug flag in there, otherwise we'll get.
fairly slow results.
And I think that's all the slides, I have to tell it through way too quick, so I guess, we can have the questions now.
So Mark, this is my heat up maybe I can.
there's a few questions in the chat so I can read them off.
let's see.
what's one what's the main difference of the EPYC line of processors to the horizon 3000 series.
Oh man I don't know if it's gonna be really bad, I think I have one, I think the I think the difference is the I O die right.
I have one on my computer I.
look at it, you know they they use the same chiplets and like I said we we were there a way, for we do these electrical test, and we look at the characteristics.
And you know the reasons they have higher boots frequencies and they have they don't have quite the stringent reliability mean how many years old last under what temperature so same chiplets some functionality, is the same, but different iodized meaning they have.
Within that io die, we have a data fabric they're able to run their data fabric faster than what we run ours at on on EPYC and they have fewer memory have fewer memory channels.
Does that.
I think that was from William right.
You can unmute me.
Yes, yes okay great.
Next, one was.
Does the design that we saw today, how does it change or apply 7003 series I guess that's the next.
Next, one that just got released.
i've heard of that.
it's pretty good.
He gets.
Shared yeah.
yeah so so the.
I mean there's there's so many little things that change that add up to big changes.
But the one that would affect you guys read the most is that we now have CCD now has a single CC X or all a core share a 32 megabyte l will L3 cache.
And the data fabric runs a little faster, so you know, in general, you mean in general there's times that will that when we do the cost analysis and show the performance that the Rome is seven to the Rome chip is a better value and, in some cases, performed very well well enough.
But, but those are the differences, you know you have a different core you have cores and then three.
But it's it's not like the jump from from our first generation Naples to Rome where was twice the course I mean was huge difference.
big difference here.
In the middle one.
There was a question about how an end user could control painting a new layout and I think we're going to talk about that in my talk.
Okay, yes.
We have some binding scripts that are not make it easy.
You think it makes it easy to be complete, but.
I think, then, the one thing I wanted to note, as we do run with MPs for.
An Expanse yeah.
Good choice which makes that pinning even more important.
So there was a question about whether end users can enable disable CC six lots of systems in recommend systems badminton right like you know.
yeah I don't yeah certainly can do it from the last I don't know what yeah you think you need some sort of root privileges.
yeah I.
don't know.
But is it does it can it be done without a reboot.
Yes, Okay, yes.
Like I said, unless you're really worried about a latency like network latency.
Or you know I I would leave it, but what you do take a little bit of ability coming out of.
Power downward has the load backup caching and.
So it needs to actually figure out if I fix anything else.
Martin that's a question, I can take back and.
figure out if it's even an option on a production kind of setting.
Do it actively enabling and disabling.
It was a question or how do I find the if the os supports multi Level three.
mean using L1 L3, as well as new month.
Martin, you can unmute and ask.
yeah basically just concerned if other always that we're running he said or a certain fee is having the support for the smoothie all as the, as you mentioned for the for the for the EPYC ships.
If you're adding you said, surely what enough Colonel to support this.
yeah I I I know it, it, Sir, I mean if you do the updates that will be on the default seven dot three it certainly does it is aware of the L3 but I I don't know, I do not know the setting job to set to to.
To make sure that it takes advantage of that I, I always think i've just the pinning whenever you, you know start your work, your workload you you pin them and that you have control of saying I want, I want to share these threads within this L1 L3.
If you know, so that the ios doesn't allow it to jump around and it's okay if it jumps around, but you could lose like 10% performance.
You know if well if you had a thread running and ios decides, you know it's not being active, so the next time it comes up it moves it over to another quadrant.
You can see where that could be.
Not a good thing, so, most of them almost all the htc.
types that I work with they they just control it through MPI and open in.
In the.
Whatever you yeah.
So that's a rapper scripts that way.
Right right yeah that's that's that's the way to do it, and then it you don't have to worry about it.
mean when we did see issues with npr and peter's codes where if we ended up with two people.
read on the same core and really slowed things down.
Oh, we had to go and.
bind them after the fact.
I know that's what we do when we do our.
spec runs to.
There was a question of what what pre compiled x86 64 chord you could can you just run them.
I guess it could be okay, as long as you don't have any sex fight level or some of those.
That you didn't have a lot here, using the Intel compiler if you specified in architecture flag.
yeah and a lot of people do that, let me a lot of plumbing not specify the architecture texture flag, but they have pre compiled code and they use it.
All the time because a lot of times we're running on clusters ahead everything from as well too broad well skylight you know so they didn't make it specific to an architecture and ran fine on this.
You know.
Does using your cpu cores mean getting more higher cpu boosts.
yeah exactly.
yeah exactly and but, but if say you're going to run you know you had SMP opposite we're going to run eight eight threads you know spread them out.
I mean assuming they don't need to communicate, yes, is data between threads if you spread them out you put them all on one CCD while there you know we we monitor temperature current power.
Hundreds of places hundred 30 something billion transistors you know, so if you localize all that work being done.
Then, then doesn't matter that all the other cores are powered off and nothing's happening on those memory channels, you may hit a thermal limit or even a current density limit and therefore you'll start throttling so spread them out if if that works.
think it was an electric thing when not running on all course to improve.
I guess memory bandwidth and hardcore and other things, does the power consumption degrees, or they just bought a bump boosting up till it actually hits the envelope.
What what could you say that this one more time.
So when you're not using all the cords.
So you want to.
improve the memory bandwidth book or some other aspect, does the power consumption decrease on the.
boost is not large enough to.
pick up the phone and call it right.
Right you're going to hit that 3.
4 I think yours are limited at 3.
yeah yeah but it definitely will distribute the power mature in there are these local limits.
In imax on inch.
yeah and.
I have found that when you're running 32 cores you know say every other core on a 64 core device the bottleneck is out here at collective memory controller, is where the bottleneck is when when you go to all 64 cores or hundred and 28 if you have s&t enabled.
The bottleneck is here pull up here on that.
infinity fabric interface between the CCD gives you an idea, an idea of what's going on there also I don't know if you guys are running at 3200 you know.
DDR or, if you have dual right, I mean you have dual you have two dims per channel and you're down clocking to 2933 because of that loading I.
Think we're at 3200, but I can check that.
But that's another way you know running at 2933 the latency because the data fabric is running at was at 1467 whatever that half of 2933 the data fabric taps out of that on Rome.
And so it it's synchronous at 2933 to that memory so latency is actually better at 2933 verses 3200 So if you wanted to save power and have better latency you know you can down clock your memory.
say some power get a better boost.
All the little tricks that's another difference with Milan, is that the data fabric runs at 1600 dollars.
Okay yeah like different.
Things the last question of how does a dual cpu system bash the one PCI slot peripheral effectively like on Nick or.
Oh yeah yeah.
really good question.
You know you're going to have the good thing is that you know when it when you're in a dual system, you can see how how half.
Half the PCI lane's well will become the interface over to the second chip and so it's running much faster than what you're used to and they're 64 lanes on it so it's it's pretty fast, but you do have to come over you know, whenever you're doing a cluster and go out that one device.
Something some some oem and I don't know which system that you guys have they take that jin jin for device, and they have an open compute three standard who's the oem that will that you guys, but your assist.
them okay so Dell has this I don't know if they put it in your system but it's really clever they have that open compute three on that 7525 there and they put that module in there it's Jen for, and so they bifurcated and do by eight to each socket and since.
June 4 for you get 100 gigabits.
On each of those by eight and therefore it is local to that socket.
Well that's nice ya know we don't.
Have yeah we don't have that, because our connection is hdr 100 so.
yeah yeah.
I think I saw exactly yeah yeah yeah you can have that you know where it's a by on the output or two 100 gigabit slot you know coming up and it's really clever.
But otherwise he's come across that and it's a lot faster than what it used to be.
You know what used to but but still you're right, it has to come over that will that will that link yeah.
That was it if I missed any questions, please.
Either open your MIC and ask or raise your hand.
Okay there's one more at the end, how much benefit is seen when using SMP.
Like what's a primary in teacher workload.
or something like Jim I'm like.
I know, on.
dejan we run it if we get maybe, so we have 64 cores with us empty off and you get a score and if we raise it up if we turn on SMP we may get 10% more.
It uses more power, of course, and all that but, again, you know if it's really good for databases and virtual machine, there are not memory limited in it, you know but, but on the gym type things you may see 10% uplift for or for the note could be a little higher but probably not.
Another question of what any tips about maximizing performance.
For GCC compiler open MP fortran code.
You know I'm afraid I don't have that experience.
You know, obviously use will use the architecture flags oh.
yeah I'll.
I'll take a note and see if if if there's something on our website.
With with guidance on so GCC important.
question was basically.
The question is, are using one I guess the follow up on like jam jam jam thread so use one region thread per core and then additionally have a non compete progress thread on one reports but it's it's something like this is the question.
yeah like this.
I don't understand your question.
I guess I can explain.
idea of it basically have primarily floating point computations and that would be done on all the cores on additionally I'm have like a communication progress that that is would be primarily integer.
On shared on one of the cords with empty.
How much benefit is seen by that versus.
Having it just be shared or basically how much benefit, you can get and what is relatively optimal case rather than thinking right.
So, so when you turn on s&t.
The caches are all dynamically shared, except for the L1 instruction cache and so by turning on SMP all the cores are going to immediately only have half the instruction cache so everybody would take that penalty.
And so, if you're not using those other cores.
yeah maybe.
Bigger penalty than what you wanted, you know but otherwise yeah it, I mean it's it's a clever I.
idea, it might work for you.
All right, hey Mark Thank you I'm going to step in we're now ready to transition to our next speaker.
At 1045 who is Marty Kandes is going to talk about the FCC compiler so let's go ahead and let him screen share and.
get him set up to get started Marty, are you with us and can I hear you.
Yes, can you hear me.
I can hear you great.
Look, I see your screen.
Anyway, thank you very much, Mark that was excellent some great questions folks and you can keep asking them and we'll try to answer them as we go.
And then.
There we go Marty, I see your talk it looks good.
Okay, great.
So thanks everybody for joining us today for this AMD advanced user training.
My topic today was to sort of give you an introduction to the AMD.
compilers and math libraries, that you might want to try and use with your codes on Expanse or any other and the EPYC system, you might be running on and they're becoming a bit more popular.
Just to preface I mean this is sort of my first foray into actually using the AMD compilers and math libraries myself So hopefully it's not going to be too advanced for most most of you.
You know it's really just you know my attempt in the last week and a half to sort of try to figure out how to start using the compilers and.
Some of the libraries and look at some of the performance characteristics and things to look out for when you start testing your codes and benchmarking them.
You know if you're sort of if you're thinking about applying for time on Expanse for for your project if you have some custom code and you need to submit the performance benchmarks to XSEDE to.
Request time and whatnot so.
So that's sort of some caveats so a lot of there's a lot to cover here there's a lot of different libraries i've only been able to sort of touch on a few so.
We don't I'll just so it's more of a high level overview, with a few concrete examples of sort of my experience basically.
For the talk today so skip over the code of conduct, I think Jeff covered that in the beginning, so so yeah this is sort of the outline for today it's sort of one at first touch on.
You know, using the compilers and then give you a quick example of a code that I sort of you know we compiled with lcc and looked at the performance and just trying to understand.
You know what's what's available um one thing I'll say off the BAT is the naming of the compilers is the nba optimizing C c++ compiler on website, but it also includes fortran.
But I think marks flies actually had a different name for the compiler effects and maybe they're changing the branding.
So after I would go through just the High Level overview of the father and some some of my initial work work there I'll talk about.
give you an overview of all the libraries that are currently available, so if it fits within your own.
Your code base to sort of compile against those libraries, he has sort of have an idea of what's available.
But I'll give you some explicit examples that I use to sort of do some similar benchmarks against the particular the AMD.
math library and sort of the indie bliss library and then just kind of give you a summary from references I think probably a lot of them Mark might have had on.
His reference slide as well, and then we'll take a Q amp a and, of course, if there's any important questions that come up along the way, hopefully, someone on my team, you know stops me and lets me know so.
Okay, so.
Sorry, this week my zoom bars kind of in my way I can only see the slides okay.
So yeah so the dmv lcc compiler is is really a set of compilers for C c++ fortran it looks like Marks lead, they were talking about Java as well now maybe.
And these are tuned for the AMD EPYC architecture, basically, so this is, I think, I think, with the this whole new EPYC line of process of it, I think this compiler.
You know stack with sort of started in the first generation to sort of target this new architecture that they.
have been building up for the last few years and it's all based on the if you're familiar with Lol vm I am not so hopefully there's not enough LM vm questions or someone else in the in the in the workshop today has maybe some answers for you, they can maybe answer them in the chat.
But so yeah it supports C c++ and fortran then fortran compiler is based off of flying compiler in.
Some features, there is also open MC open MP support across all three languages.
it's well tested on the latest operating systems and for us, in particular in the user services group here at St St we're building all of our.
Applications that we provide in the standard software module environment with respect and so we've been actually working a lot with.
The AMD team who's who who are optimizing certain applications for.
The EPYC architecture and building the AMD sort of optimized versions for the spec packages for those for those application so if you have questions on that at some point that's probably another topic will be talked about in the future.
So just a few things if you're not familiar with the ll bean.
tool chain, you know I would definitely at some points dig into it, if you're really going to get into using the indie valor sex and it's all based off of bellingham.
I know very little about other than sort of the basics, you know sort of the different pieces that go together with the front end this sort of intermediate representation things get.
Translated into and optimize on until your final binary but essentially the LV in part if you're not familiar with it, you can you know you know, as I, as I am not getting started in the last week and a half or so.
You know the the compilers that come with a OCC are sort of you can just use them intent right, you can do a lot more advanced things with elevate sort of.
optimizing in this sort of intermediate representation sort of stopping the compilation and tweaking things and.
There are a lot of dance things I know you can do, but I have not done that myself, so I will not be teaching that but.
You know there's there's a lot if you if you don't know a lot about compilers I can get a little been probably a good thing to learn how it works and kind of give you some ideas of what's actually going on in most compilers.
So the C c++ support is through the cleansing project so am these a SEC I think it's just a custom derivative of client the client project so.
And, which supports you know basically x86 and arm architectures and various standard versions of playing of sees it playing now sports.
Again there's probably a ton of things that you can look into and claim specific compiler options and things that might work with a SEC, that you can find in the client documentation and guessing.
Which is quite extensive as well.
On the fortran side on my fortran program by training myself, so a lot of the examples that are more for 10 Have you had you have a si si example for you.
And if the one thing I wanted to say about flying which you know, I was still learning about a little bit last night and there's there's actually a lot of different variants of it, so if you this is the correct language from the.
version of playing the AMD is or if Forked in an optimized or maybe even contributing back to this is classic playing So if you are a fortran compiler.
fortran programmer and you get a little bit confused which which version of flying is is the afc see version based off of this is kind of the language on the classic flying.
github page and.
So it's kind of I think an important note there's looks like there's a lot of history here.
And as I'll show you I think there's probably a lot of, as you can see there's this note also at the end you know.
If they're planning on replacing this classic playing in the future so there's actually.
In my opinion, so far from my experience there seems like there's quite a there's still a bit of work for some maybe some of the flank support for transport but I'll touch on that and once we get into it.
um I have these references at the end, but there's there are sort of these quick.
Quick guides for you know the different compiler options, you can use with a SEC, you know a lot of them are very similar to things you see with other competitors with.
GCC fortran.
The Intel compilers.
And so you know if you have your make file, or you know some sort of build process, you can pretty much you know, take a quick look at these re compile things with a OCC.
Whether it's climbing or flying and you get running pretty quick and that's kind of what my approach was like Can I take a code that I have.
Now, and i've used for years with say gee fortran only essentially in this example in can I, you know just you know, read the basic documentation, you know, change the compiler options were needed and see you know what happens essentially.
So that's sort of this example that I want to.
walk through to start.
me just checking the time here.
So this is a code that I wrote probably 10 years ago at the beginning my PhD and essentially it's a.
Simple finite difference in simulation code that you know solves essentially sort of more complicated version of the schrodinger equation to sort of do these simulations of.
quantum mechanical particles in a sort of one dimensional rings sort of topology right and it's all written in fortran centrally probably all fortran 90.
and These are the characteristics of why chosen essentially right it's sort of pretty simple fortune, you know, I was not really.
As advanced of a fortran programmer as it was, maybe I'm now but it so it's all mostly basic for 1090 it's a serial single core code there's not any complications of.
Using open MP or MPI it's just you know straightforward serial code is pretty simple and straightforward, you know there's not a whole lot of heavy duty work that needs to be done to carry out the simulations that it was sort of intended to.
Look at um However, there are a few things that I wanted to see I mean I thought would maybe test the compiler stack and see.
You know what the performance was what sort of the what features are available and maybe what what ones, have been optimized but not been optimized.
So a few things to note right, so this is sort of a quantum mechanical simulation payload so it's sort of all double precision complex arrays is is sort of the large memory.
component of the of the.
Of the code there's a lot of uses sort of elements and otherwise array operations which are available in fortran and then there's a few intrinsic fortran.
functions in the sort of complex conversion taking a complex conjugate have an array and using some over an array and then there's also the one of the core things which is sort of watches do is there's also a laid back routine it's.
sort of at the heart of the.
Of the.
Time stepping process for for the simulation and so originally my my my goal is to get to the point where I was compiling this against you know, maybe the AMD laid back essential that they have, I did not get there, but you know I got I got some basic results.
Just running the code compiled with different colors.
So I ran a we compile the code set up the make file to use the different compilers G fortran which i've only used in the past, essentially for the simulation code.
compared it with the latest iPhone we have on Expanse and the latest version of the OCC flying to Point two that we have on Expanse.
And these are kind of the basic settings that I always have used for G fortran and the translation of those in both for the.
Intel fortran compiler and the fine compiler and you know so pretty basic probably most people who've compiled, you know, seeing fortran code are familiar with many of these.
These options and so it's sort of a one to one comparison in most parts there's a few things that are not that you might notice is that well of the help.
Is there.
So just looking at the results, you can see that you know, interestingly G4 trend 831 with the fastest of the G4 trying to buy a little strange because G fortran 10 two is the one that actually has the optimizations present to so I'm not exactly sure what's going on there.
I did see a notes, if you look at the quick guide for the compiler options that.
for whatever reason, the recommended G CC version was nine three.
In that guide, which was, I think the last at least in December so I'm not exactly sure what's going on there, but these are kind of the results.
You can see.
You know, using the Intel compiler on Expanses in the EPYCs you know it's significantly slowed down from from the D fortran compiled version and it's slow down even further with the flying the FCC flying.
compilers so.
I did one check and ran the a different I Intel compiled version on our gpu nodes which are actually Intel processors on Expanse and you can see the performance kind of drop back down to you know, in the range of closer to what the.
g fortran compiled version on the EPYCs and So what do I think's going on here it's there's a few things that they may be going on.
Obviously, I think, if you read any of the details there's a little bit of.
chicanery going on with the compiler is piling on different architectures and things between them different competitors, so I could be running into those issues here.
The one thing that I think may be an issue with.
The FCC flag, if I my my only sort of hypothesis is is right, this is all double double precision complex arrays essentially and maybe that the.
Double precision complex support is not really there yet, the only thing I could think of like what's causing such a slow down, and you can be completely I'm not familiar with enough of the compiler options, but.
that's my only hypothesis, right now, so.
One of the things I didn't notice, though, which.
I think mark talked about is you know, this is a.
single core code that I was running and I noticed that there was, I could definitely visibly see the L3 cache contention right So when I was running on a node.
which we have shared nodes on Expanse where multiple users will be running codes on the same hundred and 28 core nodes are everybody, you know you could in theory have 120 different users with a single color running their code on the system.
You know very noticeable difference when the one node is loaded up with.
Many tax many processes running versus one that's pretty quiet right, so you will see that kind of cache contentions sort of performance hit in depending on sort of the random random draw depending on how busy the notice you're running on essentially.
um Another thing I noticed just I mean I don't know if it really had any effect, but you know if you look at the output from the compiler options and build warnings that I sort of got one compiling this code, there was a few things that I didn't quite understand.
For example, at the bottom here, these this morning about these arguments weren't used during compilation.
don't know why it's warning, maybe these are specific to the flying compiler and at some point playing gets involved in warning about this, because those those compiler options were in the documentation so.
Everything still seem to work, so I mean all the signs, I mean I checked the energy conservation and things like that mass conservation in the output so everything scientifically was working, but you know you might get some serious morning that might not make sense to you.
Once once you get started, and then there's also.
was something else I wanted to say on this point.
Marty Marty, we have a couple questions that Marty wants to chime in.
I think there were a couple of suggestions in the chat actually I'm actually does sets instructions scheduling but you might actually need Mr to use the instructions.
Yes, I did check this I mean is, Mr are empty and is supposed to set the.
sort of.
minimum necessary, I think, to run locally on the on the architecture you're compiling right, and you can actually check this.
You know interactively essentially and see what architecture it's using when you set empty on the specific machine and I check that and it was using then too, so I don't know I mean I will I can definitely go back and try to explicitly with Mr.
But yeah I I did check this and I don't.
I don't expect that to change results.
Other suggestion make your reason given is basically the virtualization it's.
A double complex are.
not going to the park.
yeah so I mean, so the one thing aside here with with with les pack I'm not compiling against any of the AMD libraries, this is a reference les Pack.
routine that i've used.
phrase it's distributed with the code, essentially, so this is yeah, this is not using any optimized version of the AMD libraries, or any other version of the optimized.
play pack my bags, we have an Expanse so that's why I was like a simple like I said, this is very simple code, you know I took take one one laid back routine and distribute it with the code So yes, yes optimize I mean that's sort of one of the goals for this exercise, maybe if I get to.
Linking against some more optimized version of a.
laid back it'll really speed up the code so.
There any other questions.
You most Americans in the chaotic.
Okay, should we address them or I don't know.
I think.
The most part, there was a suggestion that yeah you could do, Mr T equals native that's yeah that would work the PGI Question No I don't think they've tried it, but we do have NBC like it's on the gpu side so of the stocks or we can probably find out on the cpus and see what the character.
yeah so like I said I mean this is a very.
Straightforward just convert a make file to use the what i've always used in the past to a SEC and see what happens basically right So yes.
I will, I will definitely go back and check with the the Mr native but like I said I sort of check that it was in tune was setting then to so.
But we'll see.
So the second set of talk I kind of wanted to back off like so I didn't quite have a lot of success with just you know, using sort of performance wise with with the SEC like just off the BAT they didn't have necessarily.
There, there clearly is maybe something an issue there with maybe like I said, my hypothesis, like the complex.
Complex math parts of the compiler sport so as I sort of wanted to back off and say Okay, what are the weather, for a concrete examples going to go through some of these more.
Smaller micro benchmarks to test some of the other math libraries and things like that, and so that's kind of what the second part of the talk is about, and so, but I do want to start with sort of a high level overview of just what the libraries are.
And what's available for you, because I have only really tested the math library and bliss by the toxin.
So the.
The end math library or live in is what you'll see in the documentation is really just the MV optimized version of the live em.
library available basically with any.
Linux distribution, so a lot of the times when you're compiling your code it'll be linking maybe without if you're not familiar with it might be is probably linking against the.
distribution of the os distribution of the live em library, for some of your function calls right.
And so, this is the optimized version, and so this is kind of if you want to get I think you know start with the AMD optimized libraries and see.
You know what forms games, it might give you this is probably where I would you know start with with your with your own code right next level up if you're doing any sort of.
linear algebra, this is the AMD bliss sort of blast library and that's available, so this provides also the blast functionality would find in open blaster and kale.
For for your code and until all do an example of using this as well, by the end of the talk and the next step is a linear algebra stack is live flame, which is the AMD optimize.
laid back, and so I did not get to using this myself sort of kind of was my my goal for today, but I just didn't happen, and so, but if you do use a lot of late pack routines in your code, this is where you're going to want to go, the one thing that this wasn't necessarily clear to me.
Coming from fortran side of the world is you know, this is a see only implementation and what the sort of compatibility layer that they talked about provides in terms of.
Supporting fortran codes I assume it, it does support on fortran at through this compatibility and there that the talking about.
And you know I mean one of the biggest things if you if you if you do a little bit of seeing a little bit of fortran you know that the.
The the memory sort of layout in in arrays sort of the ordering between rows and columns is flip flopped and so I'm kind of curious like what the.
What the how they handle that in this compatibility layer to support the performance, but this is, you know if you're using any sort of laid back scenes you know, using bliss combined with Lib flame is is sort of your you know full stack linear algebra.
library that you want to use, and of course they do also now have a MPI scale a pack optimized for AMD as well, and so.
That covers sort of the one more linear algebra library so there's also a sparse.
linear algebra for sparse matrices vectors that also optimize again I haven't used it myself, but if you do do a lot of sparse.
matrix operations, this is where you're going to want to go and look at.
outside of linear algebra they do have a.
Baby optimized ffc w library, and I have used this myself actually not in this talk, but we used it for part of our benchmarking on Expanse for think it was handy we use it for.
Like we got the best also.
Part of our production quantum Espresso better than a few other things.
yeah so we have used this to sort of support applications deployed on Expanse and part of our sort of benchmarking getting the system ready for production, and so it there, you know, so we do, we have seen.
instances where using sort of an fft w you know to compile some of these Community codes against gives the best performance on the indie hardware, so this is definitely one i've seen results.
results for in production.
And then the last set of math libraries are available, right now, if you do when you sort of Monte Carlo simulations they have random number generator libraries both sort of general pseudo random number generation and secure sort of cryptographic.
Random number generation, so this is where you'll go for.
You know testing your Code against.
These libraries and.
I'm not playing with it.
The next example is the sort of concrete thing that you can take away from today, and you know there's code on github that you'll be able to use and play with yourself if you have access to Expanse.
This is a very simple example this example I found in reference that I'll give later at the end of the talk about this very old benchmark.
i've never heard of before myself.
But it goes back to them.
it's kind of interesting So this was, I mean this was used in a.
Review essentially of Zen one architecture assessing the Ad Lib em performance, clearly this loop of 2500 iterations needs to be a bit larger today.
But if you Google this benchmark you'll find it kind of interesting people you know, over the last 30 to 40 years i've been also tracking like how well it does on like.
graphing calculators and things like that kind of an interesting thing, but essentially at the heart of it, what it's doing is trying to assess the floating point performance of these sort of transcendental functions, essentially, if you look at this inner loop here.
The you know you could you could simplify this in your head it's really supposed to be a equals a plus one.
So all it does is make a some, but it makes it complicated some through these.
transcendental functions essentially right so mathematically converts out very you know it's it's one of these metrics I like where it's you know exactly what the answer should be but.
Once you dig into the details and actually do some of the competing some of the experimental testing is kind of interesting and.
So yeah I included the runtime performance of this benchmark back in you know CIRCA 1983 just for for reference, so this is what I got today, essentially, I mean today, but in the last week, so this was.
You know, running just a modern version of this code this simple benchmark that I wrote last week.
And this is when it was compiled against the standard Libyan distributed with centos eight on Expanse.
And with different compilers and there's some interesting results here so essentially the number of iterations in this loop that I ran over was essentially.
And equals, this is 1,000,010 million 20,000,050 million hundred to a million essentially right, so a little bit bigger than 2500 but you know, to actually show some differences, I had to sort of scale up.
So you'll see that all of the compilers are kind of.
You know, do well up to 10 million, there is something weird that happens with GCC and ACC playing at 20 millions.
The runtime performance explodes for some reasons I don't know what this is this looks like a bunch of some sort somewhere in that my code, but in maybe the somewhere in the library something gets wedged in a weird spot and that's kind of my hypothesis you'll see.
compiling this code, with the talk about it, we have an Expanse, you know it's scales, all the way to the high high some reservations.
At a billion and gets you know good performance, you know relatively compared to the other compilers.
In so the reason I'm thinking there's something wrong with there's something wrong in the standard live em with some some somewhere down down down the chain, there is it I assume the Intel is actually probably pulling out most I mean they probably have most of their optimized.
Standard live m functions that are in this benchmark sort of aren't using the one from the centos distribution so they're probably just you know it's like, no, no, no we're going to use our version and that's why I keep scaling that's kind of my assumption.
You know so it's you know you really have to dig into like what's maybe he what's what's coming out in terms of the Assembly code to see what's actually going on.
Now this is exactly someone access if you've been pile the same code and link it against for each part of the link it against the in the Lib em library, which we have on Expanse and it's sort of a little bit more complicated to get at it, because it's in the.
22 module that we have an expensive, you might have to explicitly link to it, so you sort of have to define a path to it.
So if you if you are doing this, it gets a little complicated because of how we have the module environment setup you really do have to point to.
Where it's located on the Expanse if you're using a different compiler at least, and so you see here that you know, using the MB math library, you know saves the day, at least in terms of whatever that that blow up was a few CC in a SEC claim compiling linking enslave them.
And although the performances still a little bit faster with with Intel right, and so, so I mean this is also for for you.
For the suggestion of using you know, Mr versus to native kind of again kind of showing here that you know the performance is sort of the same, at least with the with with with each compiled on the on the same architecture.
The other interesting note, if you have noticed, I guess is is at high iterations the this this this loop actually goes into like what we would call the things dynamical systems or a fixed point which is kind of interesting.
And you can actually.
I mean, this is a think we just introduced issue with the floating point math that that's going on, this is not going to change with any compiler essentially so whatever the.
This is so it's kind of interesting i've kind of seen this before myself.
And the reason you can show that it's kind of due to the 64 bit limit, essentially in.
In the floating point computations is, if you actually redo this and fortran use 128 bit math everything works out fine So these are just some other observations that.
That, I made the The other thing is really if you if you go back to the example you'll notice that I did not use the fast math.
flag and basically if you do use the fast mass flag with any of the compilers you'll see that what's likely going on is there, symbolically.
Pre computing they recognize that Oh, we can simplify that expression to a equals a plus one almost so it's everything comes out much, much, much, much nicer in terms of in terms of the results so yeah.
In the in the reference that I'll mention it talks about this benchmark and you can go look at what they did they actually set they actually did this benchmark with the oh three oh.
optimization so they turn up all optimizations to sort of you know, assess the different performance of the library, so I think this is a little bit more fair just turning off fast math which I think keeps the IE seven for compliance in check so.
So just some other observations and assumptions that sort of going on with this benchmark.
So, moving on the next.
Essentially, two examples that I wanted to show was sort of a matrix multiplication example.
Both using some intrinsic the intrinsic matrix multiplication in fortran and the DJ and performance from essentially blast or Andy bliss library.
So this is a benchmark so i've used this.
benchmark before for sort of teaching students about.
Parallel programming we've used this matrix on so I started just put together a bit of code that i've used in the past and sort of extended it to this matrix multiplication example.
And essentially that's you it's it's a simple set of codes that use this hilbert matrix which is kind of interesting sort of like this that savage benchmark it's kind of a quirky.
benchmark or quirky sort of you know, set of computations that you might highlight numerical issues right, so this hilbert matrix is actually the music and a lot of numerical study of.
algorithm linear algebra algorithms probably since the 80s to I think i've seen references to because essentially it's a very ill conditioned matrix.
But it's very simple define and set up and then you can define a lot of analytical results, like, I think you know.
You know, like the eigenvalues eigenvectors and everything you know tons of things analytically about this matrix so when you go and test things numerically you can see where things break down.
So that's why it's kind of useful it's kind of these benchmarks, I like where it's like I can actually write down things analytically in check that everything happened correctly.
And so, for the benchmark I ran here was, I wanted to just take the square of the matrix so multiply the helper matrix by itself, you know for some some dimension, and that I chose and.
The the check that I perform just to make sure everything seems to be working.
Was that the first element in the.
In the computer results it's actually just the it basically approaches this pi over squared divided by six again, you can kind of use this benchmark to check things pretty quickly if you have like.
A little analytical handle on anything you want in any sort of linear algebra.
linear algebra sort of calculation that you want to.
numerically benchmark.
And so, this is um I did everything in this case in fortran just because I wanted to both look at the intrinsic matrix multiplication in fortran and support there.
Right, so this is the results comparing the different compilers and you can see, numerically the results that was tracking is all basically the same down to the one to the base precision and but the performance is wildly different.
You know G4 trended did quite well again.
The lcc flying did you know pretty well but could be improved so again it's possible that maybe just the flank support for some of the for trends intrinsic is not quite there yet, I mean I don't know for sure.
And maybe in the fortune in the Intel compiler case you know it looks like.
They just don't want to do fortran well for you, for some reason.
I did not compare how this runs on an Intel core but I suspect it probably does better.
The next benchmark that I ran was okay let's replace that intrinsic fortran matrix multiplication with just a standard.
blast routine you know so just the standard the gems, and so the first.
resolve here was just you know, taking the reference net literally patter blast lay pack the gym compile that and in run it, you can see those quite poor, which is not surprising it's just that the reference and it's not really optimized for any architecture with any sort of modern.
vector ization and things like that, so the next set of checks was using open blass in tongue kale and you can see here, all of them do quite well when you look across.
The different class library, so there was almost you know very little difference, you know, compared to some of the other results we've seen.
The one thing that I guess I did want to notice.
The open blast in and the bliss numerical precision is quite the same.
There might be some there's some relation there, but I mean it's it's it's it's one thing that stood out to me here and but the The anti bless you can see it's still a bit more optimized overall for the set of runs that I did.
don't think there was anything else, I wanted to comment here and that's kind of my initial experience with.
lcc and you know the MV Lib em and Andy bliss so very simple, you know start to learning a whole nother set of tools that we know are going to be around for a long time and will be using on Expanse for the next five years at least and.
So yeah I encourage you to if your C c++ 14 compiler or programmer you know, and you have some some your own custom code that might you know benefit from both the compiler.
The lcc compiler stack and the libraries and definitely give them a check and, as you can kind of see like.
me I can't say that you know it's going to be obvious when when it's going to be performance for you, for this is for your specific code and use case right you really have to make an assessment yourself.
And do the do the do the checks and so, for, for example, for me, I'm still curious about you know.
Using the anti libertarian to see what the support is for you know Mike Mike complex tried and I tried diagonal solver and then they the first code that I, I tried to get started with so i'd be curious to see how that works out.
And that's kind of it, and I have a bunch of references here at the end of the slides there's links the links to the different guides you can look at.
I included here both the OCC to two guides one available because that's what we have right now and Expanses, we will be removing the mcc three dot O once we update our module environments on expansive next you know month or two.
And because they they just released a SEC three I think last month.
there's also, I think you know there's all the stuff on the developer dot ATT COM site, you can check out the individual libraries.
resources that they have there there's I think as mark gave there's tuning guides there's even you know developer side manuals which is sort of even more low level stuff if you really want to dig into it and kind of be on the compilers and architectures work together.
And I mean one thing I do like about the avenues approach here is you'll notice that a lot of this is open source on github and so you can actually dig into but what they're.
What they're doing what what developments are being made and what what changes are sort of coming down down the way, which is a little different, I think, from some of the other vendors we've worked with in the past.
And the last set of references, I want to provide this is the price computing initiative in Europe so.
One of the best guides that I came across like comprehensive was this one that they put out last year it was it's based on the design one architecture, but it.
Really covers everything and they have lots of benchmarks in there for different things that they tested, including the you know this savage benchmark and the gym benchmarks to have you know FST who benchmarks.
And really did a lot of this a lot of work in sort of testing the architecture to understand what what what.
What was going to be different, essentially, and what you sort of need to take account of when you're developing on this new architecture.
And, and they do also have, I think I noticed that even they have least last year that an updated sort of at least more architecture focus look at the AMD.
Rome architecture and so for for then to architecture, they have a nice guide on that as well, so that's kind of where I wanted to end and take any questions that people had.
Marty there's a few I think one was for the amp D folks is there a plan to make yourself comfortable with and BHP see.
mean you do have the source, so you can actually build this actually like this.
So yeah I mean, so I mean I don't know obviously what's the interview Center but I mean one thing I didn't notice is that the rate the flying derivative that they're using is in open source version of the PGI PGI for so.
Maybe I don't know I mean.
what's going to say something okay.
That might be changing because that's the classic flying right.
There is, of the Open Source PGI.
So mark, I saw you on mute So if you have.
Okay, no worries I can I can follow up.
On that and yeah I'm also gonna put some links to where you can get support and questions like this answered.
him he had a question about linking libraries beyond loading the lcc modules and the other modules.
I mean we'll have to obviously art and the link part, but I think other than that.
The library names have been changed on the on the CW and any other ones, but unless, of course, a different light it won't be nambla park and or last change the link one.
yeah and what one thing I noticed is yeah with the.
least with the you know the spec packages that we've deployed the you know you have to explicitly link against the AMD live in right it's not going to you can't just give the you know, simple, you know l dash Andy live em kind of flag.
You have to you know sort of point to where it's located and it's actually in a SEC compiler so if you're linking against it with any other compiler can have to.
Point back over to the other compiler.
Or the mcc component.
yeah and also, I think, also my here just comment for for anyone who's interested right, so I think we only have a few of the AMD libraries on Expanse right now, but will likely be you know, putting putting the rest into production at some point, so we have the end live, you know bliss.
Do we have a live point, maybe.
We do live.
In fw I think that's I think those are the four that we have right now.
is to know to know where we can see some examples of linking the libraries, give them is anything beyond just the name of the library.
yeah, so I would definitely I mean this this price best practice guide for me EPYC definitely has a lot of linking examples, the I mean the user guides will have for the different for the different packages will have some.
I have you know a few in the slides that you can look at it.
But yeah.
And then later we can always add examples to our examples directly, if you have any specific things.
yeah I mean, for example, maybe I'll add some of these are examples, for you know unexpected just.
yeah it's just there's just a minus suggestion just just to make it meaningful.
I think, Martin, you had a thing saying you have a your experience was at mk I was a little faster than blass and bliss also.
I want to say it's ruined lot of cases what we've seen a few specific instances where.
Carol has been really poor.
And I don't know what the reason was but we've had a couple of users who've had to completely rework everything to use.
Use the AMD libraries and get good performance actually what quantum Espresso is a good example, I think I heard much better performance using.
The MDF50w and the open plus versus.
going down kale road for some reason.
Okay Mahi I'm gonna step in, so we stay on time and point out that we have a good.
amount of time at the end for more questions, and I think talking about compiling might be one of them, so let's move on to Mahidhar is going to speak next.
Thank you Marty that was really informative.
I think, putting some of those links from Marty and mark on the github repository will be posting things on would be helpful, so my he is going to talk about my head are touching me will be talking about.
SLURM and runtime configurations.
I see your window.
I know your audio works.
So it looks good like a presentation.
Thank you.
I get started, so what I'm going to go over is some of the slurm runtime configurations and kind of tuning in terms of layout and binding that we've looked at.
So a lot of the tools that I'm going to be talking about like the binding and definitely tools were developed by Manu Shantharam and he's on the chat I think that questions he will be able to answer better than me, but I can help.
Some of this might be a little bit of a repeat, I heard it in for complete not sex in case someone picks up this presentation separately, so what I'm going to do is.
Look at.
The AMD processor architecture.
If the hardware details and numa options and so where things.
could be an issue in terms of laying out the past and everything and then talk about the Expanse starts layout and definitely scripts that we love and some of the things we learned, while benchmarking and helping early users and all of these are in flux and being modified.
quite a bit so and we appreciate feedback for this so from Mark's talk this morning, you know that basically.
Eight core complex dies on each of the.
processors, so I have the picture down there.
And there are eight memory channels per socket, so it is DDR4 at 3.
2 gigahertz for us.
and As Mark noted, you can actually extract the memory and io into separate quadrants and the way we've set it up and I'll talk about that.
And then, at the core complex die level, there are two core complexes PC actors per CCD so the for them to call the ECC X share the L3 cache for that.
And there's total of 16 of these on the node so.
So your end up.
On the processor you end up with basically 256 megabytes of L3 cache total.
Each core has a private by L2.
So the way, and there are different ways of booting this up and the way we've set it up is NPS4 what it's for new domains per socket so the four quadrants.
Are in different domains, and this is a BIOS option, so if you need to change this, it needs a complete reboot essentially.
So I have that circle there as one of the new model and so basically the memories interleaved across to memory channels and PCI devices will be local to one of these new domains so, for example, one of those could be the infinite bank card.
And I think on experience I have to go back and check but it's not on the first NUMA domain so it's kind of number of.
A there is a So this is the typical hbc configuration to have for MPs for.
With Tony what domains, and what this does mean is that your applications.
When you're running their applications, you have to be new mama over and the ranks and memories.
should be pinned to the cause and new one or.
As much as possible.
In some cases, it can really took quite a big performance hit like hatred code for exam.
So, in terms of.
That kind of the quick overview on the architecture so let's look at the.
Expanses out loud and definitely script but.
Before I get into that you can have obviously do this with MPI options old MPI implementations have affinity options.
You can, for example with open MPI you can use the map by option and say my bio three cache and so, for example, this come on I'm running 32,000 and mapping them by all three caches and, basically, this was an HBO run and a Nord have 32.
32 CC access, so you can basically.
Put an MPI task on each one of them and then have four threads on the same sharing the same eldrick as.
Well that's the effect of their.
Similarly, with Intel MPI you can use in domain options and again, these are like not there are lots of these options I'm just kind of giving you an example here.
So you can do basically a, for example, Olympic or Poland compact in European domain often so you basically get the same layout if you he was 32 API doesn't for.
Four threads.
The other thing just solid applications will have their own pending option so, for example, for nom de run.
This is a two node or an example that I'm showing here, well, basically, I have for MPI as per node.
And you know and essentially doing an ethnicity best option on this anomalies right, so you can see it's doing a plus B P and 31 wondering why it's not 32 we have some set aside for communication, there you can specifically use particular fours and use a P map and.
That works very well that's Obviously there are ways different ways of doing this just using MPI options and combining them with the application pending options.
But we found that a lot of times, it might be more useful to.
To basically use a.
Use experts are definitely REPS so that we can lay out the tasks exactly how we want and also buying buying them.
Using open MP flags and NPI.
Sign because he will be calling us around but.
But what you see as an end user is is just idea on indefinitely, so the basic usage is you just use it run with the executable and whatever executable options, and I will run we'll figure out from your slum request.
What you're actually trying to do and.
Try to do the best job with it.
But you can also use affinity options.
there's another script called affinity that lets you.
Set hints on how you want to lay out the tasks and then, and then the script takes care of finding out all the core idea is that it needs to use, and you know, use the appropriate binding options under the covers.
So you could be doing this yourself by you know writing a script that finds out all these things, and then.
By users can be affinity flags and all sorts of things to kind of get the right thing, so we've kind of put all that in our nice package.
So the various affinity option, so we can do a scatter which basically scatter lines there's a compact option, also at his character option which will scatter runs across all new domains in a cyclic manner.
You could do a scatter CCD.
which will scatter the MPI ranks across all the MDC D domains in our psychic manner.
You could also do a scatter CC X which lets you put the ranks across CCS domains, you know cyclic Martin so in the spl example that I was showing with MPI on before you could probably you can give some scatters sex option to do get the same result basically in terms of the layout.
we've also kind of.
Given a little bit more control on the scatter options with the block size, so that.
It scattering the lines across new domain cyclic manner, but then there's a block side effects of consecutive ranks packed into a single domain.
That way, you can control exactly where your empty itis land essentially by doing different options.
For the valid block sizes obviously depend on the cpus per task that you're using in Islam, so one thing to keep in mind as we are.
Using the slum.
request that you have, as the.
probe in terms of what you want to do.
So so so you have to set the right thing up in that case, and then the domain type right like so depending on what you have you can.
Like when I knew my domain, you can go up to a block size of 16 on a CCD you can go to a block size of eight and now now.
CC X eight can go to a block size of four right So those are all the things you can do.
Well, this slide was basically a guide for the layout diagrams that I'm going to be showing in the upcoming slides so imagine these 16 boxes as basically one number new domain and so you have several of these new domains, and then within our domain.
So zero to three, there is one CCS and four to seven is an acceptance on.
And this is not related to the physical layer is just for for you to figure out how things are being.
The core numbers that you're using essentially.
Well, so the first example is just using to nod and saying hey we have we're gonna have one cpu per task and only for MPI task or not, so we are.
And we're running this with no options, so we are just doing i'd be done and then it's just going to by default cycle across all the new, more nodes so you will end up with.
Just one on the first new order mentors to on the second in modern and so on, and that's something similar thing repeater on node to, so this is where you'll see task one task to task, three, four and then 5678 on the next node.
Now you're using the computer partition again and saying cpus per task equals one and then enter espanola equals six, but if you go down here we are saying, if you need to do the scattering but with a block of three right so, then you end up with three pass on the first.
New modem and be on the second one, and then you flip to the next node and so on.
So going to the.
Next slide so what I have is.
used for task it's now moved up to two so now, if you look at it, the.
first task which has actually two cores.
assigned to it takes up the first two cores on the NUMA domain that second task takes up the next two and a third pass takes the next two and so on.
So, but now you still have three MPI tasmanian each know my domain so So you can see how you can use the bottom, it does in the last batch and.
And I definitely options to kind of spread out the tasks, where you want.
So now, if you had.
gone back to task or not is too, but like that scattered at the CCD level with a block size of two now you'll see that.
In the first name, and we in the first two tabs or four zero unquote one.
But then the first eight as a part of one first eight cores are part of one.
CD right so it's not moving the third and fourth us to the second CCD which will be core, eight and nine basically in the targeted or or yeah or call, eight and nine basically so and then, then it goes to the next.
Your domain and does the same so.
So you can see how you can use a combination of the particular scatter option and the block option to kind of lay out the task the way you want.
Now, if you had gone and done the same thing, but with a scatter CC option, then you'll see etc X is basically for course right, so there was four plus four and so on, so the layout is now going first task on the first vc dx I can, on the second one, and so on.
So all of these under the covers are calling us around and also using the.
Open MP binding options.
To do the layout so, how does this look on the node if you actually.
Look at the cost, so I did this with a test hdl run which wasn't really putting the node within an hour okay too much memory, but basically just to illustrate how this gets laid out.
So you have nodes equals one and task will notice 32 and cpus per task is for so.
So basically you're using all 128 cores in this run.
So we have doing an affinity.
With scotus sex so you'll see.
Basically, the MPI task is actually 100% the other threads are not quite putting all the course to the limit so it's actually easy to see how things are laid out here and I'm using this is output from H top so you can.
You can actually.
log in and look and I once I'm done with the slides to one extent I'll actually run this interactively and show you how it looks.
So basically we are.
Exposing I mean putting them big task on the first core of each sex for the absolute hundred percent record 145.
or nine and so on.
And he is nice in the way it was all the utilization for all the course so and then you can see the threads are all there.
So just to confirm my also logged in and looked at the.
binding and how it looks.
And you can see that.
Core zero.
And then core one, two and three, then it's not in sequence, but all the stars of the same ID process ID so you can kind of see.
So I think there was a question.
If you use.
definitely be enabled tasks, if not equal see group and use the same yeah you could probably do it that way, too, I think.
And I don't know how it will conflict with this.
So what I was trying to ask is when you when you enable the signal SLIM is going to pick exact cpus for you and you have no idea which cpus was taken by snare unless you define your es patch options to tell this is the affinity that I want.
yeah you could request.
For me.
Before if you go with the default.
and pass per node and the cpu per task you will end up with arbitrary cpu per node and then, if you use the affinity in your IV wrong then a just use any of those cores and doesn't care about the affinity.
So big I think Mano can probably jump in but, basically, I think we are setting them explicitly and those in definitely.
yeah so here, at least in this initial version of the script we are setting what course get mapped to what ranks based on the number of tasks, you ask, so if slum by default is doing some assignment.
Then it will definitely conflict with Iberian and.
that's right.
So, right now, the way we it so if you get the entire node right right now, the way it has getting assigned as you get all 128 course.
It doesn't assign PR rank kind of thing so Hence, there is no conflict, but if that changes, yes there'll be conflict.
Right in the previous slides you're just asking for like four task and one cpu per task you don't have any idea that which course have been picked for for your job, and then the affinity whatever you you choose is not going to work for the IB wrong.
All right now it'll work because you are asking for the entire node so you'll have.
Our entire not yes that's right.
yeah it doesn't work for sharing no.
Yes, yeah.
Yes, yeah I should stress that all of this is then compute partition which is exclusive on or.
Under shared it gets a lot trickier because.
I think it's also like almost difficult to optimize at that level, because we really don't know which calls will get from the scheduler itself because there could be other jobs running that have already mixed up.
Everything right now, a lot of single core jobs running on the track partition so you could you will have suboptimal set up, no matter what we do, basically.
So that almost you have to live with what you get and I think Marty was kind of alluding to the.
The issue with.
With push shared alfre I think right and yeah that's they're going to have that, no matter what what we do.
I think it, I mean at this point we really can't schedule at seaside like see the CCS level of the CCD level, for example.
That would be the way to get around what you're saying right.
that's a nice.
schedule at a different level.
Right that's right now we're kind of not thinking about a shared partition when we're doing these things.
I think there was a question about.
So if you go to the previous slides I.
mean little.
Part of the previous one.
He asked for this case you're asking for two nodes and then you're asking like.
Six task and one core per task my assumption is we we didn't know and 128 cores per know this is going to be a shared job, and you have no idea where those scores are going to land.
Oh, this was more of an example.
Like you didn't want to show all that us like so our compute partition is actually exclusive.
So I mean, in this case, if you add like as fast as it exclusive then, no matter how many cores and task you're using just basically taking all the notes.
For that, and as you work for you yeah.
Which of the case here the partition compute is exclusive.
But right okay.
So okay.
So maybe you had a question you have a package that.
Like setting its own affinity.
I think that's the case you probably shouldn't use the affinity script because I presume you know better, and how to bind at that level.
yep that's correct.
So you just leave the affinity part out basically.
If you only if you look at the script it's basically.
quite simple what it's doing under the covers it's just because saving you the trouble of.
Finding all the.
All the calls and doing that asset and so on.
And thanks Martin for that yeah you can you can just do PS with options to get the get fit info.
I think that answers, there was another question above.
I think that's your question.
So okay.
One other thing I wanted to talk about is the slum affinities setup for the MPI pete rights codes now, this was interesting I don't know if other AMD.
sites have seen this, but if you have a code that's either just P threads or MPI P threads and.
And you launch it say not even on a shared shared compute node environment what we were seeing was P threads.
tasks were essentially learning on the same court, sometimes, and it was quite quite random like some some runs would work and then someone's we're just hanging hanging in the sense, it would be 20 X lower because two of the threads and you're up on the pink or.
So this is kind of the genesis of why we brought this morning, wrote the script basically the slum ethnicity production script.
which basically it's a little bit of a break your code is using P threads and your court has the logic to do the right opportunity, then they shouldn't be an issue.
Like you, can you can set that up definitely yourselves in your court then that's the optimal way to do it, but what we saw is that there are some applications Community applications and, in particular, this quarter this issue issue I showed up with xml code is quite.
commonly used on our system.
They just let the free throw line where the os let's go and.
And it wasn't an issue so much of an issue on.
On our systems like commerce or or anyone stampede to because the core count wasn't that high and the code was fine using the whole node and we never saw this issue.
What an Expanse we were seeing that the code doesn't scale beyond a particular quarter round and or at least it's more efficient at smaller cons like I think around 40 or so for the particular test case you're running.
And what was happening was you would end up and I shared node with threads.
lining on the same floor and ending up with a situation where you would get really poor performance, so this script What it does is.
wait for a bit after your job launches and then.
Then actually.
looks at your slum information and.
looks for the binding that needs to be done and basically binds the threads explicitly.
So so it's a little bit tricky to us in terms of.
Like it's, not a single line like we had for Iberian, but what we have what I'm showing you here is basically this.
On the left hand side is the actual run script so we have entered for northwestern and 50 words per task is poor.
And then.
What we are doing is.
An excellent so it's an mvp threads code.
and launching that task the.
rapper script with the options and on the right, you see what's the what's in the transcript or xml hbc hybrid So what we are doing is setting up the modules and this is a script I was talking about the slab of affinity production.
Or you don't need the dash test right now we have it in production so and all of X is executable and.
Then basically.
It launches like like during that will be your MPI task, essentially, and then that's long affinity script will go figure out.
What to do with the tribes, and how to find them the one I'm I assume you're looking at the MPI ask ID and then figuring out which, which calls for us right.
America, since this is a shared nor this is for shared nor jobs each rank will have its own set of course assigned to it so based on that you bind the threats to the course.
So yeah that's kind of yeah.
So obviously this.
This is reading a little bit before it binds so if you're real quick when you want to see it work, but if you have a test that surrounds for 20 minutes.
Minor you're waiting for how long do you wait before you do the binding.
So, initially, I think it is maybe 10 to 20 seconds and it pulls every second after that.
So so that's basically the thing to do.
So you will see this would be.
Like the MPI P threads code they've always been tricky on.
Larger SMP we had a similar issue when we were running on our vs MP nodes on garden.
Where the threads would end up in the wrong place, and they are obviously the penalty was much higher because that node was.
physically, I mean aggregating physically separate nodes into one which was not so I tried moving of a socket actually thread moving off to a different physical node which was really bad so.
it's not that bad here but, but we did see the lock up like a real slow down even.
In this case, because of threads on the same core different issue.
So those so let me actually show all these like in action before I go to the summary slide.
I'm going to stop sharing the.
So I'm going to move to my HP directory and submit that job, and then I can also.
So everyone can see my terminal right.
Yes, if you.
If you look at.
This is the full html script So you can see, I have the knowledge equals one and transport is 32 cpu first task is for.
I set up my all my.
compile and in the modules that are used to compile the code.
And also doing a module Lord stc.
which actually puts it on your part, and definitely.
And this particular one was actually using bliss so I have some very specific flags that gets up and.
So I'm doing IV run afternoon tea, and then the scatter CC X and HBO.
I'm hoping the normal machine is not that busy and I'm going to get through real quick.
For their jobs running now so I'm going to just go to.
acknowledge and run on a laptop So you can see, at the beginning, when the memories being loaded just MPI us further and then now you have the binding that I was showing you in the US.
And he can he can see that it's doing Okay, and I have a little wrapper script for what Martin was talking about which.
Which kind of shows where all that after bond, you can actually see all the binding data.
So modern you said it would be possible to share the Iberian entered well yeah, I think, so we can put it on github Dr.
Sherman so whatever I don't know if you can post it to any github yeah.
I Martin, if you have an Expanse account, you can just grab it.
Without with.
attribution to one oh that's.
yeah definitely and we would actually More importantly, I think we would like feedback on how it looks and if there's some other options, you want to add into it and so on.
As a.
lot of this.
New we've used it for some of our.
Some of our tests already so.
But our experiences that users always find some other thing that we didn't think of.
That you can see all the layout was fine I'm going to know.
That this one is the maximal run that I was talking about, and have it here yeah it does have to remove stuff.
So I'm.
In this for my job.
And now, this is another job and.
So you can see, this is a shared node so I'm not using all the all the tasks and.
So yeah so you can actually see that it's it's pick some random set of course so.
For me sure this goes to your point about the scheduler might just give us some different set of nods set of course so which is what's happened here, so you can see, I can get like a nice contiguous like, even though the not empty so.
Right yeah.
So, so the know shared between your job, and your jobs.
yeah it's solid, this could be the case where something else is running on the other, calls and you would get so this particular Peter and script does handle that very well, so we can we can actually find what Islam has given us and then use that for the distribution.
One or you can add something if you want.
So yeah this particular maximum code here it's a shared nor job so.
If you asked for, say 40 cores it would give you 40 cores but it's not specific to either a CCD CCS or numerator just give you whatever it is, whatever they have.
that's correct.
And the problem with this code was although each MPI rank say gets four cores some tracks in those four course would run on the same core, for example.
One core would be ideal and three course would have hundred percent hundred percent and 50 and 50% of utilization for thread so.
Basically, something was going on, either with always or I don't know what that was causing a code to get.
into that idle state, and that was causing severe performance degradation, so what we did, is just read out the threats across the course it has been assigned, instead of having one identical, that is what this script does.
was a clear.
But I think yeah I think that's.
That seems fine yeah.
So I think there are a few comments in the chat so mark, you said that Dell support CC acts as a number domain, so I think the question is more on the scheduling side like how that would complicate things I don't know if there's been any.
Work on the slum and of things rescheduling at that level.
Right that's the question right.
I'll ask internally see how we do it, but I do know that people do use that for.
applications that are what to where you don't have control.
telecom applications only look at new.
So we're kind of working around it, by doing using these clips and doing the best we can do it what's there but yeah if we can schedule at the new model world and that's a nice thing.
yeah I think what you guys have is really good I was just thinking of the case where where you don't get the entire node ya.
know what you guys have is amazing really nice.
I think what else is there.
A yeah mark my at this point we're in the Q amp a session for your talk so there's a lot of questions in the chat that I think your best looking at and if anybody just wants to open up a MIC and ask a question.
feel free to do so or.
Both Emily.
And I can I can look at the action so just to summarize basically we saw the new my domain setup with 16 cores on each we are running and ps4 so that's the way it's on Expanse.
You saw the.
CCD is per processor with to call complexes per CCD and four calls and sex during the pre, though, we have a script for MPI open MP kind of setup and also for MPI P threads kind of setup.
You can follow along on Expanse.
Over I think I have the wrong link, but any fashion update this.
Week that's one of the links that works, but.
we're going to point to the Expanse user and then.
These tools are definitely open for update and I'm sure mark manoa and I and others can help if there are questions and we'll be happy to modify if you find something that's a case that doesn't work.
So I think we are.
Good for from the presentation standpoint I'll walk through and see look at the chart and see if that weren't answered.
In God master them in line what.
We should have questions question.
So the command to find the affinity I think Martin answered that in much later.
But basically, there appears options that you can use.
program I mean the binding.
yeah I think you brought up on questions if you have anyone else has any questions, let me know.
So I'm ready I think model answer but yeah basically the affinity script points.
To the sake of course and.
Islam affinity production script basically binds the threads for basically band for the second one is meant for Peter it's codes.
Because we've seen a lot of cases where.
There is no no environment variable set up that he can do to bind the process he's the kind of afterwards.
Well, if there's not any more questions and there's been a lot of great ones in the chat and a lot of good questions asked to everybody, we can take an early break and return back at 11:55am Pacific time where Bob sinkewitz will give one of his amazing talks on.
profiling and working with the you proof.
product that AMD has, and I want to share the screen for a moment Jeff if I could.
Because we've got a couple of links we've.
I have a window here I'm going to put this in the chat, but if you go to can you all see my screen showing the.
Training link.
To this program we've added a link to the github repository so that will take you to a repo where we're going to put the talks from the presenters and then it sounded to me like a few key helpful links that.
will add in here, especially as some of the ones from marty's talk and mark offered to put a few in there, but you'll be able to come back here.
In a day or two once we round up all the presentations and get get them uploaded to the repository so I'll put this also into the chat but, just in case you lose it's on the training.
event page, which will stay with this event page and information and that's it you're free to get up and roam about the country or your office wherever you are.
Any questions.
we'll just keep the zoom session active and I'll be around to answer questions if anybody has anything they need to ask thanks.
Thank you, Joe.
and advancing credits.
dancing yeah okay go ahead and stop awesome Thank you.
So we say.
Okay let's go ahead and get started welcome back everyone thanks for hanging in there next we've got Bob Sinkovits presenting on profiling applications using empty new proof and I'll stop sharing and let Bob takeover.
hey good afternoon for for most of you on late late morning for those of you and California so I'm going to be talking about the AMD uprof tool and just go ahead and get my.
slide shared.
Okay, everybody see my slides.
yeah it looks good alrighty.
yeah so I'm going to be talking about profile and application I'm using the AMD you prop tool and I'll have this link in a few places in my in my talk now let me just go ahead I'm going to cut and paste sorry.
it's going to cut and paste this put this into the chat case anybody wants to follow along there.
will also be putting these links on the github repository page.
Training session, thank you.
alrighty so so diving.
In um.
yeah before we start profiling just a good just spend a minute on on why we should profile your code.
So, assuming that if you're profiling your code you're you're interested in improving the performance, you want to make it run faster or you want to make better use of your allocation.
So I would say, for profiling, there are really two things that you're trying to do on first is, you want to determine what portions of your code are using the most time.
So if you're working on a small application, something that you're intimately familiar with that you may have written you probably don't need to profile, you might already have a pretty good idea of where your coders spend the time.
But a lot of modern htc and data intensive applications are big often thousands and, in some cases, millions of lines of code, so you really can't just dive in started the history of the first function is set retaining work your way through.
Try trying to optimize it you want to know you really want to know where to focus your effort.
And, in most cases you'll find that it's that your time is being spent in just a very small number of routines.
And then, sometimes you want to go a little bit deeper and you want to figure out why those portions decoder taking so much time.
So understanding why a section of code is so time consuming can sometimes give you valuable insights into how it can be improved.
I would say that a lot of us would stop at the first step, I often do that one one side determine the function of the block code that's using the most time.
That I pretty much know how to proceed, but I'll be showing an example with the AMD uProf tool of how did he get some additional insight but going a little bit deeper.
So Andy uprof is a proprietary performance analysis tool that is just for AMD hardware.
On it allows you to good profiling and multiple ways the basic one is will call the time based profiling if you're familiar with gprof it's very similar you get a report showing wait which functions retains are using the most time.
You profit goes a little goes a little bit further, it will try to assign assigned cpu time to individual lines of code which is helpful, but as I'll show when you are.
On profiling your code with optimization turn turned on it's hard to assign usage to particular lines of code, since there may have been transformations of software by the compiler.
On if you if you want to get that information, though you comply with the dash chief live to get symbol information.
And the you probably will provide profile and data and two formats and csv files which we're going to be looking at today so just regular comma separated value file, these are human readable, you can open these in spreadsheets.
And it also generates a sqlite database that can then be imported into the AMD you prefer gui now, in addition to the time based profiling.
You can also access the performance counters to look at cache usage branch prediction and instruction based them playing on today we're just going to look at cache usage, which I think is the is the most common common use case.
So the process workflow involves three steps so there's the collect the translate and the analyze phases.
So in the collect phase, this is where we're going to run the application and generate or collect the profile data.
Then in the translate stage, we need to process the profile data to aggregate and correlate it and save it into the database and the end the csv file and then, finally, the announced this step.
So the collective translate this is analogous to what's going on with gprof where you for your first run your code.
It will generate a gmail dot out file, which is not human readable and then you would do translate phase where you would use the use the profile data, along with the executable to then generate a human readable profile.
So first step to mention is going to be collect we're going to do that, using the AMD you prophecy Li Ben tool with the collect command.
And sampling can be done using a variety of predefined profiles I'm going to show these on the next two slides or you could also use the more advanced features and specify specific events, I think, in general, though, that you're going to want to stick to the.
stick to the predefined profiles, so we show the kind of generic form form the command here AMD uprofs the Li been.
first argument is collect That means that we're going to be using this and collect mode, followed by dash dash config the type of sampling that we're going to do we then specify dash Oh, the output file and then our executable followed by the arguments.
So the specification of the output input files it's just a little confusing here what.
Is the output that was bestbuy it's actually a prefix so.
depending on whether you're running on Linux or windows you're never going to get an output.
ca prf file or an output dot php file.
And that's going to be fed into the into the translate stage and, of course, since we are on on Expanse that it's going to be on Linux just check the chat here.
um, how can you properly loader great question, I should have mentioned this earlier, we do not yet have uprof available.
system staff is going to be in process of all he have ever installing you.
Prof turns out it's not a simple installation, since it does require require chrono modifications.
So it's going to be available soon, and I believe that it's going to be available as a as a module, but if we have anybody from the systems or users part group here, you know feel free to weigh in and correct me if I got that wrong.
yeah I think that's good and we may also need a feature if or depending on how we implement the permissions on the.
document documented as soon as it's available.
For that, next minute or.
So, so I mentioned that we have predefined profiles for for for the for the collect phase.
On the basic one is the time based sampling it's abbreviated as TV P and we're going to use this configuration identify where the programs are spending their time think of this as being like gee profits that.
A little bit more, and then we have a bunch of different profiles, we have in the next one is you know assessing the performance extended.
We can investigate branch and behavior data access and the abbreviations here PM you was performance monitor unit and pmc is performance monitoring counter talk by that very briefly.
We could investigate instruction access cache analysis instruction based sampling and kind of a lower level performance assessment.
you'll notice that, for each of these, with the exception of time based sampling that we have a number of PM you events where we're actually monitoring the performance that the performance counters.
These codes for example pmc X 076 and so on they're not going to mean anything, unless you go to the.
If you'd like to go to the user guide and see what these what would these counters actually actually correspond to, but you can see, now the convenience of having.
Having these predefined profiles, rather than having to go go through the user guides figure out which counters should be used.
So a quick a quick aside on what we call paranoid levels so so a newer and newer Linux distributions there's a there's a parameter called per event paranoid.
And this this controls, essentially, how many how many performance counters you can access simultaneously, there are a couple of links here one and stack exchange one on stack overflow if you're interested in diving a little bit deeper but but, once you.
Once you give users access to the performance counters it does introduce some security holes.
So set setting setting different values for perfect and paranoid or determine how many of those those cameras, you could access simultaneously.
And we're trying to find the right balance between security and usability and what we're going to be doing is setting.
The setting the paranoia Level two one which is going to allow you to do the time based on based profiling and the data access based profiling well you'll still be able to access specific counters at this level, so if I go back um I had tried doing.
It doing branch profiling and the number of counters on this case it was eight counters it was too large for four days for this now I'm.
paranoid level, but you can always go back use the more advanced features of uprof, if you really want to do branch and profiling and just specify a subset of those of those characters.
You have a question here.
Is it possible, integrate uprof pappy library um.
I don't know.
i've been Bob I might have asked this question, I am and I need to know.
Okay that's fine.
I can, I can double check that and get back to you.
Let me go on Okay, so the second phase of using your profit is going to be it's going to be translate, and this is where we take the profiling data and we turn it into a format that we can use.
So in this case we're going to use the same command AMD you proxy Li dash been so really important to note here that it's just a single on it's a single tool that we're using for for both collecting and profiling, but in this case, instead of collect we're going to specify report.
We have to provide the.
provide the name of the.
Of the profile data, the name of the file containing the profile data that was generated during the collect phase so here we're going to specify dash I.
we're going to use the same prefix that we did before output that.
ca prf as case because we're working on a Linux environment.
What this is going to do it's going to generate a new directory and output, which contains two files, so it gives you both and output duck csv and then output.
db so the former, of course, the plain text, while latter is a database file, that is, read by the Red right by the gooey.
And that gets us to the last phase, and this is the analyze phase so profile and data can be analyzed anywhere, so my preferred way of doing this would be to install the AMD uprof.
ui on my local machine copied and copied the database over and you use that goofy but unfortunately md does not currently have an implementation for the MAC on this.
Well, as of at least a month ago, that that may have that may have been remedied, but as far as I know, there's still no gui for Google we format.
um but you could also but you always have access to that csv file, so what i've been doing is i've been taking the output.
Or the csv file and i've been paste that in text, though, and this actually works pretty well, it requires just a little bit oh I'm.
A little bit of reformatting column resizing and familiarizing yourself with the output, maybe it's a little bit clunky or than using than using the ui, but it still gives you access to all of the information.
That we're going to take a look at a look, look at a simple example I put together a toy code, and this is they might get repo.
That this is the intro example or I have a main program and a call to sub routines sub one and sub to it's going to call sub one twice it's going to cost up to one and then no separate teams in turn are going to call the functions F one F two.
So 17 one is going to call F1 n times n F two and tones and then step two is going to call F2 on n times.
So there's going to be three stages and doing the profile and first we're going to do our build.
Only thing that we need to make sure we do is specify the dash chief line so that we that we get symbol information.
This case since we're running running Expanse I'm going to set the march flag to core ABS to I'm going to turn on.
and turn turn optimization dash oh three and you'll see following that in brackets I specify optionally the inline level second that equal to zero, so the optimizing compiler one of the one of the transformations that the Intel fortran compiler does.
What you call function enlightening will take the body of a function, and it will replicate it where it's called, and this can really boost performance to avoid the function call overhead.
And it also it gives you a lot more up more opportunities for us a vector ization of loops I'm going to I'm going to name this executable intro and then we're compiling the intro dot F file.
Then the next phase is going to be my collect phase again I'm going to use the empty uprof coi Ben been tool.
I'm going to do, collect, in this case I'm going to use the TV P time based profiling configuration or specify the output output based name is intro dash TV P and I'm going to execute intro in this case I'm going to set any for 1 billion.
And then, finally, the translates stage again the md uprof coi been tool that this time, instead of collect report and then I need to specify the.
The profile data in this case, would be the intro dash TV p.
ca P or.
Just to show you what this code looks like here here's a few key snippets from the code on the left hand side is the is the for trend main Program.
you'll see that I'm allocating arrays w X y&z all to be of size, then this is that parameter that I said equal to equal 1 billion and previous example I'm going to initialize the brain that the arrays X and y.
I'm just set them to devalue based on based on a loop index and I'm going to call it step one twice on first with X y&z next time with y X and Z and then finally I'm gonna make a call sub routine to.
on the right side, we see the wigs the bodies of these sub retains so sub routine one.
All it does is element by element, it will call the functions F one F two subverting to element by element is going to call the function have one.
And then for functions F1 F2 on that they just do a in this case kind of a nonsense calculation on the two arguments.
In the first case square the square root of X divided by squared y plus y over X and then, and the second case, the square root of the square the product divided by the square root of the.
Of the ratio again plus plus ratio I just put this in here too, so we have a function that would take up a reasonable amount of time and make our code compute bound, rather than memory and I see a comment.
For from William for you, proper usage, are there any particular compiler flags to turn on one pilot.
No, you don't you don't need to specify any particular flags I just use the dash G, so that we got the symbol information but unlike gprof.
You don't need you don't need a specific fly in order to you in order to use your Prof on like n G profits and the dash dash PG flying or started started with G profits of the fpga flag, so you know we don't need that.
Okay, so this is this, this is the csv output, that I get and I'm not expecting you to read this this is kind of small and it's you know it is hard to hard to read csv files, but I just want to point out the key.
The key sections of the report they're going to get up at the top you'll see the execution environment and profile details, and this is really handy it keeps track of which executable you're running.
What machine you're running on the processor architecture and so on um it's really easy to lose track that information.
you'll then see a section called hot functions, this is a listing of the functions that are accounting for for most of the time and then after that we get into the function details.
So when I did the time based profiling within lightning.
You see, on the left to the heart functions you'll notice that we have main there on the top it's the third line us a little over 14 seconds.
But you also notice that there's no usage assigned to sub one or sub two or F1 or F two and that's because all of that code within lined.
up there was a transformation, the code so Essentially, we now have just a single function on the right hand side there, we can see a breakdown of where of where time with us that main that now keep in mind that this is.
That the assignment over the assignment of time to individual lines of code is always a little sketchy.
I would take these results with a.
With a grain of salt, I mean here it doesn't look too bad we'll see that F1 F2 accounted for.
About five seconds and about seven seconds, respectively, which is what we expect on these were just function definitions that were enlightened into the into the main program and then a smattering of time and other lines.
Now I went back and I did a profiling within line in disabled so again going back to going to the compilation stage I turned off I added that in line level equals zero to see how the results with change.
And without enlightening you'll see on the left hand side that most of our usage is with F1 F2 as expected.
Some usage by step, one and step two and then also a smattering of time for the system calls, I want to point out, though, on the right hand side.
Where we where we do a deeper dive into the function, this is a case where the assignment of time two lines of code wasn't correct again it's really hard to especially for optimized code to assign usage to a particular line if we look at these first few lines here function F1.
To talk about 32 seconds.
The line that you think would have would have accounted for most of the usage, you know we're calculating the square roots and doing the divisions.
I'm here was assigned seven seconds, whereas the statement and function F one was assigned the bulk of the time so it's it's it's not not not quite right can take these with a take these with a grain of salt.
Okay, so I'm going to do a brief digression before demonstrating data access profiling, so what we've been doing with you probably would have done with G pro is using it to figure out where.
What where the code is spending, most of the time which functions are are are particularly expensive but it doesn't give us any insight into why a particular functions expensive so for codes with low competition intensity.
By this I mean that we're doing a relatively small amount of computation for each word data that we're operating on performance depends on how well we manage data movement.
In particular, what we make use of use the cache and uprof gives us access to the performance counters and allows us to go a little bit deeper.
So I talked to him already about memory bound compute bound codes and memory bound codes, the performance is limited by how fast we can deliver data to the cpu.
And here, our goal is going to be to apply cache level optimization so the cpu is not star for data example that we looked at earlier would be a compute bound code.
Where the performance of the processors limiting factor we're doing a doing a lot of expensive operations divisions and square roots on the data, so we can.
We can deliver the data fast enough and we're just limited by how fast the cpu is but in real applications we're all finding to see a combination of compute bound and memory bound current.
So continuing in the depression, but hopefully lot of you have seen figures like this already on the computer memory hierarchy or usually shown as a pyramid where, at the very top, we have registers, which are limited.
Again, modern processors had maybe in the order of 100 hundred registers this registers hold data that is immediately available for use by the by by the.
By by the cpu by the functional units on the next clock cycle very, very fast very small, but of course expensive.
down at the bottom of the hierarchy, we have an external storage, we have disk or possibly even take I'm just above that we have DRAM our main memory.
Typically, this is, you know hundred to 200 gigabytes on on modern htc knows.
But even though memory is fast compared to compared to disk it's very, very slow compared to compare it to cpu So what we like to do.
Is try try to move data from DRAM up into the cache hierarchy, so if we're looking at the L1 cache, which is where we really want data to live.
it's small it's on the order of about 10 kilobytes.
um, but we can tax this that very, very quickly, that this is fast memory that's directly on the that directly on the processor, and then we'll also have an L two and health three level of cache.
If you're working on a very, very small problem all of the data may fit into cache but, in general, your problem is going to be way too large to fit the cache, so you can only take a small fraction of your data and have it live in cache at any one time.
So, in order to take advantage of cache, we need to exploit what we call temporal and spatial locality so.
The idea of temporal locality, is that data that was recently access is likely to be used again so once we load data into cache, we would like to, we would like to use the began before we have to push it out to make room for other data.
spatial locality means that if a piece of data is access it's likely that neighboring data elements in memory were also be needed so caches organized into lines, I say here typically 64 Bytes.
For for modern processes i've never seen anything other than 64 Bytes and an entire load is.
Sorry, the entire line is loaded at once, so our goal and cache level optimization is going to be very simple exploit these two principles to minimize data access times and particular try to minimize the number of cache misses.
So we need to look at how.
How data is laid out in memory so we're going to start with a simple example of a one dimensional array this is easy one day arrays are stored as contiguous blocks and blocks of data in memory.
So, looking at a little snippet of seek let's say that we're going to declare Andre integers assume that there are four Bytes long.
they're going to be laid out with the first element element zero will say it has an offset of zero.
element, one is going to be offset by four Bytes element to by eight Bytes element three by 12 Bytes and so on and so on.
So cache optimization for one day array is going to be very straightforward and you're probably going to write.
Optimal code, without even trying so here's an example where I take all those elements to be re and then just increment them by implement them by 100.
So here the way I would just naturally right Lou and I equals zero I less than and I plus plus is going to access those elements in the right order.
So what's going on under the hood you know what what is what's going on with regards to cache.
So when we first encounter this loop.
The processor, is going to load elements zero through 15 into cache and then we're going to increment X zero through expectation.
That we get to element 16 it's not in cache we're going to have to load that next cache line that there was there was no net next 16 elements.
implement those but, in reality, the process where it's going to recognize the pattern of data access and prefect the next cache line before this before if needed.
This is a perfect example of a of a memory bound Cone we're reading we're reading in a piece of data we're doing a minimal amount of computation in this case just implementing and storing.
On so we're really going to be limited by the performance of the memory subsystem in this case there's really nothing else we can do the code is already optimal.
question that comes up is do I have control over catch The short answer is no.
You know, sometimes we get questions like is there is there in assembly language instruction that I can use to say, I want to load a particular and location and memory and cache and there's not.
The modern processes directly implement advanced cache replacement strategies and branch prediction and pre fetch so I'm gonna say the best you can do is just follow the standard practices of exploring temporal and spatial locality.
And that gives us the Multi dimensional arrays here's where things get a little more complicated, and you can easily right.
suboptimal code so from the computers point of view there's no such thing as a two dimensional array This is just what we call syntactic sugar it's the convenience of programmer under the hood multi dimensional arrays are stored as linear blocks data.
here's what gets a little complicated, there are two ways of doing this there's column major order and row major order.
And column major order, and this is what's using for trends are in matlab the first or the leftmost index varies the fastest, so if I have this array this four by four grid of numbers.
Zero through 15 I'm going to go down the first column 04 812 159 13 and they're going to be laid out memory 04 812 That was the first column 159 13 that's the second column.
and stolen row major order is where the last or the right most than gets mary's fastest, and this is using Python mathematica and C c++ and the other C like languages here we're going to go across the first row 03.
There are three four through 7437 and seven.
It would be really nice if there was a symbol convention but that's never going to happen it's kind of like driving say and then in the US, and most of the Europe you drive on the right side of the road.
And UK and the former Commonwealth countries you drive on the left hand side of the road, no, no nobody's nobody's going to make the change we're kind of stuck with it, so you need to be aware of this, as you're working with multi dimensional arrays.
So, but let's say we wanted to add to multi dimensional arrays are loop nest and it's going to be different if we're working with fortran or working with see.
So I properly written fortran code know that I have the right most index outer most like most index innermost, so I do, why do Jane start do Jay do high, and then the I J was excellent J plus y Jane.
Properly written C code is going to flip that around, so we have four I for Jay and then Z Jay was excellent J plus y rj.
And that gets us to to performance and how we're going to use uprof.
So I put together this code again if, then the github repo why I mentioned the runtime for the addition of two large n by n matrix these, and I saw some expected behavior, but I also saw some really unusual behavior.
So I ran this for some pretty big matrix these on an equal to 32 767 768 and 769 when I use the proper Luke nesting this code took about took a little over two seconds to run.
A little bit of variation on but pretty much within the within the run to run to run variation when I did the improper nesting when I had my loops order and the wrong way it took a lot longer.
But notice the notice the little wild berries ever seen um you know for for the first case, it took about 15 times longer so nearly 30 seconds, rather than about two seconds for the for the third case 32 769 it took about 10 times longer.
To two seconds versus about 22 but for that middle case 32 768 it took about 60 times longer so we're trying to figure out what's going on um as you as you would imagine, this is the memory bound code something's going on with with with the data access patterns.
Remember we're talking about cache on cache is so much smaller than the main memory that you can't call the entire problem and cache.
So, with the improper Luke nesting we're loading cache lines, we might be using as little as one element of that cache life before push down, and we have to reload new data so it's bad in all the cases but it's particularly bad and they in the middle case 32 768.
So what we can do is use you cross and do data access profiling, in this case I'm going to compile using the G4 trend compiler.
reason I did this, is that the fourth is that the Intel fortran compiler sometimes recognizes when you have loops or improperly and will read next them.
um I wanted to make sure that the bad version of code performed badly that we didn't have that optimization being done on it being done behind the scenes.
So we can probably compiler cones we're going to collect.
Only thing that's different here is that, instead of TV P for time based profiling we're doing data access and do this for both the good and the bad versions, the code, so I call a demon good.
For double precision matrix edition, good for the proper Luke nesting and demon bag for the improper lunacy.
When I do this if I look at the function details section report.
I get I get insight into into the cache usage in particular it, these are the number of cache misses that we had for for the properly that nested and improperly nesting on cones.
Will sleep for the years appropriately nesting that we had you know about 10 to 11,000 data cache misses on for the improper lewdness it was much larger and a case.
And, in particular, look at the 32 768 case where it was you know we went from about 11,000 cache messages to nearly a million Cashman.
I'm just going to escape here for a minute and show you the show you this in the in the spreadsheet.
On sorry wrong one.
let's share again.
there's the just to show you the block of code that I was looking at will go back here to the properly nested code.
So just highlight, we could we could see here for this function that we had those 10,419 Mrs this was for the good version, the Cone I'm using 32 768.
We go over here to the bad version, the Cone.
I'm sorry I realized I need to.
know no I'm sorry that was correct.
Here, we could see in particular that smell.
That we had that we had nearly nearly a million.
nearly a million cache misses So this is the kind of information that we couldn't have gotten for from time based profiling and then we couldn't go with a tool like gprof, let me just switch back to the presentation.
So we would we would expect that we would have a lot more cache message, when we have the improper Luke nasty but, but why is it so bad for for that middle case 32 768.
And it has to do with with how data in main memory is mapped to cache so there's a whole textbooks written on this I won't go into and go into the details.
But the location that main memory gets mapped cache is based on the map is based on a portion of the memory address, and when we're accessing data.
With the large power to stride we keep banging on the same cache lines so so not only was having the improper nesting bad.
But when we use the problem that was a large power of two it was even worse, we kept we kept hitting the same cache line.
So we're probably loading that cache line, the entire 64 Bytes operating on just one element throwing that cache line wait to make room for for other data that couldn't fit into cache.
So just about wrapping up here where to go for help.
there's someone I rented right into when I first put together his presentation you prefer mute.
I'm gonna say AMD tends to use these two names interchangeably and sometimes on the same web page or document.
If you do a Google search search for uprof or uprof for Mike your I guess your stands for for micro profile they seem to yield the same heads for the official empty page.
So if I look up off and left uprof on the right on the exactly take you to the same material, but if I go a little bit further down, and I look at the.
At non hAMDi websites, for example, reddit geeks 3D notebook check github and so on you'll see that they that they turn up different bit different hits.
And in the case of view Prof there are a few cases where we found unrelated it so this nom second one here on the right side rip slash uprof I believe it is unrelated to on unrelated to the md up I see coming to church.
Yes, so I'm hungry, I think I think you're answering this you know what if you search md up as opposed to new pro yes so you're going to get the.
The top hits.
For the end for for the for the empty developer site, it will you'll get to the same material, but again if you're looking at Community input, you know.
it's ready and ready and stack overflow and so on you're going to get different heads, so I would say, you know my recommendation, if you're doing a little bit digging and you want to go beyond the beyond the official AMD documentation I would probably do book.
And with that I'm just about ready to ready to wrap up like I said we don't have you pro install yet.
What you can do, though, if you're if you're interested is a good read the proper user guide and I'll paste this into the chat more roll done and then, once uprof is installed and Expanse, you can go back you can rerun these exercises I'm going to say.
You know, try try use it use it using different compilers on different problems sizes for the.
For the intro intro example where we were looking at time based profiling, you may want to.
experiment with the content of functions F one F two you'll get some interesting results if you use sine cosine and log functions, instead of divisions and square roots.
You can go dive a little bit deeper into into enlightening by by breaking the breaking the fall into into multiple files.
With that let's take questions, let me take a look at the chat first.
Ah I'm.
So there's a question here, what about MPI code, so I haven't yet and clarified, I mean parallel code they eat they use MPI you know I haven't tried uprof, with an MPI code.
I think what's going to happen it's going to be similar to similar to gprof and that you're going to get a profile per process.
Which which could still be useful, but, but uprof isn't a tracing tool like town so so you're not going to get information about say.
About bottlenecks are load imbalances or communication patterns.
and happy to take a few minutes happy to take any any additional questions.
Okay, we are pushing her How does AMD uprof compared to home perfect tools, you know, to be honest, on not.
I'm not personally familiar with the perfect tools I'm going to guess, though, that AMD goes it goes a bit deeper I don't know that perfect tools give you access to.
Access to the program cameras, in which case you wouldn't be able to get information like like cache misses and Miss predict wrenches, but if we have anybody from AMD or other SDSCers here who are more familiar with her feel free to live feel free to weigh in.
Good great Thank you thanks mark for posting the link about home empty application so more acceptance here, did you want to make any comments on it.
hey this is Mano.
yeah you can use both to get a lot of cache related stats as well Okay, I have never tried it on the AMD EPYC so I don't know about that okay so yeah I think you should be able to get certain events and those events can be cache advance our branch humans.
Okay yeah and imagine where even though we're going to run into the same issues with the with the paranoid level.
mm hmm.
In order if you wanted to.
If you want to track multiple counters that same time.
Great thanks Martin.
ahead and stop share do we do we have any more questions.
I know we've been here for a while everybody's brains are probably getting getting full from all the technical material.
Right well Mary I'm going to hand things back to.
Thank you, Bob nice nice presentation, very important for optimizing and tuning on the AMD EPYC processor, so thank you to all our presenters for their really in depth informative.
presentations will be sending out a survey at the in the next few days, to find out whether you know get some feedback from you are in the XSEDE survey, but we might ask.
Our own questions it's the first time we've done this kind of a workshop and we thought it was really important for the Andy architecture.
And so we have a few minutes dedicated to any kind of questions that you guys want to ask, there were still a lot of questions out there about.
The compiler in library presentation by Marty, and then the slum and runtime configurations so now's your chance to ask them and in the meantime the talks and key links will go on to the Repo for the the workshop so anybody want to ask a question or.
make us share an opinion or a thought.
You guys just want your get out of school pass early today.
And that's fine that's fine so.
If you don't have any.
Question oh great.
Well compiling and libraries.
Sure, or boy.
I'm actually online right now and trying to compile my code and I'm trying to use html5 and I load hdr five mile module but.
doesn't nothing I mean some of the Apps like each five dump and stuff like that show up in the path, but the libraries don't actually show up in my path, and so, when I compile it doesn't find the html5 library.
Tom I think a few things there is actually so html5 is, I think one of the libraries, that we have a hidden one that was kind of probably what he picked up.
Are you loading any compilers before you do the sdf Lord.
yeah I'm trying various things like the Intel compiler and.
So if you have the right combinations of things, then the only other.
edition, you would have to do is add the path to the library and we usually set the html5 home variable to point over the actual location as.
You can just use your platinum that html5 home.
yeah I think that's what it will be.
In caps so.
And then black lives, should give you the library location and slash enclosure.
it's all one word.
yeah it's all one word once you load the module that set by the module.
Well, you love to modify a make file or your various crypto essentially include the library part and then it should work fine yep.
it's there thanks a lot.
Okay yeah and if there are any other issues, you can just send in a ticket and we can follow.
Any other questions comments.
Bob I think there was one that was.
Like profiling code using libraries, instead of where to get more details in Berlin.
debugging debug versions of these libraries I.
think we.
yeah I mean, I think I think I think if to support that I think what we'd have to do is actually deploy some the package with the bud debug flags on I have seen that on other systems.
yeah I mean if you're interested in that I think we'll probably have maybe you know, have to work with you, maybe just install those debug versions in your home directory, and if you're going to do some sort of extensive sort of debugging analysis, maybe.
And the question above about the AFL CIO.
Making sort of someone's libraries, like the MV Lib and a little bit more accessible, this is one of them was one of my sort of maybe feedback comments for our MV spec people at some point is, you know, maybe I don't know the current state of the latest sort of a OCC.
spec package, but to make some environment variables available that people can easily more easily hook into.
versus you know, having to explicitly point the path to where the Ad libbing is.
All the time so yeah that's a fair comment I think we'll probably bring back to our in the spec folks at least when we deploy some of these packages in the in the future it'll be a little bit more easy to sort of Lincoln to them.
Thanks modi and anybody else have any questions or comments or any input, you want to share, about your experiences on Expanse today that would be helpful.
Well it's been a long morning with a lot of oh there's a question from omri more.
is one of our experts can they take that.
yeah I think.
I'm ready yet that's exactly what we are doing that's that's that home variable that I mentioned.
For each package that gets set, I think the issue is the liver is sitting in a SEC, and if you're building it with some other compiler than that won't be loaded.
Yes, backup prevent you from loading the two compilers on so it's yeah it's a little yeah that's a particular issue.
And if you're wondering why we call it home and didn't use the route.
It actually broke some packages, because that was a Reserve Board for some of them.
So yeah for every module that you Lord, you can just capitalize the model name and then apparent home to it and that variable that points to the location.
there's a good question by RON about asking about an automated way to take an MPI code on Expanse and have it run with a bunch of affinity options to see which one is optimal.
And that I can think of one or you want to take a stab and.
Basically, they can.
Use the IV run and.
hit rate or different flags and I break.
that's what I can think of yeah.
All right, well then, I want to thank everybody for coming and thank all of our presenters and thank you guys for working on Expanse and becoming part of our Community.
feel free to send us more feedback about workshops of this type or our other workshops, so we can determine what information, people need, as they progress on their work on Expanses and with that I will will shut down and say goodbye to everybody and.
You guys all have a great rest of your weekend we'll see you online bye Thank you everyone.


Okay let's go ahead and get started good morning everyone or good afternoon on the east coast.
Today we've got a great webinar from 9am to 1pm so it's going to be a long one on the AMD EPYC advanced user training on Expanse before we get started, I want to bring everyone's attention to the XSEDE code of conduct.

Search the Transcript

No search results yet...